Large-scale clustering of antigen receptor gene sequence data using hyper-dimensional point packing

Date

2018-08-20

Authors

Chang, Haiyang

Journal Title

Journal ISSN

Volume Title

Publisher

University of Guelph

Abstract

Lymphocytes generate abundant antigen receptor (AR) genes to recognize an almost infinite number of epitopes. One challenge is to group AR sequences based on the recognition of a common epitope. Traditional clustering methods are based on hierarchical clustering, which comes at a significant computational cost due to pairwise genetic distance comparisons. In this thesis, a point packing strategy was applied to incrementally break down the data into subsets, which limits pairwise sequence comparison to the final cluster level. Sub-setting was achieved by picking maximally spaced anchor sequences from a dataset, iteratively, and assigning the remaining sequences to the closest anchor. This results in an inverted tree with anchor sequences as nodes and a descending anchor distance gradient for each layer. In addition, new sequences can be added to a clustered dataset by comparison with existing anchor nodes to achieve quick positioning and substantially reduce the computational burden.

Description

Keywords

clustering, antigen receptor, anchor, bioinformatics

Citation