Using Anchor Clustering to Identify Associations between Codon Bias and Gene Attributes within the Human Genome

Stoodley, Matthew Alexander
Journal Title
Journal ISSN
Volume Title
University of Guelph

Codon bias describes the tendency to use certain synonymous codons to encode amino acids. It is well established that codon bias varies between different organisms and plays a role in gene expression and co-translational folding. It is important to understand codon bias because a better understanding of gene expression and translation mechanics may allow for more efficient recombinant protein production, and could ultimately improve the ability to create synthetic genes. Human genes were investigated to elucidate the connection between their codon bias and the subsequent impact on structure, function, and tissue specific expression levels. Analysis was performed by representing human genes according to their codon bias, then clustering genes together that have a similar codon bias. Gene clusters were studied to see if genes that use similar codons are statistically more likely to share other properties. Clustering was performed using a novel data driven approach to a simple clustering algorithm called anchor clustering. Anchor clustering was used because it is fast and deterministic; two qualities that other approaches can struggle with when clustering data in high dimensional spaces. To study the connection between gene product structure and codon bias, clusters were analysed according to their likelihood to contain intrinsically disordered proteins. Because structure and function are so closely related, clusters were also analysed for GO term overrepresentation. Last, clusters were examined through the lens of tissue specific gene expression by incorporating expression information at the mRNA and protein levels. The analyses revealed an association between codon usage and the propensity of a gene product to be intrinsically disordered, while the functional analyses revealed that codon bias is associated with cell cycle regulation and cell type differentiation. Expression analysis revealed that in humans there may be a codon bias associated with highly expressed genes indiscriminate of tissue, as well as tissue specific codon biases in the cortex, testis, and liver. Some of the tissue specific findings have been found by other groups, but this investigation distinguishes between an organism-wide codon bias associated with high expression and particular codon biases associated with high expression in individual tissues. In addition, this work builds on the current knowledge of codon bias, determining if these findings previously only evaluated using mRNA levels also appear at the protein concentration level. The results suggest that codon harmonization can be improved further by seeking to replicate the tissue codon bias in which a gene could be highly expressed.

Clustering, Codon bias, Tissue specific expression, Gene function, Elongation rate, Co-translational folding, Enrichment, Unsupervised Learning, Evolutionary compuation, Genetic algorithm, Human gene analysis, Data driven, Bioinformatics, Anchor clustering, Codon harmonization, Human genome, Packing problems, Algorithm initialization, Protein folding, Protein structure, Coding sequence, Gene expression, Gene translation, Ribosomal elongation, Disordered proteins, Intrinsically unstructured proteins, Codon usage, mRNA expression, Protein concentration, Codon, tRNA, tRNA concentration, Amino acid, Genetic code
Stoodley, M., Ashlock, D., & Graether, S. (2018). Data driven point packing for fast clustering. In 2018 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) (pp. 1-8). IEEE. DOI: 10.1109/CIBCB.2018.8404974