The identification of nuclear mitochondrial pseudogenes using motif statistics
The analysis and classification of biological sequences is a large field with a great deal of application. Sequence classification is difficult because of the many to one relationship of sequence composition to protein structure and function. Pseudogenes are non-functional genes that arise through point mutations and translocation events. Nuclear Mitochondrial (NUMT) pseudogenes or fragments, are a type of pseudogene which result from failed integrations of mitochondrial genes into the nucleus. NUMTs are characterized by premature stop codons and mutations that compromise the final protein structure, rendering them functionally inactive. DNA barcoding is a recent initiative where organisms are identified and classified based on a particular gene. Cytochrome oxidase subunit I (COI) is a common selection because of its universality and sequence conservation. COI is known to have NUMT copies for various species. NUMT's interfere with DNA barcoding because of the similarity to their mitochondrial parents. NUMT contamination can result in incorrect species classification and spurious species relationships. In our study we demonstrate the effectiveness of Motif Probability Profiling, which uses the normal distribution of motif frequencies to identify NUMTs. We compare and contrast our method with two others; Profile Hidden Markov Model, and Context Comparison Analysis. We also provide the ground work for the existence and further exploration of shared motif patterns in NUMTs.