Correcting Ambiguous Base Labels in DNA Sequencing using Neural Networks and its Impact on DNA Barcoding Applications
Obtaining DNA sequences relies on software algorithms, sequencing technology, and human effort. Advancement in algorithms and technology improves the rate in obtaining sequences. In this thesis, an artificial neural network based method improves the number of bases obtained from Sanger sequencing, by post-processing DNA sequences and replacing ambiguous N-labels with DNA base labels. The existing KB basecalling algorithm produces the initial sequence that is post-processed by the presented method. DNA Barcoding is a platform that depends on highly accurate sequences. In DNA Barcoding, species are identified by short reads of a standardized gene region (e.g. 600-700 bases for COI). Barcode of Life Datasystems (BOLD) is the largest repository and analytics platform serving the International Barcode of Life (iBOL) project. In this thesis, a novel machine learned error correction system is developed, the System-3 N-label Editor (S3). S3 is developed and validated on DNA Barcoding data, using 850,000 ambiguous base labels across 160,000 sequences. S3 internally represents uncertainty to estimate error and commits an N-label replacement when predicted error is sufficiently low. S3 maintains an observed error rate lower than 1%, while disambiguating 79% of N-labels in animal barcodes, 80% of N-labels from plant barcodes, and 58% of N-labels in non- protein-coding markers. S3 is tested for its impact in bioinformatics applications on 90,000 sequences from Canadian National Parks Malaise Project. Bioinformatics analyses are run using the KB, S3, and BOLD sequences as three treatment groups. Three applications are used for the comparison: Barcode Gap, Species Identification and Discovery, and Tree Building. The barcode gap refers to the difference in the between-and-within species distances; S3 improves the difference in two-thirds of species as compared with KB. For species identification/discovery, S3 did not improve over KB in resolving species. When phylogenetic trees are constructed using an overlapping region between KB, S3, and BOLD sequences - S3 trees are significantly more similar to to BOLD trees. The success in N-label replacement performance validation of S3, and encouraging results in DNA Barcoding applications point the way to future works that work on modern sequencing technologies and cover other error correction modes.