Correcting Ambiguous Base Labels in DNA Sequencing using Neural Networks and its Impact on DNA Barcoding Applications

dc.contributor.advisorKremer, Stefan
dc.contributor.authorMa, Eddie
dc.date.accessioned2018-02-21T14:17:13Z
dc.date.available2018-02-21T14:17:13Z
dc.date.copyright2018-02
dc.date.created2018-01-10
dc.date.issued2018-02-21
dc.degree.departmentSchool of Computer Scienceen_US
dc.degree.grantorUniversity of Guelphen_US
dc.degree.nameDoctor of Philosophyen_US
dc.degree.programmeComputer Scienceen_US
dc.description.abstractObtaining DNA sequences relies on software algorithms, sequencing technology, and human effort. Advancement in algorithms and technology improves the rate in obtaining sequences. In this thesis, an artificial neural network based method improves the number of bases obtained from Sanger sequencing, by post-processing DNA sequences and replacing ambiguous N-labels with DNA base labels. The existing KB basecalling algorithm produces the initial sequence that is post-processed by the presented method. DNA Barcoding is a platform that depends on highly accurate sequences. In DNA Barcoding, species are identified by short reads of a standardized gene region (e.g. 600-700 bases for COI). Barcode of Life Datasystems (BOLD) is the largest repository and analytics platform serving the International Barcode of Life (iBOL) project. In this thesis, a novel machine learned error correction system is developed, the System-3 N-label Editor (S3). S3 is developed and validated on DNA Barcoding data, using 850,000 ambiguous base labels across 160,000 sequences. S3 internally represents uncertainty to estimate error and commits an N-label replacement when predicted error is sufficiently low. S3 maintains an observed error rate lower than 1%, while disambiguating 79% of N-labels in animal barcodes, 80% of N-labels from plant barcodes, and 58% of N-labels in non- protein-coding markers. S3 is tested for its impact in bioinformatics applications on 90,000 sequences from Canadian National Parks Malaise Project. Bioinformatics analyses are run using the KB, S3, and BOLD sequences as three treatment groups. Three applications are used for the comparison: Barcode Gap, Species Identification and Discovery, and Tree Building. The barcode gap refers to the difference in the between-and-within species distances; S3 improves the difference in two-thirds of species as compared with KB. For species identification/discovery, S3 did not improve over KB in resolving species. When phylogenetic trees are constructed using an overlapping region between KB, S3, and BOLD sequences - S3 trees are significantly more similar to to BOLD trees. The success in N-label replacement performance validation of S3, and encouraging results in DNA Barcoding applications point the way to future works that work on modern sequencing technologies and cover other error correction modes.en_US
dc.identifier.urihttp://hdl.handle.net/10214/12555
dc.language.isoenen_US
dc.publisherUniversity of Guelphen_US
dc.rightsAttribution 2.5 Canada*
dc.rights.urihttp://creativecommons.org/licenses/by/2.5/ca/*
dc.subjectDNA Sequencingen_US
dc.subjectDNA Barcodingen_US
dc.subjectArtificial Neural Networksen_US
dc.subjectMachine Learningen_US
dc.subjectBioinformaticsen_US
dc.subjectComputer scienceen_US
dc.titleCorrecting Ambiguous Base Labels in DNA Sequencing using Neural Networks and its Impact on DNA Barcoding Applicationsen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Ma_Eddie_201802_PhD.pdf
Size:
6.66 MB
Format:
Adobe Portable Document Format
Description:
PhD dissertation