Automating tools for scanning, characterizing and analyzing large DNA sequence data sets
Progress in DNA sequencing techniques has facilitated high-throughput research by providing greater speed and accuracy at a lower cost. The sheer volume of data generated requires new methods for managing and analyzing large DNA sequence data sets. This thesis focuses on the development of two computer programs that help to meet this need for augmented analytical tools: ' SeqCleanR' (Chapter 2) and 'Diagnostica' (Chapter 3). The program 'SeqCleanR' makes use of profile hidden Markov models to help manage large datasets and provide another level of error detection and characterization prior to sequence analysis. The program ' Diagnostica' employs a newly developed method for locating single and compound DNA sequence characters that are diagnostic of the data sets provided.