Development Strategies for Parallelized Genetic Sequence Data Preparation and Application of Artificial Neural Networks for Sequencing Error Correction
A set of core component algorithms of a complete sequencing error correction pipeline for multiplexed read sequences was developed. Semi-synthetic and real-world data-sets were applied for evaluation of algorithms with performance measures and non-parametric statistical analyses. A synthetic sequencing error generation technique was developed and applied to simulate read sequences from reference sequences. A pairwise alignment search parallelized for multi-core CPUs utilizing the striped Smith-Waterman algorithm, accounting for orientation, was developed and applied for de-multiplexing to infer identities of read sequences. Algorithms for filtering and trimming sequences were developed for data preparation. Sequencing error correction was approached as a multi-class classification problem with several artificial neural network models; techniques for imbalanced data learning were investigated. A procedure of gapped segment isolation and gap placeholder character embedding was developed for correction of deletion errors. Trained long short-term memory recurrent neural networks demonstrated sequencing error correction.