Development Strategies for Parallelized Genetic Sequence Data Preparation and Application of Artificial Neural Networks for Sequencing Error Correction

Date
Authors
Cicanovski, Ivan
Journal Title
Journal ISSN
Volume Title
Publisher
University of Guelph
Abstract

A set of core component algorithms of a complete sequencing error correction pipeline for multiplexed read sequences was developed. Semi-synthetic and real-world data-sets were applied for evaluation of algorithms with performance measures and non-parametric statistical analyses. A synthetic sequencing error generation technique was developed and applied to simulate read sequences from reference sequences. A pairwise alignment search parallelized for multi-core CPUs utilizing the striped Smith-Waterman algorithm, accounting for orientation, was developed and applied for de-multiplexing to infer identities of read sequences. Algorithms for filtering and trimming sequences were developed for data preparation. Sequencing error correction was approached as a multi-class classification problem with several artificial neural network models; techniques for imbalanced data learning were investigated. A procedure of gapped segment isolation and gap placeholder character embedding was developed for correction of deletion errors. Trained long short-term memory recurrent neural networks demonstrated sequencing error correction.

Description
Keywords
machine learning, neural networks, parallel computing, sequence analysis, string-searching algorithms
Citation