A new variance-covariance structure-based statistical pattern recognition system for solving the sequence-set proximity problem under the homology-free assumption
Our goal in this dissertation is to solve the sequence-set proximity problem (SSPP) under the homology-free assumption, which is defined as the problem of measuring the closeness between any two sets of biosequences, where there is no prior knowledge of homology within each sequence-set or between sequence-sets (homology is defined as a property of two or more sequences that share a common ancestor). It is a generalization of the sequence proximity problem. These sets of bio-sequences are the subject matter of several applications, such as unsupervised classification (clustering), supervised classification, and observing the changes that may take place over time in a biological phenomenon. Measuring the distance between two sets of sequences can be achieved using a distance measure at the sequence-set level. Discriminating sets of sequences based on their biological variation using a mathematical distance measure defined at the sequence-set level is a complicated task, and the question is: Can we observe the biological variation in sets of sequences, and employ it for the discrimination process between them? In this dissertation we propose distance measure(s)/metric(s) defined in terms of the variance-covariance structure to capture the biological variation, as an alternative structure to solve the sequence-set proximity problem under the homology-free assumption, where this assumption prevents the effective use of all the existing homology-based alignment measures and homology-based alignment-free measures. The variance-covariance structure has several appealing properties that make it useful in describing biological variation. The proposed measures/metric are matrix inverse operation-free, and thus they can work with singular variance-covariance matrices, and are computationally simpler than the existing variance-covariance-based distance measures (or models). We build a number of algorithms upon the proposed distance measure(s)/metric(s), to perform a variety of tasks at the sequence-set level, such as: searching, unsupervised classification, supervised classification, variability detection, and visualization, all without relying on any assumptions of homology. The datasets under investigation are sequence-based datasets. Thus, there is a need for feature extraction algorithms (i.e. adapters), to map bio-sequences from the data space ( D ) to the feature space ( F : 'p'), in order to enable the proposed algorithms to work with real-valued patterns. In this context, we propose (i) a new mechanism for the 'n'-grams technique, to extract features from these sets of biosequences. The new mechanism has the capability to map the changes in these sequences over the time (i.e. sequence time, where a sequence is a one dimensional time series), to the feature space, and (ii) a new approach for defining a new feature vector, where the features are defined as the distances between successive recurrences of a set of words of length 'n.' We perform a number of experiments using real datasets, and our algorithms show robustness in performing the required tasks.