A new form of nested association pattern for data mining and class discrimination
To facilitate an easier interpretation, supported by identifying the complex internal associations between multiple values of a pattern, a new form of high-order (multi-value) pattern known as 'N'ested ' H'igh-'O'rder 'P'attern (NHOP) is presented. The proposed form of pattern has a nested granular structure that highlights a hierarchical and iterative association evaluation. Multi-value association pattern as defined, also generalizes a common form of data mining of sequential pattern, since it represents a set of associated values extracted from sampling outcomes of a random 'N'-tuple and need not be contiguous. Furthermore, because it is a value pattern from multiple variables, it is also more descriptive than its corresponding variable pattern. The pattern is detected by statistical testing if its occurrence is significantly deviated from the expected according to a prior model or null hypothesis. Even though NHOP by itself is clearly important in understanding the association structure from the dataset, we further extend NHOP to perform classification as discrimination pattern. The rationale is that, meaningful association pattern and at the same time predicting class discrimination can reinforce the detection of the underlying regularity, and hence can further understand the data domain. We propose a 'C'lassification method based on the 'N'ested 'H'igh-'O'rder ' P'atterns (referred to as C-NHOP). The relevance of NHOP for pattern discovery is evaluated using synthetic data, machine learning benchmark datasets, and real life biomolecular and organismal biological datasets. The first evaluation is based on comparing two types of closely related patterns, the general type of high-order pattern and the proposed NHOP. The second evaluation is based on the data of a biomolecule family known as SH3 domain, a model for protein-protein interaction mediator. We proposed two algorithms, the 'r-Tree' and the 'Best-k ' algorithm, to extract a set of patterns using the maximized criterion of NHOP. The goal is to identify the relationship between the primary structure and the 3-dimensional structure of the molecule. The relevance of C-NHOP is also evaluated using 26 machine learning benchmark datasets. Experiments show that C-NHOP is very competitive in classification tasks. Finally, the proposed classifier using NHOP is applied to an important problem in biotechnology. It is used to differentiate the transgenic and conventional pig lins concerning their chemical compositions in tissues. Since this set of data can be used in evaluating the effects relating genetic manipulations and physiochemical consequences, a reliable evaluation is extremely important to the newly developed technology in biology. As a whole, we conclude that we have evaluated the significance of the proposed pattern and found it to be extremely useful in data mining and class discrimination tasks.