Dimensionality Reduction in the Creation of Classifiers and the Effects of Correlation, Cluster Overlap, and Modelling Assumptions.

Petrcich, William
Journal Title
Journal ISSN
Volume Title
University of Guelph

Discriminant analysis and random forests are used to create models for classification. The number of variables to be tested for inclusion in a model can be large. The goal of this work was to create an efficient and effective selection program. The first method used was based on the work of others. The resulting models were underperforming, so another approach was adopted. Models were built by adding the variable that maximized new-model accuracy. The two programs were used to generate discriminant-analysis and random forest models for three data sets. An existing software package was also used. The second program outperformed the alternatives. For the small number of runs produced in this study, it outperformed the method that inspired this work. The data sets were studied to identify determinants of performance. No definite conclusions were reached, but the results suggest topics for future study.

food authentication, classification, variable selection, BIC, mclust, random forests, clustvarsel