Generalized linear regression model with LASSO, group LASSO, and sparse group LASSO regularization methods for finding bacteria associated with colorectal cancer using microbiome data
With ever increasing advancements in microbiome sequencing technologies, the need for efficient statistical modelling of these systems has become apparent. Most microbiome data is filled with sparsity and therefore creates a problem for modelling with many conventional statistical analysis methods. For example, in the study of Nakatsu et al. (2015), the 16S ribosomal RNA sequencing on the colon tissue of healthy, carcinoma-inflicted, and adenoma-inflicted subjects were collected. One wishes to identify bacteria that are associated to the outcome of the three health states. The ordinary binomial or multinomial regression model would fail to perform a meaningful analysis due to the large number of taxa and the sparsity of the taxonomic count. In this thesis, we attempt to solve these problems by using the LASSO, group LASSO, and sparse group LASSO regularization on the multinomial and binomial regression models. Raw read microbiome sequencing data of the study of Nakatsu et al. (2015) is obtained from the Sequence Read Archive, of NCBI. The software "mothur" is used to preprocess the sequence data and cluster them into Operational Taxonomic Units (OTUs), and OTU counts are obtained for each taxa. We find that, in general, similar bacteria are chosen for healthy and adenoma phenotypes, and different bacteria are chosen for the carcinoma phenotype. We find that Proteobacteria are more often selected under the normal phenotype, whereas Fusobacterium are more often selected under the carcinoma phenotype. The adenoma phenotype generally resembles the bacteria from the other two phenotypes, but with different coefficients.