Bayesian Clustering Approaches for Discrete Data
Unsupervised classification or clustering uses no a priori knowledge of the labels of the observations in the process of categorizing data. The research contained in this thesis focuses on the machine learning of discrete-valued gene expression datasets using clustering, with the aim of identifying gene co-expression networks. Specifically, a number of topics surrounding the use of mixture models and Markov chain Monte Carlo (MCMC) methods in clustering of discrete data from high-throughput transcriptome sequencing technologies is presented. After outlining current challenges and gaps in research with respect to clustering approaches, three mixture model-based clustering methods are presented: mixtures of multivariate Poisson-log normal distributions, mixtures of multivariate Poisson-log normal factor analyzers and mixtures of matrix-variate Poisson-log normal distributions. Significance, innovation, limitations and a number of future directions stemming from this research are discussed.