Bayesian Clustering Approaches for Discrete Data

Thumbnail Image
Silva, H. Anjali
Journal Title
Journal ISSN
Volume Title
University of Guelph

Unsupervised classification or clustering uses no a priori knowledge of the labels of the observations in the process of categorizing data. The research contained in this thesis focuses on the machine learning of discrete-valued gene expression datasets using clustering, with the aim of identifying gene co-expression networks. Specifically, a number of topics surrounding the use of mixture models and Markov chain Monte Carlo (MCMC) methods in clustering of discrete data from high-throughput transcriptome sequencing technologies is presented. After outlining current challenges and gaps in research with respect to clustering approaches, three mixture model-based clustering methods are presented: mixtures of multivariate Poisson-log normal distributions, mixtures of multivariate Poisson-log normal factor analyzers and mixtures of matrix-variate Poisson-log normal distributions. Significance, innovation, limitations and a number of future directions stemming from this research are discussed.

Clustering, RNA sequencing, Discrete data, Multivariate Poisson-Log Normal distribution, Markov chain Monte Carlo, Factor analyzers, Matrix variate distribution, Co-expression network
Silva A., Rothstein, S. J. McNicholas, P. D. and Subedi, S. (2017) A Multivariate Poisson-Log Normal Mixture Model for Clustering Transcriptome Sequencing Data. arXiv preprint arXiv:1711.11190.