Main content

Model-based clustering of high-dimensional binary data

Show full item record

Title: Model-based clustering of high-dimensional binary data
Author: Tang, Yang
Department: Department of Mathematics and Statistics
Program: Mathematics and Statistics
Advisor: McNicholas, Paul D.Browne, Ryan P.
Abstract: We present a mixture of latent trait models with common slope parameters (MCLT) for high dimensional binary data, a data type for which few established methods exist. Recent work on clustering of binary data, based on a d-dimensional Gaussian latent variable, is extended by implementing common factor analyzers. We extend the model further by the incorporation of random block effects. The dependencies in each block are taken into account through block-specific parameters that are considered to be random variables. A variational approximation to the likelihood is exploited to derive a fast algorithm for determining the model parameters. The Bayesian information criterion is used to select the number of components and the covariance structure as well as the dimensions of latent variables. Our approach is demonstrated on U.S. Congressional voting data and on a data set describing the sensory properties of orange juice. Our examples show that our model performs well even when the number of observations is not very large relative to the data dimensionality. In both cases, our approach yields intuitive clustering results. Additionally, our dimensionality-reduction method allows data to be displayed in low-dimensional plots.
Date: 2013-08

Files in this item

Files Size Format View
Tang_Yang_201308_Msc.pdf 3.894Mb PDF View/Open

This item appears in the following Collection(s)

Show full item record