Biostatistics: Bayesian Models for High-Dimensional Categorical and Mixed Domain Data

There is increasing interest in analysis of mixed domain data. For example, data may consist of not only vectors of categorical, count and continuous variables but also document text, images, and curves of various types. Traditional methods of joint modeling based on Gaussian latent factors or random effects fall short in such settings. Alternatives have been proposed using nonparametric Bayes methods that place Dirichlet process priors on a joint random effects distribution. Such models induce dependence through defining a discrete mixture model for each type of data, with the mixture component allocation (cluster) for a subject being identical across all domains. Unfortunately, global clustering is restrictive and can lead to introducing an over-abundance of clusters. We propose a new class of factor partition models that instead allow separate but dependent clustering in each domain through a class of simplex factor models for efficient nonparametric modeling of high-dimensional unordered categorical data. Properties are described and we develop a highly efficient MCMC algorithm for posterior computation that scales well with increasing dimension. The methods are illustrated through applications to a variety of settings including modeling of dependence in high-dimensional categorical data (contingency tables), using high-dimensional categorical data (e.g., gene sequence) to predict a response, and prediction from other high-dimensional “objects” (curves, text, images, etc) allowing higher order interactions.

Event Information

Date & Time(s)
Wednesday, May 11, 2011, 4:00 PM
Speaker(s)

David Dunson
Department of Statistical Science
Duke University

Contact Information

Address

307 East 63rd Street, 3rd Floor Conference Room