MixDir: Scalable bayesian clustering for high-dimensional categorical data

Constantin Ahlmann-Eltze, Christopher Yau

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

183 Downloads (Pure)

Abstract

Multivariate analysis of high-dimensional datasets with multiple categorical variables (e.g. surveys, questionnaires) is a challenging task but can reveal patterns of responses that are masked from univariate analyses. In this paper we propose a novel variational inference algorithm to cluster high-dimensional categorical observations into latent classes. Variational inference is an approximate Bayesian inference algorithm, which combines fast optimization methods with the ability to propagate the uncertainty to the clustering (soft clustering). The model is robust to misspecification of the number of latent classes and can infer a reasonable number from the data. We assess the performance on synthetic and real world data and show that our algorithm has similar performance to the best other tested method if the correct number of classes is known and outperforms the other methods if it the number of classes needs to be inferred. An R-package implementing our algorithm is available at the Comprehensive R Archive Network.

Original languageEnglish
Title of host publicationProceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018
EditorsTina Eliassi-Rad, Wei Wang, Ciro Cattuto, Foster Provost, Rayid Ghani, Francesco Bonchi
PublisherIEEE
Pages526-539
Number of pages14
ISBN (Electronic)9781538650905
DOIs
Publication statusPublished - 2019
Event5th IEEE International Conference on Data Science and Advanced Analytics - Turin, Italy
Duration: 1 Oct 20184 Oct 2018

Publication series

NameProceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018

Conference

Conference5th IEEE International Conference on Data Science and Advanced Analytics
Abbreviated titleDSAA 2018
Country/TerritoryItaly
CityTurin
Period1/10/184/10/18

Keywords

  • Bayesian
  • Categorical variables
  • Clustering
  • High-dimensional
  • Variational inference

Fingerprint

Dive into the research topics of 'MixDir: Scalable bayesian clustering for high-dimensional categorical data'. Together they form a unique fingerprint.

Cite this