A latent concept topic model for robust topic inference using word embeddings

Junichi Tsujii, Weihua Hu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Uncovering thematic structures of SNS and blog posts is a crucial yet challenging task, because of the severe data sparsity induced by the short length of texts and diverse use of vocabulary. This hinders effective topic inference of traditional LDA because it infers topics based on document-level co-occurrence of words. To robustly infer topics in such contexts, we propose a latent concept topic model (LCTM). Unlike LDA, LCTM reveals topics via co-occurrence of latent concepts, which we introduce as latent variables to capture conceptual similarity of words. More specifically, LCTM models each topic as a distribution over the latent concepts, where each latent concept is a localized Gaussian distribution over the word embedding space. Since the number of unique concepts in a corpus is often much smaller than the number of unique words, LCTM is less susceptible to the data sparsity. Experiments on the 20Newsgroups show the effectiveness of LCTM in dealing with short texts as well as the capability of the model in handling held-out documents with a high degree of OOV words
Original languageEnglish
Title of host publicationProceedings of ACL 2016
Pages380-386
Number of pages7
Publication statusPublished - Jun 2016

Fingerprint

Dive into the research topics of 'A latent concept topic model for robust topic inference using word embeddings'. Together they form a unique fingerprint.

Cite this