Distributed Document and Phrase Co-embeddings for Descriptive Clustering

Motoki Sato, Austin Brockmeier, Georgios Kontonatsios, Tingting Mu, John Goulermas, Junichi Tsujii, Sophia Ananiadou

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Descriptive document clustering aims to automatically discover groups of semantically related documents and to assign a meaningful label to characterise the content of each cluster. In this paper, we present a descriptive clustering approach that employs a distributed representation model, namely the paragraph vector model, to capture semantic similarities between documents and phrases. The proposed method uses a joint representation of phrases and documents (i.e., a co- embedding) to automatically select a descriptive phrase that best represents each document cluster. We evaluate our method by comparing its performance to an existing state-of-the-art descriptive clustering method that also uses co-embedding but relies on a bag-of-words representation. Results obtained on benchmark datasets demonstrate that the paragraph vector-based method obtains superior performance over the existing approach in both identifying clusters and assigning appropriate descriptive labels to them.
Original languageEnglish
Title of host publicationProceedings of EACL 2017
Pages991-1001
Number of pages11
Publication statusPublished - Jan 2017
EventEuropean Chapter of the Association for Computational Linguistics - Valencia Conference Center, Valencia, Spain
Duration: 3 Apr 20177 Apr 2017
Conference number: 15
http://eacl2017.org/

Conference

ConferenceEuropean Chapter of the Association for Computational Linguistics
Abbreviated titleEACL
Country/TerritorySpain
CityValencia
Period3/04/177/04/17
Internet address

Keywords

  • descriptive clustering
  • co-embeddings
  • text mining
  • systematic reviews

Fingerprint

Dive into the research topics of 'Distributed Document and Phrase Co-embeddings for Descriptive Clustering'. Together they form a unique fingerprint.

Cite this