Cluster-Level Contrastive Learning for Emotion Recognition in Conversations

Research output: Contribution to journalArticlepeer-review


A key challenge for Emotion Recognition in Conversations (ERC) is to distinguish semantically similar emotions. Some works utilise Supervised Contrastive Learning (SCL) which uses categorical emotion labels as supervision signals and contrasts in high-dimensional semantic space. However, categorical labels fail to provide quantitative information between emotions. ERC is also not equally dependent on all embedded features in the semantic space, which makes the high-dimensional SCL inefficient. To address these issues, we propose a novel low-dimensional Supervised Cluster-level Contrastive Learning (SCCL) method, which first reduces the high-dimensional SCL space to a three-dimensional affect representation space Valence-Arousal-Dominance (VAD), then performs cluster-level contrastive learning to incorporate measurable emotion prototypes. To help modelling the dialogue and enriching the context, we leverage the pre-trained knowledge adapters to infuse linguistic and factual knowledge. Experiments show that our method achieves new state-of-the-art results with 69.81% on IEMOCAP, 65.7% on MELD, and 62.51% on DailyDialog datasets. The analysis also proves that the VAD space is not only suitable for ERC but also interpretable, with VAD prototypes enhancing its performance and stabilising the training of SCCL. In addition, the pre-trained knowledge adapters benefit the performance of the utterance encoder and SCCL. Our code is available at:
Original languageEnglish
Number of pages12
JournalIEEE Transactions on Affective Computing
Publication statusE-pub ahead of print - 8 Feb 2023


  • Adaptation models
  • Cluster-Level Contrastive Learning
  • Emotion Recognition in Conversations
  • Emotion recognition
  • Linguistics
  • Pre-trained Knowledge Adapters
  • Prototypes
  • Semantics
  • Task analysis
  • Training
  • Valence-Arousal-Dominance


Dive into the research topics of 'Cluster-Level Contrastive Learning for Emotion Recognition in Conversations'. Together they form a unique fingerprint.

Cite this