Skip to main navigation Skip to search Skip to main content

Evaluating Semantic Representations in Multimodal Word Grounding

Research output: Chapter in Book/Conference proceedingConference contributionpeer-review

2 Downloads (Pure)

Abstract

Word grounding tasks aim to associate individual words with corresponding elements in visual scenes, enabling machines to link language with perception for effective human–machine interaction. However, existing grounding models struggle to generalize to synonyms or unseen lexical variants, limiting their performance in open-domain scenarios. In this paper, we present a Bayesian multimodal grounding model that incorporates word embeddings as priors within a probabilistic generative process to improve robustness under lexical variation. We compare the effects of static FastText and contextual BERT embeddings on grounding accuracy by conditioning word–visual associations on their semantic representations. Experiments use CLEVR-generated 3D scenes paired with structured compositional descriptions to test the grounding of object categories, colors, and spatial relations across lexical shifts. Results show that contextual embeddings such as BERT consistently outperform static embeddings like FastText in overall grounding accuracy and in resolving spatial relations. We demonstrate that integrating structured probabilistic inference with rich semantic embeddings offers a principled and scalable solution for robust, interpretable word grounding.
Original languageEnglish
Title of host publication22nd Pacific Rim International Conference on Artificial Intelligence (PRICAI 2025)
DOIs
Publication statusPublished - 23 Feb 2026

Keywords

  • Multimodal Word Grounding
  • Semantic Embeddings
  • Bayesian inference
  • Synonym Substitution
  • CLEVR Dataset

Fingerprint

Dive into the research topics of 'Evaluating Semantic Representations in Multimodal Word Grounding'. Together they form a unique fingerprint.

Cite this