Abstract
Word grounding tasks aim to associate individual words with corresponding elements in visual scenes, enabling machines to link language with perception for effective human–machine interaction. However, existing grounding models struggle to generalize to synonyms or unseen lexical variants, limiting their performance in open-domain scenarios. In this paper, we present a Bayesian multimodal grounding model that incorporates word embeddings as priors within a probabilistic generative process to improve robustness under lexical variation. We compare the effects of static FastText and contextual BERT embeddings on grounding accuracy by conditioning word–visual associations on their semantic representations. Experiments use CLEVR-generated 3D scenes paired with structured compositional descriptions to test the grounding of object categories, colors, and spatial relations across lexical shifts. Results show that contextual embeddings such as BERT consistently outperform static embeddings like FastText in overall grounding accuracy and in resolving spatial relations. We demonstrate that integrating structured probabilistic inference with rich semantic embeddings offers a principled and scalable solution for robust, interpretable word grounding.
| Original language | English |
|---|---|
| Title of host publication | 22nd Pacific Rim International Conference on Artificial Intelligence (PRICAI 2025) |
| DOIs | |
| Publication status | Published - 23 Feb 2026 |
Keywords
- Multimodal Word Grounding
- Semantic Embeddings
- Bayesian inference
- Synonym Substitution
- CLEVR Dataset
Fingerprint
Dive into the research topics of 'Evaluating Semantic Representations in Multimodal Word Grounding'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver