Concept Representation Learning For Multimedia Information Retrieval

  • Ubai Sandouk

Student thesis: Phd


Modern multimedia applications aspire to achieve human-like performance in tasks such as retrieval, comparison and recommendation of media; collectively known as Multimedia Information Retrieval (MIR) tasks. MIR tasks require processing on the level of concepts rather than media features. Methods for concept acquisition and processing are dominated by ad-hoc remedies and unorganized efforts ranging from the naive and costly manual indexing to the sophisticated and coordinated analysis of multi-sourced information; from simply supervised to complex statistical models. In this thesis, such efforts are organized using an overarching terminology and a fresh scheme allowing the systematic study of benefits and limitations of different approaches avoiding the analysis of the overwhelming number of individual models. Accordingly, a checklist of human-like cognitive tasks is developed in order to compare different approaches to concept learning. Among these tasks, descriptive context-aware concepts' representation is central for MIR and is yet to be properly modelled. The absence of such representation results in difficulties performing real-world tasks such as semantically relating media instances, managing uncertain or unfamiliar contexts, and accommodating never-seen-before concepts. To address these limitations, learning contextualized semantics in the form of distance-based Concept Embedding (CE) representation is formulated. This learning is undertaken using a Siamese Architecture of Deep Neural Networks operating on labels and tags appearing in coherent contexts. Unsupervised training of this model leads to a Concept Embedding space where media-related concepts' similarities are reflected by their representations' inter-distances. The properties of this model are demonstrated using three image and three music datasets. Qualitatively, the model is evaluated by visual demonstrations of the resultant embedding space highlighting its emergent organization as well as the effective treatment of Out-of-Vocabulary (OOV) terms. Quantitatively, the model is evaluated via Semantic Priming in multiple settings highlighting the superior performance in domain specific concept modelling, transferability across datasets, tolerance to uncertainty in context, and accommodation of OOV terms. The good performance comes at a cost in model complexity and training times. Alleviating limitations of this model is an on-going endeavour, part of which is attempted in this work. For example, output compression and multiple dataset combination are shown to partially alleviate some limitations. The presented results illustrate the advantages of the proposed CE model in capturing domain-specific semantics, as demonstrated in two further MIR applications: a) Tag Completion: the model recommends useful media-tags based on already used ones; and b) Multi-Label Zero-Shot Learning: the model annotates media content with labels even when no examples of such labels had been observed during training. Intuitively, performing such applications requires high-level understating of media-related concepts, i.e., bridging the semantic gap. Accordingly, the demonstrated performance highlights advantages towards the overall goal of media understanding.
Date of Award31 Dec 2017
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorXiaojun Zeng (Supervisor) & Ke Chen (Supervisor)


  • Multi-Label Zero-Shot Learning
  • Zero-Shot Learning
  • Semantic Priming
  • Contextualized Concepts
  • Concept Embedding Learning
  • Multimedia Information Retrieval
  • Concept Representation

Cite this