Deep Learning Architectures for Complex Data Fusion and Integration

  • Hafiz Tayyab Rauf

Student thesis: Phd

Abstract

Data integration is a fundamental challenge in data management, specifically given the rapidly growing repositories of large-scale, heterogeneous, and semantically diverse tabular data. Core data integration tasks such as schema inference, entity resolution, and domain discovery often require clustering methods that go beyond surface-level similarity and can exploit structural, semantic, and statistical signals. This thesis investigates the applicability of deep clustering (DC) as an unsupervised, representation-driven paradigm for data integration, where clustering and representation learning are jointly optimized. We first systematically evaluated state-of-the-art DC methods in tabular data integration. We identified their key limitations in representing dense numerical features through systematic experimentation, including their reliance on static similarity metrics, difficulties handling overlapping semantic domains, and barriers to scaling to large integration settings. In response to the identified issues, we propose TableDC, a deep clustering method specifically designed for data management tasks. TableDC integrates the Mahalanobis distance to model inter-feature dependencies, a heavy-tailed Cauchy similarity kernel to manage overlapping cluster boundaries, and a Birch-based cluster initialization method to enhance the model's stability and convergence in high-dimensional settings. Experimental results on multiple real-world datasets demonstrate that TableDC improves clustering quality, especially in dense, ambiguous, and overlapping scenarios. TableDC builds on and refines embeddings that represent tabular data and metadata. However, existing embedding models often perform less well on numerical data than on textual data. Thus, we propose a method called Gem, a Gaussian Mixture Model (GMM)-based embedding framework bespoke to numerical features. Gem integrates the statistical distribution of numeric columns as compact signatures with the semantic discriminability of numerical data. Combined with structural signals such as headers, the resulting embeddings improve clustering performance in domain discovery and semantic type annotation. Overall, this thesis's contributions validate deep clustering as a scalable, domain-adaptive paradigm for data integration and demonstrate how task-specific design in similarity metrics and embedding mechanisms can improve performance on real-world data management tasks.
Date of Award4 Dec 2025
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorAndre Freitas (Co Supervisor) & Norman Paton (Main Supervisor)

Keywords

  • Data Integration
  • Schema Inference
  • Entity Resolution
  • Domain Discovery
  • Deep Clustering
  • Embeddings

Cite this

'