Gem: Gaussian Mixture Model Embeddings for Numerical Feature Distributions

Hafiz Tayyab Rauf, Alex Bogatu, Norman W. Paton, Andre Freitas

Research output: Chapter in Book/Conference proceedingConference contributionpeer-review

4 Downloads (Pure)

Abstract

Embeddings are now used to underpin a wide variety of data management tasks, including entity resolution, dataset search and semantic type detection. Such applications often involve datasets with numerical columns, but there has been more emphasis placed on the semantics of categorical data in embeddings than on the distinctive features of numerical data. In this paper, we propose a method called Gem (Gaussian mixture model embeddings) that creates embeddings that build on numerical value distributions from columns. The proposed method specializes a Gaussian Mixture Model (GMM) to identify and cluster columns with similar value distributions. We introduce a signature mechanism that generates a probability matrix for each column, indicating its likelihood of belonging to specific Gaussian components, which can be used for different applications, such as to determine semantic types. Finally, we generate embeddings for three numerical data properties: distributional, statistical and contextual. Our core method focuses on numerical columns without using table metadata for context. However, the method can be combined with other types of evidence, and we integrate attribute names with the Gaussian embeddings to evaluate the method’s contribution to improving overall performance. We compare Gem with several baseline methods for numeric only and numeric + context tasks, showing that Gem consistently outperforms the baselines on five benchmark datasets.
Original languageEnglish
Title of host publication28th International Conference on Extending Database Technology
Publication statusAccepted/In press - 5 Feb 2025

Fingerprint

Dive into the research topics of 'Gem: Gaussian Mixture Model Embeddings for Numerical Feature Distributions'. Together they form a unique fingerprint.

Cite this