Deep Clustering for Data Cleaning and Integration

Hafiz Tayyab Rauf, Andre Freitas, Norman W. Paton

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

38 Downloads (Pure)

Abstract

Deep Learning (DL) techniques now constitute the state-of-theart for important problems in areas such as text and image processing, and there have been impactful results that deploy DL in several data management tasks. Deep Clustering (DC) has recently emerged as a sub-discipline of DL, in which data representations are learned in tandem with clustering, with a view to automatically identifying the features of the data that lead to improved clustering results. While DC has been used to good effect in several domains, particularly in image processing, the potential of DC for data management tasks remains unexplored. In this paper, we address this gap by investigating the suitability of DC for data cleaning and integration tasks, specifically schema inference, entity resolution and domain discovery, from the perspective of tables, rows and columns, respectively. In this setting, we compare and contrast several DC and non-DC clustering algorithms using standard benchmarks. The results show, among other things, that the most effective DC algorithms consistently outperform non-DC clustering algorithms for data integration tasks. Experiments also show consistently strong performance compared with state-of-the-art bespoke algorithms for each of the data integration tasks.
Original languageEnglish
Title of host publicationProceedings 27th International Conference on Extending Database Technology ( EDBT 2024 )
PublisherOpenProceedings
ISBN (Print)2367-2005
Publication statusAccepted/In press - 8 Dec 2023

Fingerprint

Dive into the research topics of 'Deep Clustering for Data Cleaning and Integration'. Together they form a unique fingerprint.

Cite this