Abstract
Entity resolution, which seeks to identify records that represent
the same entity, is an important step in many data integration and
data cleaning applications. However, entity resolution is challenging both
in terms of scalability (all-against-all comparisons are computationally
impractical) and result quality (syntactic evidence on record equivalence
is often equivocal). As a result, end-to-end entity resolution proposals
involve several stages, including blocking to efficiently identify candidate
duplicates, detailed comparison to refine the conclusions from blocking,
and clustering to identify the sets of records that may represent the
same entity. However, the quality of the result is often crucially dependent
on configuration parameters in all of these stages, for which it may
be difficult for a human expert to provide suitable values. This paper
describes an approach in which a complete entity resolution process is
optimized, on the basis of feedback (such as might be obtained from
crowds) on candidate duplicates. Given such feedback, an evolutionary
search of the space of configuration parameters is carried out, with a view
to maximizing the fitness of the resulting clusters. The approach is payas-
you-go in that more feedback can be expected to give rise to better
outcomes. An empirical evaluation shows that the co-optimization of the
different stages in entity resolution can yield signifcant improvements
over default parameters, even with small amounts of feedback.
the same entity, is an important step in many data integration and
data cleaning applications. However, entity resolution is challenging both
in terms of scalability (all-against-all comparisons are computationally
impractical) and result quality (syntactic evidence on record equivalence
is often equivocal). As a result, end-to-end entity resolution proposals
involve several stages, including blocking to efficiently identify candidate
duplicates, detailed comparison to refine the conclusions from blocking,
and clustering to identify the sets of records that may represent the
same entity. However, the quality of the result is often crucially dependent
on configuration parameters in all of these stages, for which it may
be difficult for a human expert to provide suitable values. This paper
describes an approach in which a complete entity resolution process is
optimized, on the basis of feedback (such as might be obtained from
crowds) on candidate duplicates. Given such feedback, an evolutionary
search of the space of configuration parameters is carried out, with a view
to maximizing the fitness of the resulting clusters. The approach is payas-
you-go in that more feedback can be expected to give rise to better
outcomes. An empirical evaluation shows that the co-optimization of the
different stages in entity resolution can yield signifcant improvements
over default parameters, even with small amounts of feedback.
Original language | English |
---|---|
Title of host publication | Large-scale Data and Knowledge-Centered Systems |
Volume | 10120 |
DOIs | |
Publication status | Published - 16 Dec 2016 |
Publication series
Name | Lecture Notes in Computer Science |
---|---|
Publisher | Springer |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |