Evolving Controllably Difficult Datasets for Clustering

Cameron Shand, Richard Allmendinger, Julia Handl, Andrew Webb, John Keane

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

325 Downloads (Pure)

Abstract

Synthetic datasets play an important role in evaluating clustering algorithms,
as they can help shed light on consistent biases, strengths, and weaknesses of particular techniques, thereby supporting sound conclusions. Despite this, there is a surprisingly small set of established clustering benchmark data, and many of these are currently handcrafted. Even then, their difficulty is typically not quantified or considered, limiting the ability to interpret algorithmic performance on these datasets. Here, we introduce HAWKS, a new data generator that uses an evolutionary algorithm to evolve cluster structure of a synthetic data set. We demonstrate how such an approach can be used to produce datasets of a pre-specified difficulty, to trade off different aspects of problem difficulty, and how these interventions directly translate into changes in the clustering performance of established algorithms.
Original languageEnglish
Title of host publicationProceedings of the Annual Conference on Genetic and Evolutionary Computation (GECCO '19)
DOIs
Publication statusPublished - 13 Jul 2019
EventThe Genetic and Evolutionary Computation Conference - Prague, Czech Republic
Duration: 13 Jul 201917 Jul 2019

Conference

ConferenceThe Genetic and Evolutionary Computation Conference
Abbreviated titleGECCO 2019
Country/TerritoryCzech Republic
CityPrague
Period13/07/1917/07/19

Fingerprint

Dive into the research topics of 'Evolving Controllably Difficult Datasets for Clustering'. Together they form a unique fingerprint.

Cite this