Evolving Controllably Difficult Datasets for Clustering

Cameron Shand, Richard Allmendinger, Julia Handl, Andrew Webb, John Keane

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

272 Downloads (Pure)


Synthetic datasets play an important role in evaluating clustering algorithms,
as they can help shed light on consistent biases, strengths, and weaknesses of particular techniques, thereby supporting sound conclusions. Despite this, there is a surprisingly small set of established clustering benchmark data, and many of these are currently handcrafted. Even then, their difficulty is typically not quantified or considered, limiting the ability to interpret algorithmic performance on these datasets. Here, we introduce HAWKS, a new data generator that uses an evolutionary algorithm to evolve cluster structure of a synthetic data set. We demonstrate how such an approach can be used to produce datasets of a pre-specified difficulty, to trade off different aspects of problem difficulty, and how these interventions directly translate into changes in the clustering performance of established algorithms.
Original languageEnglish
Title of host publicationProceedings of the Annual Conference on Genetic and Evolutionary Computation (GECCO '19)
Publication statusPublished - 13 Jul 2019
EventThe Genetic and Evolutionary Computation Conference - Prague, Czech Republic
Duration: 13 Jul 201917 Jul 2019


ConferenceThe Genetic and Evolutionary Computation Conference
Abbreviated titleGECCO 2019
Country/TerritoryCzech Republic


Dive into the research topics of 'Evolving Controllably Difficult Datasets for Clustering'. Together they form a unique fingerprint.

Cite this