Identifying clusters in data has a wide range of applications, yet remains a surprisingly difficult task. The subjective nature of clustering leads to difficulty in both selecting the most appropriate algorithm for a given problem and evaluating the performance of these algorithms (as typically there is no ground truth to aid evaluation). Synthetic data can help us to understand the capabilities of algorithms, but only if this data is itself well-understood and presents challenges that are reflective of real-world clustering problems. Current benchmark sets for clustering are limited in this regard. To address this issue, we propose a synthetic data generator (named HAWKS) for clustering that uses an evolutionary algorithm to optimize different challenges. We compare HAWKS against other popular generators and datasets for clustering. We find that HAWKS is able to both produce datasets that result in a wide spread of performance for clustering algorithms, and that the performance varies differently for different algorithms. We extend this generator to directly maximize the performance difference between two clustering algorithms, automatically finding their relative weaknesses without explicit parameterization of cluster properties. This can provide greater mechanistic insights and aid in algorithmic development. In the last part of the thesis, we again explore the use of evolutionary algorithms in clustering, but for the assignment of data points to clusters using multiple objectives. We extend the $\Delta$-MOCK algorithm to adapt the search space (which scales with the size of the dataset) in order to reduce computation and focus the search. By adapting the search space using the current performance and employing strategies to explore this space, at least equivalent performance is achieved for a near two-thirds reduction in computation time (compared to $\Delta$-MOCK). We then use HAWKS to obtain greater insights into the relative differences in our proposed strategies by producing datasets with properties that were better suited to previously poor strategies.
Date of Award | 1 Aug 2020 |
---|
Original language | English |
---|
Awarding Institution | - The University of Manchester
|
---|
Supervisor | John Keane (Supervisor) & Richard Allmendinger (Supervisor) |
---|
- Problem generation
- Cluster analysis
- Synthetic data
- Evolutionary algorithms
- Machine learning
Evolutionary algorithms in clustering: Challenging problem generation and search space adaptation
Shand, C. (Author). 1 Aug 2020
Student thesis: Phd