Fast aggregation-based algorithms for knowledge discovery

  • Xinye Chen

Student thesis: Phd

Abstract

This thesis introduces new knowledge discovery algorithms based on fast data aggregation, with a particular focus on coreset construction and data partitioning. A key idea pursued here is that aggregation of data points into groups can often be significantly sped up by an appropriate sorting of the data. Applications of the developed sorting-based aggregation methods include clustering, information retrieval, efficient time series representations, as well as anomaly detection. The first main topic in this thesis is clustering. Clustering has a wide variety of applications in science and engineering. We present a new explainable density-based clustering method that is well-suited for large-scale data. It is based on a sorting-based aggregation approach combined with an early-stopping policy which helps group data points quickly by reducing the number of pairwise distance computations. The clustering method attains near-linear time and linear space complexity. We also show how to generate textual interpretations of the computed clusters, an important feature in many applications. Our second main topic is information retrieval. We design a fixed-radius nearest neighbor search algorithm using ideas from our clustering method. By leveraging the first principal component of the data, our method significantly prunes the search space and achieves further acceleration by exploiting high-level Basic Linear Algebra Subprograms. An empirical study on synthetic and real-world data shows that our method compares favorably to other exact tree-based nearest neighbor search algorithms. The third main topic is time series knowledge representation and anomaly detection. We propose a fast tolerance-driven symbolic aggregation method that achieves significant speedup while maintaining similar reconstruction accuracy compared to other approaches. We also propose a joint symbolic aggregation approach for multiple time series, making use of parallel processing. By using techniques from the area of natural language processing, we explore the semantic properties of time series using the developed methods. Throughout the thesis, we compare our methods to widely used and well-established software libraries. Numerous computational comparisons and benchmarks are included, and all software and data required to reproduce our results are available at https://github.com/nla-group and the repositories corresponding to the separate chapters.
Date of Award1 Aug 2024
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorNicholas Higham (Supervisor) & Stefan Guettel (Supervisor)

Keywords

  • algorithms
  • aggregation
  • knowledge discovery
  • clustering
  • time series

Cite this

'