It is a common misconception in data analysis that more data equates to more knowledge. However, as data becomes bigger, the methods required to decipher and visualise critical information become an ever more cumbersome task. A central problem in data analysis is identifying and understanding relationships in complex systems. Mutual information has proven valuable in this regard and is already a crucial measure in many data analysis and machine learning tasks. Nearest neighbour techniques are a classic approach in non-parametric statistics and have proven effective in entropy estimation. Unfortunately, it is well established that estimating entropy is fraught with difficulties, especially for discrete-continuous mixed cases. In this thesis, we develop an ensemble method to estimate information-theoretic measures using a novel noisy resampling technique. The method is empirically shown to be asymptotically unbiased and consistent. Moreover, through artificial and real-world experiments, we show that the approach repeatedly outperforms the current leading k-nearest-neighbour methods - the Kozachenko-Leonenko estimator and the KSG estimator - to achieve a more accurate and robust estimate for discrete and continuous random variables alike. This ability is essential in classification problems, where the class variable is often discrete, significantly widening the applicability of mutual information measures in data analysis. In real-world domains, the proposed method successfully identifies key variables supported by machine learning results. New algorithms are implemented in an exploratory analysis tool for multivariate data. We investigate the visualisations currently used for multi-dimensional data and introduce DataViewer, a visualisation software package for exploring patterns in modern data sets. We propose a variable interaction diagram for illustrating variable correlations with significant mutual information. The software is designed to aid interpretation of complex data structures, which in turn motivates intelligent feature selection. The techniques are illustrated by application to several real world data sets.
Date of Award | 1 Aug 2022 |
---|
Original language | English |
---|
Awarding Institution | - The University of Manchester
|
---|
Supervisor | Stephen Watts (Supervisor) & Cinzia Da Via (Supervisor) |
---|
- supervised learning
- partitioning
- data visualisation
- exploratory data analysis
- classification
- explainable machine learning
- information theory
- Kullback-Leibler divergence
- entropy estimation
- entropy
- mutual information
- histogram
- Cohen's kappa
- parallel coordinates
A novel approach to estimating information-theoretic measures for exploratory data analysis and explainable machine learning
Crow, L. (Author). 1 Aug 2022
Student thesis: Phd