Classifying SDSS Data using Active Learning

  • Nathan Steer

Student thesis: Master of Science by Research


This thesis applies active learning to a dataset of spectroscopically labelled sources from the Sloan Digital Sky Survey (SDSS). The sources are selected from the photometric data in the SDSS and the Widefield Infrared Survey Explorer (WISE). Two machine learning techniques were used: a neural network and a random forest classifier. Four different active learning methods were investigated with these data: uncertainty sampling, best- vs-second-best, variance reduction, and learning active learning, plus a generic random method as a control. The uncertainty sampling was implemented using a form known as entropy measure, for which a binary case and a multi-class case were tested separately. These machine learning techniques were also applied to different configurations of Gaussian clouds to help understand their effect on different types of data. The learning active learning received particular focus as the most expandable method. To assist in the selection of active learning methods, the average accuracy scores and feature importances, as well as the class precision, recall, and F1-scores were all compared. These tests resulted in the entropy sampling and the learning active learning being selected as most capable, requiring only 25, 600 datapoints in the training set, with the latter having the most room for improvement.
Date of Award1 Aug 2022
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorGary Fuller (Supervisor) & Anna Scaife (Supervisor)


  • Active Learning
  • Classification
  • Random Forest
  • Machine Learning
  • WISE
  • SDSS
  • Neural Networks

Cite this