Exploring AdaBoost and Random Forests machine learning approaches for infrared pathology on unbalanced data sets

Research output: Contribution to journalArticlepeer-review

74 Downloads (Pure)


The use of infrared spectroscopy to augment decision-making in histopathology is a promising direction for the diagnosis of many disease types. Hyperspectral images of healthy and diseased tissue, generated by infrared spectroscopy, are used to build chemometric models that can provide objective metrics of disease state. It is important to build robust and stable models to provide confidence to the end user. The data used to develop such models can have a variety of characteristics which can pose problems to many model-building approaches. Here we have compared the performance of two machine learning algorithms – AdaBoost and Random Forests – on a variety of non-uniform data sets. Using samples of breast cancer tissue, we devised a range of training data capable of describing the problem space. Models were constructed from these training sets and their characteristics compared. In terms of separating infrared spectra of cancerous epithelium tissue from normal-associated tissue on the tissue microarray, both AdaBoost and Random Forests algorithms were shown to give excellent classification performance (over 95% accuracy) in this study. AdaBoost models were more robust when datasets with large imbalance were provided. The outcomes of this work are a measure of classification accuracy as a function of training data available, and a clear recommendation for choice of machine learning approach.
Original languageEnglish
Pages (from-to)5880-5891
Number of pages12
JournalThe Analyst
Issue number19
Publication statusPublished - 18 May 2021


  • machine learning
  • AdaBoost
  • random forests
  • breast cancer
  • infrared spectroscopy
  • boosting

Research Beacons, Institutes and Platforms

  • Cancer


Dive into the research topics of 'Exploring AdaBoost and Random Forests machine learning approaches for infrared pathology on unbalanced data sets'. Together they form a unique fingerprint.

Cite this