A Strategy for A Systematic Approach to Biomarker Discovery Validation - A Study on Lung Cancer Microarray data set

  • Zulkifli Dol

Student thesis: Phd


Cancer is a serious threat to human health and is now one of major causes of death worldwide. However, the complexity of the cancer makes the development of new and specific diagnostic tools particularly challenging. A number of different strategies have been developed for biomarker discovery in cancer using microarray data. The problem that typically needs to be addressed is the scale of the data sets; we simply do not have (or are likely to obtain) sufficient data for classical machine learning approaches for biomarker discovery to be properly validated. Obtaining a biomarker that is specific to a particular cancer is also very challenging. The initial promise that was held out for gene microarray work for the development of cancer biomarkers has not yet yielded the hoped for breakthroughs. This work discusses the construction of a strategy for a systematic approach to biomarker discovery validation using lung cancer gene expression microarray data based around non-small cell cancer and in patients which either stayed disease free after surgery (a five year window) or in which the disease progressed and re-occurred. As a means of assisting the validation purposes we have therefore looked at new methodologies for using existing biological knowledge to support machine learning biomarker discovery techniques. We employ text mining strategy using previously published literature for correlating biological concepts to a given phenotype. Pathway driven approaches through the use of Web Services and workflows, enabled the large-scale dataset to be analysed systematically. The results showed that it was possible, at least using this specific data set, to clearly differentiate between progressive disease and disease free patients using a set of biomarkers implicated in neuroendocrine signaling. A validation of the biomarkers identified was attempted in three separately published data sets. This analysis showed that although there was support for some of our findings in one of these data sets, this appeared to be a function of the close similarity in experimental design followed rather than through specific of the analysis method developed.
Date of Award1 Aug 2016
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorAndrew Brass (Supervisor)


  • Microarray data
  • Gene expression
  • Lung cancer
  • Biomarker discovery
  • Machine learning
  • Text mining

Cite this