Extraction and representation of key characteristics from epidemiological literature

  • George Karystianis

Student thesis: Phd


Epidemiological studies are rich in information that could improve the understanding of concept complexity of a health problem, and are important sources for evidence based medicine. However, epidemiologists experience difficulties in recognising and aggregating key characteristics in related research due to an increasing number of published articles. The main aim of this dissertation is to explore how text mining techniques can assist epidemiologists to identify important pieces of information and detect and integrate key knowledge for further research and exploration via concept maps. Concept maps are widely used in medicine for exploration and representation as a relatively formal, easy to design and understand knowledge representation model.To support this aim, we have developed a methodology for the extraction of key epidemiological characteristics from all types of epidemiological research articles in order to visualise, explore and aggregate concepts related to a health care problem. A generic rule-based approach was designed and implemented for the identification of mentions of six key characteristics, including study design, population, exposure, outcome, covariate and effect size. The system also relies on automatic term recognition and biomedical dictionaries to identify concepts of interests. In order to facilitate knowledge integration and aggregation, extracted characteristics are further normalized and mapped to existing resources. Study design mentions are mapped to an expanded version of the Ontology of Clinical Research (OCRe), whereas exposure, outcome and covariate mentions are mapped to Unified Medical Language System (UMLS) semantic groups and categories. Population mentions are mapped to age groups, gender and nationality/ethnicity, and effect size mentions are normalised with the regards to the used metric and confidence interval and related concept. The evaluation has shown reliable results, with an average micro F-score of 87% for recognition of epidemiological mentions and 91% for normalisation. Normalised concepts are further organised in an automatically generated concept map, which has three sections for exposures, outcomes and covariates.To demonstrate the potential of the developed methodology, it was applied to a large-scale corpus of epidemiological research abstracts related to obesity. Obesity was chosen as a case study since it has emerged as one of the most important global health problems of the 21st century. Using the concepts extracted from the corpus, we have built a searchable database of key epidemiological characteristics explored in obesity and an automatically generated concept map represented the normalized exposures, outcomes and covariates. An epidemiological workbench (EpiTeM) was designed to enable further exploration and inspection of the normalized extracted data, with direct links to the literature. The generated results also allow exploration of trends in obesity research and can facilitate understanding of its concept complexity. For example, we have noted the most frequent concepts and the most common pairs of characteristics that have been studied in obesity epidemiology.Finally, this thesis also discusses a number of challenges for text mining of epidemiological literature and suggests various opportunities for future work.
Date of Award1 Aug 2014
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorGoran Nenadic (Supervisor) & Iain Buchan (Supervisor)


  • text mining
  • epidemiology
  • concept map
  • key characteristics

Cite this