Table mining and data curation from biomedical literature

Nikola Milosevic, Goran Nenadic

    Research output: Book/ReportCommissioned report

    18326 Downloads (Pure)

    Abstract

    Current text mining efforts are mostly focused on extracting information from the main body of research and scholarly articles. However, tables contain important information such as key characteristics of clinical trials, clinical outcomes, interaction between drugs or proteins. Processing of information from tables is often limited to the textual caption and data presented in tables are typically ignored. The aim of this project is to examine information extraction and table mining from biomedical scientific publications. We aim to provide support for semi-automated data curation of the data stored in tables and its integration with other information presented in the article. Table processing usually includes three steps: table recognition (locating tables in the document), functional analysis of the tables (recognizing cell’s function, i.e. whether the cell is part of the header, stub, sub-header or body of the table) and table understanding (semantic processing, analysis and understanding of data in the table). In our method we split table understanding in a set of tasks which include header and stub processing, finding navigational path for each data cell, normalization of cell’s value, pattern analysis, pattern linking, information extraction and knowledge integration. As a pilot project, we present a case study of the information extraction of body mass index, participant group names and patient weights from tables in clinical trial publications from PubMedCentral (PMC).The study showed that it is possible to successfully extract information from table, although some classes are more challenging because of the layout and the way their data is presented. Certain amount of domain knowledge seems to be inevitable in order to correctly select the right piece of information. Preliminary evaluation of our method showed F-measure of 85% for body mass index extraction, 71.3% for participant group name extraction and 57.7% for participant weight extraction.
    Original languageEnglish
    Place of PublicationManchester
    PublisherUniversity of Manchester
    Number of pages110
    Publication statusPublished - Dec 2014

    Publication series

    NameEnd of first year PhD report
    PublisherUniveristy of Manchester

    Keywords

    • text mining
    • table mining
    • information extraction

    Fingerprint

    Dive into the research topics of 'Table mining and data curation from biomedical literature'. Together they form a unique fingerprint.

    Cite this