TY - BOOK
T1 - Table mining and data curation from biomedical literature
AU - Milosevic, Nikola
AU - Nenadic, Goran
PY - 2014/12
Y1 - 2014/12
N2 - Current text mining efforts are mostly focused on extracting information from the main body of research and scholarly articles. However, tables contain important information such as key characteristics of clinical trials, clinical outcomes, interaction between drugs or proteins. Processing of information from tables is often limited to the textual caption and data presented in tables are typically ignored. The aim of this project is to examine information extraction and table mining from biomedical scientific publications. We aim to provide support for semi-automated data curation of the data stored in tables and its integration with other information presented in the article. Table processing usually includes three steps: table recognition (locating tables in the document), functional analysis of the tables (recognizing cell’s function, i.e. whether the cell is part of the header, stub, sub-header or body of the table) and table understanding (semantic processing, analysis and understanding of data in the table). In our method we split table understanding in a set of tasks which include header and stub processing, finding navigational path for each data cell, normalization of cell’s value, pattern analysis, pattern linking, information extraction and knowledge integration. As a pilot project, we present a case study of the information extraction of body mass index, participant group names and patient weights from tables in clinical trial publications from PubMedCentral (PMC).The study showed that it is possible to successfully extract information from table, although some classes are more challenging because of the layout and the way their data is presented. Certain amount of domain knowledge seems to be inevitable in order to correctly select the right piece of information. Preliminary evaluation of our method showed F-measure of 85% for body mass index extraction, 71.3% for participant group name extraction and 57.7% for participant weight extraction.
AB - Current text mining efforts are mostly focused on extracting information from the main body of research and scholarly articles. However, tables contain important information such as key characteristics of clinical trials, clinical outcomes, interaction between drugs or proteins. Processing of information from tables is often limited to the textual caption and data presented in tables are typically ignored. The aim of this project is to examine information extraction and table mining from biomedical scientific publications. We aim to provide support for semi-automated data curation of the data stored in tables and its integration with other information presented in the article. Table processing usually includes three steps: table recognition (locating tables in the document), functional analysis of the tables (recognizing cell’s function, i.e. whether the cell is part of the header, stub, sub-header or body of the table) and table understanding (semantic processing, analysis and understanding of data in the table). In our method we split table understanding in a set of tasks which include header and stub processing, finding navigational path for each data cell, normalization of cell’s value, pattern analysis, pattern linking, information extraction and knowledge integration. As a pilot project, we present a case study of the information extraction of body mass index, participant group names and patient weights from tables in clinical trial publications from PubMedCentral (PMC).The study showed that it is possible to successfully extract information from table, although some classes are more challenging because of the layout and the way their data is presented. Certain amount of domain knowledge seems to be inevitable in order to correctly select the right piece of information. Preliminary evaluation of our method showed F-measure of 85% for body mass index extraction, 71.3% for participant group name extraction and 57.7% for participant weight extraction.
KW - text mining
KW - table mining
KW - information extraction
M3 - Commissioned report
T3 - End of first year PhD report
BT - Table mining and data curation from biomedical literature
PB - University of Manchester
CY - Manchester
ER -