TY - JOUR
T1 - LitSumm
T2 - large language models for literature summarization of noncoding RNAs
AU - Green, Andrew
AU - Ribas, Carlos Eduardo
AU - Ontiveros-Palacios, Nancy
AU - Griffiths-Jones, Sam
AU - Petrov, Anton I
AU - Bateman, Alex
AU - Sweeney, Blake
N1 - © The Author(s) 2025. Published by Oxford University Press.
PY - 2025/2/5
Y1 - 2025/2/5
N2 - Curation of literature in life sciences is a growing challenge. The continued increase in the rate of publication, coupled with the relatively fixed number of curators worldwide, presents a major challenge to developers of biomedical knowledgebases. Very few knowledgebases have resources to scale to the whole relevant literature and all have to prioritize their efforts. In this work, we take a first step to alleviating the lack of curator time in RNA science by generating summaries of literature for noncoding RNAs using large language models (LLMs). We demonstrate that high-quality, factually accurate summaries with accurate references can be automatically generated from the literature using a commercial LLM and a chain of prompts and checks. Manual assessment was carried out for a subset of summaries, with the majority being rated extremely high quality. We apply our tool to a selection of >4600 ncRNAs and make the generated summaries available via the RNAcentral resource. We conclude that automated literature summarization is feasible with the current generation of LLMs, provided that careful prompting and automated checking are applied. Database URL: https://rnacentral.org/.
AB - Curation of literature in life sciences is a growing challenge. The continued increase in the rate of publication, coupled with the relatively fixed number of curators worldwide, presents a major challenge to developers of biomedical knowledgebases. Very few knowledgebases have resources to scale to the whole relevant literature and all have to prioritize their efforts. In this work, we take a first step to alleviating the lack of curator time in RNA science by generating summaries of literature for noncoding RNAs using large language models (LLMs). We demonstrate that high-quality, factually accurate summaries with accurate references can be automatically generated from the literature using a commercial LLM and a chain of prompts and checks. Manual assessment was carried out for a subset of summaries, with the majority being rated extremely high quality. We apply our tool to a selection of >4600 ncRNAs and make the generated summaries available via the RNAcentral resource. We conclude that automated literature summarization is feasible with the current generation of LLMs, provided that careful prompting and automated checking are applied. Database URL: https://rnacentral.org/.
KW - RNA, Untranslated/genetics
KW - Data Curation/methods
KW - Humans
KW - Databases, Nucleic Acid
KW - Software
U2 - 10.1093/database/baaf006
DO - 10.1093/database/baaf006
M3 - Article
C2 - 39908113
SN - 1758-0463
VL - 2025
JO - Database: the journal of biological databases and curation
JF - Database: the journal of biological databases and curation
M1 - baaf006
ER -