{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,9]],"date-time":"2026-02-09T09:05:38Z","timestamp":1770627938764,"version":"3.49.0"},"reference-count":36,"publisher":"Oxford University Press (OUP)","issue":"14","license":[{"start":{"date-parts":[[2018,3,10]],"date-time":"2018-03-10T00:00:00Z","timestamp":1520640000000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100000268","name":"BBSRC","doi-asserted-by":"publisher","award":["BB\/L020858\/1"],"award-info":[{"award-number":["BB\/L020858\/1"]}],"id":[{"id":"10.13039\/501100000268","id-type":"DOI","asserted-by":"publisher"}]},{"name":"EU-METASPACE","award":["34402"],"award-info":[{"award-number":["34402"]}]},{"name":"Imperial College Stratified Medicine Graduate Training Programme in Systems Medicine and Spectroscopic Profiling"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2018,7,15]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Motivation<\/jats:title><jats:p>Recognition of biomedical entities from scientific text is a critical component of natural language processing and automated information extraction platforms. Modern named entity recognition approaches rely heavily on supervised machine learning techniques, which are critically dependent on annotated training corpora. These approaches have been shown to perform well when trained and tested on the same source. However, in such scenario, the performance and evaluation of these models may be optimistic, as such models may not necessarily generalize to independent corpora, resulting in potential non-optimal entity recognition for large-scale tagging of widely diverse articles in databases such as PubMed.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>Here we aggregated published corpora for the recognition of biomolecular entities (such as genes, RNA, proteins, variants, drugs and metabolites), identified entity class overlap and performed leave-corpus-out cross validation strategy to test the efficiency of existing models. We demonstrate that accuracies of models trained on individual corpora decrease substantially for recognition of the same biomolecular entity classes in independent corpora. This behavior is possibly due to limited generalizability of entity-class-related features captured by individual corpora (model \u2018overtraining\u2019) which we investigated further at the orthographic level, as well as potential annotation standard differences. We show that the combined use of multi-source training corpora results in overall more generalizable models for named entity recognition, while achieving comparable individual performance. By performing learning-curve-based power analysis we further identified that performance is often not limited by the quantity of the annotated data.<\/jats:p><\/jats:sec><jats:sec><jats:title>Availability and implementation<\/jats:title><jats:p>Compiled primary and secondary sources of the aggregated corpora are available on: https:\/\/github.com\/dterg\/biomedical_corpora\/wiki and https:\/\/bitbucket.org\/iAnalytica\/bioner.<\/jats:p><\/jats:sec><jats:sec><jats:title>Supplementary information<\/jats:title><jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/bty152","type":"journal-article","created":{"date-parts":[[2018,3,8]],"date-time":"2018-03-08T20:11:31Z","timestamp":1520539891000},"page":"2474-2482","source":"Crossref","is-referenced-by-count":13,"title":["Exploiting and assessing multi-source data for supervised biomedical named entity recognition"],"prefix":"10.1093","volume":"34","author":[{"given":"Dieter","family":"Galea","sequence":"first","affiliation":[{"name":"Computational and Systems Medicine, Department of Surgery and Cancer, Faculty of Medicine, Imperial College London, London, UK"}]},{"given":"Ivan","family":"Laponogov","sequence":"additional","affiliation":[{"name":"Computational and Systems Medicine, Department of Surgery and Cancer, Faculty of Medicine, Imperial College London, London, UK"}]},{"given":"Kirill","family":"Veselkov","sequence":"additional","affiliation":[{"name":"Computational and Systems Medicine, Department of Surgery and Cancer, Faculty of Medicine, Imperial College London, London, UK"}]}],"member":"286","published-online":{"date-parts":[[2018,3,10]]},"reference":[{"key":"2023012713013853700_bty152-B1","doi-asserted-by":"crossref","first-page":"54.","DOI":"10.1186\/1471-2105-14-54","article-title":"Gimli: open source and high-performance biomedical name recognition","volume":"14","author":"Campos","year":"2013","journal-title":"BMC Bioinformatics"},{"key":"2023012713013853700_bty152-B2","doi-asserted-by":"crossref","first-page":"281.","DOI":"10.1186\/1471-2105-14-281","article-title":"A modular framework for biomedical concept recognition","volume":"14","author":"Campos","year":"2013","journal-title":"BMC Bioinformatics"},{"key":"2023012713013853700_bty152-B3","first-page":"640","article-title":"Intrinsic evaluation of text mining tools may not predict performance on realistic tasks","author":"Caporaso","year":"2008","journal-title":"Pac Symp Biocomput"},{"key":"2023012713013853700_bty152-B4","doi-asserted-by":"crossref","first-page":"1852","DOI":"10.1093\/bioinformatics\/btx083","article-title":"nala: text mining natural language mutation mentions","volume":"33","author":"Cejuela","year":"2017","journal-title":"Bioinformatics"},{"key":"2023012713013853700_bty152-B5","doi-asserted-by":"crossref","first-page":"bat064.","DOI":"10.1093\/database\/bat064","article-title":"Bioc: a minimalist approach to interoperability for biomedical text processing","volume":"2013","author":"Comeau","year":"2013","journal-title":"Database"},{"key":"2023012713013853700_bty152-B6","doi-asserted-by":"crossref","first-page":"368","DOI":"10.1186\/s12859-017-1776-8","article-title":"A neural network multi-task learning approach to biomedical named entity recognition","volume":"18","author":"Crichton","year":"2017","journal-title":"BMC Bioinformatics"},{"key":"2023012713013853700_bty152-B7","volume-title":"Biocomputing 2002","author":"Ding","year":"2001"},{"key":"2023012713013853700_bty152-B8","article-title":"Predicting sample size required for classification performance","volume":"12","author":"Figueroa","year":"2012","journal-title":"BMC Med. Inf. Dec. Mak"},{"key":"2023012713013853700_bty152-B9","doi-asserted-by":"crossref","first-page":"363","DOI":"10.3115\/1219840.1219885","volume-title":"Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL \u201905","author":"Finkel","year":"2005"},{"key":"2023012713013853700_bty152-B10","doi-asserted-by":"crossref","first-page":"365","DOI":"10.1093\/bioinformatics\/btl616","article-title":"RelEx\u2013relation extraction using dependency parse trees","volume":"23","author":"Fundel","year":"2007","journal-title":"Bioinformatics"},{"key":"2023012713013853700_bty152-B11","doi-asserted-by":"crossref","first-page":"84","DOI":"10.1186\/1471-2105-9-84","article-title":"OSIRISv1.2: a named entity recognition system for sequence variants of genes in biomedical literature","volume":"9","author":"Furlong","year":"2008","journal-title":"BMC Bioinformatics"},{"key":"2023012713013853700_bty152-B12","first-page":"72","author":"Gerner","year":"2010"},{"key":"2023012713013853700_bty152-B13","doi-asserted-by":"crossref","first-page":"85","DOI":"10.1016\/j.jbi.2017.05.002","article-title":"Character-level neural network for biomedical named entity recognition","volume":"70","author":"Gridach","year":"2017","journal-title":"J. Biomed. Inf"},{"key":"2023012713013853700_bty152-B14","first-page":"96","author":"GuoDong","year":"2004"},{"key":"2023012713013853700_bty152-B15","first-page":"96","author":"GuoDong","year":"2004"},{"key":"2023012713013853700_bty152-B16","doi-asserted-by":"crossref","first-page":"914","DOI":"10.1016\/j.jbi.2013.07.011","article-title":"The DDI corpus: an annotated corpus with pharmacological substances and drug\u2013drug interactions","volume":"46","author":"Herrero-Zazo","year":"2013","journal-title":"J. Biomed. Inf"},{"key":"2023012713013853700_bty152-B36","doi-asserted-by":"crossref","first-page":"i286","DOI":"10.1093\/bioinformatics\/btn183","article-title":"Integrating high dimensional bi-directional parsing models for gene mention tagging","volume":"24","author":"Hsu","year":"2008","journal-title":"Bioinformatics"},{"key":"2023012713013853700_bty152-B17","doi-asserted-by":"crossref","first-page":"18","DOI":"10.12688\/f1000research.3-18.v2","article-title":"Mutation extraction tools can be combined for robust recognition of genetic variants in the literature","volume":"3","author":"Jimeno Yepes","year":"2014","journal-title":"F1000Res"},{"key":"2023012713013853700_bty152-B18","doi-asserted-by":"crossref","first-page":"i180","DOI":"10.1093\/bioinformatics\/btg1023","article-title":"Genia corpus-a semantically annotated corpus for bio-textmining","volume":"19","author":"Kim","year":"2003","journal-title":"Bioinformatics"},{"key":"2023012713013853700_bty152-B19","doi-asserted-by":"crossref","first-page":"S2.","DOI":"10.1186\/1758-2946-7-S1-S2","article-title":"The CHEMDNER corpus of chemicals and drugs and its annotation principles","volume":"7","author":"Krallinger","year":"2015","journal-title":"J. Cheminf"},{"key":"2023012713013853700_bty152-B20","doi-asserted-by":"crossref","first-page":"W535","DOI":"10.1093\/nar\/gkv383","article-title":"PolySearch2: a significantly improved text-mining system for discovering associations between human diseases, genes, drugs, metabolites, toxins and more","volume":"43","author":"Liu","year":"2015","journal-title":"Nucleic Acids Res"},{"key":"2023012713013853700_bty152-B21","author":"McCallum","year":"2002"},{"key":"2023012713013853700_bty152-B22","author":"Neves","year":"2012"},{"key":"2023012713013853700_bty152-B23","first-page":"27","author":"Ohta","year":"2012"},{"key":"2023012713013853700_bty152-B24","doi-asserted-by":"crossref","first-page":"868","DOI":"10.1093\/bioinformatics\/btt580","article-title":"Anatomical entity mention recognition at literature scale","volume":"30","author":"Pyysalo","year":"2014","journal-title":"Bioinformatics"},{"key":"2023012713013853700_bty152-B25","doi-asserted-by":"crossref","first-page":"50.","DOI":"10.1186\/1471-2105-8-50","article-title":"Bioinfer: a corpus for information extraction in the biomedical domain","volume":"8","author":"Pyysalo","year":"2007","journal-title":"BMC Bioinformatics"},{"key":"2023012713013853700_bty152-B26","doi-asserted-by":"crossref","first-page":"i575","DOI":"10.1093\/bioinformatics\/bts407","article-title":"Event extraction across multiple levels of biological organization","volume":"28","author":"Pyysalo","year":"2012","journal-title":"Bioinformatics"},{"key":"2023012713013853700_bty152-B27","doi-asserted-by":"crossref","first-page":"S2","DOI":"10.1186\/1471-2105-13-S11-S2","article-title":"Overview of the ID, EPI and REL tasks of BioNLP shared task 2011","volume":"13","author":"Pyysalo","year":"2012","journal-title":"BMC Bioinformatics"},{"key":"2023012713013853700_bty152-B28","author":"Rei","year":"2016"},{"key":"2023012713013853700_bty152-B29","doi-asserted-by":"crossref","first-page":"3191","DOI":"10.1093\/bioinformatics\/bti475","article-title":"Abner: an open source tool for automatically tagging genes, proteins and other entity names in text","volume":"21","author":"Settles","year":"2005","journal-title":"Bioinformatics"},{"key":"2023012713013853700_bty152-B30","doi-asserted-by":"crossref","first-page":"S4","DOI":"10.1186\/1471-2105-12-S4-S4","article-title":"Challenges in the association of human single nucleotide polymorphism mentions with unique database identifiers","volume":"12","author":"Thomas","year":"2011","journal-title":"BMC Bioinformatics"},{"key":"2023012713013853700_bty152-B31","doi-asserted-by":"crossref","first-page":"349.","DOI":"10.1186\/1471-2105-10-349","article-title":"Construction of an annotated corpus to support biomedical information extraction","volume":"10","author":"Thompson","year":"2009","journal-title":"BMC Bioinformatics"},{"key":"2023012713013853700_bty152-B32","doi-asserted-by":"crossref","first-page":"S11.","DOI":"10.1186\/1471-2105-7-S5-S11","article-title":"NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition","volume":"7","author":"Tsai","year":"2006","journal-title":"BMC Bioinformatics"},{"key":"2023012713013853700_bty152-B33","doi-asserted-by":"crossref","first-page":"3619","DOI":"10.1093\/bioinformatics\/btw503","article-title":"Dtminer: identification of potential disease targets through biomedical literature mining","volume":"32","author":"Xu","year":"2016","journal-title":"Bioinformatics"},{"key":"2023012713013853700_bty152-B34","doi-asserted-by":"crossref","first-page":"S2.","DOI":"10.1186\/1471-2105-6-S1-S2","article-title":"Biocreative task 1a: gene mention finding evaluation","volume":"6","author":"Yeh","year":"2005","journal-title":"BMC Bioinformatics"},{"key":"2023012713013853700_bty152-B35","doi-asserted-by":"crossref","first-page":"283.","DOI":"10.3390\/e19060283","article-title":"LSTM-CRF for drug-named entity recognition","volume":"19","author":"Zeng","year":"2017","journal-title":"Entropy"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/34\/14\/2474\/48917751\/bioinformatics_34_14_2474.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/34\/14\/2474\/48917751\/bioinformatics_34_14_2474.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,9,1]],"date-time":"2023-09-01T14:19:23Z","timestamp":1693577963000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/34\/14\/2474\/4925744"}},"subtitle":[],"editor":[{"given":"Jonathan","family":"Wren","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2018,3,10]]},"references-count":36,"journal-issue":{"issue":"14","published-print":{"date-parts":[[2018,7,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bty152","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2018,7,15]]},"published":{"date-parts":[[2018,3,10]]}}}