{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,28]],"date-time":"2026-01-28T02:06:14Z","timestamp":1769565974382,"version":"3.49.0"},"reference-count":41,"publisher":"Oxford University Press (OUP)","issue":"1","license":[{"start":{"date-parts":[[2025,12,27]],"date-time":"2025-12-27T00:00:00Z","timestamp":1766793600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2026,1,2]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>High-throughput extraction and structured labeling of data from academic articles are crucial for enabling downstream machine learning applications and secondary analyses. Current approaches lack integration with the publishing process and comprehensive annotation of experimental roles and methodologies alongside bioentity recognition.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We embedded multimodal data curation into the academic publishing process to annotate segmented figure panels and captions, combining natural language processing with authors\u2019 feedback to increase annotation accuracy. The resulting dataset, SourceData-NLP, comprises over 620\u2009000 annotated biomedical entities, curated from 18\u2009689 figures in 3223 articles in molecular and cell biology. Annotations include eight classes of bioentities (small molecules, gene products, subcellular components, cell lines, cell types, tissues, organisms, and diseases), plus additional classes that delineate the entities\u2019 roles in experimental designs and methodologies. We evaluate the utility of the dataset for training AI models using named-entity recognition, segmentation of figure captions into their constituent panels, and a novel context-dependent semantic task that assesses whether an entity is a controlled intervention target or a measurement object. We also demonstrate multi-modal applications for segmenting figures into panel images and their corresponding captions.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>Trained models are available at https:\/\/huggingface.co\/EMBO. The SourceData-NLP dataset and code are available at https:\/\/github.com\/source-data\/soda-data, https:\/\/github.com\/source-data\/soda-model, and https:\/\/github.com\/source-data\/soda_image_segmentation.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaf685","type":"journal-article","created":{"date-parts":[[2025,12,27]],"date-time":"2025-12-27T12:44:19Z","timestamp":1766839459000},"source":"Crossref","is-referenced-by-count":0,"title":["Integrating curation into scientific publishing to train AI models"],"prefix":"10.1093","volume":"42","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0211-6416","authenticated-orcid":false,"given":"Jorge","family":"Abreu-Vicente","sequence":"first","affiliation":[{"name":"EMBO , Heidelberg 69117,","place":["Germany"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5461-3265","authenticated-orcid":false,"given":"Hannah","family":"Sonntag","sequence":"additional","affiliation":[{"name":"EMBO , Heidelberg 69117,","place":["Germany"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-8874-3534","authenticated-orcid":false,"given":"Thomas","family":"Eidens","sequence":"additional","affiliation":[{"name":"EMBO , Heidelberg 69117,","place":["Germany"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5472-6355","authenticated-orcid":false,"given":"Cassie S","family":"Mitchell","sequence":"additional","affiliation":[{"name":"Department of Biomedical Engineering, Georgia Institute of Technology and Emory University School of Medicine , Atlanta, GA 30332,","place":["United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2499-4025","authenticated-orcid":false,"given":"Thomas","family":"Lemberger","sequence":"additional","affiliation":[{"name":"EMBO , Heidelberg 69117,","place":["Germany"]}]}],"member":"286","published-online":{"date-parts":[[2025,12,27]]},"reference":[{"key":"2026012708121748900_btaf685-B1","author":"Abdollahi"},{"key":"2026012708121748900_btaf685-B2","author":"Adarsh"},{"key":"2026012708121748900_btaf685-B6","author":"Beeri","year":"2024"},{"key":"2026012708121748900_btaf685-B7","doi-asserted-by":"crossref","first-page":"194","DOI":"10.3115\/974557.974586","volume-title":"Fifth Conference on Applied Natural Language Processing","author":"Bikel","year":"1997"},{"key":"2026012708121748900_btaf685-B8","doi-asserted-by":"crossref","first-page":"baz085","DOI":"10.1093\/database\/baz085","article-title":"Large expert-curated database for benchmarking document similarity detection in biomedical literature search","volume":"2019","author":"Brown","year":"2019","journal-title":"Database"},{"key":"2026012708121748900_btaf685-B9","author":"de Herrera"},{"key":"2026012708121748900_btaf685-B10","author":"Deu\u00dfer"},{"key":"2026012708121748900_btaf685-B13","first-page":"1409","volume-title":"IEEE J Biomed Health Inform 2025;","author":"Ding"},{"key":"2026012708121748900_btaf685-B15","first-page":"25792","author":"Fries"},{"key":"2026012708121748900_btaf685-B17","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3458754","article-title":"Domain-specific language model pretraining for biomedical natural language processing","volume":"3","author":"Gu","year":"2022","journal-title":"ACM Trans Comput Healthcare"},{"key":"2026012708121748900_btaf685-B18","author":"Guo","year":"2023"},{"key":"2026012708121748900_btaf685-B20","doi-asserted-by":"crossref","first-page":"S161","DOI":"10.1016\/j.amepre.2008.03.035","article-title":"The collaboration readiness of transdisciplinary research teams and centers findings from the national cancer institute\u2019s tree year-one evaluation study","volume":"35","author":"Hall","year":"2008","journal-title":"Am J Prev Med"},{"key":"2026012708121748900_btaf685-B22","first-page":"782","author":"Hoffart","year":"2011"},{"key":"2026012708121748900_btaf685-B23","doi-asserted-by":"crossref","first-page":"1056","DOI":"10.1162\/qss_a_00076","article-title":"Pandemic publishing: medical journals drastically speed up their publication process for Covid-19","volume":"1","author":"Horbach","year":"2020","journal-title":"Quant Sci Stud"},{"key":"2026012708121748900_btaf685-B24","author":"Jiang","year":"2021"},{"key":"2026012708121748900_btaf685-B25","author":"Jim\u00e9nez Guti\u00e9rrez"},{"key":"2026012708121748900_btaf685-B26","first-page":"56","article-title":"The open biomedical annotator","volume":"2009","author":"Jonquet","year":"2009","journal-title":"Summit Transl Bioinform"},{"key":"2026012708121748900_btaf685-B27","doi-asserted-by":"crossref","first-page":"31513","DOI":"10.1109\/ACCESS.2022.3157854","article-title":"How do your biomedical named entity recognition models generalize to novel entities?","volume":"10","author":"Kim","year":"2022","journal-title":"IEEE Access"},{"key":"2026012708121748900_btaf685-B30","doi-asserted-by":"crossref","first-page":"baw068","DOI":"10.1093\/database\/baw068","article-title":"Biocreative V CDR task corpus: a resource for chemical disease relation extraction","volume":"2016","author":"Li","year":"2016","journal-title":"Database"},{"key":"2026012708121748900_btaf685-B31","doi-asserted-by":"crossref","first-page":"i468","DOI":"10.1093\/bioinformatics\/btab331","article-title":"Utilizing image and caption information for biomedical document classification","volume":"37","author":"Li","year":"2021","journal-title":"Bioinformatics"},{"key":"2026012708121748900_btaf685-B32","doi-asserted-by":"crossref","first-page":"1021","DOI":"10.1038\/nmeth.4471","article-title":"Sourcedata\u2014a semantic platform for curating and searching figures","volume":"14","author":"Liechti","year":"2017","journal-title":"Nat Methods"},{"key":"2026012708121748900_btaf685-B35","doi-asserted-by":"crossref","first-page":"baq036","DOI":"10.1093\/database\/baq036","article-title":"PubMed and beyond: a survey of web tools for searching biomedical literature","volume":"2011","author":"Lu","year":"2011","journal-title":"Database"},{"key":"2026012708121748900_btaf685-B36","doi-asserted-by":"crossref","first-page":"bbac282","DOI":"10.1093\/bib\/bbac282","article-title":"BioRED: a rich biomedical relation extraction dataset","volume":"23","author":"Luo","year":"2022","journal-title":"Brief Bioinform"},{"key":"2026012708121748900_btaf685-B37","doi-asserted-by":"crossref","first-page":"e0242283","DOI":"10.1371\/journal.pone.0242283","article-title":"Interdisciplinary research maps: a new technique for visualizing research topics","volume":"15","author":"Marrone","year":"2020","journal-title":"PLoS One"},{"key":"2026012708121748900_btaf685-B38","author":"Mayhew"},{"key":"2026012708121748900_btaf685-B40","doi-asserted-by":"crossref","first-page":"327","DOI":"10.1093\/bib\/bbs084","article-title":"A survey on annotation tools for the biomedical literature","volume":"15","author":"Neves","year":"2014","journal-title":"Brief Bioinform"},{"key":"2026012708121748900_btaf685-B42","doi-asserted-by":"crossref","first-page":"673","DOI":"10.3389\/fcell.2020.00673","article-title":"Named entity recognition and relation detection for biomedical information extraction","volume":"8","author":"Perera","year":"2020","journal-title":"Front Cell Dev Biol"},{"key":"2026012708121748900_btaf685-B43","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"J Mach Learn Res"},{"key":"2026012708121748900_btaf685-B48","doi-asserted-by":"crossref","first-page":"554","DOI":"10.1089\/cmb.2005.12.554","article-title":"Word sense disambiguation in the biomedical domain: an overview","volume":"12","author":"Schuemie","year":"2005","journal-title":"J Comput Biol"},{"key":"2026012708121748900_btaf685-B50","doi-asserted-by":"crossref","first-page":"1251","DOI":"10.1038\/nbt1346","article-title":"The OBO foundry: coordinated evolution of ontologies to support biomedical data integration","volume":"25","author":"Smith","year":"2007","journal-title":"Nat Biotechnol"},{"key":"2026012708121748900_btaf685-B51","author":"T\u00e4nzer"},{"key":"2026012708121748900_btaf685-B54","first-page":"5998","article-title":"Attention is all you need","volume":"30","author":"Vaswani","year":"2017","journal-title":"Adv Neural Inf Process Syst"},{"key":"2026012708121748900_btaf685-B56","author":"Wang"},{"key":"2026012708121748900_btaf685-B57","doi-asserted-by":"crossref","first-page":"bas041","DOI":"10.1093\/database\/bas041","article-title":"Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts","volume":"2012","author":"Wei","year":"2012","journal-title":"Database"},{"key":"2026012708121748900_btaf685-B58","doi-asserted-by":"crossref","first-page":"W587","DOI":"10.1093\/nar\/gkz389","article-title":"PubTator Central: automated concept annotation for biomedical full text articles","volume":"47","author":"Wei","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2026012708121748900_btaf685-B59","author":"Whitehouse"},{"key":"2026012708121748900_btaf685-B60","doi-asserted-by":"crossref","first-page":"160018","DOI":"10.1038\/sdata.2016.18","article-title":"The FAIR guiding principles for scientific data management and stewardship","volume":"3","author":"Wilkinson","year":"2016","journal-title":"Sci Data"},{"key":"2026012708121748900_btaf685-B62","doi-asserted-by":"crossref","first-page":"722","DOI":"10.1038\/s41597-023-02617-x","article-title":"Europe PMC annotated full-text corpus for gene\/proteins, diseases and organisms","volume":"10","author":"Yang","year":"2023","journal-title":"Sci Data"},{"key":"2026012708121748900_btaf685-B63","author":"Yasunaga"},{"key":"2026012708121748900_btaf685-B64","author":"Yuan"},{"key":"2026012708121748900_btaf685-B65","doi-asserted-by":"crossref","first-page":"bbaa057","DOI":"10.1093\/bib\/bbaa057","article-title":"Recent advances in biomedical literature mining","volume":"22","author":"Zhao","year":"2020","journal-title":"Brief Bioinform"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btaf685\/66140480\/btaf685.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/42\/1\/btaf685\/66140480\/btaf685.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/42\/1\/btaf685\/66140480\/btaf685.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,27]],"date-time":"2026-01-27T13:12:30Z","timestamp":1769519550000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btaf685\/8405762"}},"subtitle":[],"editor":[{"given":"Xin","family":"Gao","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2025,12,27]]},"references-count":41,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,1,2]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaf685","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2026,1]]},"published":{"date-parts":[[2025,12,27]]},"article-number":"btaf685"}}