{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,9]],"date-time":"2026-02-09T10:07:57Z","timestamp":1770631677635,"version":"3.49.0"},"reference-count":35,"publisher":"Oxford University Press (OUP)","issue":"2","license":[{"start":{"date-parts":[[2026,1,21]],"date-time":"2026-01-21T00:00:00Z","timestamp":1768953600000},"content-version":"vor","delay-in-days":18,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100014013","name":"UK Research and Innovation","doi-asserted-by":"publisher","award":["EP\/Y031350\/1"],"award-info":[{"award-number":["EP\/Y031350\/1"]}],"id":[{"id":"10.13039\/100014013","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Claudia Harding Foundation"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2026,1,3]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Automatic information extraction from biomedical texts requires machine learning methodology that can recognize biomedical entities, characterize inter-entity relationships, and relate extracted information to specific research topics. Large language models (LLMs) excel in general tasks but perform less reliably in the biomedical domain, where texts are characterized by extensive technical terminology and semantic variations from general literature. There is an unmet need for annotated full-text datasets that can be used to fine-tune language models for significant biomedical applications. Here, we focus on extraction of the complex relationships between genes and diseases.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We present BioTriplex, a corpus of 100 full-length biomedical research articles (comprising 604 subsection texts) manually annotated with disease names, genes, and 21 subtypes of disease\u2013gene relationships. We employ BioTriplex to train the LLaMA 3.1 8B language model in gene\u2013disease relation extraction. Our fine-tuned model outperforms zero-shot and few-shot approaches, both within the LLaMA 3.1 architecture and across the larger state-of-the-art LLMs GPT-4 and Claude Sonnet 3.7, and classifies gene\u2013disease relation types with broader scope and greater granularity than previously described. These results validate BioTriplex as a useful full-text data resource and underscore the value of specialized datasets in fine-tuning language models for important biomedical tasks.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>https:\/\/github.com\/PanagiotisFytas\/BioTriplex<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btag037","type":"journal-article","created":{"date-parts":[[2026,1,16]],"date-time":"2026-01-16T12:35:32Z","timestamp":1768566932000},"source":"Crossref","is-referenced-by-count":0,"title":["<i>BioTriplex<\/i>\n                    : a full-text annotated corpus for fine-tuning language models in gene-disease relation extraction tasks"],"prefix":"10.1093","volume":"42","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8715-9035","authenticated-orcid":false,"given":"Charlotte","family":"Collins","sequence":"first","affiliation":[{"name":"Language Technology Laboratory, Theoretical and Applied Linguistics, Faculty of Modern and Medieval Languages and Linguistics, University of Cambridge , Cambridge CB3 9DA,","place":["United Kingdom"]},{"name":"Centre for Human Inspired Artificial Intelligence, Institute for Technology and Humanity, University of Cambridge , Cambridge CB2 1SB,","place":["United Kingdom"]}]},{"given":"Panagiotis","family":"Fytas","sequence":"additional","affiliation":[{"name":"Language Technology Laboratory, Theoretical and Applied Linguistics, Faculty of Modern and Medieval Languages and Linguistics, University of Cambridge , Cambridge CB3 9DA,","place":["United Kingdom"]}]},{"given":"\u0130lknur","family":"Karadeniz","sequence":"additional","affiliation":[{"name":"Language Technology Laboratory, Theoretical and Applied Linguistics, Faculty of Modern and Medieval Languages and Linguistics, University of Cambridge , Cambridge CB3 9DA,","place":["United Kingdom"]},{"name":"Department of Artificial Intelligence and Data Engineering, \u00d6zye\u011fin University , Istanbul 34794, T\u00fcrkiye"}]},{"given":"Huiyuan","family":"Zheng","sequence":"additional","affiliation":[{"name":"Institute of Environmental Medicine, Karolinska Institutet , Stockholm 171 77,","place":["Sweden"]}]},{"given":"Simon","family":"Baker","sequence":"additional","affiliation":[{"name":"Language Technology Laboratory, Theoretical and Applied Linguistics, Faculty of Modern and Medieval Languages and Linguistics, University of Cambridge , Cambridge CB3 9DA,","place":["United Kingdom"]}]},{"given":"Ulla","family":"Stenius","sequence":"additional","affiliation":[{"name":"Institute of Environmental Medicine, Karolinska Institutet , Stockholm 171 77,","place":["Sweden"]}]},{"given":"Anna","family":"Korhonen","sequence":"additional","affiliation":[{"name":"Language Technology Laboratory, Theoretical and Applied Linguistics, Faculty of Modern and Medieval Languages and Linguistics, University of Cambridge , Cambridge CB3 9DA,","place":["United Kingdom"]},{"name":"Centre for Human Inspired Artificial Intelligence, Institute for Technology and Humanity, University of Cambridge , Cambridge CB2 1SB,","place":["United Kingdom"]}]}],"member":"286","published-online":{"date-parts":[[2026,1,21]]},"reference":[{"key":"2026020809213544300_btag037-B1","author":"Anthropic","year":"2024"},{"key":"2026020809213544300_btag037-B2","doi-asserted-by":"crossref","first-page":"161","DOI":"10.1186\/1471-2105-13-161","article-title":"Concept annotation in the CRAFT corpus","volume":"13","author":"Bada","year":"2012","journal-title":"BMC Bioinformatics"},{"key":"2026020809213544300_btag037-B3","doi-asserted-by":"crossref","first-page":"14","DOI":"10.1186\/1471-2288-8-14","article-title":"Abstracts in high profile journals often fail to report harm","volume":"8","author":"Bernal-Delgado","year":"2008","journal-title":"BMC Med Res Methodol"},{"key":"2026020809213544300_btag037-B4","volume-title":"Sampling Techniques","author":"Cochran","year":"1977","edition":"3rd edn"},{"key":"2026020809213544300_btag037-B5","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1186\/s44342-024-00036-x","article-title":"Comparative analysis of generative LLMs for labeling entities in clinical notes","volume":"23","author":"del Moral-Gonz\u00e1lez","year":"2025","journal-title":"Genomics Inform"},{"key":"2026020809213544300_btag037-B7","doi-asserted-by":"crossref","first-page":"174","DOI":"10.1186\/s13643-019-1082-9","article-title":"The strong focus on positive results in abstracts may cause bias in systematic reviews: a case study on abstract reporting bias","volume":"8","author":"Duyx","year":"2019","journal-title":"Syst Rev"},{"key":"2026020809213544300_btag037-B8","doi-asserted-by":"crossref","first-page":"2474","DOI":"10.1093\/bioinformatics\/bty152","article-title":"Exploiting and assessing multi-source data for supervised biomedical named entity recognition","volume":"34","author":"Galea","year":"2018","journal-title":"Bioinformatics"},{"key":"2026020809213544300_btag037-B9","author":"Grattafiori","year":"2024"},{"key":"2026020809213544300_btag037-B10","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3458754","article-title":"Domain-specific language model pretraining for biomedical natural language processing","volume":"3","author":"Gu","year":"2022","journal-title":"ACM Trans Comput Healthcare"},{"key":"2026020809213544300_btag037-B11","author":"Hu","year":"2021"},{"key":"2026020809213544300_btag037-B12","doi-asserted-by":"crossref","first-page":"baae069","DOI":"10.1093\/database\/baae069","article-title":"The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII","volume":"2024","author":"Islamaj","year":"2024","journal-title":"Database"},{"key":"2026020809213544300_btag037-B13","doi-asserted-by":"crossref","first-page":"btae163","DOI":"10.1093\/bioinformatics\/btae163","article-title":"Advancing entity recognition in biomedicine via instruction tuning of large language models","volume":"40","author":"Keloth","year":"2024","journal-title":"Bioinformatics"},{"key":"2026020809213544300_btag037-B100","volume-title":"J Biomed Semantics","author":"K\u00fchnel L and Fluck J","year":"2022"},{"key":"2026020809213544300_btag037-B14","doi-asserted-by":"crossref","first-page":"104487","DOI":"10.1016\/j.jbi.2023.104487","article-title":"BioREx: improving biomedical relation extraction by leveraging heterogeneous datasets","volume":"146","author":"Lai","year":"2023","journal-title":"J Biomed Inform"},{"key":"2026020809213544300_btag037-B15","author":"Lai"},{"key":"2026020809213544300_btag037-B16","doi-asserted-by":"crossref","first-page":"457","DOI":"10.1038\/nj7612-457a","article-title":"Scientific literature: information overload","volume":"535","author":"Landhuis","year":"2016","journal-title":"Nature"},{"key":"2026020809213544300_btag037-B17","doi-asserted-by":"crossref","first-page":"1234","DOI":"10.1093\/bioinformatics\/btz682","article-title":"BioBERT: a pre-trained biomedical language representation model for biomedical text mining","volume":"36","author":"Lee","year":"2020","journal-title":"Bioinformatics"},{"key":"2026020809213544300_btag037-B18","doi-asserted-by":"crossref","first-page":"bbaf070","DOI":"10.1093\/bib\/bbaf070","article-title":"A large language model framework for literature-based disease\u2013gene association prediction","volume":"26","author":"Li","year":"2024","journal-title":"Brief Bioinform"},{"key":"2026020809213544300_btag037-B19","doi-asserted-by":"crossref","first-page":"btad310","DOI":"10.1093\/bioinformatics\/btad310","article-title":"AIONER: all-in-one scheme-based biomedical named entity recognition using deep learning","volume":"39","author":"Luo","year":"2023","journal-title":"Bioinformatics"},{"key":"2026020809213544300_btag037-B20","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1093\/bib\/bbac282","article-title":"BioRED: a rich biomedical relation extraction dataset","volume":"23","author":"Luo","year":"2022","journal-title":"Brief Bioinform"},{"key":"2026020809213544300_btag037-B21","doi-asserted-by":"crossref","first-page":"111","DOI":"10.1186\/s12859-022-04646-6","article-title":"TBGA: a large-scale gene\u2013disease association dataset for biomedical relation extraction","volume":"23","author":"Marchesin","year":"2022","journal-title":"BMC Bioinformatics"},{"key":"2026020809213544300_btag037-B22","author":"Milosevic","year":"2022"},{"key":"2026020809213544300_btag037-B23","doi-asserted-by":"crossref","first-page":"100591","DOI":"10.1016\/j.xgen.2024.100591","article-title":"Gene\u2013environment interactions within a precision environmental health framework","volume":"4","author":"Motsinger-Reif","year":"2024","journal-title":"Cell Genom"},{"key":"2026020809213544300_btag037-B24","doi-asserted-by":"crossref","first-page":"106","DOI":"10.18653\/v1\/2025.insights-1.11","volume-title":"The Sixth Workshop on Insights from Negative Results in NLP","author":"Nagar","year":"2025"},{"key":"2026020809213544300_btag037-B25","author":"OpenAI","year":"2023"},{"key":"2026020809213544300_btag037-B26","doi-asserted-by":"crossref","first-page":"1038","DOI":"10.1038\/s41431-020-00785-7","article-title":"Origins of human genetics. A personal perspective","volume":"29","author":"Passarge","year":"2021","journal-title":"Eur J Hum Genet"},{"key":"2026020809213544300_btag037-B27","doi-asserted-by":"crossref","first-page":"89","DOI":"10.1038\/s41576-021-00409-w","article-title":"A new era in functional genomics screens","volume":"23","author":"Przybyla","year":"2022","journal-title":"Nat Rev Genet"},{"key":"2026020809213544300_btag037-B28","first-page":"75","volume-title":"Proceedings 12th Joint ACL-ISO Workshop on Interoperable Semantic Annotation, 28 May 2016, Portoroz, Slovenia","author":"Rim","year":"2016"},{"key":"2026020809213544300_btag037-B29","first-page":"129","volume-title":"Proceedings of the 5th Linguistic Annotation Workshop 23rd\u201324th June 2011, Portland, Oregon, USA","author":"Stubbs","year":"2011"},{"key":"2026020809213544300_btag037-B30","doi-asserted-by":"crossref","first-page":"lqab062","DOI":"10.1093\/nargab\/lqab062","article-title":"RENET2: high-performance full-text gene\u2013disease relation extraction with iterative training data","volume":"3","author":"Su","year":"2021","journal-title":"NAR Genom Bioinform"},{"key":"2026020809213544300_btag037-B32","doi-asserted-by":"crossref","first-page":"207","DOI":"10.1186\/1471-2105-13-207","article-title":"A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools","volume":"13","author":"Verspoor","year":"2012","journal-title":"BMC Bioinformatics"},{"key":"2026020809213544300_btag037-B33","first-page":"5784","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China","author":"Wadden","year":"2019"},{"key":"2026020809213544300_btag037-B35","doi-asserted-by":"crossref","first-page":"W540","DOI":"10.1093\/nar\/gkae235","article-title":"PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge","volume":"52","author":"Wei","year":"2024","journal-title":"Nucleic Acids Res"},{"key":"2026020809213544300_btag037-B36","first-page":"174","volume-title":"Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association","author":"Yepes","year":"2021"},{"key":"2026020809213544300_btag037-B37","author":"Zhao","year":"2025"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btag037\/66521886\/btag037.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/42\/2\/btag037\/66521886\/btag037.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/42\/2\/btag037\/66521886\/btag037.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,8]],"date-time":"2026-02-08T14:21:46Z","timestamp":1770560506000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btag037\/8435810"}},"subtitle":[],"editor":[{"given":"Jonathan","family":"Wren","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2026,1,3]]},"references-count":35,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,1,3]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btag037","relation":{},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2026,2]]},"published":{"date-parts":[[2026,1,3]]},"article-number":"btag037"}}