{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T19:58:34Z","timestamp":1773259114936,"version":"3.50.1"},"reference-count":41,"publisher":"Oxford University Press (OUP)","funder":[{"name":"the U.S. Department of Agriculture, Agricultural Research Service","award":["2030-21000-056-00D 5030-21000-072-000D"],"award-info":[{"award-number":["2030-21000-056-00D 5030-21000-072-000D"]}]},{"name":"the U.S. Department of Agriculture, Agricultural Research Service","award":["2030-21000-056-00D 5030-21000-072-000D"],"award-info":[{"award-number":["2030-21000-056-00D 5030-21000-072-000D"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,2,17]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Curated resources at centralized repositories provide high-value service to users by enhancing data veracity. Curation, however, comes with a cost, as it requires dedicated time and effort from personnel with deep domain knowledge. In this paper, we investigate the performance of a large language model (LLM), specifically generative pre-trained transformer (GPT)-3.5 and GPT-4, in extracting and presenting data against a human curator. In order to accomplish this task, we used a small set of journal articles on wheat and barley genetics, focusing on traits, such as salinity tolerance and disease resistance, which are becoming more important. The 36 papers were then curated by a professional curator for the GrainGenes database (https:\/\/wheat.pw.usda.gov). In parallel, we developed a GPT-based retrieval-augmented generation question-answering system and compared how GPT performed in answering questions about traits and quantitative trait loci (QTLs). Our findings show that on average GPT-4 correctly categorized manuscripts 97% of the time, correctly extracted 80% of traits, and 61% of marker\u2013trait associations. Furthermore, we assessed the ability of a GPT-based DataFrame agent to filter and summarize curated wheat genetics data, showing the potential of human and computational curators working side-by-side. In one case study, our findings show that GPT-4 was able to retrieve up to 91% of disease related, human-curated QTLs across the whole genome, and up to 96% across a specific genomic region through prompt engineering. Also, we observed that across most tasks, GPT-4 consistently outperformed GPT-3.5 while generating less hallucinations, suggesting that improvements in LLM models will make generative artificial intelligence a much more accurate companion for curators in extracting information from scientific literature. Despite their limitations, LLMs demonstrated a potential to extract and present information to curators and users of biological databases, as long as users are aware of potential inaccuracies and the possibility of incomplete information extraction.<\/jats:p>","DOI":"10.1093\/database\/baaf011","type":"journal-article","created":{"date-parts":[[2025,1,31]],"date-time":"2025-01-31T15:25:39Z","timestamp":1738337139000},"source":"Crossref","is-referenced-by-count":8,"title":["Assessing the performance of generative artificial intelligence in retrieving information against manually curated genetic and genomic data"],"prefix":"10.1093","volume":"2025","author":[{"given":"Elly","family":"Poretsky","sequence":"first","affiliation":[{"name":"Crop Improvement and Genetics Research Unit, United States Department of Agriculture\u2014Agricultural Research Service, Western Regional Research Center , 800 Buchanan St, Albany, CA 94710,","place":["United States"]}]},{"given":"Victoria C","family":"Blake","sequence":"additional","affiliation":[{"name":"Crop Improvement and Genetics Research Unit, United States Department of Agriculture\u2014Agricultural Research Service, Western Regional Research Center , 800 Buchanan St, Albany, CA 94710,","place":["United States"]},{"name":"Department of Plant Sciences and Plant Pathology, Montana State University , 119 Plant Biosciences Building, Bozeman, MT 59717,","place":["United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2380-6704","authenticated-orcid":false,"given":"Carson M","family":"Andorf","sequence":"additional","affiliation":[{"name":"Corn Insects and Crop Genetics Research, U.S. Department of Agriculture, Agricultural Research Service , 819 Wallace Rd, Ames, IA 50011,","place":["United States"]},{"name":"Department of Computer Science, Iowa State University , 2434 Osborn Dr, Ames, IA 50011,","place":["United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5553-6190","authenticated-orcid":false,"given":"Taner Z","family":"Sen","sequence":"additional","affiliation":[{"name":"Crop Improvement and Genetics Research Unit, United States Department of Agriculture\u2014Agricultural Research Service, Western Regional Research Center , 800 Buchanan St, Albany, CA 94710,","place":["United States"]},{"name":"Department of Bioengineering, University of California , 306 Stanley Hall, Berkeley, CA 94720-1762,","place":["United States"]}]}],"member":"286","published-online":{"date-parts":[[2025,2,17]]},"reference":[{"key":"2025092510514454300_R1","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1093\/database\/bay088","article-title":"AgBioData consortium recommendations for sustainable genomics and genetics databases for agriculture","volume":"2018","author":"Harper","year":"2018","journal-title":"Database"},{"key":"2025092510514454300_R2","doi-asserted-by":"publisher","first-page":"2","DOI":"10.1016\/j.cpb.2017.11.001","article-title":"The art of curation at a biological database: principles and application","volume":"11\u201312","author":"Odell","year":"2017","journal-title":"Curr Plant Biol"},{"key":"2025092510514454300_R3","first-page":"102","article-title":"The Georgetown-IBM experiment demonstrated in January 1954","author":"Hutchins","year":"2004"},{"key":"2025092510514454300_R4","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1301.3781","article-title":"Efficient estimation of word representations in vector space","author":"Mikolov","year":"2013","journal-title":"arXiv"},{"key":"2025092510514454300_R5","doi-asserted-by":"publisher","first-page":"16","DOI":"10.1038\/d41586-024-00592-w","article-title":"Is ChatGPT making scientists hyper-productive? The highs and lows of using AI","volume":"627","author":"Prillaman","year":"2024","journal-title":"Nature"},{"key":"2025092510514454300_R6","doi-asserted-by":"publisher","DOI":"10.1101\/2024.11.05.622126","article-title":"Advancing plant metabolic research by using large language models to expand databases and extract labelled data","author":"Knapp","year":"2024","journal-title":"bioRxiv"},{"key":"2025092510514454300_R7","first-page":"2436","article-title":"AgriPrompt: a method to enhance ChatGPT for agricultural question answering","author":"Chen","year":"2024"},{"key":"2025092510514454300_R8","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2401.01600","article-title":"PLLaMa: an open-source large language model for plant science","author":"Yang","year":"2024","journal-title":"arXiv"},{"key":"2025092510514454300_R9","doi-asserted-by":"publisher","first-page":"1930","DOI":"10.1038\/s41591-023-02448-8","article-title":"Large language models in medicine","volume":"29","author":"Thirunavukarasu","year":"2023","journal-title":"Nat Med"},{"key":"2025092510514454300_R10","doi-asserted-by":"publisher","first-page":"1302","DOI":"10.1681\/ASN.0000000000000166","article-title":"Retrieve, summarize, and verify: how will ChatGPT affect information seeking from the medical literature?","volume":"34","author":"Jin","year":"2023","journal-title":"J Am Soc Nephrol"},{"key":"2025092510514454300_R11","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/JBHI.2024.3505955","article-title":"BioMedGPT: an open multimodal large language model for BioMedicine","volume":"2024","author":"Luo","year":"2024","journal-title":"IEEE J Biomed Health Inform"},{"key":"2025092510514454300_R12","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btae075","article-title":"GeneGPT: augmenting large language models with domain tools for improved access to biomedical information","volume":"40","author":"Jin","year":"2024","journal-title":"Bioinformatics"},{"key":"2025092510514454300_R13","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2307.16789","article-title":"ToolLLM: facilitating large language models to master 16000+ real-world APIs","author":"Qin","year":"2023","journal-title":"arXiv"},{"key":"2025092510514454300_R14","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2005.11401","article-title":"Retrieval-augmented generation for knowledge-intensive NLP tasks","author":"Lewis","year":"2021","journal-title":"arXiv"},{"key":"2025092510514454300_R15","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2202.01110","article-title":"A survey on retrieval-augmented text generation","author":"Li","year":"2022","journal-title":"arXiv"},{"key":"2025092510514454300_R16","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2310.03025","article-title":"Retrieval meets long context large language models","author":"Xu","year":"2024","journal-title":"arXiv"},{"key":"2025092510514454300_R17","doi-asserted-by":"publisher","first-page":"1122","DOI":"10.1109\/JAS.2023.123618","article-title":"A brief overview of ChatGPT: the history, status quo and potential future development","volume":"10","author":"Wu","year":"2023","journal-title":"IEEE\/CAA J Autom Sinica"},{"key":"2025092510514454300_R18","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2023.101861","article-title":"ChatGPT: jack of all trades, master of none","volume":"99","author":"Koco\u0144","year":"2023","journal-title":"Inf Fusion"},{"key":"2025092510514454300_R19","doi-asserted-by":"publisher","first-page":"W518","DOI":"10.1093\/nar\/gkt441","article-title":"PubTator: a web-based text mining tool for assisting biocuration","volume":"41","author":"Wei","year":"2013","journal-title":"Nucleic Acids Res"},{"key":"2025092510514454300_R20","doi-asserted-by":"publisher","first-page":"W540","DOI":"10.1093\/nar\/gkae235","article-title":"PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge","volume":"52","author":"Wei","journal-title":"Nucleic Acids Res"},{"key":"2025092510514454300_R21","doi-asserted-by":"publisher","DOI":"10.1101\/2023.10.16.562533","article-title":"GenePT: a simple but effective foundation model for genes and cells built from ChatGPT","author":"Chen","year":"2023","journal-title":"bioRxiv"},{"key":"2025092510514454300_R22","doi-asserted-by":"publisher","DOI":"10.1101\/2023.11.08.566195","article-title":"ChatGPT usage in the reactome curation process","author":"Tiwari","year":"2023","journal-title":"bioRxiv"},{"key":"2025092510514454300_R23","doi-asserted-by":"publisher","first-page":"18048","DOI":"10.1021\/jacs.3c05819","article-title":"ChatGPT chemistry assistant for text mining and the prediction of MOF synthesis","volume":"145","author":"Zheng","year":"2023","journal-title":"J Am Chem Soc"},{"key":"2025092510514454300_R24","doi-asserted-by":"publisher","DOI":"10.1186\/s13326-024-00320-3","article-title":"Dynamic Retrieval Augmented Generation of Ontologies using Artificial Intelligence (DRAGON-AI)","volume":"15","author":"Toro","year":"2024","journal-title":"J Biomed Semant"},{"key":"2025092510514454300_R25","doi-asserted-by":"publisher","first-page":"000790","DOI":"10.1099\/acmi.0.000790.v2","article-title":"An evaluation of ChatGPT and Bard (Gemini) in the context of biological knowledge retrieval","volume":"6","author":"Caspi","year":"2024","journal-title":"Access Microbiol"},{"key":"2025092510514454300_R26","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3571730","article-title":"Survey of hallucination in natural language generation","volume":"55","author":"Ji","year":"2023","journal-title":"ACM Comput Surv"},{"key":"2025092510514454300_R27","doi-asserted-by":"publisher","DOI":"10.1093\/database\/bar059","article-title":"Biocurators and biocuration: surveying the twenty-first century challenges","volume":"2012","author":"Burge","year":"2012","journal-title":"Database"},{"key":"2025092510514454300_R28","volume-title":"The Coming Wave","author":"Suleyman","year":"2023","edition":"5th"},{"key":"2025092510514454300_R29","doi-asserted-by":"publisher","first-page":"1","DOI":"10.52631\/jemds.v3i1.175","article-title":"ChatGPT and academic research: a review and recommendations based on practical examples","volume":"3","author":"Rahman","year":"2023","journal-title":"J Educ Manag Dev Stud"},{"key":"2025092510514454300_R30","doi-asserted-by":"publisher","DOI":"10.1093\/database\/baac034","article-title":"GrainGenes: a data-rich repository for small grains genetics and genomics","volume":"2022","author":"Yao","year":"2022","journal-title":"Database"},{"key":"2025092510514454300_R31","doi-asserted-by":"publisher","DOI":"10.1126\/science.aar7191","article-title":"Shifting the limits in wheat research and breeding using a fully annotated reference genome","volume":"361","author":"The International Wheat Genome Sequencing Consortium (IWGSC)","year":"2018","journal-title":"Science"},{"key":"2025092510514454300_R32","doi-asserted-by":"publisher","first-page":"403","DOI":"10.1016\/S0022-2836(05)80360-2","article-title":"Basic local alignment search tool","volume":"215","author":"Altschul","year":"1990","journal-title":"J Mol Biol"},{"key":"2025092510514454300_R33","first-page":"645","article-title":"Table meets LLM: can large language models understand structured table data? A benchmark and empirical study","author":"Sui","year":"2024"},{"key":"2025092510514454300_R34","doi-asserted-by":"publisher","DOI":"10.20944\/preprints202303.0422.v1","article-title":"GPT-4 vs. GPT-3.5: a concise showdown","author":"Koubaa","year":"2023","journal-title":"Preprints"},{"key":"2025092510514454300_R35","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2311.01964","article-title":"Don\u2019t make your LLM an evaluation benchmark cheater","author":"Zhou","year":"2023","journal-title":"arXiv"},{"key":"2025092510514454300_R36","doi-asserted-by":"crossref","first-page":"6233","DOI":"10.18653\/v1\/2024.findings-acl.372","article-title":"Benchmarking retrieval-augmented generation for medicine","volume-title":"Findings of the Association for Computational Linguistics ACL 2024, Bangkok, Thailand","author":"Xiong","year":"2024"},{"key":"2025092510514454300_R37","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2411.03538","article-title":"Long context RAG performance of large language models","author":"Leng","year":"2024","journal-title":"arXiv"},{"key":"2025092510514454300_R38","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2407.01370","article-title":"Summary of a Haystack: a challenge to long-context LLMs and RAG systems","author":"Laban","year":"2024","journal-title":"arXiv"},{"key":"2025092510514454300_R39","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2406.15319","article-title":"LongRAG: enhancing retrieval-augmented generation with long-context LLMs","author":"Jiang","year":"2024","journal-title":"arXiv"},{"key":"2025092510514454300_R40","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2407.10670","article-title":"Enhancing retrieval and managing retrieval: a four-module synergy for improved quality and efficiency in RAG systems","author":"Shi","year":"2024","journal-title":"arXiv"},{"key":"2025092510514454300_R41","doi-asserted-by":"crossref","first-page":"199","DOI":"10.1142\/9789819807024_0015","volume-title":"Biocomputing 2025","author":"Xiong","year":"2024"}],"container-title":["Database"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baaf011\/61935148\/baaf011.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baaf011\/61935148\/baaf011.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,25]],"date-time":"2025-09-25T14:51:54Z","timestamp":1758811914000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/database\/article\/doi\/10.1093\/database\/baaf011\/8019548"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025]]},"references-count":41,"URL":"https:\/\/doi.org\/10.1093\/database\/baaf011","relation":{},"ISSN":["1758-0463"],"issn-type":[{"value":"1758-0463","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025]]},"published":{"date-parts":[[2025]]},"article-number":"baaf011"}}