{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,14]],"date-time":"2026-04-14T16:48:50Z","timestamp":1776185330927,"version":"3.50.1"},"reference-count":39,"publisher":"Oxford University Press (OUP)","issue":"1","license":[{"start":{"date-parts":[[2024,12,23]],"date-time":"2024-12-23T00:00:00Z","timestamp":1734912000000},"content-version":"vor","delay-in-days":31,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["2328140"],"award-info":[{"award-number":["2328140"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,11,22]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Reusing massive collections of publicly available biomedical data can significantly impact knowledge discovery. However, these public samples and studies are typically described using unstructured plain text, hindering the findability and further reuse of the data. To combat this problem, we propose txt2onto 2.0, a general-purpose method based on natural language processing and machine learning for annotating biomedical unstructured metadata to controlled vocabularies of diseases and tissues. Compared to the previous version (txt2onto 1.0), which uses numerical embeddings as features, this new version uses words as features, resulting in improved interpretability and performance, especially when few positive training instances are available. Txt2onto 2.0 uses embeddings from a large language model during prediction to deal with unseen-yet-relevant words related to each disease and tissue term being predicted from the input text, thereby explaining the basis of every annotation. We demonstrate the generalizability of txt2onto 2.0 by accurately predicting disease annotations for studies from independent datasets, using proteomics and clinical trials as examples. Overall, our approach can annotate biomedical text regardless of experimental types or sources. Code, data, and trained models are available at https:\/\/github.com\/krishnanlab\/txt2onto2.0.<\/jats:p>","DOI":"10.1093\/bib\/bbae652","type":"journal-article","created":{"date-parts":[[2024,12,23]],"date-time":"2024-12-23T00:29:11Z","timestamp":1734913751000},"source":"Crossref","is-referenced-by-count":1,"title":["Annotating publicly-available samples and studies using interpretable modeling of unstructured metadata"],"prefix":"10.1093","volume":"26","author":[{"given":"Hao","family":"Yuan","sequence":"first","affiliation":[{"name":"Genetics and Genome Sciences Program, Michigan State University , East Lansing, MI 48823 ,","place":["United States"]},{"name":"Ecology, Evolution, and Behavior Program, Michigan State University , East Lansing, MI 48823 ,","place":["United States"]}]},{"given":"Parker","family":"Hicks","sequence":"additional","affiliation":[{"name":"Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus , Aurora, CO 80045 ,","place":["United States"]}]},{"given":"Mansooreh","family":"Ahmadian","sequence":"additional","affiliation":[{"name":"Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus , Aurora, CO 80045 ,","place":["United States"]}]},{"given":"Kayla A","family":"Johnson","sequence":"additional","affiliation":[{"name":"Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus , Aurora, CO 80045 ,","place":["United States"]}]},{"given":"Lydia","family":"Valtadoros","sequence":"additional","affiliation":[{"name":"Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus , Aurora, CO 80045 ,","place":["United States"]}]},{"given":"Arjun","family":"Krishnan","sequence":"additional","affiliation":[{"name":"Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus , Aurora, CO 80045 ,","place":["United States"]}]}],"member":"286","published-online":{"date-parts":[[2024,12,22]]},"reference":[{"key":"2024122300285779800_ref1","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4939-3578-9_5","article-title":"The gene expression omnibus database","volume":"1418","author":"Clough","year":"2016","journal-title":"Stat Genom: Methods Protoc"},{"key":"2024122300285779800_ref2","doi-asserted-by":"publisher","first-page":"D19","DOI":"10.1093\/nar\/gkq1019","article-title":"The sequence read archive","volume":"39","author":"Leinonen","year":"2010","journal-title":"Nucleic Acids Res"},{"key":"2024122300285779800_ref3","doi-asserted-by":"publisher","first-page":"D543","DOI":"10.1093\/nar\/gkab1038","article-title":"The pride database resources in 2022: a hub for mass spectrometry-based proteomics evidences","volume":"50","author":"Perez-Riverol","year":"2022","journal-title":"Nucleic Acids Res"},{"key":"2024122300285779800_ref4","doi-asserted-by":"publisher","first-page":"14","DOI":"10.1002\/0471250953.bi1413s53","article-title":"Metabolights: an open-access database repository for metabolomics data","volume":"53","author":"Kale","year":"2016","journal-title":"Curr Protoc Bioinform"},{"key":"2024122300285779800_ref5","doi-asserted-by":"publisher","first-page":"103","DOI":"10.1007\/s12551-018-0490-8","article-title":"Mining data and metadata from the gene expression omnibus","volume":"11","author":"Wang","year":"2019","journal-title":"Biophys Rev"},{"key":"2024122300285779800_ref6","doi-asserted-by":"publisher","first-page":"12846","DOI":"10.1038\/ncomms12846","article-title":"Extraction and analysis of signatures from the gene expression omnibus by the crowd","volume":"7","author":"Wang","year":"2016","journal-title":"Nat Commun"},{"key":"2024122300285779800_ref7","doi-asserted-by":"publisher","first-page":"229","DOI":"10.1136\/jamia.2009.002733","article-title":"An overview of metamap: historical perspective and recent advances","volume":"17","author":"Aronson","year":"2010","journal-title":"J Am Med Inform Assoc"},{"key":"2024122300285779800_ref8","first-page":"546","article-title":"The conceptmapper approach to named entity recognition","volume-title":"International Conference on Language Resources and Evaluation","author":"Tanenblatt","year":"2010"},{"key":"2024122300285779800_ref9","doi-asserted-by":"publisher","first-page":"2914","DOI":"10.1093\/bioinformatics\/btx334","article-title":"MetaSRA: normalized human sample-specific metadata for the sequence read archive","volume":"33","author":"Bernstein","year":"2017","journal-title":"Bioinformatics"},{"key":"2024122300285779800_ref10","doi-asserted-by":"publisher","first-page":"23","DOI":"10.1007\/s10916-024-02043-5","article-title":"Transformer models in healthcare: a survey and thematic analysis of potentials, shortcomings and risks","volume":"48","author":"Denecke","year":"2024","journal-title":"J Med Syst"},{"key":"2024122300285779800_ref11","doi-asserted-by":"publisher","DOI":"10.1016\/j.jbi.2021.103982","article-title":"AMMU: a survey of transformer-based biomedical pretrained language models","volume":"126","author":"Kalyan","year":"2022","journal-title":"J Biomed Inform"},{"key":"2024122300285779800_ref12","doi-asserted-by":"publisher","first-page":"187","DOI":"10.1007\/978-3-030-67670-4_12","article-title":"Automated integration of genomic metadata with sequence-to-sequence models","volume-title":"Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14\u201318, 2020, 12461","author":"Cannizzaro","year":"2021"},{"key":"2024122300285779800_ref13","doi-asserted-by":"publisher","first-page":"baac036","DOI":"10.1093\/database\/baac036","article-title":"GEMI: interactive interface for transformer-based genomic metadata integration","volume":"2022","author":"Serna Garcia","year":"2022","journal-title":"Database"},{"key":"2024122300285779800_ref14","article-title":"Google\u2019s neural machine translation system: Bridging the gap between human and machine translation","author":"Wu","year":"2016"},{"key":"2024122300285779800_ref15","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3530811","article-title":"Efficient transformers: a survey","volume":"55","author":"Tay","year":"2022","journal-title":"ACM Comput Surv"},{"key":"2024122300285779800_ref16","doi-asserted-by":"publisher","first-page":"baw080","DOI":"10.1093\/database\/baw080","article-title":"Predicting structured metadata from unstructured metadata","volume":"2016","author":"Posch","year":"2016","journal-title":"Database"},{"key":"2024122300285779800_ref17","first-page":"1","article-title":"Pre-trained language models in biomedical domain: a systematic survey","volume":"56","author":"Wang","year":"2023","journal-title":"ACM Comput Surv"},{"key":"2024122300285779800_ref18","doi-asserted-by":"publisher","first-page":"6736","DOI":"10.1038\/s41467-022-34435-x","article-title":"Systematic tissue annotations of genomics samples by modeling unstructured metadata","volume":"13","author":"Hawkins","year":"2022","journal-title":"Nat Commun"},{"key":"2024122300285779800_ref19","doi-asserted-by":"publisher","first-page":"96ra77\u201396ra77","DOI":"10.1126\/scitranslmed.3001318","article-title":"Discovery and preclinical validation of drug indications using compendia of public gene expression data","volume":"3","author":"Sirota","year":"2011","journal-title":"Sci Transl Med"},{"key":"2024122300285779800_ref20","doi-asserted-by":"publisher","first-page":"101913","DOI":"10.1016\/j.isci.2020.101913","article-title":"Cello: comprehensive and hierarchical cell type classification of human cells with the cell ontology","volume":"24","author":"Bernstein","year":"2021","journal-title":"Iscience"},{"key":"2024122300285779800_ref21","doi-asserted-by":"publisher","first-page":"baab006","DOI":"10.1093\/database\/baab006","article-title":"Curation of over 10 000 transcriptomic studies to enable data reuse","volume":"2021","author":"Lim","year":"2021","journal-title":"Database"},{"key":"2024122300285779800_ref22","doi-asserted-by":"publisher","first-page":"164","DOI":"10.1016\/j.gpb.2021.08.017","article-title":"Comprehensive analysis of ubiquitously expressed genes in humans from a data-driven perspective","volume":"21","author":"Gu","year":"2023","journal-title":"Genom Proteom Bioinform"},{"key":"2024122300285779800_ref23","doi-asserted-by":"publisher","first-page":"D710","DOI":"10.1093\/nar\/gkab1133","article-title":"TissueNexus: a database of human tissue functional gene networks built with a large compendium of curated RNA-seq data","volume":"50","author":"Lin","year":"2022","journal-title":"Nucleic Acids Res"},{"key":"2024122300285779800_ref24","doi-asserted-by":"publisher","first-page":"152","DOI":"10.1016\/j.cels.2018.12.010","article-title":"A computational framework for genome-wide characterization of the human disease landscape","volume":"8","author":"Lee","year":"2019","journal-title":"Cell Syst"},{"key":"2024122300285779800_ref25","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/gb-2012-13-1-r5","article-title":"Uberon, an integrative multi-species anatomy ontology","volume":"13","author":"Mungall","year":"2012","journal-title":"Genome Biol"},{"key":"2024122300285779800_ref26","doi-asserted-by":"crossref","DOI":"10.1101\/2022.04.13.22273750","article-title":"Mondo: unifying diseases for the world, by the world","author":"Vasilevsky","year":"2022"},{"key":"2024122300285779800_ref27","doi-asserted-by":"crossref","first-page":"625","DOI":"10.1145\/1102351.1102430","article-title":"Predicting good probabilities with supervised learning","volume-title":"Proceedings of the 22nd international conference on Machine learning","author":"Niculescu-Mizil","year":"2005"},{"key":"2024122300285779800_ref28","doi-asserted-by":"publisher","first-page":"301","DOI":"10.1111\/j.1467-9868.2005.00503.x","article-title":"Regularization and variable selection via the elastic net","volume":"67","author":"Zou","year":"2005","journal-title":"J R Stat Soc Series B Stat Methodol"},{"key":"2024122300285779800_ref29","doi-asserted-by":"crossref","DOI":"10.1145\/3458754","article-title":"Domain-specific language model pretraining for biomedical natural language processing","volume-title":"ACM Transactions on Computing for Healthcare (HEALTH)","author":"Gu"},{"key":"2024122300285779800_ref30","author":"U.S. National Library of Medicine"},{"key":"2024122300285779800_ref31","doi-asserted-by":"publisher","first-page":"89","DOI":"10.1038\/nrg3394","article-title":"Reuse of public genome-wide gene expression data","volume":"14","author":"Rung","year":"2013","journal-title":"Nat Rev Genet"},{"key":"2024122300285779800_ref32","doi-asserted-by":"publisher","first-page":"e9954","DOI":"10.7717\/peerj.9954","article-title":"The reuse of public datasets in the life sciences: potential risks and rewards","volume":"8","author":"Sielemann","year":"2020","journal-title":"PeerJ"},{"key":"2024122300285779800_ref33","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/sdata.2016.18","article-title":"The fair guiding principles for scientific data management and stewardship","volume":"3","author":"Wilkinson","year":"2016","journal-title":"Sci Data"},{"key":"2024122300285779800_ref34","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s10916-016-0561-y","article-title":"Applying data mining techniques to improve breast cancer diagnosis","volume":"40","author":"Diz","year":"2016","journal-title":"J Med Syst"},{"key":"2024122300285779800_ref35","doi-asserted-by":"publisher","first-page":"1525","DOI":"10.1093\/jamia\/ocac093","article-title":"The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression","volume":"29","author":"van den Goorbergh","year":"2022","journal-title":"J Am Med Inform Assoc"},{"key":"2024122300285779800_ref36","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2022.emnlp-main.414","article-title":"A survey of active learning for natural language processing.","author":"Zhang","year":"2022"},{"key":"2024122300285779800_ref37","article-title":"Semi-supervised classification for natural language processing","author":"Shams","year":"2014"},{"key":"2024122300285779800_ref38","doi-asserted-by":"publisher","first-page":"1138","DOI":"10.1162\/tacl_a_00511","article-title":"Causal inference in natural language processing: estimation, prediction, interpretation and beyond","volume":"10","author":"Feder","year":"2022","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024122300285779800_ref39","doi-asserted-by":"publisher","first-page":"43","DOI":"10.1109\/MIC.2021.3133551","article-title":"CausalKG: causal knowledge graph explainability using interventional and counterfactual reasoning","volume":"26","author":"Jaimini","year":"2022","journal-title":"IEEE Internet Comput"}],"container-title":["Briefings in Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bib\/article-pdf\/26\/1\/bbae652\/61254046\/bbae652.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bib\/article-pdf\/26\/1\/bbae652\/61254046\/bbae652.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,12,23]],"date-time":"2024-12-23T00:29:15Z","timestamp":1734913755000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bib\/article\/doi\/10.1093\/bib\/bbae652\/7930339"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,22]]},"references-count":39,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,11,22]]}},"URL":"https:\/\/doi.org\/10.1093\/bib\/bbae652","relation":{},"ISSN":["1467-5463","1477-4054"],"issn-type":[{"value":"1467-5463","type":"print"},{"value":"1477-4054","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025,1]]},"published":{"date-parts":[[2024,11,22]]},"article-number":"bbae652"}}