{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,5]],"date-time":"2026-03-05T08:06:15Z","timestamp":1772697975786,"version":"3.50.1"},"reference-count":58,"publisher":"Oxford University Press (OUP)","issue":"3","license":[{"start":{"date-parts":[[2026,2,2]],"date-time":"2026-02-02T00:00:00Z","timestamp":1769990400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000002","name":"USA National Institutes of Health","doi-asserted-by":"crossref","award":["R01MH111099"],"award-info":[{"award-number":["R01MH111099"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000002","name":"USA National Institutes of Health","doi-asserted-by":"crossref","award":["R03HL168983"],"award-info":[{"award-number":["R03HL168983"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2026,2,28]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Millions of high-throughput, molecular datasets have been shared in public repositories. Researchers can reuse such data to validate their own findings and explore novel questions. A frequent goal is to find multiple datasets that address similar research topics and to either combine them directly or integrate inferences from them. However, a major challenge is finding relevant datasets due to the vast number of candidates, inconsistencies in their descriptions, and a lack of semantic annotations. This challenge is first among the FAIR principles for scientific data. Here we focus on dataset discovery within Gene Expression Omnibus (GEO), a repository containing 100\u2009000\u2009s of data series. GEO supports queries based on keywords, ontology terms, and other annotations. However, reviewing these results is time-consuming and tedious, and it often misses relevant datasets.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We hypothesized that language models could address this problem by summarizing dataset descriptions as numeric representations (embeddings). Assuming a researcher has previously found some relevant datasets, we evaluated the potential to find additional relevant datasets. For six human medical conditions, we used 30 models to generate embeddings for datasets that human curators had previously associated with the conditions and identified other datasets with the most similar descriptions. This approach was often, but not always, more effective than GEO\u2019s search engine. The top-performing models were trained on general corpora, used contrastive-learning strategies, and used relatively large embeddings. Our findings suggest that language models have the potential to improve dataset discovery, likely in combination with existing search tools.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>Our analysis code and a Web-based tool that enables others to use our methodology are available from https:\/\/github.com\/srp33\/GEO_NLP and https:\/\/github.com\/srp33\/GEOfinder3.0, respectively.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btag053","type":"journal-article","created":{"date-parts":[[2026,1,29]],"date-time":"2026-01-29T12:48:05Z","timestamp":1769690885000},"source":"Crossref","is-referenced-by-count":0,"title":["Using semantic search to find publicly available gene-expression datasets"],"prefix":"10.1093","volume":"42","author":[{"given":"Grace S","family":"Brown","sequence":"first","affiliation":[{"name":"Department of Biology, Brigham Young University , Provo, UT, 84602,","place":["United States"]}]},{"given":"James","family":"Wengler","sequence":"additional","affiliation":[{"name":"Department of Biology, Brigham Young University , Provo, UT, 84602,","place":["United States"]},{"name":"Institute of Biosciences and Technology, Texas A&M Health Science Center , Houston, TX, 77030,","place":["United States"]}]},{"given":"Aaron Joyce S","family":"Fabelico","sequence":"additional","affiliation":[{"name":"Department of Biology, Brigham Young University , Provo, UT, 84602,","place":["United States"]}]},{"given":"Abigail","family":"Muir","sequence":"additional","affiliation":[{"name":"Department of Biology, Brigham Young University , Provo, UT, 84602,","place":["United States"]}]},{"given":"Anna","family":"Tubbs","sequence":"additional","affiliation":[{"name":"Department of Biology, Brigham Young University , Provo, UT, 84602,","place":["United States"]}]},{"given":"Amanda","family":"Warren","sequence":"additional","affiliation":[{"name":"Department of Biology, Brigham Young University , Provo, UT, 84602,","place":["United States"]}]},{"given":"Alexandra N","family":"Millett","sequence":"additional","affiliation":[{"name":"Department of Psychiatry, University of British Columbia , Vancouver, BC, V6T 0A6,","place":["Canada"]},{"name":"Michael Smith Laboratories, University of British Columbia , Vancouver, BC, V6T 1Z4,","place":["Canada"]}]},{"given":"Xinrui","family":"Xiang Yu","sequence":"additional","affiliation":[{"name":"Department of Psychiatry, University of British Columbia , Vancouver, BC, V6T 0A6,","place":["Canada"]},{"name":"Michael Smith Laboratories, University of British Columbia , Vancouver, BC, V6T 1Z4,","place":["Canada"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0426-5028","authenticated-orcid":false,"given":"Paul","family":"Pavlidis","sequence":"additional","affiliation":[{"name":"Department of Psychiatry, University of British Columbia , Vancouver, BC, V6T 0A6,","place":["Canada"]},{"name":"Michael Smith Laboratories, University of British Columbia , Vancouver, BC, V6T 1Z4,","place":["Canada"]}]},{"given":"Sanja","family":"Rogic","sequence":"additional","affiliation":[{"name":"Department of Psychiatry, University of British Columbia , Vancouver, BC, V6T 0A6,","place":["Canada"]},{"name":"Michael Smith Laboratories, University of British Columbia , Vancouver, BC, V6T 1Z4,","place":["Canada"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2001-5640","authenticated-orcid":false,"given":"Stephen R","family":"Piccolo","sequence":"additional","affiliation":[{"name":"Department of Biology, Brigham Young University , Provo, UT, 84602,","place":["United States"]}]}],"member":"286","published-online":{"date-parts":[[2026,2,2]]},"reference":[{"key":"2026030502114692200_btag053-B1","doi-asserted-by":"crossref","first-page":"1761","DOI":"10.1093\/bioinformatics\/btab852","article-title":"geoCancerPrognosticDatasetsRetriever: a bioinformatics tool to easily identify cancer prognostic datasets on gene expression omnibus (GEO)","volume":"38","author":"Alameer","year":"2022","journal-title":"Bioinformatics"},{"key":"2026030502114692200_btag053-B2","first-page":"72","author":"Alsentzer","year":"2019"},{"key":"2026030502114692200_btag053-B3","doi-asserted-by":"crossref","first-page":"48","DOI":"10.18637\/jss.v067.i01","article-title":"Fitting linear mixed-effects models using lme4","volume":"67","author":"Bates","year":"2015","journal-title":"J Stat Soft"},{"key":"2026030502114692200_btag053-B4","doi-asserted-by":"crossref","first-page":"2914","DOI":"10.1093\/bioinformatics\/btx334","article-title":"MetaSRA: normalized human sample-specific metadata for the sequence read archive","volume":"33","author":"Bernstein","year":"2017","journal-title":"Bioinformatics"},{"key":"2026030502114692200_btag053-B5","volume-title":"Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit","author":"Bird","year":"2009"},{"key":"2026030502114692200_btag053-B6","first-page":"135","author":"Bojanowski"},{"key":"2026030502114692200_btag053-B8","doi-asserted-by":"publisher","author":"Cao","year":"2024","DOI":"10.48550\/arXiv.2406.01607"},{"key":"2026030502114692200_btag053-B9","doi-asserted-by":"crossref","first-page":"bay145","DOI":"10.1093\/database\/bay145","article-title":"Restructured GEO: restructuring gene expression omnibus metadata for genome dynamics analysis","volume":"2019","author":"Chen","year":"2019","journal-title":"Database"},{"key":"2026030502114692200_btag053-B10","doi-asserted-by":"crossref","first-page":"e1007617","DOI":"10.1371\/journal.pcbi.1007617","article-title":"BioConceptVec: creating and evaluating literature-based biomedical concept embeddings on a large scale","volume":"16","author":"Chen","year":"2020","journal-title":"PLoS Comput Biol"},{"key":"2026030502114692200_btag053-B11","doi-asserted-by":"crossref","first-page":"300","DOI":"10.1093\/jamia\/ocx121","article-title":"DataMed\u2013an open source discovery index for finding biomedical datasets","volume":"25","author":"Chen","year":"2018","journal-title":"J Am Med Inform Assoc"},{"key":"2026030502114692200_btag053-B12","doi-asserted-by":"crossref","first-page":"166","DOI":"10.18653\/v1\/W16-2922","volume-title":"Proceedings of the 15th Workshop on Biomedical Natural Language Processing","author":"Chiu","year":"2016"},{"key":"2026030502114692200_btag053-B13","first-page":"1","volume-title":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","author":"Chua","year":"2022"},{"key":"2026030502114692200_btag053-B14","doi-asserted-by":"crossref","first-page":"93","DOI":"10.1007\/978-1-4939-3578-9_5","article-title":"The gene expression omnibus database","volume":"1418","author":"Clough","year":"2016","journal-title":"Methods Mol Biol"},{"key":"2026030502114692200_btag053-B15","first-page":"4171","author":"Devlin","year":"2019"},{"key":"2026030502114692200_btag053-B16","doi-asserted-by":"crossref","first-page":"152","DOI":"10.1016\/j.compbiolchem.2019.03.014","article-title":"Discovery of perturbation gene targets via free text metadata mining in gene expression omnibus","volume":"80","author":"Djordjevic","year":"2019","journal-title":"Comput Biol Chem"},{"key":"2026030502114692200_btag053-B17","doi-asserted-by":"crossref","first-page":"474","DOI":"10.1186\/1471-2164-10-474","article-title":"The l1-l2 regularization framework unmasks the hypoxia signature hidden in the transcriptome of a set of heterogeneous neuroblastoma cell lines","volume":"10","author":"Fardin","year":"2009","journal-title":"BMC Genomics"},{"key":"2026030502114692200_btag053-B18","doi-asserted-by":"crossref","first-page":"343","DOI":"10.1038\/s41597-020-00696-8","article-title":"Manually curated and harmonised transcriptomics datasets of psoriasis and atopic dermatitis patients","volume":"7","author":"Federico","year":"2020","journal-title":"Sci Data"},{"key":"2026030502114692200_btag053-B19","doi-asserted-by":"crossref","first-page":"bat013","DOI":"10.1093\/database\/bat013","article-title":"curatedOvarianData: clinically annotated data for the ovarian cancer transcriptome","volume":"2013","author":"Ganzfried","year":"2013","journal-title":"Database (Oxford)"},{"key":"2026030502114692200_btag053-B20","doi-asserted-by":"crossref","first-page":"8770","DOI":"10.1038\/s41598-019-45165-4","article-title":"MetaGxData: clinically annotated breast, ovarian and pancreatic cancer datasets and their use in generating a multi-cancer gene signature","volume":"9","author":"Gendoo","year":"2019","journal-title":"Sci Rep"},{"key":"2026030502114692200_btag053-B21","doi-asserted-by":"crossref","first-page":"509","DOI":"10.1186\/s12859-017-1888-1","article-title":"ALE: automated label extraction from GEO metadata","volume":"18","author":"Giles","year":"2017","journal-title":"BMC Bioinformatics"},{"key":"2026030502114692200_btag053-B22","doi-asserted-by":"crossref","first-page":"1925","DOI":"10.1093\/bioinformatics\/btt333","article-title":"Crowdsourcing for bioinformatics","volume":"29","author":"Good","year":"2013","journal-title":"Bioinformatics"},{"key":"2026030502114692200_btag053-B23","doi-asserted-by":"crossref","first-page":"170125","DOI":"10.1038\/sdata.2017.125","article-title":"Precision annotation of digital samples in NCBI\u2019s gene expression omnibus","volume":"4","author":"Hadley","year":"2017","journal-title":"Sci Data"},{"key":"2026030502114692200_btag053-B24","doi-asserted-by":"crossref","first-page":"100042","DOI":"10.1016\/j.cmpbup.2021.100042","article-title":"BERT based clinical knowledge extraction for biomedical knowledge graph construction and analysis","volume":"1","author":"Harnoune","year":"2021","journal-title":"Comput Methods Programs Biomed Update"},{"key":"2026030502114692200_btag053-B25","doi-asserted-by":"crossref","first-page":"6736","DOI":"10.1038\/s41467-022-34435-x","article-title":"Systematic tissue annotations of genomics samples by modeling unstructured metadata","volume":"13","author":"Hawkins","year":"2022","journal-title":"Nat Commun"},{"key":"2026030502114692200_btag053-B26","first-page":"28","volume-title":"European Conference on Information Retrieval","author":"Kamphuis","year":"2020"},{"key":"2026030502114692200_btag053-B27","doi-asserted-by":"crossref","first-page":"btad069","DOI":"10.1093\/bioinformatics\/btad069","article-title":"GEOfetch: a command-line tool for downloading data and standardized metadata from GEO and SRA","volume":"39","author":"Khoroshevskyi","year":"2023","journal-title":"Bioinformatics"},{"key":"2026030502114692200_btag053-B28","doi-asserted-by":"crossref","first-page":"1366","DOI":"10.1038\/s41467-018-03751-6","article-title":"Massive mining of publicly available RNA-seq data from human and mouse","volume":"9","author":"Lachmann","year":"2018","journal-title":"Nat Commun"},{"key":"2026030502114692200_btag053-B29","doi-asserted-by":"crossref","first-page":"401","DOI":"10.1037\/rev0000297","article-title":"Word meaning in minds and machines","volume":"130","author":"Lake","year":"2023","journal-title":"Psychol Rev"},{"key":"2026030502114692200_btag053-B30","doi-asserted-by":"crossref","first-page":"1234","DOI":"10.1093\/bioinformatics\/btz682","article-title":"BioBERT: a pre-trained biomedical language representation model for biomedical text mining","volume":"36","author":"Lee","year":"2020","journal-title":"Bioinformatics"},{"key":"2026030502114692200_btag053-B31","doi-asserted-by":"crossref","first-page":"giae033","DOI":"10.1093\/gigascience\/giae033","article-title":"PEPhub: a database, web interface, and API for editing, sharing, and validating biological sample metadata","volume":"13","author":"LeRoy","year":"2024","journal-title":"GigaScience"},{"key":"2026030502114692200_btag053-B32","doi-asserted-by":"crossref","first-page":"baab006","DOI":"10.1093\/database\/baab006","article-title":"Curation of over 10\u2009000 transcriptomic studies to enable data reuse","volume":"2021","author":"Lim","year":"2021","journal-title":"Database"},{"key":"2026030502114692200_btag053-B33","first-page":"265","article-title":"Medical subject headings (MeSH)","volume":"88","author":"Lipscomb","year":"2000","journal-title":"Bull Med Libr Assoc"},{"key":"2026030502114692200_btag053-B34","author":"Liu","year":"2019"},{"key":"2026030502114692200_btag053-B35","doi-asserted-by":"crossref","first-page":"1103","DOI":"10.1001\/jama.1994.03510380059038","article-title":"Understanding and using the medical subject headings (MeSH) vocabulary to perform literature searches","volume":"271","author":"Lowe","year":"1994","journal-title":"Jama"},{"key":"2026030502114692200_btag053-B36","doi-asserted-by":"publisher","author":"L\u00f9","year":"2024","DOI":"10.48550\/arXiv.2407.03618"},{"key":"2026030502114692200_btag053-B37","doi-asserted-by":"crossref","first-page":"416","DOI":"10.1038\/nrrheum.2011.68","article-title":"Pathogenesis of systemic juvenile idiopathic arthritis: some answers, more questions","volume":"7","author":"Mellins","year":"2011","journal-title":"Nat Rev Rheumatol"},{"key":"2026030502114692200_btag053-B38","author":"Mikolov","year":"2013"},{"key":"2026030502114692200_btag053-B39","author":"Mikolov","year":"2018"},{"key":"2026030502114692200_btag053-B40","doi-asserted-by":"crossref","first-page":"W170","DOI":"10.1093\/nar\/gkp440","article-title":"BioPortal: ontologies and integrated data resources at the click of a mouse","volume":"37","author":"Noy","year":"2009","journal-title":"Nucleic Acids Res"},{"key":"2026030502114692200_btag053-B41","doi-asserted-by":"crossref","first-page":"816","DOI":"10.1038\/ng.3864","article-title":"Finding useful data across multiple biomedical data repositories using DataMed","volume":"49","author":"Ohno-Machado","year":"2017","journal-title":"Nat Genet"},{"key":"2026030502114692200_btag053-B42","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1093\/database\/baaa064","article-title":"A content-based dataset recommendation system for researchers\u2014a case study on gene expression omnibus (GEO) repository","volume":"2020","author":"Patra","year":"2020","journal-title":"Database (Oxford)"},{"key":"2026030502114692200_btag053-B43","first-page":"1532","author":"Pennington","year":"2014"},{"key":"2026030502114692200_btag053-B44","doi-asserted-by":"crossref","first-page":"3512","DOI":"10.1038\/s41467-019-11461-w","article-title":"Quantifying the impact of public omics data","volume":"10","author":"Perez-Riverol","year":"2019","journal-title":"Nat Commun"},{"key":"2026030502114692200_btag053-B45","doi-asserted-by":"crossref","first-page":"30","DOI":"10.1186\/s13742-016-0135-4","article-title":"Tools and techniques for computational reproducibility","volume":"5","author":"Piccolo","year":"2016","journal-title":"Gigascience"},{"key":"2026030502114692200_btag053-B46","doi-asserted-by":"crossref","first-page":"414","DOI":"10.12688\/f1000research.8375.1","article-title":"A curated transcriptome dataset collection to investigate the functional programming of human hematopoietic cells in early life","volume":"5","author":"Rahman","year":"2016","journal-title":"F1000Res"},{"key":"2026030502114692200_btag053-B47","doi-asserted-by":"crossref","first-page":"1844","DOI":"10.1093\/jamia\/ocae029","article-title":"BioLORD-2023: semantic textual representations fusing large language models and clinical knowledge graph insights","volume":"31","author":"Remy","year":"2024","journal-title":"J Am Med Inform Assoc"},{"key":"2026030502114692200_btag053-B48","doi-asserted-by":"crossref","first-page":"333","DOI":"10.1561\/1500000019","article-title":"The probabilistic relevance framework: BM25 and beyond","volume":"3","author":"Robertson","year":"2009","journal-title":"FNT in Information Retrieval"},{"key":"2026030502114692200_btag053-B49","doi-asserted-by":"crossref","first-page":"740","DOI":"10.1002\/ajmg.b.32571","article-title":"The regulation of tetraspanin 8 gene expression\u2014a potential new mechanism in the pathogenesis of bipolar disorder","volume":"174","author":"Schartner","year":"2017","journal-title":"Am J Med Genet B Neuropsychiatr Genet"},{"key":"2026030502114692200_btag053-B50","doi-asserted-by":"crossref","first-page":"12","DOI":"10.1097\/PPO.0b013e3181cf04be","article-title":"What is the difference between triple-negative and basal breast cancers?","volume":"16","author":"Seal","year":"2010","journal-title":"Cancer J"},{"key":"2026030502114692200_btag053-B51","doi-asserted-by":"crossref","first-page":"803","DOI":"10.1038\/nbt.3603","article-title":"A crowdsourcing approach for reusing and meta-analyzing gene expression data","volume":"34","author":"Shah","year":"2016","journal-title":"Nat Biotechnol"},{"key":"2026030502114692200_btag053-B52","article-title":"Mondo disease","volume":"2807","author":"Vasilevsky","year":", : - , 2020,"},{"key":"2026030502114692200_btag053-B53","doi-asserted-by":"crossref","first-page":"514","DOI":"10.1038\/s41598-019-56339-5","article-title":"PulmonDB: a curated lung disease gene expression database","volume":"10","author":"Villase\u00f1or-Altamirano","year":"2020","journal-title":"Sci Rep"},{"key":"2026030502114692200_btag053-B54","doi-asserted-by":"crossref","first-page":"12","DOI":"10.1016\/j.jbi.2018.09.008","article-title":"A comparison of word embeddings for the biomedical natural language processing","volume":"87","author":"Wang","year":"2018","journal-title":"J Biomed Inform"},{"key":"2026030502114692200_btag053-B55","doi-asserted-by":"crossref","first-page":"12846","DOI":"10.1038\/ncomms12846","article-title":"Extraction and analysis of signatures from the gene expression omnibus by the crowd","volume":"7","author":"Wang","year":"2016","journal-title":"Nat Commun"},{"key":"2026030502114692200_btag053-B56","doi-asserted-by":"crossref","first-page":"1686","DOI":"10.21105\/joss.01686","article-title":"Welcome to the tidyverse","volume":"4","author":"Wickham","year":"2019","journal-title":"JOSS"},{"key":"2026030502114692200_btag053-B57","doi-asserted-by":"crossref","first-page":"160018","DOI":"10.1038\/sdata.2016.18","article-title":"The FAIR guiding principles for scientific data management and stewardship","volume":"3","author":"Wilkinson","year":"2016","journal-title":"Sci Data"},{"key":"2026030502114692200_btag053-B58","first-page":"38","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations","author":"Wolf","year":"2020"},{"key":"2026030502114692200_btag053-B59","doi-asserted-by":"crossref","first-page":"18","DOI":"10.1186\/1756-0381-7-18","article-title":"ExpressionData - a public resource of high quality curated datasets representing gene expression across anatomy, development and experimental conditions","volume":"7","author":"Zimmermann","year":"2014","journal-title":"BioData Min"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btag053\/66709526\/btag053.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/42\/3\/btag053\/66709526\/btag053.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/42\/3\/btag053\/66709526\/btag053.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,5]],"date-time":"2026-03-05T07:11:58Z","timestamp":1772694718000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btag053\/8455039"}},"subtitle":[],"editor":[{"given":"Jonathan","family":"Wren","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2026,2,2]]},"references-count":58,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2026,2,28]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btag053","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2026,3]]},"published":{"date-parts":[[2026,2,2]]},"article-number":"btag053"}}