{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,3]],"date-time":"2026-03-03T05:18:10Z","timestamp":1772515090942,"version":"3.50.1"},"reference-count":47,"publisher":"Oxford University Press (OUP)","issue":"3","license":[{"start":{"date-parts":[[2018,1,13]],"date-time":"2018-01-13T00:00:00Z","timestamp":1515801600000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000002","name":"NIH","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000060","name":"National Institute of Allergy and Infectious Diseases","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000060","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2018,3,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Objective<\/jats:title><jats:p>Finding relevant datasets is important for promoting data reuse in the biomedical domain, but it is challenging given the volume and complexity of biomedical data. Here we describe the development of an open source biomedical data discovery system called DataMed, with the goal of promoting the building of additional data indexes in the biomedical domain.<\/jats:p><\/jats:sec><jats:sec><jats:title>Materials and Methods<\/jats:title><jats:p>DataMed, which can efficiently index and search diverse types of biomedical datasets across repositories, is developed through the National Institutes of Health\u2013funded biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium. It consists of 2 main components: (1) a data ingestion pipeline that collects and transforms original metadata information to a unified metadata model, called DatA Tag Suite (DATS), and (2) a search engine that finds relevant datasets based on user-entered queries. In addition to describing its architecture and techniques, we evaluated individual components within DataMed, including the accuracy of the ingestion pipeline, the prevalence of the DATS model across repositories, and the overall performance of the dataset retrieval engine.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results and Conclusion<\/jats:title><jats:p>Our manual review shows that the ingestion pipeline could achieve an accuracy of 90% and core elements of DATS had varied frequency across repositories. On a manually curated benchmark dataset, the DataMed search engine achieved an inferred average precision of 0.2033 and a precision at 10 (P@10, the number of relevant results in the top 10 search results) of 0.6022, by implementing advanced natural language processing and terminology services. Currently, we have made the DataMed system publically available as an open source package for the biomedical community.<\/jats:p><\/jats:sec>","DOI":"10.1093\/jamia\/ocx121","type":"journal-article","created":{"date-parts":[[2017,9,28]],"date-time":"2017-09-28T19:12:57Z","timestamp":1506625977000},"page":"300-308","source":"Crossref","is-referenced-by-count":62,"title":["DataMed \u2013 an open source discovery index for finding biomedical datasets"],"prefix":"10.1093","volume":"25","author":[{"given":"Xiaoling","family":"Chen","sequence":"first","affiliation":[{"name":"School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA"}]},{"given":"Anupama E","family":"Gururaj","sequence":"additional","affiliation":[{"name":"School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA"}]},{"given":"Burak","family":"Ozyurt","sequence":"additional","affiliation":[{"name":"Center for Research in Biological Systems"}]},{"given":"Ruiling","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA"}]},{"given":"Ergin","family":"Soysal","sequence":"additional","affiliation":[{"name":"School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA"}]},{"given":"Trevor","family":"Cohen","sequence":"additional","affiliation":[{"name":"School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA"}]},{"given":"Firat","family":"Tiryaki","sequence":"additional","affiliation":[{"name":"School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA"}]},{"given":"Yueling","family":"Li","sequence":"additional","affiliation":[{"name":"Center for Research in Biological Systems"}]},{"given":"Nansu","family":"Zong","sequence":"additional","affiliation":[{"name":"Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA"}]},{"given":"Min","family":"Jiang","sequence":"additional","affiliation":[{"name":"School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA"}]},{"given":"Deevakar","family":"Rogith","sequence":"additional","affiliation":[{"name":"School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA"}]},{"given":"Mandana","family":"Salimi","sequence":"additional","affiliation":[{"name":"School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA"}]},{"given":"Hyeon-eui","family":"Kim","sequence":"additional","affiliation":[{"name":"Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA"}]},{"given":"Philippe","family":"Rocca-Serra","sequence":"additional","affiliation":[{"name":"e-Research Centre, University of Oxford, Oxford, UK"}]},{"given":"Alejandra","family":"Gonzalez-Beltran","sequence":"additional","affiliation":[{"name":"e-Research Centre, University of Oxford, Oxford, UK"}]},{"given":"Claudiu","family":"Farcas","sequence":"additional","affiliation":[{"name":"Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA"}]},{"given":"Todd","family":"Johnson","sequence":"additional","affiliation":[{"name":"School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA"}]},{"given":"Ron","family":"Margolis","sequence":"additional","affiliation":[{"name":"National Institutes of Health, Bethesda, MD, USA"}]},{"given":"George","family":"Alter","sequence":"additional","affiliation":[{"name":"University of Michigan, Ann Arbor, MI, USA"}]},{"given":"Susanna-Assunta","family":"Sansone","sequence":"additional","affiliation":[{"name":"e-Research Centre, University of Oxford, Oxford, UK"}]},{"given":"Ian M","family":"Fore","sequence":"additional","affiliation":[{"name":"National Institutes of Health, Bethesda, MD, USA"}]},{"given":"Lucila","family":"Ohno-Machado","sequence":"additional","affiliation":[{"name":"Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA"}]},{"given":"Jeffrey S","family":"Grethe","sequence":"additional","affiliation":[{"name":"Center for Research in Biological Systems"}]},{"given":"Hua","family":"Xu","sequence":"additional","affiliation":[{"name":"School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA"}]}],"member":"286","published-online":{"date-parts":[[2018,1,13]]},"reference":[{"key":"2020110612232276200_ocx121-B1","doi-asserted-by":"crossref","first-page":"160018","DOI":"10.1038\/sdata.2016.18","article-title":"The FAIR Guiding Principles for scientific data management and stewardship","volume":"3","author":"Wilkinson","year":"2016","journal-title":"Scientific Data."},{"key":"2020110612232276200_ocx121-B2","volume-title":"bioCADDIE White Paper \u2013 Data Discovery Index","author":"Lucila","year":"2015"},{"key":"2020110612232276200_ocx121-B3","doi-asserted-by":"crossref","first-page":"816","DOI":"10.1038\/ng.3864","article-title":"DataMed: Finding useful data across multiple biomedical data repositories","volume":"49","author":"Ohno-Machado","year":"2017","journal-title":"Nature Genet."},{"key":"2020110612232276200_ocx121-B4","author":"NIH Data Sharing Repositories"},{"issue":"1","key":"2020110612232276200_ocx121-B5","doi-asserted-by":"crossref","first-page":"207","DOI":"10.1093\/nar\/30.1.207","article-title":"Gene expression Omnibus: NCBI gene expression and hybridization array data repository","volume":"30","author":"Edgar","year":"2002","journal-title":"Nucleic Acids Res."},{"issue":"1","key":"2020110612232276200_ocx121-B6","doi-asserted-by":"crossref","first-page":"235","DOI":"10.1093\/nar\/28.1.235","article-title":"The Protein Data Bank","volume":"28","author":"Berman","year":"2000","journal-title":"Nucleic Acids Res."},{"issue":"2\u20133","key":"2020110612232276200_ocx121-B7","doi-asserted-by":"crossref","first-page":"234","DOI":"10.1007\/s12026-014-8516-1","article-title":"ImmPort: disseminating data to the public for the future of immunology","volume":"58","author":"Bhattacharya","year":"2014","journal-title":"Immunol Res."},{"issue":"5","key":"2020110612232276200_ocx121-B8","doi-asserted-by":"crossref","first-page":"406","DOI":"10.1038\/nbt.3790","article-title":"Discovering and linking public omics data sets using the Omics Discovery Index","volume":"35","author":"Perez-Riverol","year":"2017","journal-title":"Nat Biotechnol."},{"key":"2020110612232276200_ocx121-B9","doi-asserted-by":"crossref","DOI":"10.1109\/COINFO.2009.66","article-title":"DataCite \u2013 A Global Registration Agency for Research Data","volume-title":"2009 Fourth International Conference on Cooperation and Promotion of Information Resources in Science and Technology","author":"Brase","year":"2009"},{"key":"2020110612232276200_ocx121-B10","doi-asserted-by":"crossref","DOI":"10.1093\/database\/bas005","article-title":"A hybrid human and machine resource curation pipeline for the Neuroscience Information Framework","author":"Bandrowski","year":"2012","journal-title":"Database"},{"issue":"9","key":"2020110612232276200_ocx121-B11","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pone.0136206","article-title":"The NIDDK information network: a community portal for finding data, materials, and tools for researchers studying diabetes, digestive, and kidney diseases","volume":"10","author":"Whetzel","year":"2015","journal-title":"PLOS ONE."},{"key":"2020110612232276200_ocx121-B12","doi-asserted-by":"crossref","first-page":"134","DOI":"10.12688\/f1000research.6555.1","article-title":"The resource identification initiative: a cultural shift in publishing","volume":"4","author":"Bandrowski","year":"2015","journal-title":"F1000Res."},{"key":"2020110612232276200_ocx121-B13","doi-asserted-by":"crossref","first-page":"173","DOI":"10.1177\/0049124107306660","article-title":"An Introduction to the Dataverse Network as an Infrastructure for Data Sharing","volume":"36","author":"King","year":"2007","journal-title":"Soc Methods Res."},{"key":"2020110612232276200_ocx121-B14","doi-asserted-by":"crossref","first-page":"170059","DOI":"10.1038\/sdata.2017.59","article-title":"DATS, the data tag suite to enable discoverability of datasets","volume":"4","author":"Sansone","year":"2017","journal-title":"Sci Data."},{"key":"2020110612232276200_ocx121-B15","volume-title":"ElasticSearch Server","author":"Ku\u0107","year":"2013"},{"key":"2020110612232276200_ocx121-B16","doi-asserted-by":"crossref","DOI":"10.1109\/ICCIT.2009.130","article-title":"The Research of PHP Development Framework Based on MVC Pattern","volume-title":"2009 Fourth International Conference on Computer Sciences and Convergence Information Technology","author":"Cui","year":"2009"},{"key":"2020110612232276200_ocx121-B17","author":"PubMed Entrez Programming Utilities","year":"2017"},{"key":"2020110612232276200_ocx121-B18","author":"Research Portfolio Online Reporting Tools (RePORT)","year":"2015"},{"issue":"4","key":"2020110612232276200_ocx121-B19","doi-asserted-by":"crossref","first-page":"841","DOI":"10.1093\/jamia\/ocw177","article-title":"MetaMap Lite: an evaluation of a new Java implementation of MetaMap","volume":"24","author":"Demner-Fushman","year":"2017","journal-title":"J Am Med Inform Assoc."},{"key":"2020110612232276200_ocx121-B20","first-page":"254","article-title":"UTH-CCB@BioCreative V CDR Task: Identifying Chemical-induced Disease Relations in Biomedical Text","volume-title":"Fifth BioCreative Challenge Evaluation Workshop","author":"Xu","year":"2015"},{"issue":"22","key":"2020110612232276200_ocx121-B21","doi-asserted-by":"crossref","first-page":"3045","DOI":"10.1093\/bioinformatics\/btp536","article-title":"QuickGO: a web-based tool for Gene Ontology searching","volume":"25","author":"Binns","year":"2009","journal-title":"Bioinformatics."},{"issue":"2","key":"2020110612232276200_ocx121-B22","doi-asserted-by":"crossref","first-page":"276","DOI":"10.1093\/bioinformatics\/btv570","article-title":"Cell line name recognition in support of the identification of synthetic lethality in cancer from text","volume":"32","author":"Kaewphan","year":"2016","journal-title":"Bioinformatics."},{"key":"2020110612232276200_ocx121-B23","first-page":"114","article-title":"Medical subject headings","volume":"51","author":"Rogers","year":"1963","journal-title":"Bull Med Libr Assoc."},{"key":"2020110612232276200_ocx121-B24","author":"International Health Terminology Standards Development Organisation"},{"issue":"Database issue","key":"2020110612232276200_ocx121-B25","first-page":"D258","article-title":"The Gene Ontology (GO) database and informatics resource","volume":"32","author":"Harris","year":"2004","journal-title":"Nucleic Acids Res."},{"key":"2020110612232276200_ocx121-B26","volume-title":"Foundational Model of Anatomy","author":"Structural Informatics Group","year":"2017"},{"issue":"Database issue","key":"2020110612232276200_ocx121-B27","first-page":"D136","volume":"40","author":"Federhen","year":"2012","journal-title":"The NCBI Taxonomy database"},{"issue":"Database issue","key":"2020110612232276200_ocx121-B28","doi-asserted-by":"crossref","first-page":"D1079","DOI":"10.1093\/nar\/gku1071","article-title":"Genenames.org: the HGNC resources in 2015","volume":"43","author":"Gray","year":"2015","journal-title":"Nucleic Acids Res."},{"key":"2020110612232276200_ocx121-B29","author":"Elasticsearch"},{"key":"2020110612232276200_ocx121-B30","doi-asserted-by":"crossref","DOI":"10.1093\/database\/bax068","article-title":"Information retrieval for biomedical datasets: the 2016 bioCADDIE dataset retrieval challenge","author":"Roberts","year":"2017","journal-title":"Database"},{"issue":"2","key":"2020110612232276200_ocx121-B31","doi-asserted-by":"crossref","first-page":"240","DOI":"10.1016\/j.jbi.2009.09.003","article-title":"Reflective Random Indexing and indirect inference: a scalable method for discovery of implicit connections","volume":"43","author":"Cohen","year":"2010","journal-title":"J Biomed Inform."},{"key":"2020110612232276200_ocx121-B32","article-title":"Random indexing of text samples for latent semantic analysis","volume":"22","author":"Kanerva","year":"2000","journal-title":"Proc 22nd Annual Conf Cogn Sci Soc."},{"issue":"2","key":"2020110612232276200_ocx121-B33","doi-asserted-by":"crossref","first-page":"390","DOI":"10.1016\/j.jbi.2009.02.002","article-title":"Empirical distributional semantics: methods and biomedical applications","volume":"42","author":"Cohen","year":"2009","journal-title":"J Biomed Inform."},{"key":"2020110612232276200_ocx121-B34","doi-asserted-by":"crossref","first-page":"34","DOI":"10.1007\/978-3-662-45912-6_4","article-title":"Orthogonality and Orthography: Introducing Measured Distance into Semantic Space","volume-title":"Quantum Interaction: 7th International Conference","author":"Cohen","year":"2014"},{"key":"2020110612232276200_ocx121-B35","doi-asserted-by":"crossref","first-page":"231","DOI":"10.1007\/978-3-319-28675-4_18","article-title":"Graded semantic vectors: an approach to representing graded quantities in generalized quantum models","volume-title":"Quantum Interaction: 9th International Conference","author":"Widdows","year":"2016"},{"key":"2020110612232276200_ocx121-B36","doi-asserted-by":"crossref","DOI":"10.1109\/ICSC.2010.94","article-title":"The Semantic Vectors Package: New Algorithms and Public Tools for Distributional Semantics","volume-title":"2010 IEEE Fourth International Conference on Semantic Computing","author":"Widdows","year":"2010"},{"key":"2020110612232276200_ocx121-B37","article-title":"A Publicly Available Benchmark for Biomedical Dataset Retrieval: The Reference Standard for the 2016 bioCADDIE Dataset Retrieval Challenge","author":"Cohen","journal-title":"Database"},{"issue":"3","key":"2020110612232276200_ocx121-B38","doi-asserted-by":"crossref","first-page":"337","DOI":"10.1093\/jamia\/ocx134","article-title":"User needs analysis and usability assessment of DataMed\u2013a biomedical data discovery index","volume":"25","author":"Dixit","year":"2017","journal-title":"J Am Med Inform Assoc."},{"issue":"1","key":"2020110612232276200_ocx121-B39","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1016\/j.jbi.2006.02.007","article-title":"Data integration and genomic medicine","volume":"40","author":"Louie","year":"2007","journal-title":"J Biomed Inform."},{"issue":"5","key":"2020110612232276200_ocx121-B40","doi-asserted-by":"crossref","first-page":"706","DOI":"10.1016\/j.jbi.2008.03.004","article-title":"Bio2RDF: towards a mashup to build bioinformatics knowledge systems","volume":"41","author":"Belleau","year":"2008","journal-title":"J Biomed Inform."},{"issue":"Web Server issue","key":"2020110612232276200_ocx121-B41","doi-asserted-by":"crossref","first-page":"W170","DOI":"10.1093\/nar\/gkp440","article-title":"BioPortal: ontologies and integrated data resources at the click of a mouse","volume":"37","author":"Noy","year":"2009","journal-title":"Nucleic Acids Res."},{"key":"2020110612232276200_ocx121-B42","doi-asserted-by":"crossref","first-page":"255","DOI":"10.1186\/1471-2105-11-255","article-title":"Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data","volume":"11","author":"Chen","year":"2010","journal-title":"BMC Bioinformatics."},{"issue":"Database issue","key":"2020110612232276200_ocx121-B43","doi-asserted-by":"crossref","first-page":"D267","DOI":"10.1093\/nar\/gkh061","article-title":"The Unified Medical Language System (UMLS): integrating biomedical terminology","volume":"32","author":"Bodenreider","year":"2004","journal-title":"Nucleic Acids Res."},{"key":"2020110612232276200_ocx121-B44","doi-asserted-by":"crossref","first-page":"144","DOI":"10.1007\/978-3-540-69828-9_14","article-title":"A system for ontology-based annotation of biomedical data","author":"Jonquet","year":"2008","journal-title":"Proceedings of the 5th International Workshop on Data Integration in the Life Sciences."},{"key":"2020110612232276200_ocx121-B45","doi-asserted-by":"crossref","first-page":"S1","DOI":"10.1186\/1471-2105-10-S2-S1","article-title":"Ontology-driven indexing of public datasets for translational bioinformatics","volume":"10","author":"Shah","year":"2009","journal-title":"BMC Bioinformatics."},{"issue":"1","key":"2020110612232276200_ocx121-B46","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1136\/amiajnl-2013-001882","article-title":"PhenDisco: phenotype discovery system for the database of genotypes and phenotypes","volume":"21","author":"Doan","year":"2014","journal-title":"J Am Med Inform Assoc."},{"issue":"1","key":"2020110612232276200_ocx121-B47","doi-asserted-by":"crossref","first-page":"55","DOI":"10.1038\/nbt1150","article-title":"Creation and implications of a phenome-genome network","volume":"24","author":"Butte","year":"2006","journal-title":"Nat Biotechnol."}],"container-title":["Journal of the American Medical Informatics Association"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/jamia\/article-pdf\/25\/3\/300\/34149967\/ocx121.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/jamia\/article-pdf\/25\/3\/300\/34149967\/ocx121.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,8,26]],"date-time":"2023-08-26T13:35:31Z","timestamp":1693056931000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/jamia\/article\/25\/3\/300\/4807508"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,1,13]]},"references-count":47,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2018,1,13]]},"published-print":{"date-parts":[[2018,3,1]]}},"URL":"https:\/\/doi.org\/10.1093\/jamia\/ocx121","relation":{},"ISSN":["1067-5027","1527-974X"],"issn-type":[{"value":"1067-5027","type":"print"},{"value":"1527-974X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2018,3]]},"published":{"date-parts":[[2018,1,13]]}}}