{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,21]],"date-time":"2025-10-21T15:27:35Z","timestamp":1761060455520,"version":"3.41.2"},"reference-count":55,"publisher":"Oxford University Press (OUP)","license":[{"start":{"date-parts":[[2017,12,21]],"date-time":"2017-12-21T00:00:00Z","timestamp":1513814400000},"content-version":"vor","delay-in-days":354,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["R01LM011934","R01GM102282","U24AI117966"],"award-info":[{"award-number":["R01LM011934","R01GM102282","U24AI117966"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2017,1,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The recent movement towards open data in the biomedical domain has generated a large number of datasets that are publicly accessible. The Big Data to Knowledge data indexing project, biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE), has gathered these datasets in a one-stop portal aiming at facilitating their reuse for accelerating scientific advances. However, as the number of biomedical datasets stored and indexed increases, it becomes more and more challenging to retrieve the relevant datasets according to researchers\u2019 queries. In this article, we propose an information retrieval (IR) system to tackle this problem and implement it for the bioCADDIE Dataset Retrieval Challenge. The system leverages the unstructured texts of each dataset including the title and description for the dataset, and utilizes a state-of-the-art IR model, medical named entity extraction techniques, query expansion with deep learning-based word embeddings and a re-ranking strategy to enhance the retrieval performance. In empirical experiments, we compared the proposed system with 11 baseline systems using the bioCADDIE Dataset Retrieval Challenge datasets. The experimental results show that the proposed system outperforms other systems in terms of inference Average Precision and inference normalized Discounted Cumulative Gain, implying that the proposed system is a viable option for biomedical dataset retrieval.<\/jats:p><jats:p>Database URL: https:\/\/github.com\/yanshanwang\/biocaddie2016mayodata<\/jats:p>","DOI":"10.1093\/database\/bax091","type":"journal-article","created":{"date-parts":[[2017,11,16]],"date-time":"2017-11-16T20:27:57Z","timestamp":1510864077000},"source":"Crossref","is-referenced-by-count":14,"title":["Leveraging word embeddings and medical entity extraction for biomedical dataset retrieval using unstructured texts"],"prefix":"10.1093","volume":"2017","author":[{"given":"Yanshan","family":"Wang","sequence":"first","affiliation":[{"name":"Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55901, USA"}]},{"given":"Majid","family":"Rastegar-Mojarad","sequence":"first","affiliation":[{"name":"Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55901, USA"}]},{"given":"Ravikumar","family":"Komandur-Elayavilli","sequence":"first","affiliation":[{"name":"Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55901, USA"}]},{"given":"Hongfang","family":"Liu","sequence":"first","affiliation":[{"name":"Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55901, USA"}]}],"member":"286","published-online":{"date-parts":[[2017,12,20]]},"reference":[{"key":"2020053110300678000_bax091-B1","doi-asserted-by":"crossref","first-page":"816","DOI":"10.1038\/ng.3864","article-title":"Finding useful data across multiple biomedical data repositories using DataMed","volume":"49","author":"Ohno-Machado","year":"2017","journal-title":"Nature Genet."},{"key":"2020053110300678000_bax091-B2","doi-asserted-by":"crossref","first-page":"612.","DOI":"10.1038\/505612a","article-title":"NIH plans to enhance reproducibility","volume":"505","author":"Collins","year":"2014","journal-title":"Nature"},{"key":"2020053110300678000_bax091-B3","doi-asserted-by":"crossref","DOI":"10.1038\/sdata.2016.18","article-title":"The FAIR Guiding Principles for scientific data management and stewardship","volume":"3","author":"Wilkinson","year":"2016","journal-title":"Sci. Data"},{"key":"2020053110300678000_bax091-B4","doi-asserted-by":"crossref","first-page":"99","DOI":"10.1007\/s00799-016-0174-6","article-title":"Experiences in integrated data and research object publishing using GigaDB","volume":"18","author":"Edmunds","year":"2017","journal-title":"Int. J. Digital Lib"},{"key":"2020053110300678000_bax091-B5","doi-asserted-by":"crossref","first-page":"1114","DOI":"10.1093\/jamia\/ocv136","article-title":"The NIH big data to knowledge (BD2K) initiative","volume":"22","author":"Bourne","year":"2015","journal-title":"J. Am. Med. Inform. Assoc"},{"volume-title":"Proceedings of the 15th International Semantic Web Conference (ISWC)","year":"2016","author":"Solbrig","key":"2020053110300678000_bax091-B6"},{"key":"2020053110300678000_bax091-B7","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1093\/database\/baq036","article-title":"PubMed and beyond: a survey of web tools for searching biomedical literature","volume":"2011","author":"Lu","year":"2011","journal-title":"Database"},{"key":"2020053110300678000_bax091-B8","article-title":"DataMed by BioCADDIE\u2013a data discovery index prototype to unleash biomedical research data","author":"Hua Xu","year":"2016","journal-title":"Sci. Data Con"},{"key":"2020053110300678000_bax091-B9","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1093\/database\/bax068","article-title":"Information retrieval for biomedical datasets: the 2016 bioCADDIE dataset retrieval challenge","volume":"2017","author":"Roberts","year":"2017","journal-title":"Database"},{"year":"2009","author":"Croft","key":"2020053110300678000_bax091-B10"},{"year":"1968","author":"Salton","key":"2020053110300678000_bax091-B11"},{"year":"1983","author":"Salton","key":"2020053110300678000_bax091-B12"},{"key":"2020053110300678000_bax091-B13","doi-asserted-by":"crossref","first-page":"141","DOI":"10.1613\/jair.2934","article-title":"From frequency to meaning: vector space models of semantics","volume":"37","author":"Turney","year":"2010","journal-title":"J. Artif. Intel. Res"},{"key":"2020053110300678000_bax091-B14","doi-asserted-by":"crossref","first-page":"391","DOI":"10.1002\/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9","article-title":"Indexing by latent semantic analysis","volume":"41","author":"Deerwester","year":"1990","journal-title":"J. Am. Soc. Inform. Sci"},{"first-page":"289","year":"1999","author":"Hofmann","key":"2020053110300678000_bax091-B15"},{"key":"2020053110300678000_bax091-B16","doi-asserted-by":"crossref","first-page":"1736","DOI":"10.1002\/asi.23444","article-title":"Indexing by latent dirichlet allocation and an ensemble model","volume":"67","author":"Wang","year":"2016","journal-title":"J. Assoc. Inform. Sci. Technol"},{"key":"2020053110300678000_bax091-B17","first-page":"993","article-title":"Latent dirichlet allocation","volume":"3","author":"Blei","year":"2003","journal-title":"J. Machine Learn. Res"},{"key":"2020053110300678000_bax091-B18","first-page":"472","article-title":"A Markov random field model for term dependencies. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval","author":"Metzler","year":"2005"},{"first-page":"311","year":"2007","author":"Metzler","key":"2020053110300678000_bax091-B19"},{"key":"2020053110300678000_bax091-B20","doi-asserted-by":"crossref","first-page":"379","DOI":"10.1016\/j.jbi.2016.08.026","article-title":"A Part-Of-Speech term weighting scheme for biomedical information retrieval","volume":"63","author":"Wang","year":"2016","journal-title":"J. Biomed. Inform"},{"key":"2020053110300678000_bax091-B21","first-page":"198","volume-title":"Proceedings of the Conference and Labs of the Evaluation Forum (CLEF)","author":"Wang","year":"2016"},{"key":"2020053110300678000_bax091-B22","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9780511809071","volume-title":"Introduction to Information Retrieval","author":"Manning","year":"2008"},{"first-page":"4","year":"1996","author":"Xu","key":"2020053110300678000_bax091-B23"},{"first-page":"600","year":"2011","author":"Andrzejewski","key":"2020053110300678000_bax091-B24"},{"year":"2013","author":"Mikolov","key":"2020053110300678000_bax091-B25"},{"volume-title":"Proceedings of the 2016 Text Retrieval Conference","year":"2016","key":"2020053110300678000_bax091-B26"},{"key":"2020053110300678000_bax091-B27","article-title":"NKU at TREC 2016: Clinical Decision Support Track.","author":"Zhang","year":"2016","journal-title":"Proceedings of the 2016 Text Retrieval Conference"},{"key":"2020053110300678000_bax091-B28","doi-asserted-by":"crossref","DOI":"10.6028\/NIST.SP.500-321.clinical-ETH","article-title":"ETH Zurich at TREC clinical decision support 2016","author":"Greuter,S","year":"2016","journal-title":"Proceedings of the 2016 Text Retrieval Conference"},{"key":"2020053110300678000_bax091-B29","doi-asserted-by":"crossref","DOI":"10.6028\/NIST.SP.500-321.clinical-MERCKKGAA","article-title":"Semi-supervised information retrieval system for clinical decision support","author":"Gurulingappa","year":"2016","journal-title":"Proceedings of the 2016 Text Retrieval Conference"},{"key":"2020053110300678000_bax091-B30","first-page":"367","article-title":"Query expansion with locally-trained word embeddings","volume-title":"Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics","author":"Diaz","year":"2016"},{"key":"2020053110300678000_bax091-B31","first-page":"109","article-title":"Okapi at TREC-3","volume":"109","author":"Robertson","year":"1995","journal-title":"Nist. Special Publ. Sp"},{"first-page":"403","year":"2001","author":"Zhai","key":"2020053110300678000_bax091-B32"},{"first-page":"334","year":"2001","author":"Zhai","key":"2020053110300678000_bax091-B33"},{"key":"2020053110300678000_bax091-B34","doi-asserted-by":"crossref","first-page":"113","DOI":"10.1007\/s10791-015-9259-x","article-title":"State-of-the-art in biomedical literature retrieval for clinical cases: a survey of the TREC 2014 CDS track","volume":"19","author":"Roberts","year":"2016","journal-title":"Inform. Retrieval J"},{"key":"2020053110300678000_bax091-B35","first-page":"41","article-title":"The unified medical language system","author":"Lindberg","year":"1993","journal-title":"IMIA Yearbook"},{"key":"2020053110300678000_bax091-B36","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1136\/jamia.1998.0050001","article-title":"The unified medical language system","volume":"5","author":"Humphreys","year":"1998","journal-title":"J. Am. Med. Inf. Assoc"},{"key":"2020053110300678000_bax091-B37","doi-asserted-by":"crossref","first-page":"12","DOI":"10.1136\/jamia.1998.0050012","article-title":"The unified medical language system","volume":"5","author":"Campbell","year":"1998","journal-title":"J. Am. Med. Inf. Assoc"},{"key":"2020053110300678000_bax091-B38","doi-asserted-by":"crossref","DOI":"10.6028\/NIST.SP.500-319.clinical-DUTH","article-title":"DUTH at TREC 2015 clinical decision support track","author":"George Drosatos","year":"2015","journal-title":"Proceedings of the 2015 Text Retrieval Conference"},{"key":"2020053110300678000_bax091-B39","first-page":"265.","article-title":"Medical subject headings (MeSH)","volume":"88","author":"Lipscomb","year":"2000","journal-title":"Bull. Med. Lib. Assoc"},{"article-title":"NovaSearch at TREC 2015 clinical decision support track","year":"2015","author":"Mourao","key":"2020053110300678000_bax091-B40"},{"key":"2020053110300678000_bax091-B41","doi-asserted-by":"crossref","DOI":"10.6028\/NIST.SP.500-319.clinical-DBNET_AUEB","article-title":"AUEB at TREC 2015: clinical decision support track","author":"Giannis Nikolentzos","year":"2015","journal-title":"Proceedings of the 2015 Text Retrieval Conference"},{"key":"2020053110300678000_bax091-B42","doi-asserted-by":"crossref","first-page":"W518.","DOI":"10.1093\/nar\/gkt441","article-title":"PubTator: a web-based text mining tool for assisting biocuration","volume":"41","author":"Wei","year":"2013","journal-title":"Nucleic Acids Res"},{"key":"2020053110300678000_bax091-B43","doi-asserted-by":"crossref","first-page":"1915","DOI":"10.1093\/bioinformatics\/btt317","article-title":"BeCAS: biomedical concept recognition services and visualization","volume":"29","author":"Nunes","year":"2013","journal-title":"Bioinformatics"},{"key":"2020053110300678000_bax091-B44","doi-asserted-by":"crossref","first-page":"D54","DOI":"10.1093\/nar\/gki031","article-title":"Entrez Gene: gene-centered information at NCBI","volume":"33","author":"Maglott","year":"2005","journal-title":"Nucleic Acids Res"},{"key":"2020053110300678000_bax091-B45","doi-asserted-by":"crossref","first-page":"D115","DOI":"10.1093\/nar\/gkh131","article-title":"UniProt: the universal protein knowledgebase","volume":"32","author":"Apweiler","year":"2004","journal-title":"Nucleic Acids Res"},{"key":"2020053110300678000_bax091-B46","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1038\/75556","article-title":"Gene Ontology: tool for the unification of biology","volume":"25","author":"Ashburner","year":"2000","journal-title":"Nat. Genet"},{"key":"2020053110300678000_bax091-B47","doi-asserted-by":"crossref","first-page":"793.","DOI":"10.1289\/ehp.6028","article-title":"The comparative toxicogenomics database (CTD)","volume":"111","author":"Mattingly","year":"2003","journal-title":"Environ. Health Perspect"},{"key":"2020053110300678000_bax091-B48","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1093\/database\/baw156","article-title":"BELMiner: adapting a rule-based relation extraction system to extract biological expression language statements from bio-medical literature evidence sentences","volume":"2017","author":"Ravikumar","year":"2017","journal-title":"Database"},{"key":"2020053110300678000_bax091-B49","first-page":"3111","article-title":"Distributed representations of words and phrases and their compositionality","author":"Mikolov","year":"2013","journal-title":"Adv. Neural Inf. Process. Syst"},{"volume-title":"Proceedings of the 2015 Text Retrieval Conference","year":"2015","author":"Palotti","key":"2020053110300678000_bax091-B50"},{"first-page":"1","year":"2017","author":"Cohen","key":"2020053110300678000_bax091-B51"},{"first-page":"603","year":"2008","author":"Yilmaz","key":"2020053110300678000_bax091-B52"},{"key":"2020053110300678000_bax091-B53","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1093\/database\/bax062","article-title":"Multi-field query expansion is effective for biomedical dataset retrieval","volume":"2017","author":"Bouadjenek","year":"2017","journal-title":"Database"},{"year":"2016","author":"Wang","key":"2020053110300678000_bax091-B54"},{"key":"2020053110300678000_bax091-B55","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1093\/database\/bax056","article-title":"Elsevier\u2019s approach to the bioCADDIE 2016 dataset retrieval challenge","volume":"2017","author":"Scerri","year":"2017","journal-title":"Database"}],"container-title":["Database"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/bax091\/33329474\/bax091.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/bax091\/33329474\/bax091.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,27]],"date-time":"2025-06-27T02:51:42Z","timestamp":1750992702000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/database\/article\/doi\/10.1093\/database\/bax091\/4769380"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,1,1]]},"references-count":55,"URL":"https:\/\/doi.org\/10.1093\/database\/bax091","relation":{},"ISSN":["1758-0463"],"issn-type":[{"type":"electronic","value":"1758-0463"}],"subject":[],"published-other":{"date-parts":[[2017]]},"published":{"date-parts":[[2017,1,1]]},"article-number":"bax091"}}