{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,3]],"date-time":"2026-03-03T05:18:11Z","timestamp":1772515091269,"version":"3.50.1"},"reference-count":25,"publisher":"Oxford University Press (OUP)","license":[{"start":{"date-parts":[[2020,10,1]],"date-time":"2020-10-01T00:00:00Z","timestamp":1601510400000},"content-version":"vor","delay-in-days":274,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Center for Big Data in Health Sciences"},{"DOI":"10.13039\/100004917","name":"Cancer Research and Prevention Institute of Texas","doi-asserted-by":"crossref","award":["RP170668"],"award-info":[{"award-number":["RP170668"]}],"id":[{"id":"10.13039\/100004917","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["R00LM012104"],"award-info":[{"award-number":["R00LM012104"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,1,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>It is a growing trend among researchers to make their data publicly available for experimental reproducibility and data reusability. Sharing data with fellow researchers helps in increasing the visibility of the work. On the other hand, there are researchers who are inhibited by the lack of data resources. To overcome this challenge, many repositories and knowledge bases have been established to date to ease data sharing. Further, in the past two decades, there has been an exponential increase in the number of datasets added to these dataset repositories. However, most of these repositories are domain-specific, and none of them can recommend datasets to researchers\/users. Naturally, it is challenging for a researcher to keep track of all the relevant repositories for potential use. Thus, a dataset recommender system that recommends datasets to a researcher based on previous publications can enhance their productivity and expedite further research. This work adopts an information retrieval (IR) paradigm for dataset recommendation. We hypothesize that two fundamental differences exist between dataset recommendation and PubMed-style biomedical IR beyond the corpus. First, instead of keywords, the query is the researcher, embodied by his or her publications. Second, to filter the relevant datasets from non-relevant ones, researchers are better represented by a set of interests, as opposed to the entire body of their research. This second approach is implemented using a non-parametric clustering technique. These clusters are used to recommend datasets for each researcher using the cosine similarity between the vector representations of publication clusters and datasets. The maximum normalized discounted cumulative gain at 10 (NDCG@10), precision at 10 (p@10) partial and p@10 strict of 0.89, 0.78 and 0.61, respectively, were obtained using the proposed method after manual evaluation by five researchers. As per the best of our knowledge, this is the first study of its kind on content-based dataset recommendation. We hope that this system will further promote data sharing, offset the researchers\u2019 workload in identifying the right dataset and increase the reusability of biomedical datasets.<\/jats:p><jats:p>Database URL: http:\/\/genestudy.org\/recommends\/#\/<\/jats:p>","DOI":"10.1093\/database\/baaa064","type":"journal-article","created":{"date-parts":[[2020,8,4]],"date-time":"2020-08-04T19:16:59Z","timestamp":1596568619000},"source":"Crossref","is-referenced-by-count":23,"title":["A content-based dataset recommendation system for researchers\u2014a case study on Gene Expression Omnibus (GEO) repository"],"prefix":"10.1093","volume":"2020","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2997-5314","authenticated-orcid":false,"given":"Braja Gopal","family":"Patra","sequence":"first","affiliation":[{"name":"Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston\/1200 Pressler Street, Suite E-833, Houston, TX, 77030, USA and"}]},{"given":"Kirk","family":"Roberts","sequence":"additional","affiliation":[{"name":"School of Biomedical Informatics, The University of Texas Health Science Center at Houston\/7000 Fannin st. Suite 600, Houston, TX, 77030, USA"}]},{"given":"Hulin","family":"Wu","sequence":"additional","affiliation":[{"name":"Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston\/1200 Pressler Street, Suite E-833, Houston, TX, 77030, USA"},{"name":"School of Biomedical Informatics, The University of Texas Health Science Center at Houston\/7000 Fannin st. Suite 600, Houston, TX, 77030, USA"}]}],"member":"286","published-online":{"date-parts":[[2020,11,12]]},"reference":[{"key":"2020111216460961100_R1","doi-asserted-by":"publisher","first-page":"300","DOI":"10.1093\/jamia\/ocx121","article-title":"Datamed\u2013an open source discovery index for finding biomedical datasets","volume":"25","author":"Chen","year":"2018","journal-title":"Journal of the American Medical Informatics Association"},{"key":"2020111216460961100_R2","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1093\/database\/bax068","article-title":"Information retrieval for biomedical datasets: the 2016 biocaddie dataset retrieval challenge","volume":"2017","author":"Roberts","year":"2017","journal-title":"Database"},{"key":"2020111216460961100_R3","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1093\/database\/bax061","article-title":"A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 biocaddie dataset retrieval challenge","volume":"2017","author":"Cohen","year":"2017","journal-title":"Database"},{"key":"2020111216460961100_R4","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1093\/database\/bax104","article-title":"Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval","volume":"2018","author":"Karisani","year":"2018","journal-title":"Database"},{"key":"2020111216460961100_R5","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1093\/database\/bax065","article-title":"Query expansion using mesh terms for dataset retrieval: Ohsu at the biocaddie 2016 dataset retrieval challenge","volume":"2017","author":"Wright","year":"2017","journal-title":"Database"},{"key":"2020111216460961100_R6","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1093\/database\/bax056","article-title":"Elsevier\u2019s approach to the biocaddie 2016 dataset retrieval challenge","volume":"2017","author":"Scerri","year":"2017","journal-title":"Database"},{"key":"2020111216460961100_R7","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1093\/database\/bay017","article-title":"Finding relevant biomedical datasets: the UC San Diego solution for the biocaddie retrieval challenge","volume":"2018","author":"Wei","year":"2018","journal-title":"Database"},{"key":"2020111216460961100_R8","doi-asserted-by":"publisher","first-page":"W445","DOI":"10.1093\/nar\/gkx258","article-title":"Omicseq: a web-based search engine for exploring omics datasets","volume":"45","author":"Sun","year":"2017","journal-title":"Nucleic acids research"},{"key":"2020111216460961100_R9","doi-asserted-by":"crossref","first-page":"pp. 1149","DOI":"10.1145\/1242572.1242739","article-title":"Determining the user intent of web search engine queries","author":"Jansen","year":"2007","journal-title":"In Proceedings of the 16th international conference on World Wide Web"},{"key":"2020111216460961100_R10","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0158423","article-title":"Science concierge: A fast content-based recommendation system for scientific publications","volume":"11","author":"Achakulvisut","year":"2016","journal-title":"PloS one"},{"key":"2020111216460961100_R11","doi-asserted-by":"publisher","DOI":"10.1016\/j.jbi.2020.103399","article-title":"A content-based literature recommendation system for datasets to improve data reusability. A case study on Gene Expression Omnibus (GEO) datasets","volume":"104","author":"Patra","year":"2020","journal-title":"Journal of Biomedical Informatics"},{"key":"2020111216460961100_R12","doi-asserted-by":"publisher","DOI":"10.1038\/sdata.2017.59","article-title":"Dats, the data tag suite to enable discoverability of datasets","volume":"4","author":"Sansone","year":"2017","journal-title":"Scientific data"},{"key":"2020111216460961100_R13","first-page":"pp. 36","article-title":"Dataset recommendation for data linking: An intensional approach","author":"Ellefi","year":"2016","journal-title":"In European Semantic Web Conference"},{"key":"2020111216460961100_R14","first-page":"548","article-title":"Combining a co-occurrence-based and a semantic measure for entity linking","author":"Nunes","year":"2013","journal-title":"In Extended Semantic Web Conference"},{"key":"2020111216460961100_R15","article-title":"Predicting and recommending relevant datasets in complex environments","author":"Srivastava","year":"2018"},{"key":"2020111216460961100_R16","doi-asserted-by":"publisher","first-page":"105","DOI":"10.3233\/978-1-61499-649-1-105","article-title":"Identifying and improving dataset references in social sciences full texts","author":"Ghavimi","year":"2016"},{"key":"2020111216460961100_R17","article-title":"Identifying data sharing in biomedical literature","volume":"Vol. 2008","author":"Piwowar","year":"2008","journal-title":"In AMIA Annual Symposium Proceedings"},{"key":"2020111216460961100_R18","doi-asserted-by":"crossref","first-page":"pp 31","DOI":"10.18653\/v1\/W19-2604","article-title":"Dataset mention extraction and classification","author":"Prasad","year":"2019","journal-title":"In Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications"},{"key":"2020111216460961100_R19","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1093\/database\/bay019","article-title":"Geometacuration: a web-based application for accurate manual curation of gene expression omnibus metadata","volume":"2018","author":"Li","year":"2018","journal-title":"Database"},{"key":"2020111216460961100_R20","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1093\/database\/bay145","article-title":"Restructured geo: restructuring gene expression omnibus metadata for genome dynamics analysis","volume":"2019","author":"Chen","year":"2019","journal-title":"Database"},{"key":"2020111216460961100_R21","doi-asserted-by":"crossref","first-page":"249","DOI":"10.1080\/10618600.2000.10474879","article-title":"Markov chain sampling methods for dirichlet process mixture models","volume":"9","author":"Neal","year":"2000","journal-title":"Journal of computational and graphical statistics"},{"key":"2020111216460961100_R22","doi-asserted-by":"crossref","first-page":"pp 625","DOI":"10.1109\/ICDE.2016.7498276","article-title":"A model-based approach for text clustering with outlier detection","author":"Yin","year":"2016","journal-title":"In Proceedings of the 2016 IEEE 32nd International Conference on Data Engineering (ICDE)"},{"key":"2020111216460961100_R23","doi-asserted-by":"publisher","DOI":"10.1186\/1747-5333-1-11","article-title":"The emergence and diffusion of dna microarray technology","volume":"1","author":"Lenoir","year":"2006","journal-title":"Journal of biomedical discovery and collaboration"},{"key":"2020111216460961100_R24","article-title":"A theoretical analysis of ndcg ranking measures","volume":"Vol. 8","author":"Wang","year":"2013","journal-title":"In 26th Annual Conference on Learning Theory (COLT 2013)"},{"key":"2020111216460961100_R25","doi-asserted-by":"publisher","first-page":"1930","DOI":"10.1177\/0962280217746719","article-title":"A big data pipeline: Identifying dynamic gene regulatory networks from time-course gene expression omnibus data with applications to influenza infection","volume":"27","author":"Carey","year":"2018","journal-title":"Statistical methods in medical research"}],"container-title":["Database"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baaa064\/34283872\/baaa064.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baaa064\/34283872\/baaa064.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,11]],"date-time":"2024-08-11T09:18:17Z","timestamp":1723367897000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/database\/article\/doi\/10.1093\/database\/baaa064\/5909105"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,1,1]]},"references-count":25,"URL":"https:\/\/doi.org\/10.1093\/database\/baaa064","relation":{},"ISSN":["1758-0463"],"issn-type":[{"value":"1758-0463","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020]]},"published":{"date-parts":[[2020,1,1]]},"article-number":"baaa064"}}