{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,19]],"date-time":"2026-05-19T04:16:23Z","timestamp":1779164183713,"version":"3.51.4"},"reference-count":167,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,1,9]],"date-time":"2025-01-09T00:00:00Z","timestamp":1736380800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,1,9]],"date-time":"2025-01-09T00:00:00Z","timestamp":1736380800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"European Union \u2013 Next Generation EU programme","award":["Project Age-It (Ageing Well in an Ageing Society)"],"award-info":[{"award-number":["Project Age-It (Ageing Well in an Ageing Society)"]}]},{"DOI":"10.13039\/501100021856","name":"Ministero dell'Universit\u00e0 e della Ricerca","doi-asserted-by":"publisher","award":["\u201cDipartimenti di Eccellenza 2023-2027\u201d ReGAInS"],"award-info":[{"award-number":["\u201cDipartimenti di Eccellenza 2023-2027\u201d ReGAInS"]}],"id":[{"id":"10.13039\/501100021856","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100019180","name":"HORIZON EUROPE European Research Council","doi-asserted-by":"publisher","award":["FINDHR 01070212"],"award-info":[{"award-number":["FINDHR 01070212"]}],"id":[{"id":"10.13039\/100019180","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100005156","name":"Alexander von Humboldt-Stiftung","doi-asserted-by":"publisher","award":["Humboldt Research Fellowship"],"award-info":[{"award-number":["Humboldt Research Fellowship"]}],"id":[{"id":"10.13039\/100005156","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BioData Mining"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Biomedical datasets are the mainstays of computational biology and health informatics projects, and can be found on multiple data platforms online or obtained from wet-lab biologists and physicians. The quality and the trustworthiness of these datasets, however, can sometimes be poor, producing bad results in turn, which can harm patients and data subjects. To address this problem, policy-makers, researchers, and consortia have proposed diverse regulations, guidelines, and scores to assess the quality and increase the reliability of datasets. Although generally useful, however, they are often incomplete and impractical. The guidelines of <jats:italic>Datasheets for Datasets<\/jats:italic>, in particular, are too numerous; the requirements of the <jats:italic>Kaggle Dataset Usability Score<\/jats:italic> focus on non-scientific requisites (for example, including a cover image); and the <jats:italic>European Union Artificial Intelligence Act<\/jats:italic> (EU AI Act) sets forth sparse and general data governance requirements, which we tailored to datasets for biomedical AI. Against this backdrop, we introduce our new Venus score to assess the data quality and trustworthiness of biomedical datasets. Our score ranges from 0 to 10 and consists of ten questions that anyone developing a bioinformatics, medical informatics, or cheminformatics dataset should answer before the release. In this study, we first describe the <jats:italic>EU AI Act<\/jats:italic>, <jats:italic>Datasheets for Datasets<\/jats:italic>, and the <jats:italic>Kaggle Dataset Usability Score<\/jats:italic>, presenting their requirements and their drawbacks. To do so, we reverse-engineer the weights of the influential Kaggle Score for the first time and report them in this study. We distill the most important data governance requirements into ten questions tailored to the biomedical domain, comprising the Venus score. We apply the Venus score to twelve datasets from multiple subdomains, including electronic health records, medical imaging, microarray and bulk RNA-seq gene expression, cheminformatics, physiologic electrogram signals, and medical text. Analyzing the results, we surface fine-grained strengths and weaknesses of popular datasets, as well as aggregate trends. Most notably, we find a widespread tendency to gloss over sources of data inaccuracy and noise, which may hinder the reliable exploitation of data and, consequently, research results. Overall, our results confirm the applicability and utility of the Venus score to assess the trustworthiness of biomedical data.<\/jats:p>","DOI":"10.1186\/s13040-024-00412-x","type":"journal-article","created":{"date-parts":[[2025,1,8]],"date-time":"2025-01-08T23:02:00Z","timestamp":1736377320000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":9,"title":["The Venus score for the assessment of the quality and trustworthiness of biomedical datasets"],"prefix":"10.1186","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9655-7142","authenticated-orcid":false,"given":"Davide","family":"Chicco","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6108-9940","authenticated-orcid":false,"given":"Alessandro","family":"Fabris","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2705-5728","authenticated-orcid":false,"given":"Giuseppe","family":"Jurman","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2025,1,9]]},"reference":[{"key":"412_CR1","unstructured":"The Hammond Times. Work with new electronic \u2018brains\u2019 opens field for army math experts. https:\/\/web.archive.org\/web\/20230410183037\/, https:\/\/www.newspapers.com\/clip\/50687334\/the-times\/. Article published on 10th November 1957; saved on NewsPapers.com on 10th April 2023; Wayback Machine URL visited on 8th November 2024."},{"key":"412_CR2","volume-title":"Passages from the Life of a Philosopher","author":"C Babbage","year":"1864","unstructured":"Babbage C. Passages from the Life of a Philosopher. London: Longman; 1864."},{"issue":"107366","key":"412_CR3","doi-asserted-by":"publisher","first-page":"107366","DOI":"10.1016\/j.asoc.2021.107366","volume":"106","author":"G Fenza","year":"2021","unstructured":"Fenza G, Gallo M, Loia V, Orciuoli F, Herrera-Viedma E. Data set quality in Machine Learning: consistency measure based on Group Decision Making. Appl Soft Comput. 2021;106(107366):107366.","journal-title":"Appl Soft Comput."},{"issue":"2","key":"412_CR4","doi-asserted-by":"publisher","first-page":"91","DOI":"10.1016\/j.gpb.2018.11.006","volume":"18","author":"Q Chen","year":"2020","unstructured":"Chen Q, Britto R, Erill I, Jeffery CJ, Liberzon A, Magrane M, et al. Quality matters: biocuration experts on the impact of duplication and other data quality issues in biological databases. Genomics Proteomics Bioinforma. 2020;18(2):91\u2013103.","journal-title":"Genomics Proteomics Bioinforma."},{"key":"412_CR5","unstructured":"Budach L, Feuerpfeil M, Ihde N, Nathansen A, Noack N, Patzlaff H, et\u00a0al. The effects of data quality on machine learning performance. 2022. arXiv:2207.14529."},{"key":"412_CR6","doi-asserted-by":"crossref","unstructured":"Simson J, Fabris A, Kern C. Lazy data practices harm fairness research. 2024. arXiv:2404.17293.","DOI":"10.1145\/3630106.3658931"},{"key":"412_CR7","doi-asserted-by":"crossref","unstructured":"Foidl H, Felderer M, Ramler R. Data smells. In: Proceedings of CAIN \u201922 \u2013 the 1st International Conference on AI Engineering: Software Engineering for AI. New York City: ACM; 2022.","DOI":"10.1145\/3522664.3528590"},{"key":"412_CR8","doi-asserted-by":"crossref","unstructured":"Pasquetto IV, Cullen Z, Thomer A, Wofford M. What is research data \u201cmisuse\u201d? And how can it be prevented or mitigated? J Assoc Inf Sci Technol. 2024.","DOI":"10.1002\/asi.24944"},{"issue":"1","key":"412_CR9","doi-asserted-by":"publisher","first-page":"98","DOI":"10.1038\/s41597-023-01969-8","volume":"10","author":"LD Hughes","year":"2023","unstructured":"Hughes LD, Tsueng G, DiGiovanna J, Horvath TD, Rasmussen LV, Savidge TC, et al. Addressing barriers in FAIR data practices for biomedical data. Sci Data. 2023;10(1):98. https:\/\/doi.org\/10.1038\/s41597-023-01969-8.","journal-title":"Sci Data."},{"issue":"1","key":"412_CR10","doi-asserted-by":"publisher","first-page":"e6","DOI":"10.1016\/s2589-7500(23)00222-4","volume":"6","author":"A Saenz","year":"2024","unstructured":"Saenz A, Chen E, Marklund H, Rajpurkar P. The MAIDA initiative: establishing a framework for global medical-imaging data sharing. Lancet Digit Health. 2024;6(1):e6\u20138. https:\/\/doi.org\/10.1016\/s2589-7500(23)00222-4.","journal-title":"Lancet Digit Health."},{"key":"412_CR11","doi-asserted-by":"publisher","unstructured":"Sambasivan N, Kapania S, Highfill H, Akrong D, Paritosh P, Aroyo LM. \u201cEveryone wants to do the model work, not the data work\u201d: data cascades in High-Stakes AI. In: Proceedings of CHI\u00a0\u201921 \u2013 the 2021 CHI Conference on Human Factors in Computing Systems. ACM; 2021. pp. 1\u201315. https:\/\/doi.org\/10.1145\/3411764.3445518.","DOI":"10.1145\/3411764.3445518"},{"key":"412_CR12","doi-asserted-by":"publisher","first-page":"e41446","DOI":"10.2196\/41446","volume":"25","author":"FA Bernardi","year":"2023","unstructured":"Bernardi FA, Alves D, Crepaldi N, Yamada DB, Lima VC, Rijo R. Data quality in health research: integrative literature review. J Med Internet Res. 2023;25:e41446. https:\/\/doi.org\/10.2196\/41446.","journal-title":"J Med Internet Res."},{"issue":"1\u20132","key":"412_CR13","doi-asserted-by":"publisher","first-page":"88","DOI":"10.1177\/1833358319887743","volume":"50","author":"B Ehsani-Moghaddam","year":"2019","unstructured":"Ehsani-Moghaddam B, Martin K, Queenan JA. Data quality in healthcare: A report of practical experience with the Canadian primary care sentinel surveillance network data. Health Inf Manag J. 2019;50(1\u20132):88\u201392. https:\/\/doi.org\/10.1177\/1833358319887743.","journal-title":"Health Inf Manag J."},{"key":"412_CR14","doi-asserted-by":"crossref","unstructured":"Cychnerski J, Dziubich T. Process of medical dataset construction for machine learning-multifield study and guidelines. In: Proceedings of ADBIS 2021 \u2013 the 23rd European Conference on Advances in Databases and Information Systems. Springer; 2021. pp. 217\u2013229.","DOI":"10.1007\/978-3-030-85082-1_20"},{"key":"412_CR15","doi-asserted-by":"crossref","unstructured":"Rostamzadeh N, Mincu D, Roy S, Smart A, Wilcox L, Pushkarna M, et\u00a0al. Healthsheet: development of a Transparency Artifact for Health Datasets. In: Proceedings of FAccT \u201922 \u2013 the 5th Annual ACM Conference on Fairness, Accountability, and Transparency, Seoul, South Korea. ACM; 2022. pp. 1943\u20131961.","DOI":"10.1145\/3531146.3533239"},{"issue":"1","key":"412_CR16","doi-asserted-by":"publisher","first-page":"302","DOI":"10.1186\/s12911-021-01656-x","volume":"21","author":"E Tute","year":"2021","unstructured":"Tute E, Ganapathy N, Wulff A. A data driven learning approach for the assessment of data quality. BMC Med Inf Decis Mak. 2021;21(1):302. https:\/\/doi.org\/10.1186\/s12911-021-01656-x.","journal-title":"BMC Med Inf Decis Mak."},{"issue":"7","key":"412_CR17","doi-asserted-by":"publisher","first-page":"e1011224","DOI":"10.1371\/journal.pcbi.1011224","volume":"19","author":"D Chicco","year":"2023","unstructured":"Chicco D, Cumbo F, Angione C. Ten quick tips for avoiding pitfalls in multi-omics data integration analyses. PLoS Comput Biol. 2023;19(7):e1011224.","journal-title":"PLoS Comput Biol."},{"key":"412_CR18","unstructured":"Johnson SG, Speedie S, Simon G, Kumar V, Westra BL. A data quality ontology for the secondary use of EHR data. In: Proceedings of the AMIA 2015 Annual Symposium, vol. 2015. American Medical Informatics Association; 2015. p. 1937."},{"issue":"1","key":"412_CR19","doi-asserted-by":"publisher","first-page":"144","DOI":"10.1136\/amiajnl-2011-000681","volume":"20","author":"NG Weiskopf","year":"2013","unstructured":"Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc. 2013;20(1):144\u201351.","journal-title":"J Am Med Inform Assoc."},{"issue":"10","key":"412_CR20","doi-asserted-by":"publisher","first-page":"1730","DOI":"10.1093\/jamia\/ocad120","volume":"30","author":"AE Lewis","year":"2023","unstructured":"Lewis AE, Weiskopf NG, Abrams ZB, Foraker R, Lai AM, Payne PRO, et al. Electronic health record data quality assessment and tools: a systematic review. J Am Med Inform Assoc. 2023;30(10):1730\u201340.","journal-title":"J Am Med Inform Assoc."},{"issue":"1","key":"412_CR21","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s12887-016-0592-z","volume":"16","author":"LA Knake","year":"2016","unstructured":"Knake LA, Ahuja M, McDonald EL, Ryckman KK, Weathers N, Burstain T, et al. Quality of EHR data extractions for studies of preterm birth in a tertiary care center: guidelines for obtaining reliable data. BMC Pediatr. 2016;16(1):1\u20138.","journal-title":"BMC Pediatr"},{"key":"412_CR22","first-page":"1","volume":"2010","author":"T Botsis","year":"2010","unstructured":"Botsis T, Hartvigsen G, Chen F, Weng C. Secondary use of EHR: data quality issues and informatics opportunities. Summit Transl Bioinforma. 2010;2010:1.","journal-title":"Summit Transl Bioinforma."},{"key":"412_CR23","doi-asserted-by":"crossref","unstructured":"Weiskopf NG, Bakken S, Hripcsak G, Weng C. A Data Quality Assessment guideline for electronic health record data reuse. eGEMs J. 2017;5(1):14.","DOI":"10.5334\/egems.218"},{"key":"412_CR24","unstructured":"Stirling K. Development of a multi-factorial data quality score for primary care electronic medical records [Master of Science thesis]. London: the University of Western Ontario; 2022."},{"issue":"3","key":"412_CR25","doi-asserted-by":"publisher","first-page":"e024722","DOI":"10.1136\/bmjopen-2018-024722","volume":"9","author":"KP Fadahunsi","year":"2019","unstructured":"Fadahunsi KP, Akinlua JT, O\u2019Connor S, Wark PA, Gallagher J, Carroll C, et al. Protocol for a systematic review and qualitative synthesis of information quality frameworks in eHealth. BMJ Open. 2019;9(3):e024722. https:\/\/doi.org\/10.1136\/bmjopen-2018-024722.","journal-title":"BMJ Open."},{"key":"412_CR26","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/1471-2105-5-80","volume":"5","author":"BR Zeeberg","year":"2004","unstructured":"Zeeberg BR, Riss J, Kane DW, Bussey KJ, Uchio E, Linehan WM, et al. Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics. BMC Bioinformatics. 2004;5:1\u20136.","journal-title":"BMC Bioinformatics."},{"key":"412_CR27","doi-asserted-by":"crossref","unstructured":"Lewis D. Autocorrect errors in Excel still creating genomics headache. Nature. 2021.","DOI":"10.1038\/d41586-021-02211-4"},{"issue":"1","key":"412_CR28","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41598-022-17104-3","volume":"12","author":"CWT Koh","year":"2022","unstructured":"Koh CWT, Ooi JSG, Joly GLC, Chan KR. Gene Updater: a web tool that autocorrects and updates for Excel misidentified gene names. Sci Rep. 2022;12(1):1\u20137.","journal-title":"Sci Rep"},{"key":"412_CR29","unstructured":"Figshare. Store, share, discover research. https:\/\/www.figshare.com. URL visited on 8th November 2024."},{"key":"412_CR30","unstructured":"Zenodo. Research, shared. https:\/\/www.zenodo.org. URL visited on 8th November 2024."},{"key":"412_CR31","unstructured":"University of California Irvine. Machine Learning Repository. https:\/\/archive.ics.uci.edu\/. URL visited on 8th November 2024."},{"key":"412_CR32","unstructured":"Iglovikov V, Mushinskiy S, Osin V. Satellite imagery feature detection using deep convolutional neural network: a Kaggle competition. 2017. arXiv preprint arXiv:1706.06169."},{"key":"412_CR33","doi-asserted-by":"crossref","unstructured":"Quaranta L, Calefato F, Lanubile F. KGTorrent: A dataset of python jupyter notebooks from kaggle. In: Proceedings of MSR 2021 \u2013 the 18th IEEE\/ACM International Conference on Mining Software Repositories. IEEE; 2021. pp. 550\u2013554.","DOI":"10.1109\/MSR52588.2021.00072"},{"issue":"9","key":"412_CR34","first-page":"1","volume":"22","author":"B Graham","year":"2015","unstructured":"Graham B. Kaggle diabetic retinopathy detection competition report. Univ Warwick. 2015;22(9):1\u20139.","journal-title":"Univ Warwick"},{"key":"412_CR35","unstructured":"Hugging Face. The AI community building the future. https:\/\/huggingface.co\/datasets. URL visited on 8th November 2024."},{"key":"412_CR36","unstructured":"re3data. Registry of research data repositories. https:\/\/www.re3data.org\/. URL visited on 8th November 2024."},{"key":"412_CR37","unstructured":"Google. Dataset search. https:\/\/datasetsearch.research.google.com\/. URL visited on 8th November 2024."},{"issue":"1","key":"412_CR38","doi-asserted-by":"publisher","first-page":"49","DOI":"10.1016\/j.jalz.2015.06.1896","volume":"12","author":"AW Toga","year":"2016","unstructured":"Toga AW, Neu SC, Bhatt P, Crawford KL, Ashish N. The global Alzheimer\u2019s association interactive network. Alzheimers Dement. 2016;12(1):49\u201354.","journal-title":"Alzheimers Dement."},{"key":"412_CR39","unstructured":"The Global Alzheimer\u2019s Association Interactive Network. GAAIN data: 523,957 subjects online from 66 GAAIN data partners. https:\/\/www.gaaindata.org\/partners\/online.html. URL visited on 8th November 2024."},{"issue":"1","key":"412_CR40","doi-asserted-by":"publisher","first-page":"207","DOI":"10.1093\/nar\/30.1.207","volume":"30","author":"R Edgar","year":"2002","unstructured":"Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30(1):207\u201310.","journal-title":"Nucleic Acids Res."},{"issue":"1","key":"412_CR41","doi-asserted-by":"publisher","first-page":"68","DOI":"10.1093\/nar\/gkg091","volume":"31","author":"A Brazma","year":"2003","unstructured":"Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, et al. ArrayExpress\u2013a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 2003;31(1):68\u201371.","journal-title":"Nucleic Acids Res."},{"issue":"D1","key":"412_CR42","doi-asserted-by":"publisher","first-page":"D54","DOI":"10.1093\/nar\/gkr854","volume":"40","author":"Y Kodama","year":"2012","unstructured":"Kodama Y, Shumway M, Leinonen R. The Sequence Read Archive: explosive growth of sequencing data. Nucleic Acids Res. 2012;40(D1):D54\u20136.","journal-title":"Nucleic Acids Res."},{"issue":"10","key":"412_CR43","doi-asserted-by":"publisher","first-page":"1113","DOI":"10.1038\/ng.2764","volume":"45","author":"JN Weinstein","year":"2013","unstructured":"Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, Ellrott K, et al. The Cancer Genome Atlas pan-cancer analysis project. Nat Genetics. 2013;45(10):1113\u201320.","journal-title":"Nat Genetics."},{"key":"412_CR44","doi-asserted-by":"publisher","first-page":"1045","DOI":"10.1007\/s10278-013-9622-7","volume":"26","author":"K Clark","year":"2013","unstructured":"Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, et al. The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. J Digit Imaging. 2013;26:1045\u201357.","journal-title":"J Digit Imaging."},{"issue":"1","key":"412_CR45","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/sdata.2017.124","volume":"4","author":"F Prior","year":"2017","unstructured":"Prior F, Smith K, Sharma A, Kirby J, Tarbox L, Clark K, et al. The public cancer radiology imaging collections of the Cancer Imaging Archive. Sci Data. 2017;4(1):1\u20137.","journal-title":"Sci Data."},{"issue":"3","key":"412_CR46","doi-asserted-by":"publisher","first-page":"70","DOI":"10.1109\/51.932728","volume":"20","author":"GB Moody","year":"2001","unstructured":"Moody GB, Mark RG, Goldberger AL. PhysioNet: a web-based resource for the study of physiologic signals. IEEE Eng Med Biol Mag. 2001;20(3):70\u20135.","journal-title":"IEEE Eng Med Biol Mag."},{"key":"412_CR47","doi-asserted-by":"publisher","unstructured":"Johnson AEW, Pollard TJ, Shen L, Lehman LWH, Feng M, Ghassemi M, et\u00a0al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3(1). https:\/\/doi.org\/10.1038\/sdata.2016.35.","DOI":"10.1038\/sdata.2016.35"},{"issue":"1","key":"412_CR48","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41597-022-01899-x","volume":"10","author":"AEW Johnson","year":"2023","unstructured":"Johnson AEW, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data. 2023;10(1):1.","journal-title":"Sci Data."},{"issue":"1","key":"412_CR49","doi-asserted-by":"publisher","first-page":"e51","DOI":"10.1016\/S2589-7500(20)30240-5","volume":"3","author":"SM Khan","year":"2021","unstructured":"Khan SM, Liu X, Nath S, Korot E, Faes L, Wagner SK, et al. A global review of publicly available datasets for ophthalmological imaging: barriers to access, usability, and generalisability. Lancet Digit Health. 2021;3(1):e51\u201366.","journal-title":"Lancet Digit Health."},{"issue":"1","key":"412_CR50","doi-asserted-by":"publisher","first-page":"17","DOI":"10.5334\/dsj-2022-017","volume":"21","author":"D Chicco","year":"2022","unstructured":"Chicco D, Cerono G, Cangelosi D. A survey on publicly available open datasets derived from electronic health records (EHRs) of patients with neuroblastomaa. Data Sci J. 2022;21(1):17.","journal-title":"Data Sci J."},{"issue":"5","key":"412_CR51","doi-asserted-by":"publisher","first-page":"1470","DOI":"10.1093\/ejcts\/ezv385","volume":"49","author":"M Salati","year":"2015","unstructured":"Salati M, Falcoz PE, Decaluwe H, Rocco G, Van Raemdonck D, Varela G, et al. The European thoracic data quality project: an aggregate Data Quality score to measure the quality of international multi-institutional databases. Eur J Cardiothorac Surg. 2015;49(5):1470\u20135. https:\/\/doi.org\/10.1093\/ejcts\/ezv385.","journal-title":"Eur J Cardiothorac Surg."},{"issue":"10","key":"412_CR52","doi-asserted-by":"publisher","first-page":"2686","DOI":"10.1093\/humrep\/del231","volume":"21","author":"G Jones","year":"2006","unstructured":"Jones G, Jenkinson C, Taylor N, Mills A, Kennedy S. Measuring quality of life in women with endometriosis: tests of data quality, score reliability, response rate and scaling assumptions of the Endometriosis Health Profile Questionnaire. Hum Reprod. 2006;21(10):2686\u201393.","journal-title":"Hum Reprod."},{"key":"412_CR53","doi-asserted-by":"crossref","unstructured":"Gupta N, Patel H, Afzal S, Panwar N, Mittal RS, Guttula S, et\u00a0al. Data quality toolkit: automatic assessment of data quality and remediation for machine learning datasets. 2021. arXiv:2108.05935.","DOI":"10.1145\/3447548.3470817"},{"key":"412_CR54","doi-asserted-by":"crossref","unstructured":"Hickey D, Connor R, McCormack P, Kearney P, Rosti R, Brennan R. The data quality index: improving data quality in Irish healthcare records. In: Proceedings of ICEIS \u201921 \u2013 the 24th International Conference on Enterprise Information Systems. Cham, Switzerland: Springer; 2021.","DOI":"10.5220\/0010441906250636"},{"key":"412_CR55","unstructured":"Open Data Toronto. Towards an updated Data Quality Score in open data. https:\/\/open.toronto.ca\/towards-an-updated-data-quality-score-in-open-data\/. Published on 21st August 2023. URL visited on 8th November 2024."},{"key":"412_CR56","unstructured":"Hernandez C. Towards a Data Quality Score in open data (part 1). https:\/\/medium.com\/open-data-toronto\/towards-a-data-quality-score-in-open-data-part-1-525e59f729e9. Published on 15th January 2020. URL visited on 8th November 2024."},{"key":"412_CR57","unstructured":"Hernandez C. Towards a Data Quality Score in open data (part 2). https:\/\/medium.com\/open-data-toronto\/towards-a-data-quality-score-in-open-data-part-2-3f193eb9e21d. Published on 11th February 2020. URL visited on 8th November 2024."},{"issue":"10","key":"412_CR58","doi-asserted-by":"publisher","first-page":"e1001885","DOI":"10.1371\/journal.pmed.1001885","volume":"12","author":"EI Benchimol","year":"2015","unstructured":"Benchimol EI, Smeeth L, Guttmann A, Harron K, Moher D, Petersen I, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. PLoS Med. 2015;12(10):e1001885. https:\/\/doi.org\/10.1371\/journal.pmed.1001885.","journal-title":"PLoS Med."},{"key":"412_CR59","first-page":"1","volume":"363","author":"SM Langan","year":"2018","unstructured":"Langan SM, Schmidt SAJ, Wing K, Ehrenstein V, Nicholls SG, Filion KB, et al. The reporting of studies conducted using observational routinely collected health data statement for pharmacoepidemiology (RECORD-PE). Br Med J. 2018;363:1\u201319.","journal-title":"Br Med J"},{"key":"412_CR60","unstructured":"Mokrane M, Cepinskas L, \u00c5kerman V, de\u00a0Vries J, von Stein I, Verburg M. FAIR Aware. 2024. https:\/\/fairaware.dans.knaw.nl\/. URL visited on 8th November."},{"key":"412_CR61","unstructured":"Institute of Accelerating Systems and Applications All. FAIRness score. 2024. https:\/\/wiki.appdb.egi.eu\/docs\/faq\/general\/fairscore\/. URL visited on 8th November."},{"key":"412_CR62","unstructured":"European Parliament. Artificial intelligence act. https:\/\/www.europarl.europa.eu\/doceo\/document\/TA-9-2024-0138_EN.pdf. Resolution of 13th March 2024. URL visited on 8th November 2024."},{"key":"412_CR63","unstructured":"European Parliament News. Press release: Artificial Intelligence Act, MEPs adopt landmark law. https:\/\/www.europarl.europa.eu\/news\/en\/press-room\/20240308IPR19015\/artificial-intelligence-act-meps-adopt-landmark-law. URL visited on 8th November 2024."},{"key":"412_CR64","doi-asserted-by":"crossref","unstructured":"Gebru T, Morgenstern J, Vecchione B, Vaughan JW, Wallach H, III HD, et al. Datasheets for datasets. Commun ACM. 2021;64(12):86\u201392.","DOI":"10.1145\/3458723"},{"key":"412_CR65","unstructured":"European Commission. Shaping Europe\u2019s digital future. https:\/\/digital-strategy.ec.europa.eu\/en\/policies\/regulatory-framework-ai. URL visited on 8th November 2024."},{"key":"412_CR66","unstructured":"Kaggle. Kaggle datasets \u2013 Explore, analyze, and share quality data. https:\/\/www.kaggle.com\/datasets. URL visited on 8th November 2024."},{"key":"412_CR67","unstructured":"Hu C, the Kaggle Team. [Request for input] Improving the dataset usability rating design. https:\/\/www.kaggle.com\/discussions\/product-feedback\/354788. Published on 22nd September 2022. URL visited on 8th November 2024."},{"key":"412_CR68","unstructured":"Hu C, the Kaggle Team. [Product update] New usability rating user experience. https:\/\/www.kaggle.com\/discussions\/product-feedback\/372061. Published on 15th December 2022. URL visited on 8th November 2024."},{"key":"412_CR69","unstructured":"Kaggle Datasets. Fitbitdata. https:\/\/www.kaggle.com\/datasets\/panfordofori\/fitbitdata. URL visited on 8th November 2024."},{"key":"412_CR70","unstructured":"Kaggle Datasets. A hotel\u2019s customers dataset. https:\/\/www.kaggle.com\/datasets\/nantonio\/a-hotels-customers-dataset. URL visited on 8th November 2024."},{"key":"412_CR71","unstructured":"Kaggle Datasets. 1980s Album covers. https:\/\/www.kaggle.com\/datasets\/ronanpickell\/1980s-album-covers. URL visited on 8th November 2024."},{"key":"412_CR72","unstructured":"Kaggle Datasets. LFW \u2013 Facial recognition. https:\/\/www.kaggle.com\/datasets\/quadeer15sh\/lfw-facial-recognition. URL visited on 8th November 2024."},{"issue":"12","key":"412_CR73","first-page":"1","volume":"12","author":"S Holland","year":"2020","unstructured":"Holland S, Hosny A, Newman S, Joseph J, Chmielinski K. The dataset nutrition label. Data Protect Priv. 2020;12(12):1.","journal-title":"Data Protect Priv."},{"key":"412_CR74","doi-asserted-by":"publisher","first-page":"587","DOI":"10.1162\/tacl_a_00041","volume":"6","author":"EM Bender","year":"2018","unstructured":"Bender EM, Friedman B. Data Statements for Natural Language Processing: toward mitigating System Bias and Enabling Better Science. Trans Assoc Comput Linguist. 2018;6:587\u2013604. https:\/\/doi.org\/10.1162\/tacl_a_00041.","journal-title":"Trans Assoc Comput Linguist."},{"issue":"6","key":"412_CR75","doi-asserted-by":"publisher","first-page":"2074","DOI":"10.1007\/S10618-022-00854-Z","volume":"36","author":"A Fabris","year":"2022","unstructured":"Fabris A, Messina S, Silvello G, Susto GA. Algorithmic fairness datasets: the story so far. Data Min Knowl Disc. 2022;36(6):2074\u2013152. https:\/\/doi.org\/10.1007\/S10618-022-00854-Z.","journal-title":"Data Min Knowl Disc."},{"key":"412_CR76","doi-asserted-by":"crossref","unstructured":"Bertino E. Data trustworthiness\u2014approaches and research challenges. In: International Workshop on Data Privacy Management. Springer; 2014. pp. 17\u201325.","DOI":"10.1007\/978-3-319-17016-9_2"},{"key":"412_CR77","unstructured":"University of California Irvine Machine Learning Repository. Arrhythmia. https:\/\/doi.org\/10.24432\/C5BS32. URL visited on 8th November 2024."},{"key":"412_CR78","unstructured":"Stanford ML Group. CheXpert, a Large Chest X-ray Dataset And Competition. https:\/\/stanfordmlgroup.github.io\/competitions\/chexpert\/. URL visited on 8th November 2024."},{"key":"412_CR79","doi-asserted-by":"crossref","unstructured":"Irvin J, Rajpurkar P, Ko M, Yu Y, Ciurea-Ilcus S, Chute C, et\u00a0al. CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. In: Proceedings of AAAI 2019 \u2013 the 33rd Conference on Artificial Intelligence. AAAI Press; 2019. pp. 590\u2013597.","DOI":"10.1609\/aaai.v33i01.3301590"},{"key":"412_CR80","unstructured":"Rajpurkar P, Joshi A, Pareek A, Chen P, Kiani A, Irvin J, et\u00a0al. CheXpedition: investigating generalization challenges for translation of chest X-ray algorithms to the clinical setting. 2020. arXiv:2002.11379."},{"issue":"1","key":"412_CR81","doi-asserted-by":"publisher","first-page":"37","DOI":"10.1038\/s41591-018-0272-7","volume":"25","author":"WN Price","year":"2019","unstructured":"Price WN, Cohen IG. Privacy in the age of medical big data. Nat Med. 2019;25(1):37\u201343.","journal-title":"Nat Med."},{"issue":"1","key":"412_CR82","doi-asserted-by":"publisher","first-page":"6","DOI":"10.1186\/s13040-023-00326-0","volume":"16","author":"D Chicco","year":"2023","unstructured":"Chicco D, Jurman G. Ten simple rules for providing bioinformatics support within a hospital. BioData Min. 2023;16(1):6.","journal-title":"BioData Min."},{"key":"412_CR83","unstructured":"Creative Commons. CC BY 4.0 DEED Attribution 4.0 International. https:\/\/creativecommons.org\/licenses\/by\/4.0\/deed.en. URL visited on 8th November 2024."},{"key":"412_CR84","unstructured":"MIT Laboratory for Computational Physiology. MIMIC, Medical Information Mart for Intensive Care. https:\/\/mimic.mit.edu\/. URL visited on 8th November 2024."},{"key":"412_CR85","unstructured":"PhysioNet. MIMIC-IV. https:\/\/physionet.org\/content\/mimiciv\/2.0\/. URL visited on 8th November 2024."},{"issue":"1","key":"412_CR86","first-page":"1","volume":"4","author":"JL Harenza","year":"2017","unstructured":"Harenza JL, Diamond MA, Adams RN, Song MM, Davidson HL, Hart LS, et al. Transcriptomic profiling of 39 commonly-used neuroblastoma cell lines. Sci Data. 2017;4(1):1\u20139.","journal-title":"Sci Data"},{"key":"412_CR87","unstructured":"Omnibus GE. GSE89413 \u2013 Transcriptomic profiling of 39 neuroblastoma cell lines. https:\/\/www.ncbi.nlm.nih.gov\/geo\/query\/acc.cgi?acc=GSE89413. URL visited on 8th November 2024."},{"issue":"8","key":"412_CR88","doi-asserted-by":"publisher","first-page":"e0201991","DOI":"10.1371\/journal.pone.0201991","volume":"13","author":"G Le Gall","year":"2018","unstructured":"Le Gall G, Kirchgesner J, Bejaoui M, Landman C, Nion-Larmurier I, Bourrier A, et al. Clinical activity is an independent risk factor of ischemic heart and cerebrovascular arterial disease in patients with inflammatory bowel disease. PLoS ONE. 2018;13(8):e0201991.","journal-title":"PLoS ONE."},{"key":"412_CR89","unstructured":"Le\u00a0Gall G, Kirchgesner J, Bejaoui M, Landman C, Nion-Larmurier I, Bourrier A, et\u00a0al.. Dataset for \u201cClinical activity is an independent risk factor of ischemic heart and cerebrovascular arterial disease in patients with inflammatory bowel disease\u201d. https:\/\/figshare.com\/articles\/dataset\/Clinical_activity_is_an_independent_risk_factor_of_ischemic_heart_and_cerebrovascular_arterial_disease_in_patients_with_inflammatory_bowel_disease\/7036235. URL visited on 8th November 2024."},{"key":"412_CR90","unstructured":"Tanrikulu A, Er O. Mesothelioma\u2019s disease data set. https:\/\/archive.ics.uci.edu\/dataset\/351\/mesothelioma+s+disease+data+set. Dataset donated on 10th January 2016. URL visited on 8th November 2024."},{"issue":"1","key":"412_CR91","doi-asserted-by":"publisher","first-page":"75","DOI":"10.1016\/j.compeleceng.2011.09.001","volume":"38","author":"O Er","year":"2012","unstructured":"Er O, Tanrikulu AC, Abakay A, Temurtas F. An approach based on probabilistic neural network for diagnosis of Mesothelioma\u2019s disease. Comput Electr Eng. 2012;38(1):75\u201381.","journal-title":"Comput Electr Eng."},{"key":"412_CR92","doi-asserted-by":"publisher","first-page":"85","DOI":"10.2147\/DMSO.S390857","volume":"16","author":"Z Yan","year":"2023","unstructured":"Yan Z, Cai M, Han X, Chen Q, Lu H. The interaction between age and risk factors for diabetes and prediabetes: a community-based cross-sectional study. Diabetes Metab Syndr Obes. 2023;16:85\u201393.","journal-title":"Diabetes Metab Syndr Obes."},{"key":"412_CR93","unstructured":"European Parliament. General Data Protection Regulation. https:\/\/eur-lex.europa.eu\/eli\/reg\/2016\/679\/oj. Resolution of 13th March 2024. URL visited on 8th November 2024."},{"issue":"3","key":"412_CR94","doi-asserted-by":"publisher","first-page":"269","DOI":"10.1016\/j.jclinepi.2004.07.006","volume":"58","author":"ACM Jansen","year":"2005","unstructured":"Jansen ACM, van Aalst-Cohen ES, Hutten BA, B\u00fcller HR, Kastelein JJP, Prins MH. Guidelines were developed for data collection from medical records for use in retrospective analyses. J Clin Epidemiol. 2005;58(3):269\u201374.","journal-title":"J Clin Epidemiol."},{"issue":"2","key":"412_CR95","doi-asserted-by":"publisher","first-page":"217","DOI":"10.1111\/j.1553-2712.2011.01275.x","volume":"19","author":"CD Newgard","year":"2012","unstructured":"Newgard CD, Zive D, Jui J, Weathers C, Daya M. Electronic versus manual data processing: evaluating the use of electronic health records in out-of-hospital clinical research. Acad Emerg Med. 2012;19(2):217\u201327.","journal-title":"Acad Emerg Med."},{"issue":"4","key":"412_CR96","first-page":"385","volume":"20","author":"C Pagel","year":"2009","unstructured":"Pagel C, Gallivan S. Exploring potential consequences on mortality estimates of errors in clinical databases. IMA J Manag Math. 2009;20(4):385\u201393.","journal-title":"IMA J Manag Math."},{"issue":"4","key":"412_CR97","doi-asserted-by":"publisher","first-page":"497","DOI":"10.1177\/009885881303900401","volume":"39","author":"S Hoffman","year":"2013","unstructured":"Hoffman S, Podgurski A. The use and misuse of biomedical data: is bigger really better? Am J Law Med. 2013;39(4):497\u2013538.","journal-title":"Am J Law Med."},{"key":"412_CR98","unstructured":"Goldberg SI, Niemierko A, Turchin A. Analysis of data errors in clinical research databases. In: AMIA Annual Symposium Proceedings, vol. 2008. American Medical Informatics Association; 2008. pp. 242\u2013246."},{"issue":"9","key":"412_CR99","doi-asserted-by":"publisher","first-page":"1522","DOI":"10.1109\/TIP.2008.2001398","volume":"17","author":"JM Sanches","year":"2008","unstructured":"Sanches JM, Nascimento JC, Marques JS. Medical image noise reduction using the Sylvester-Lyapunov equation. IEEE Trans Image Process. 2008;17(9):1522\u201339.","journal-title":"IEEE Trans Image Process."},{"issue":"1","key":"412_CR100","doi-asserted-by":"publisher","first-page":"43","DOI":"10.1038\/s41592-018-0254-1","volume":"16","author":"M B\u00fcttner","year":"2019","unstructured":"B\u00fcttner M, Miao Z, Wolf FA, Teichmann SA, Theis FJ. A test metric for assessing single-cell RNA-seq batch correction. Nat Methods. 2019;16(1):43\u20139.","journal-title":"Nat Methods."},{"issue":"Suppl 6","key":"412_CR101","doi-asserted-by":"publisher","first-page":"279","DOI":"10.1186\/s12859-022-04775-y","volume":"23","author":"M Sprang","year":"2022","unstructured":"Sprang M, Andrade-Navarro MA, Fontaine JF. Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality. BMC Bioinformatics. 2022;23(Suppl 6):279.","journal-title":"BMC Bioinformatics."},{"issue":"1","key":"412_CR102","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.compbiomed.2007.06.003","volume":"38","author":"M Blanco-Velasco","year":"2008","unstructured":"Blanco-Velasco M, Weng B, Barner KE. ECG signal denoising and baseline wander correction based on the empirical mode decomposition. Comput Biol Med. 2008;38(1):1\u201313.","journal-title":"Comput Biol Med."},{"issue":"5","key":"412_CR103","doi-asserted-by":"publisher","first-page":"371","DOI":"10.1016\/0013-4694(91)90202-F","volume":"79","author":"JP Pijn","year":"1991","unstructured":"Pijn JP, Van Neerven J, Noest A, da Silva FHL. Chaos or noise in EEG signals; dependence on state and brain site. Electroencephalogr Clin Neurophysiol. 1991;79(5):371\u201381.","journal-title":"Electroencephalogr Clin Neurophysiol."},{"key":"412_CR104","doi-asserted-by":"publisher","first-page":"295","DOI":"10.1002\/0471780367.ch5","volume":"22","author":"M Sundling","year":"2006","unstructured":"Sundling M, Sukumar N, Zhang H, Embrechts MJ, Breneman CM. Wavelets in chemistry and cheminformatics. Rev Comput Chem. 2006;22:295\u2013329.","journal-title":"Rev Comput Chem."},{"issue":"11","key":"412_CR105","doi-asserted-by":"publisher","first-page":"e77089","DOI":"10.1371\/journal.pone.0077089","volume":"8","author":"M Welvaert","year":"2013","unstructured":"Welvaert M, Rosseel Y. On the definition of signal-to-noise ratio and contrast-to-noise ratio for fMRI data. PLoS ONE. 2013;8(11):e77089.","journal-title":"PLoS ONE."},{"key":"412_CR106","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s12874-015-0022-1","volume":"15","author":"P Hayati Rezvan","year":"2015","unstructured":"Hayati Rezvan P, Lee KJ, Simpson JA. The rise of multiple imputation: a review of the reporting and implementation of the method in medical research. BMC Med Res Methodol. 2015;15:1\u201314.","journal-title":"BMC Med Res Methodol."},{"key":"412_CR107","doi-asserted-by":"publisher","unstructured":"Groh M, Harris C, Soenksen L, Lau F, Han R, Kim A, et\u00a0al. Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology With the Fitzpatrick 17k Dataset. In: Proceedings of CVPR 2021 \u2013 the 2021 IEEE Conference on Computer Vision and Pattern Recognition Workshops, virtual, June 19-25, 2021. Computer Vision Foundation \/ IEEE; 2021. pp. 1820\u20131828. https:\/\/doi.org\/10.1109\/CVPRW53098.2021.00201.","DOI":"10.1109\/CVPRW53098.2021.00201"},{"issue":"1","key":"412_CR108","first-page":"1","volume":"2","author":"K Canese","year":"2013","unstructured":"Canese K, Weis S. PubMed: the bibliographic database. NCBI Handbook. 2013;2(1):1\u20139.","journal-title":"NCBI Handbook"},{"key":"412_CR109","unstructured":"Ranking SJ. Health informatics open access journals. https:\/\/www.scimagojr.com\/journalrank.php?category=2718&type=j &openaccess=true. URL visited on 8th November 2024."},{"key":"412_CR110","unstructured":"Ranking SJ. Molecular biology open access journals. https:\/\/www.scimagojr.com\/journalrank.php?openaccess=true&type=j &category=1301. URL visited on 8th November 2024."},{"key":"412_CR111","doi-asserted-by":"publisher","first-page":"e175","DOI":"10.7717\/peerj.175","volume":"1","author":"HA Piwowar","year":"2013","unstructured":"Piwowar HA, Vision TJ. Data reuse and the open data citation advantage. PeerJ. 2013;1:e175.","journal-title":"PeerJ."},{"key":"412_CR112","unstructured":"Peng K, Mathur A, Narayanan A. Mitigating dataset harms requires stewardship: lessons from 1000 papers. In: Vanschoren J, Yeung S, editors. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual. 2021. https:\/\/datasets-benchmarks-proceedings.neurips.cc\/paper\/2021\/hash\/077e29b11be80ab57e1a2ecabb7da330-Abstract-round2.html. Accessed 15 Sept 2024."},{"key":"412_CR113","doi-asserted-by":"crossref","unstructured":"Cohen JP, Lo HZ. Academic Torrents: a community-maintained distributed repository. In: Proceedings of XSEDE \u201914 \u2013 the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment. Atlanta: ACM; 2014. p. 1\u20132.","DOI":"10.1145\/2616498.2616528"},{"issue":"6464","key":"412_CR114","doi-asserted-by":"publisher","first-page":"447","DOI":"10.1126\/science.aax2342","volume":"366","author":"Z Obermeyer","year":"2019","unstructured":"Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447\u201353.","journal-title":"Science."},{"issue":"1","key":"412_CR115","doi-asserted-by":"publisher","first-page":"123","DOI":"10.1146\/annurev-biodatasci-092820-114757","volume":"4","author":"IY Chen","year":"2021","unstructured":"Chen IY, Pierson E, Rose S, Joshi S, Ferryman K, Ghassemi M. Ethical machine learning in healthcare. Ann Rev Biomed Data Sci. 2021;4(1):123\u201344.","journal-title":"Ann Rev Biomed Data Sci."},{"key":"412_CR116","unstructured":"Johnson A, Pollard T, Mark R. MIMIC-III Clinical Database. https:\/\/physionet.org\/content\/mimiciii\/1.4\/. URL visited on 8th November 2024."},{"key":"412_CR117","unstructured":"Letenneur L, Commenges D, Dartigues JF, Barberger-Gateau P. Longitudinal data on cognitive and physical aging in the elderly. https:\/\/search.r-project.org\/CRAN\/refmans\/lcmm\/html\/paquid.html. URL visited on 8th November 2024."},{"issue":"6","key":"412_CR118","doi-asserted-by":"publisher","first-page":"1256","DOI":"10.1093\/ije\/23.6.1256","volume":"23","author":"L Letenneur","year":"1994","unstructured":"Letenneur L, Commenges D, Dartigues JF, Barberger-Gateau P. Incidence of dementia and Alzheimer\u2019s disease in elderly community residents of south-western France. Int J Epidemiol. 1994;23(6):1256\u201361.","journal-title":"Int J Epidemiol."},{"key":"412_CR119","doi-asserted-by":"crossref","unstructured":"Moody GB, Mark RG. A database to support development and evaluation of intelligent intensive care monitoring. In: Computers in Cardiology 1996. IEEE; 1996. pp. 657\u2013660.","DOI":"10.1109\/CIC.1996.542622"},{"key":"412_CR120","unstructured":"Saeed M, Lieu C, Raber G, Mark RG. MIMIC II: a massive temporal ICU patient database to support research in intelligent patient monitoring. In: Computers in Cardiology. IEEE; 2002. pp. 641\u2013644."},{"key":"412_CR121","unstructured":"PhysioNet. The research resource for complex physiologic signals. https:\/\/www.physionet.org\/ . URL visited on 8th November 2024."},{"key":"412_CR122","unstructured":"Proust-Lima C, Liquet B. lcmm: an R package for estimation of latent class mixed models and joint latent class models. In: Proceedings of useR! 2011 \u2013 the 2011 R User Conference, 16-18 August 2011, University of Warwick, Coventry; 2011. p.\u00a066."},{"key":"412_CR123","unstructured":"Versteeg R, Volckmann R. Integrated bioinformatic and wet-lab approach to identify potential oncogenic networks in neuroblastoma. https:\/\/www.ncbi.nlm.nih.gov\/geo\/query\/acc.cgi?acc=GSE16476. URL visited on 8th November 2024."},{"issue":"7391","key":"412_CR124","doi-asserted-by":"publisher","first-page":"589","DOI":"10.1038\/nature10910","volume":"483","author":"JJ Molenaar","year":"2012","unstructured":"Molenaar JJ, Koster J, Zwijnenburg DA, van Sluis P, Valentijn LJ, van der Ploeg I, et al. Sequencing of neuroblastoma identifies chromothripsis and defects in neuritogenesis genes. Nature. 2012;483(7391):589\u201393.","journal-title":"Nature."},{"key":"412_CR125","unstructured":"Bartenhagen C. Telomerase is a prognostic marker of poor outcome and a therapeutic target in neuroblastoma. https:\/\/www.ebi.ac.uk\/biostudies\/arrayexpress\/studies\/E-MTAB-8248. URL visited on 8th November 2024."},{"key":"412_CR126","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1200\/PO.19.00072","volume":"3","author":"A Roderwieser","year":"2019","unstructured":"Roderwieser A, Sand F, Walter E, Fischer J, Gecht J, Bartenhagen C, et al. Telomerase is a prognostic marker of poor outcome and a therapeutic target in neuroblastoma. JCO Precis Oncol. 2019;3:1\u201320.","journal-title":"JCO Precis Oncol."},{"key":"412_CR127","unstructured":"Beane J, Tassinari AM. Airway epithelial cells from smokers with and without bronchial premalignant lesions. https:\/\/www.ncbi.nlm.nih.gov\/geo\/query\/acc.cgi?acc=GSE79209. URL visited on 8th November 2024."},{"issue":"17","key":"412_CR128","doi-asserted-by":"publisher","first-page":"5091","DOI":"10.1158\/1078-0432.CCR-16-2540","volume":"23","author":"J Beane","year":"2017","unstructured":"Beane J, Mazzilli SA, Tassinari AM, Liu G, Zhang X, Liu H, et al. Detecting the presence and progression of premalignant lung lesions via airway gene expression. Clin Cancer Res. 2017;23(17):5091\u2013100.","journal-title":"Clin Cancer Res."},{"key":"412_CR129","unstructured":"Schalk G, McFarland DJ, Hinterberger T, Birbaumer N, Wolpaw JR. EEG Motor Movement\/Imagery Dataset. https:\/\/doi.org\/10.13026\/C28G6P. URL visited on 8th November 2024."},{"issue":"6","key":"412_CR130","doi-asserted-by":"publisher","first-page":"1034","DOI":"10.1109\/TBME.2004.827072","volume":"51","author":"G Schalk","year":"2004","unstructured":"Schalk G, McFarland DJ, Hinterberger T, Birbaumer N, Wolpaw JR. BCI2000: a general-purpose brain-computer interface (BCI) system. IEEE Trans Biomed Eng. 2004;51(6):1034\u201343.","journal-title":"IEEE Trans Biomed Eng."},{"key":"412_CR131","unstructured":"Moody GB, Mark RG. MIT-BIH Arrhythmia Database. https:\/\/doi.org\/10.13026\/C2F305. URL visited on 8th November 2024."},{"key":"412_CR132","doi-asserted-by":"crossref","unstructured":"Moody GB, Mark RG. The MIT-BIH arrhythmia database on CD-ROM and software for use with it. In: Proceedings of CinC 1990 \u2013 Computers in Cardiology. IEEE; 1990. pp. 185\u2013188.","DOI":"10.1109\/CIC.1990.144205"},{"issue":"3","key":"412_CR133","doi-asserted-by":"publisher","first-page":"45","DOI":"10.1109\/51.932724","volume":"20","author":"GB Moody","year":"2001","unstructured":"Moody GB, Mark RG. The impact of the MIT-BIH arrhythmia database. IEEE Eng Med Biol Mag. 2001;20(3):45\u201350.","journal-title":"IEEE Eng Med Biol Mag."},{"key":"412_CR134","unstructured":"Konz N, Buda M, Gu H, Saha A, Yang J, Chledowski J, et\u00a0al. Breast-Cancer-Screening-DBT \u2013 Breast Cancer Screening - Digital Breast Tomosynthesis. https:\/\/www.cancerimagingarchive.net\/collection\/breast-cancer-screening-dbt\/. URL visited on 8th November 2024."},{"issue":"2","key":"412_CR135","doi-asserted-by":"publisher","first-page":"e230524","DOI":"10.1001\/jamanetworkopen.2023.0524","volume":"6","author":"N Konz","year":"2023","unstructured":"Konz N, Buda M, Gu H, Saha A, Yang J, Chledowski J, et al. A competition, benchmark, code, and data for using artificial intelligence to detect lesions in digital breast tomosynthesis. JAMA Netw Open. 2023;6(2):e230524.","journal-title":"JAMA Netw Open."},{"key":"412_CR136","doi-asserted-by":"publisher","unstructured":"Tang EK, Ghazoui Z, Barrett R, Edvardsson U, Vincent J, Garnett M, et\u00a0al. AstraZeneca-Sanger drug combination prediction DREAM Challenge. https:\/\/doi.org\/10.7303\/syn4231880. URL visited on 8th November 2024.","DOI":"10.7303\/syn4231880"},{"issue":"1","key":"412_CR137","doi-asserted-by":"publisher","first-page":"2674","DOI":"10.1038\/s41467-019-09799-2","volume":"10","author":"MP Menden","year":"2019","unstructured":"Menden MP, Wang D, Mason MJ, Szalai B, Bulusu KC, Guan Y, et al. Community assessment to advance computational prediction of cancer drug combinations in a pharmacogenomic screen. Nat Commun. 2019;10(1):2674.","journal-title":"Nat Commun."},{"key":"412_CR138","unstructured":"Ben\u00a0Abacha A, Demner-Fushman D. MedQuAD: Medical Question Answering Dataset. https:\/\/github.com\/abachaa\/MedQuAD. URL visited on 8th November 2024."},{"key":"412_CR139","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s12859-019-3119-4","volume":"20","author":"A Ben Abacha","year":"2019","unstructured":"Ben Abacha A, Demner-Fushman D. A question-entailment approach to question answering. BMC Bioinformatics. 2019;20:1\u201323.","journal-title":"BMC Bioinformatics."},{"issue":"3","key":"412_CR140","doi-asserted-by":"publisher","first-page":"300","DOI":"10.1093\/jamia\/ocx121","volume":"25","author":"X Chen","year":"2018","unstructured":"Chen X, Gururaj AE, Ozyurt B, Liu R, Soysal E, Cohen T, et al. DataMed-an open source discovery index for finding biomedical datasets. J Am Med Inform Assoc. 2018;25(3):300\u20138.","journal-title":"J Am Med Inform Assoc."},{"issue":"6","key":"412_CR141","doi-asserted-by":"publisher","first-page":"2557","DOI":"10.1109\/TIP.2016.2544703","volume":"25","author":"TA Lampert","year":"2016","unstructured":"Lampert TA, Stumpf A, Gan\u00e7arski P. An Empirical Study Into Annotator Agreement, Ground Truth Estimation, and Algorithm Evaluation. IEEE Trans Image Process. 2016;25(6):2557\u201372.","journal-title":"IEEE Trans Image Process."},{"key":"412_CR142","doi-asserted-by":"crossref","unstructured":"Amidei J, Piwek P, Willis A. Agreement is overrated: A plea for correlation to assess human evaluation reliability. In: van Deemter K, Lin C, Takamura H, editors. Proceedings of INLG\u00a02019 \u2013 the 12th International Conference on Natural Language Generation, Tokyo, Japan, 29 October \u2013 1 November 2019. Association for Computational Linguistics; 2019. pp. 344\u2013354.","DOI":"10.18653\/v1\/W19-8642"},{"key":"412_CR143","doi-asserted-by":"crossref","unstructured":"Popovi\u0107 M, Belz A. On reporting scores and agreement for error annotation tasks. In: Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM).\u00a0ACL Antology: 2022. p. 306\u201315.","DOI":"10.18653\/v1\/2022.gem-1.26"},{"key":"412_CR144","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/1471-2105-8-194","volume":"8","author":"H Yu","year":"2007","unstructured":"Yu H, Wang F, Tu K, Xie L, Li YY, Li YX. Transcript-level annotation of Affymetrix probesets improves the interpretation of gene expression data. BMC Bioinformatics. 2007;8:1\u201315.","journal-title":"BMC Bioinformatics."},{"key":"412_CR145","doi-asserted-by":"crossref","unstructured":"Albers MJ. Signal to noise ratio of information in documentation. In: Proceedings of SIGDOC \u201904 \u2013 the 22nd Annual International Conference on Design of Communication. New York City: ACM; 2004. p. 41\u20134.","DOI":"10.1145\/1026533.1026546"},{"issue":"7","key":"412_CR146","doi-asserted-by":"publisher","first-page":"646","DOI":"10.1038\/s41588-020-0651-0","volume":"52","author":"L Bonomi","year":"2020","unstructured":"Bonomi L, Huang Y, Ohno-Machado L. Privacy challenges and research opportunities for genomic data sharing. Nat Genet. 2020;52(7):646\u201354.","journal-title":"Nat Genet."},{"key":"412_CR147","first-page":"1243","volume":"20","author":"M Oestreich","year":"2021","unstructured":"Oestreich M, Chen D, Schultze JL, Fritz M, Becker M. Privacy considerations for sharing genomics data. EXCLI J. 2021;20:1243.","journal-title":"EXCLI J."},{"key":"412_CR148","doi-asserted-by":"crossref","unstructured":"Fabris A, Messina S, Silvello G, Susto GA. Tackling Documentation Debt: A Survey on Algorithmic Fairness Datasets. In: Proceedings of EAAMO 2022 \u2013 the 2nd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, Arlington, Virginia, USA, 6-9\u00a0October 2022. ACM; 2022. pp. 2:1\u20132:13.","DOI":"10.1145\/3551624.3555286"},{"key":"412_CR149","unstructured":"Gene Expression Omnibus. MIAME and MINSEQE guidelines. https:\/\/www.ncbi.nlm.nih.gov\/geo\/info\/MIAME.html. URL visited on 8th November 2024."},{"key":"412_CR150","unstructured":"PhysioNet. Author guidelines. https:\/\/physionet.org\/about\/publish\/#author_guidelines. URL visited on 8th November 2024."},{"key":"412_CR151","unstructured":"American Medical Informatics Association. Secondary use of health data. https:\/\/web.archive.org\/web\/20080724171701\/. Webpage of 24th July 2008 saved on Wayback Machine. URL visited on 8th November 2024."},{"issue":"9","key":"412_CR152","doi-asserted-by":"publisher","first-page":"1337","DOI":"10.1038\/s41591-019-0548-6","volume":"25","author":"J Wiens","year":"2019","unstructured":"Wiens J, Saria S, Sendak M, Ghassemi M, Liu VX, Doshi-Velez F, et al. Do no harm: a roadmap for responsible machine learning for health care. Nat Med. 2019;25(9):1337\u201340.","journal-title":"Nat Med."},{"key":"412_CR153","doi-asserted-by":"crossref","unstructured":"Alex Philippidis. Top 10 U.S. Biopharma Clusters. https:\/\/www.genengnews.com\/topics\/drug-discovery\/top-10-u-s-biopharma-clusters-10\/. URL visited on 8th November 2024.","DOI":"10.1089\/gen.44.08.10"},{"issue":"2","key":"412_CR154","doi-asserted-by":"publisher","first-page":"15","DOI":"10.3390\/DATA6020015","volume":"6","author":"A Trisovic","year":"2021","unstructured":"Trisovic A, Mika K, Boyd C, Feger SS, Crosas M. Repository Approaches to Improving the Quality of Shared Data and Code. Data. 2021;6(2):15. https:\/\/doi.org\/10.3390\/DATA6020015.","journal-title":"Data."},{"key":"412_CR155","doi-asserted-by":"crossref","unstructured":"Feger SS, Dallmeier-Tiessen S, Wozniak PW, Schmidt A. Gamification in science: a study of requirements in the context of reproducible research. In: Brewster SA, Fitzpatrick G, Cox AL, Kostakos V, editors. Proceedings of CHI 2019 \u2013 the 2019 Conference on Human Factors in Computing Systems, Glasgow, Scotland, United Kingdom, 4-9 May 2019. ACM; 2019. pp. 460.","DOI":"10.1145\/3290605.3300690"},{"key":"412_CR156","unstructured":"Giner-Miguelez J, G\u00f3mez A, Cabot J. Using Large Language Models to Enrich the Documentation of Datasets for Machine Learning. 2024. arXiv:2404.15320."},{"issue":"7","key":"412_CR157","doi-asserted-by":"publisher","first-page":"968","DOI":"10.1038\/s41592-023-01881-4","volume":"20","author":"J Chen","year":"2023","unstructured":"Chen J, Viana MP, Rafelski SM. When seeing is not believing: application-appropriate validation matters for quantitative bioimage analysis. Nat Methods. 2023;20(7):968\u201370.","journal-title":"Nat Methods."},{"issue":"11","key":"412_CR158","doi-asserted-by":"publisher","first-page":"167505","DOI":"10.1016\/j.jmb.2022.167505","volume":"434","author":"M Hartley","year":"2022","unstructured":"Hartley M, Kleywegt GJ, Patwardhan A, Sarkans U, Swedlow JR, Brazma A. The Bioimage archive-building a home for life-sciences microscopy data. J Mol Biol. 2022;434(11):167505.","journal-title":"J Mol Biol."},{"key":"412_CR159","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/1471-2105-15-306","volume":"15","author":"A Dander","year":"2014","unstructured":"Dander A, Baldauf M, Sperk M, Pabinger S, Hiltpolt B, Trajanoski Z. Personalized Oncology Suite: integrating next-generation sequencing data and whole-slide bioimages. BMC Bioinformatics. 2014;15:1\u20138.","journal-title":"BMC Bioinformatics."},{"issue":"11","key":"412_CR160","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.stem.2024.09.011","volume":"31","author":"A Migliorini","year":"2024","unstructured":"Migliorini A, Ge S, Atkins MH, Oakie A, Sambathkumar R, Kent G, et al. Embryonic macrophages support endocrine commitment during human pancreatic differentiation. Cell Stem Cell. 2024;31(11):1\u201321.","journal-title":"Cell Stem Cell."},{"issue":"9","key":"412_CR161","doi-asserted-by":"publisher","first-page":"1642","DOI":"10.1093\/jamia\/ocac105","volume":"29","author":"SN Duda","year":"2022","unstructured":"Duda SN, Kennedy N, Conway D, Cheng AC, Nguyen V, Zayas-Cab\u00e1n T, et al. HL7 FHIR-based tools and initiatives to support clinical research: a scoping review. J Am Med Inform Assoc. 2022;29(9):1642\u201353.","journal-title":"J Am Med Inform Assoc."},{"issue":"03","key":"412_CR162","doi-asserted-by":"publisher","first-page":"675","DOI":"10.1055\/s-0041-1732423","volume":"12","author":"BJ Douthit","year":"2021","unstructured":"Douthit BJ, Del Fiol G, Staes CJ, Docherty SL, Richesson RL. A conceptual framework of data readiness: the contextual intersection of quality, availability, interoperability, and provenance. Appl Clin Inform. 2021;12(03):675\u201385.","journal-title":"Appl Clin Inform."},{"key":"412_CR163","doi-asserted-by":"crossref","unstructured":"Castelijns LA, Maas Y, Vanschoren J. The ABC of data: a classifying framework for data readiness. In: Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, W\u00fcrzburg, Germany, 16\u201320 September 2019, Proceedings, Part I. Springer; 2020. pp. 3\u201316.","DOI":"10.1007\/978-3-030-43823-4_1"},{"key":"412_CR164","doi-asserted-by":"crossref","unstructured":"Afzal S, Rajmohan C, Kesarwani M, Mehta S, Patel H. Data readiness report. In: Proceedings of IEEE SMDS 2021 \u2013 the 7th IEEE International Conference on Smart Data Services. IEEE; 2021. pp. 42\u201351.","DOI":"10.1109\/SMDS53860.2021.00016"},{"key":"412_CR165","unstructured":"Lawrence ND. Data readiness levels. 2017. arXiv:1705.02245."},{"issue":"1","key":"412_CR166","doi-asserted-by":"publisher","first-page":"152","DOI":"10.1186\/s12911-024-02544-w","volume":"24","author":"M Ahangaran","year":"2024","unstructured":"Ahangaran M, Zhu H, Li R, Yin L, Jang J, Chaudhry AP, et al. DREAMER: a computational framework to evaluate readiness of datasets for machine learning. BMC Med Inform Decis Mak. 2024;24(1):152.","journal-title":"BMC Med Inform Decis Mak."},{"key":"412_CR167","doi-asserted-by":"crossref","unstructured":"Clark T, Caufield H, Parker JA, Al\u00a0Manir S, Amorim E, Eddy J, et\u00a0al. AI-readiness for Biomedical Data: Bridge2AI recommendations. bioRxiv. 2024;2024(1):1\u201321.","DOI":"10.1101\/2024.10.23.619844"}],"container-title":["BioData Mining"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13040-024-00412-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13040-024-00412-x\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13040-024-00412-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,1,9]],"date-time":"2025-01-09T00:02:31Z","timestamp":1736380951000},"score":1,"resource":{"primary":{"URL":"https:\/\/biodatamining.biomedcentral.com\/articles\/10.1186\/s13040-024-00412-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1,9]]},"references-count":167,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["412"],"URL":"https:\/\/doi.org\/10.1186\/s13040-024-00412-x","relation":{},"ISSN":["1756-0381"],"issn-type":[{"value":"1756-0381","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,1,9]]},"assertion":[{"value":"1 August 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 December 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 January 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The consents for the usage of the patients\u2019 data employed in our analysis were obtained by the original curators of those datasets and listed in their references\u00a0(Table\u00a0).","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"1"}}