{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,22]],"date-time":"2026-05-22T17:05:33Z","timestamp":1779469533584,"version":"3.53.1"},"reference-count":53,"publisher":"Oxford University Press (OUP)","license":[{"start":{"date-parts":[[2026,5,22]],"date-time":"2026-05-22T00:00:00Z","timestamp":1779408000000},"content-version":"vor","delay-in-days":141,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000054","name":"National Cancer Institute","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000054","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["U24CA289073"],"award-info":[{"award-number":["U24CA289073"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["3U24CA180996-10S1"],"award-info":[{"award-number":["3U24CA180996-10S1"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2026,1,15]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Public omics repositories contain vast amounts of valuable data, but their metadata suffers from extreme heterogeneity, unstandardized terminologies, and quality issues that severely limit data reusability and cross-study integration. While prospective metadata standards exist, the majority of published omics data remain in non-standardized formats requiring retrospective harmonization. We performed comprehensive manual curation and harmonization of metadata, such as participant characteristics and study conditions, from 212\u2009027 omics samples across 468 studies in two repositories: curatedMetagenomicData (93 studies, 22\u2009588 samples) and cBioPortal (375 studies, 189\u2009438 samples). Through systematic ontology mapping, we consolidated redundant, dispersed information into far fewer harmonized columns, reduced unique values, and increased the completeness of major attributes. This curation process revealed common metadata quality issues, including typos, inconsistent terminologies, misplaced values, conflicting annotations, and inappropriately merged information across attributes. We document the challenges, decisions, and solutions during this large-scale metadata harmonization. The harmonized metadata, accessible through the OmicsMLRepoR Bioconductor package, enables repository-wide queries and cross-study analyses previously challenging with heterogeneous metadata. Our experience provides practical guidance for similar curation efforts and demonstrates the value of investing in retrospective metadata improvement for existing public omics resources.<\/jats:p>","DOI":"10.1093\/database\/baag027","type":"journal-article","created":{"date-parts":[[2026,5,9]],"date-time":"2026-05-09T11:44:09Z","timestamp":1778327049000},"source":"Crossref","is-referenced-by-count":0,"title":["Large-scale manual curation and harmonization of metadata from metagenomic and cancer genomic repositories: challenges and solutions"],"prefix":"10.1093","volume":"2026","author":[{"given":"Kaelyn","family":"Long","sequence":"first","affiliation":[{"name":"Institute for Implementation Science in Population Health, City University of New York School of Public Health , New York, NY ,","place":["United States"]},{"name":"City University of New York School of Public Health Department of Epidemiology and Biostatistics, , New York, NY ,","place":["United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Kai","family":"Gravel-Pucillo","sequence":"additional","affiliation":[{"name":"Institute for Implementation Science in Population Health, City University of New York School of Public Health , New York, NY ,","place":["United States"]},{"name":"City University of New York School of Public Health Department of Epidemiology and Biostatistics, , New York, NY ,","place":["United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2725-0694","authenticated-orcid":false,"given":"Levi","family":"Waldron","sequence":"additional","affiliation":[{"name":"Institute for Implementation Science in Population Health, City University of New York School of Public Health , New York, NY ,","place":["United States"]},{"name":"City University of New York School of Public Health Department of Epidemiology and Biostatistics, , New York, NY ,","place":["United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Sean","family":"Davis","sequence":"additional","affiliation":[{"name":"University of Colorado Anschutz School of Medicine Departments of Biomedical Informatics and Medicine, , Denver, CO ,","place":["United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9490-3061","authenticated-orcid":false,"given":"Sehyun","family":"Oh","sequence":"additional","affiliation":[{"name":"Institute for Implementation Science in Population Health, City University of New York School of Public Health , New York, NY ,","place":["United States"]},{"name":"City University of New York School of Public Health Department of Epidemiology and Biostatistics, , New York, NY ,","place":["United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"286","published-online":{"date-parts":[[2026,5,22]]},"reference":[{"key":"2026052212143309400_bib1","doi-asserted-by":"publisher","first-page":"e009938","DOI":"10.1161\/CIRCOUTCOMES.123.009938","article-title":"Facilitating harmonization of variables in Framingham, MESA, ARIC, and REGARDS studies through a metadata repository","volume":"16","author":"Mallya","year":"2023","journal-title":"Circ Cardiovasc Qual Outcomes"},{"key":"2026052212143309400_bib2","doi-asserted-by":"publisher","first-page":"e15199","DOI":"10.2196\/15199","article-title":"Fast Healthcare Interoperability Resources (FHIR) as a meta model to integrate common data models: development of a tool and quantitative validation study","volume":"7","author":"Pfaff","year":"2019","journal-title":"JMIR Med Inform"},{"key":"2026052212143309400_bib3","doi-asserted-by":"publisher","first-page":"5419","DOI":"10.1038\/s41467-023-41185-x","article-title":"Demonstrating paths for unlocking the value of cloud genomics through cross cohort analysis","volume":"14","author":"Deflaux","year":"2023","journal-title":"Nat Commun"},{"key":"2026052212143309400_bib4","doi-asserted-by":"publisher","first-page":"293","DOI":"10.3389\/fgene.2019.00293","article-title":"A novel joint gene set analysis framework improves identification of enriched pathways in cross disease transcriptomic analysis","volume":"10","author":"Qin","year":"2019","journal-title":"Front Genet"},{"key":"2026052212143309400_bib5","doi-asserted-by":"publisher","first-page":"396","DOI":"10.3389\/fgene.2019.00396","article-title":"Integrative analysis of DiseaseLand omics database for disease signatures and treatments: a bipolar case study","volume":"10","author":"Wu","year":"2019","journal-title":"Front Genet"},{"key":"2026052212143309400_bib6","doi-asserted-by":"publisher","first-page":"843","DOI":"10.1038\/s41592-019-0509-5","article-title":"Assessment of network module identification across complex diseases","volume":"16","author":"Choobdar","year":"2019","journal-title":"Nat Methods"},{"key":"2026052212143309400_bib7","doi-asserted-by":"crossref","first-page":"238","DOI":"10.3390\/genes10030238","article-title":"Challenges in the integration of omics and non-omics data","volume":"10","author":"L\u00f3pez\u00a0de\u00a0Maturana","year":"2019","journal-title":"Genes"},{"key":"2026052212143309400_bib8","doi-asserted-by":"publisher","first-page":"20230211","DOI":"10.1259\/bjr.20230211","article-title":"Artificial intelligence (AI) and machine learning (ML) in precision oncology: a review on enhancing discoverability through multiomics integration","volume":"96","author":"Wei","year":"2023","journal-title":"Br J Radiol"},{"key":"2026052212143309400_bib9","doi-asserted-by":"publisher","first-page":"107739","DOI":"10.1016\/j.biotechadv.2021.107739","article-title":"Using machine learning approaches for multi-omics data analysis: a review","volume":"49","author":"Reel","year":"2021","journal-title":"Biotechnol Adv"},{"key":"2026052212143309400_bib10","doi-asserted-by":"publisher","first-page":"144","DOI":"10.1038\/s41597-020-0486-7","article-title":"The TRUST principles for digital repositories","volume":"7","author":"Lin","year":"2020","journal-title":"Sci Data"},{"key":"2026052212143309400_bib11","doi-asserted-by":"publisher","first-page":"358","DOI":"10.1038\/s41587-019-0080-8","article-title":"FAIRsharing as a community approach to standards, repositories and policies","volume":"37","author":"Sansone","year":"2019","journal-title":"Nat Biotechnol"},{"key":"2026052212143309400_bib12","doi-asserted-by":"publisher","first-page":"160018","DOI":"10.1038\/sdata.2016.18","article-title":"The FAIR guiding principles for scientific data management and stewardship","volume":"3","author":"Wilkinson","year":"2016","journal-title":"Sci Data"},{"key":"2026052212143309400_bib13","doi-asserted-by":"crossref","first-page":"659","DOI":"10.1038\/s41597-022-01792-7","article-title":"Harvesting metadata in clinical care: a crosswalk between FHIR, OMOP, CDISC and openEHR metadata","volume":"9","author":"B\u00f6nisch","year":"2022","journal-title":"Sci Data"},{"key":"2026052212143309400_bib14","doi-asserted-by":"publisher","first-page":"190021","DOI":"10.1038\/sdata.2019.21","article-title":"The variable quality of metadata about biological samples used in biomedical experiments","volume":"6","author":"Gon\u00e7alves","year":"2019","journal-title":"Sci Data"},{"key":"2026052212143309400_bib15","doi-asserted-by":"publisher","first-page":"592","DOI":"10.1038\/s41597-022-01707-6","article-title":"Machine actionable metadata models","volume":"9","author":"Batista","year":"2022","journal-title":"Sci Data"},{"key":"2026052212143309400_bib16","doi-asserted-by":"publisher","first-page":"33","DOI":"10.1186\/s40793-022-00425-1","article-title":"Metadata harmonization-Standards are the key for a better usage of omics data for integrative microbiome analysis","volume":"17","author":"Cernava","year":"2022","journal-title":"Environ Microbiome"},{"key":"2026052212143309400_bib17","doi-asserted-by":"publisher","first-page":"e100953","DOI":"10.1136\/bmjhci-2023-100953","article-title":"Seamless EMR data access: integrated governance, digital health and the OMOP-CDM","volume":"31","author":"Hallinan","year":"2024","journal-title":"BMJ Health Care Inform"},{"key":"2026052212143309400_bib18","first-page":"574","article-title":"Observational Health Data Sciences and Informatics (OHDSI): opportunities for observational researchers","volume":"216","author":"Hripcsak","year":"2015","journal-title":"Stud Health Technol Inform"},{"key":"2026052212143309400_bib19","doi-asserted-by":"publisher","first-page":"btaf279","DOI":"10.1093\/bioinformatics\/btaf279","article-title":"OLS4: a new Ontology Lookup Service for a growing interdisciplinary knowledge ecosystem","volume":"41","author":"McLaughlin","year":"2025","journal-title":"Bioinformatics"},{"key":"2026052212143309400_bib20","doi-asserted-by":"publisher","first-page":"2354","DOI":"10.1093\/bioinformatics\/btq415","article-title":"ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level","volume":"26","author":"Rocca-Serra","year":"2010","journal-title":"Bioinformatics"},{"key":"2026052212143309400_bib21","doi-asserted-by":"publisher","first-page":"365","DOI":"10.1038\/ng1201-365","article-title":"Minimum information about a microarray experiment (MIAME)-toward standards for microarray data","volume":"29","author":"Brazma","year":"2001","journal-title":"Nat Genet"},{"key":"2026052212143309400_bib22","doi-asserted-by":"publisher","first-page":"415","DOI":"10.1038\/nbt.1823","article-title":"Minimum information about a marker gene sequence (MIMARKS) and minimum information about any sequence (MIxS) specifications","volume":"29","author":"Yilmaz","year":"2011","journal-title":"Nat Biotechnol"},{"key":"2026052212143309400_bib23","doi-asserted-by":"publisher","first-page":"696","DOI":"10.1038\/s41597-022-01815-3","article-title":"Modeling community standards for metadata as templates makes data FAIR","volume":"9","author":"Musen","year":"2022","journal-title":"Sci Data"},{"key":"2026052212143309400_bib24","doi-asserted-by":"crossref","unstructured":"Sundaram SS, Gon\u00e7alves RS, Musen MA. Toward total recall: enhancing data FAIRness through AI-driven metadata standardization. Gigascience. 2026;15:giag019. 10.1093\/gigascience\/giag019","DOI":"10.1093\/gigascience\/giag019"},{"key":"2026052212143309400_bib25","first-page":"1050","article-title":"Structured knowledge base enhances effective use of large language models for metadata curation","volume":"2024","author":"Sundaram","year":"2024","journal-title":"AMIA Annu Symp Proc"},{"key":"2026052212143309400_bib26","doi-asserted-by":"publisher","first-page":"S7","DOI":"10.1186\/1471-2105-9-S1-S7","article-title":"Rule-based knowledge aggregation for large-scale protein sequence analysis of influenza A viruses","volume":"9","author":"Miotto","year":"2008","journal-title":"BMC Bioinf"},{"key":"2026052212143309400_bib27","doi-asserted-by":"publisher","first-page":"e0322365","DOI":"10.1371\/journal.pone.0322365","article-title":"Harmonizing CT scanner acquisition variability in an anthropomorphic phantom: a comparative study of image-level and feature-level harmonization using GAN, ComBat, and their combination","volume":"20","author":"Mali","year":"2025","journal-title":"PLoS One"},{"key":"2026052212143309400_bib28","doi-asserted-by":"publisher","first-page":"842","DOI":"10.3390\/jpm11090842","article-title":"Making radiomics more reproducible across scanner and imaging protocol variations: a review of harmonization methods","volume":"11","author":"Mali","year":"2021","journal-title":"J Pers Med"},{"key":"2026052212143309400_bib29","doi-asserted-by":"crossref","unstructured":"Ikeda S, Zou Z, Bono H \u00a0et al. \u00a0Extraction of biological terms using large language models enhances the usability of metadata in the BioSample database. Gigascience. 2025;14:giaf070. 10.1093\/gigascience\/giaf070","DOI":"10.1093\/gigascience\/giaf070"},{"key":"2026052212143309400_bib30","doi-asserted-by":"publisher","first-page":"20210","DOI":"10.1038\/s41598-025-06447-2","article-title":"Evaluating language model embeddings for Parkinson\u2019s disease cohort harmonization using a novel manually curated variable mapping schema","volume":"15","author":"Salimi","year":"2025","journal-title":"Sci Rep"},{"key":"2026052212143309400_bib31","doi-asserted-by":"crossref","unstructured":"Verbitsky A, Boutet P, Eslami M. Metadata harmonization from biological datasets with language models. Bioinform Adv. 2025;5:vbaf241. 10.1093\/bioadv\/vbaf241","DOI":"10.1101\/2025.01.15.633281"},{"key":"2026052212143309400_bib32","first-page":"1","article-title":"A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions","volume":"43","author":"Huang","year":"2025","journal-title":"ACM Trans Inf Syst"},{"key":"2026052212143309400_bib33","doi-asserted-by":"crossref","first-page":"1373","DOI":"10.1162\/COLI.a.16","article-title":"Siren\u2019s song in the AI ocean: a survey on hallucination in large language models","volume":"51","author":"Zhang","year":"2025","journal-title":"Comput Linguist"},{"key":"2026052212143309400_bib34","doi-asserted-by":"publisher","first-page":"34188","DOI":"10.52202\/079017-1077","article-title":"LLM-check: investigating detection of hallucinations in large language models","volume-title":"Advances in Neural Information Processing Systems 37","author":"Bharti","year":"2024"},{"key":"2026052212143309400_bib35","doi-asserted-by":"publisher","DOI":"10.1101\/2024.11.04.24316718","article-title":"Two-phase framework clinical question-answering\u2014autocorrection for guideline-concordance. Two-phase framework clinical question-answering\u2014autocorrection for guideline-concordance","author":"Tariq","year":"2024","journal-title":"medRxiv"},{"key":"2026052212143309400_bib36","doi-asserted-by":"publisher","first-page":"191","DOI":"10.12688\/openreseurope.20839.1","article-title":"Comparison of explainability methods for hallucination analysis in LLMs","volume":"5","author":"Papagiannopoulos","year":"2025","journal-title":"Open Res Eur"},{"key":"2026052212143309400_bib37","doi-asserted-by":"publisher","first-page":"e75608","DOI":"10.2196\/75608","article-title":"Automated data harmonization in clinical research: natural language processing approach","volume":"9","author":"Mallya","year":"2025","journal-title":"JMIR Form Res"},{"key":"2026052212143309400_bib38","doi-asserted-by":"publisher","first-page":"e0328262","DOI":"10.1371\/journal.pone.0328262","article-title":"A natural language processing approach to support biomedical data harmonization: leveraging large language models","volume":"20","author":"Li","year":"2025","journal-title":"PLoS One"},{"key":"2026052212143309400_bib39","first-page":"1680","article-title":"Data harmonization and data pooling from cohort studies: a practical approach for data management","volume":"6","author":"Adhikari","year":"2021","journal-title":"Int J Popul Data Sci"},{"key":"2026052212143309400_bib40","doi-asserted-by":"publisher","first-page":"238","DOI":"10.1186\/s12874-021-01434-3","article-title":"Standardizing registry data to the OMOP Common Data Model: experience from three pulmonary hypertension databases","volume":"21","author":"Biedermann","year":"2021","journal-title":"BMC Med Res Methodol"},{"key":"2026052212143309400_bib41","doi-asserted-by":"publisher","first-page":"1023","DOI":"10.1038\/nmeth.4468","article-title":"Accessible, curated metagenomic data through ExperimentHub","volume":"14","author":"Pasolli","year":"2017","journal-title":"Nat Methods"},{"key":"2026052212143309400_bib42","doi-asserted-by":"publisher","first-page":"958","DOI":"10.1200\/CCI.19.00119","article-title":"Multiomic integration of public oncology databases in bioconductor","volume":"4","author":"Ramos","year":"2020","journal-title":"JCO Clin Cancer Inform"},{"key":"2026052212143309400_bib43","doi-asserted-by":"publisher","first-page":"401","DOI":"10.1158\/2159-8290.CD-12-0095","article-title":"The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data","volume":"2","author":"Cerami","year":"2012","journal-title":"Cancer Discov"},{"key":"2026052212143309400_bib44","doi-asserted-by":"publisher","first-page":"l1","DOI":"10.1126\/scisignal.2004088","article-title":"Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal","volume":"6","author":"Gao","year":"2013","journal-title":"Sci Signal"},{"key":"2026052212143309400_bib45","doi-asserted-by":"publisher","first-page":"3861","DOI":"10.1158\/0008-5472.CAN-23-0816","article-title":"Analysis and visualization of longitudinal genomic and clinical data from the AACR Project GENIE Biopharma Collaborative in cBioPortal","volume":"83","author":"de\u00a0Bruijn","year":"2023","journal-title":"Cancer Res"},{"key":"2026052212143309400_bib46","volume-title":"Welcome to the EMBL-EBI Ontology Lookup Service.","author":"Ontology Lookup Service (OLS)"},{"key":"2026052212143309400_bib47","year":"2017"},{"key":"2026052212143309400_bib48","doi-asserted-by":"publisher","first-page":"1104","DOI":"10.1093\/bioinformatics\/btw763","article-title":"ontologyX: a suite of R packages for working with ontological data","volume":"33","author":"Greene","year":"2017","journal-title":"Bioinformatics"},{"key":"2026052212143309400_bib49","year":"2017"},{"key":"2026052212143309400_bib50","first-page":"133","article-title":"Linking Data to Ontologies","volume-title":"Journal on Data Semantics X, Lecture notes in computer science","author":"Poggi","year":"2008"},{"key":"2026052212143309400_bib51","author":"Sehyun\u00a0Oh","year":"2024"},{"key":"2026052212143309400_bib52","doi-asserted-by":"publisher","first-page":"114","DOI":"10.1038\/s41568-021-00408-3","article-title":"Harnessing multimodal data integration to advance precision oncology","volume":"22","author":"Boehm","year":"2022","journal-title":"Nat Rev Cancer"},{"key":"2026052212143309400_bib53","doi-asserted-by":"publisher","DOI":"10.1101\/2025.06.10.658658","article-title":"Multi-agent AI system for high quality metadata curation at scale. Multi-agent AI system for high quality metadata curation at scale","author":"Mondal","year":"2025","journal-title":"bioRxiv"}],"container-title":["Database"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baag027\/68369007\/baag027.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baag027\/68369007\/baag027.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,5,22]],"date-time":"2026-05-22T16:14:42Z","timestamp":1779466482000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/database\/article\/doi\/10.1093\/database\/baag027\/8690724"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026]]},"references-count":53,"URL":"https:\/\/doi.org\/10.1093\/database\/baag027","relation":{},"ISSN":["1758-0463"],"issn-type":[{"value":"1758-0463","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2026]]},"published":{"date-parts":[[2026]]},"article-number":"baag027"}}