{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,30]],"date-time":"2025-07-30T11:48:08Z","timestamp":1753876088901,"version":"3.41.2"},"reference-count":61,"publisher":"Oxford University Press (OUP)","license":[{"start":{"date-parts":[[2024,9,27]],"date-time":"2024-09-27T00:00:00Z","timestamp":1727395200000},"content-version":"vor","delay-in-days":270,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,9,27]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Advanced omics technologies and facilities generate a wealth of valuable data daily; however, the data often lack the essential metadata required for researchers to find, curate, and search them effectively. The lack of metadata poses a significant challenge in the utilization of these data sets. Machine learning (ML)\u2013based metadata extraction techniques have emerged as a potentially viable approach to automatically annotating scientific data sets with the metadata necessary for enabling effective search. Text labeling, usually performed manually, plays a crucial role in validating machine-extracted metadata. However, manual labeling is time-consuming and not always feasible; thus, there is a need to develop automated text labeling techniques in order to accelerate the process of scientific innovation. This need is particularly urgent in fields such as environmental genomics and microbiome science, which have historically received less attention in terms of metadata curation and creation of gold-standard text mining data sets. In this paper, we present two novel automated text labeling approaches for the validation of ML-generated metadata for unlabeled texts, with specific applications in environmental genomics. Our techniques show the potential of two new ways to leverage existing information that is only available for select documents within a corpus to validate ML models, which can then be used to describe the remaining documents in the corpus. The first technique exploits relationships between different types of data sources related to the same research study, such as publications and proposals. The second technique takes advantage of domain-specific controlled vocabularies or ontologies. In this paper, we detail applying these approaches in the context of environmental genomics research for ML-generated metadata validation. Our results show that the proposed label assignment approaches can generate both generic and highly specific text labels for the unlabeled texts, with up to 44% of the labels matching with those suggested by a ML keyword extraction algorithm.<\/jats:p>","DOI":"10.1093\/database\/baae093","type":"journal-article","created":{"date-parts":[[2024,9,27]],"date-time":"2024-09-27T15:20:17Z","timestamp":1727450417000},"source":"Crossref","is-referenced-by-count":2,"title":["Automated annotation of scientific texts for ML-based keyphrase extraction and validation"],"prefix":"10.1093","volume":"2024","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4569-5729","authenticated-orcid":false,"given":"Oluwamayowa O","family":"Amusat","sequence":"first","affiliation":[{"name":"Scientific Data Division, Lawrence Berkeley National Laboratory , 1 Cyclotron road, Berkeley, CA 94720,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Harshad","family":"Hegde","sequence":"additional","affiliation":[{"name":"Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory , 1 Cyclotron road, Berkeley, CA 94720,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6601-2165","authenticated-orcid":false,"given":"Christopher J","family":"Mungall","sequence":"additional","affiliation":[{"name":"Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory , 1 Cyclotron road, Berkeley, CA 94720,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Anna","family":"Giannakou","sequence":"additional","affiliation":[{"name":"Scientific Data Division, Lawrence Berkeley National Laboratory , 1 Cyclotron road, Berkeley, CA 94720,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Neil P","family":"Byers","sequence":"additional","affiliation":[{"name":"DOE Joint Genome Institute, Lawrence Berkeley National Laboratory , 1 Cyclotron road, Berkeley, CA 94720,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Dan","family":"Gunter","sequence":"additional","affiliation":[{"name":"Scientific Data Division, Lawrence Berkeley National Laboratory , 1 Cyclotron road, Berkeley, CA 94720,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kjiersten","family":"Fagnan","sequence":"additional","affiliation":[{"name":"DOE Joint Genome Institute, Lawrence Berkeley National Laboratory , 1 Cyclotron road, Berkeley, CA 94720,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lavanya","family":"Ramakrishnan","sequence":"additional","affiliation":[{"name":"Scientific Data Division, Lawrence Berkeley National Laboratory , 1 Cyclotron road, Berkeley, CA 94720,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2024,9,27]]},"reference":[{"key":"2025040107470752700_R1","doi-asserted-by":"publisher","first-page":"pp. 26","DOI":"10.1109\/MLHPC.2018.8638633","article-title":"Automated labeling of electron microscopy images using deep learning","author":"Weber","year":"2018"},{"key":"2025040107470752700_R2","doi-asserted-by":"publisher","DOI":"10.1002\/widm.1339","article-title":"A review of keyphrase extraction","volume":"10","author":"Papagiannopoulou","year":"2019","journal-title":"WIREs Data Mining and Knowledge Discovery"},{"key":"2025040107470752700_R3","doi-asserted-by":"publisher","first-page":"257","DOI":"10.1016\/j.ins.2019.09.013","article-title":"YAKE! keyword extraction from single documents using multiple local features","volume":"509","author":"Campos","year":"2020","journal-title":"Information Sciences"},{"key":"2025040107470752700_R4","doi-asserted-by":"publisher","first-page":"pp.11173","DOI":"10.18653\/v1\/2023.acl-long.626","article-title":"Is GPT3-3 a good data annotator?","author":"Ding","year":"2023"},{"key":"2025040107470752700_R5","doi-asserted-by":"publisher","DOI":"10.1186\/1471-2105-13-161","article-title":"Concept annotation in the CRAFT corpus","volume":"13","author":"Bada","year":"2012","journal-title":"BMC Bioinformatics"},{"key":"2025040107470752700_R6","doi-asserted-by":"publisher","first-page":"e00936","DOI":"10.1128\/msystems.00936-23","article-title":"Multiple microbial guilds mediate soil methane cycling along a wetland salinity gradient","volume":"9","author":"Hartman","year":"2024","journal-title":"Msystems"},{"key":"2025040107470752700_R7","doi-asserted-by":"publisher","DOI":"10.1126\/sciadv.adg7888","article-title":"Reproducible growth of Brachypodium in EcoFAB 2.0 reveals that nitrogen form and starvation modulate root exudation","volume":"10","author":"Novak","year":"2024","journal-title":"Science Advances"},{"key":"2025040107470752700_R8","doi-asserted-by":"publisher","DOI":"10.1093\/database\/baab069","article-title":"OBO foundry in 2021: operationalizing open data principles to evaluate ontologies","volume":"2021","author":"Jackson","year":"2021","journal-title":"Database"},{"key":"2025040107470752700_R9","doi-asserted-by":"crossref","DOI":"10.1109\/eScience.2018.00025","article-title":"ScienceSearch: enabling search through automatic metadata generation","author":"Rodrigo","year":"2018"},{"article-title":"Sci-key: a keyword extraction pipeline for scientific documents","year":"2024","author":"Giannakou","key":"2025040107470752700_R10"},{"key":"2025040107470752700_R11","first-page":"pp.404","article-title":"Textrank: Bringing order into text","author":"Mihalcea","year":"2004"},{"key":"2025040107470752700_R12","first-page":"1","volume-title":"Text Mining","author":"R","year":"2010"},{"key":"2025040107470752700_R13","doi-asserted-by":"publisher","DOI":"10.1186\/s13326-017-0157-6","article-title":"Entity recognition in the biomedical domain using a hybrid approach","volume":"8","author":"Basaldella","year":"2017","journal-title":"Journal of Biomedical Semantics"},{"key":"2025040107470752700_R14","doi-asserted-by":"publisher","DOI":"10.1186\/s13321-018-0326-3","article-title":"OGER++: hybrid multi-type entity recognition","volume":"11","author":"Furrer","year":"2019","journal-title":"Journal of Cheminformatics"},{"key":"2025040107470752700_R15","first-page":"319","article-title":"ScispaCy: Fast and robust models for biomedical natural language processing","author":"Neumann","year":"2019"},{"key":"2025040107470752700_R16","doi-asserted-by":"publisher","DOI":"10.1186\/2041-1480-4-43","article-title":"The environment ontology: contextualising biological and biomedical entities","volume":"4","author":"Buttigieg","year":"2013","journal-title":"Journal of Biomedical Semantics"},{"key":"2025040107470752700_R17","doi-asserted-by":"publisher","DOI":"10.1186\/s13326-016-0097-6","article-title":"The environment ontology in 2016: bridging domains with increased scope, semantic density, and interoperation","volume":"7","author":"Luigi Buttigieg","year":"2016","journal-title":"Journal of Biomedical Semantics"},{"key":"2025040107470752700_R18","doi-asserted-by":"publisher","first-page":"25","DOI":"10.1038\/75556","article-title":"Gene ontology: tool for the unification of biology","volume":"25","author":"Ashburner","year":"2000","journal-title":"Nature Genetics"},{"key":"2025040107470752700_R19","doi-asserted-by":"crossref","first-page":"D325","DOI":"10.1093\/nar\/gkaa1113","article-title":"The Gene Ontology resource: enriching a GOld mine","volume":"49","author":"The Gene Ontology Consortium","year":"2020","journal-title":"Nucleic Acids Research"},{"key":"2025040107470752700_R20","doi-asserted-by":"publisher","first-page":"D1214","DOI":"10.1093\/nar\/gkv1031","article-title":"ChEBI in 2016: Improved services and an expanding collection of metabolites","volume":"44","author":"Hastings","year":"2016","journal-title":"Nucleic Acids Research"},{"key":"2025040107470752700_R21","doi-asserted-by":"crossref","first-page":"D136","DOI":"10.1093\/nar\/gkr1178","article-title":"The NCBI taxonomy database","volume":"40","author":"Federhen","year":"2011","journal-title":"Nucleic Acids Research"},{"key":"2025040107470752700_R22","doi-asserted-by":"publisher","first-page":"1325","DOI":"10.1093\/bioinformatics\/btt113","article-title":"EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats","volume":"29","author":"Ison","year":"2013","journal-title":"Bioinformatics"},{"key":"2025040107470752700_R23","doi-asserted-by":"publisher","DOI":"10.1093\/pcp\/pcs163","article-title":"The plant ontology as a tool for comparative plant anatomy and genomic analyses","volume":"54","author":"Cooper","year":"2013","journal-title":"Plant and Cell Physiology"},{"key":"2025040107470752700_R24","doi-asserted-by":"crossref","first-page":"D1168","DOI":"10.1093\/nar\/gkx1152","article-title":"The planteome database: an integrated resource for reference ontologies, plant genomics and phenomics","volume":"46","author":"Cooper","year":"2017","journal-title":"Nucleic Acids Research"},{"article-title":"Chemical reactions ontology (RXNO)","year":"2020","author":"Batchelor","key":"2025040107470752700_R25"},{"key":"2025040107470752700_R26"},{"key":"2025040107470752700_R27","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0154556","article-title":"Whetzel, and Jie Zheng. The ontology for biomedical investigations","volume":"11","author":"Bandrowski","year":"2016","journal-title":"PLoS One"},{"key":"2025040107470752700_R28","doi-asserted-by":"publisher","first-page":"1008","DOI":"10.1093\/bib\/bbx035","article-title":"The anatomy of phenotype ontologies: principles, properties and applications","volume":"19","author":"Gkoutos","year":"2018","journal-title":"Briefings in Bioinformatics"},{"key":"2025040107470752700_R29","doi-asserted-by":"publisher","DOI":"10.1186\/gb-2004-6-1-r8","article-title":"Using ontologies to describe mouse phenotypes","volume":"6","author":"Gkoutos","year":"2004","journal-title":"Genome Biology"},{"key":"2025040107470752700_R30","doi-asserted-by":"crossref","first-page":"684","DOI":"10.1007\/978-3-319-76941-7_63","volume-title":"Advances in Information Retrieval","author":"Campos","year":"2018"},{"key":"2025040107470752700_R31"},{"key":"2025040107470752700_R32","first-page":"pp.4140","article-title":"Revisiting the gold standard: grounding summarization evaluation with robust human evaluation","author":"Liu","year":"2023"},{"key":"2025040107470752700_R33","doi-asserted-by":"publisher","first-page":"971","DOI":"10.1111\/ajps.12291","article-title":"Computer-assisted keyword and document set discovery from unstructured text","volume":"61","author":"King","year":"2017","journal-title":"American Journal of Political Science"},{"article-title":"Vicuna: an open-source chatbot impressing gpt-4 with 90%* chatgpt quality","year":"2023","author":"Chiang","key":"2025040107470752700_R34"},{"key":"2025040107470752700_R35","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2307.09288","article-title":"Llama 2: Open foundation and fine-tuned chat models","author":"Touvron","year":"2023","journal-title":"arXiv"},{"key":"2025040107470752700_R36","first-page":"1","article-title":"Ontology development 101: a guide to creating your first ontology","author":"Noy","year":"2001"},{"key":"2025040107470752700_R37","doi-asserted-by":"publisher","DOI":"10.17226\/26755","volume-title":"Ontologies in the Behavioral Sciences: Accelerating research and the spread of knowledge","author":"National Academies of Sciences, Engineering, and Medicine; Division of Behavioral and Social Sciences and Education; Board on Behavioral, Cognitive, and Sensory Sciences; Committee on Accelerating Behavioral Science through Ontology Development and Use.","year":"2022"},{"article-title":"spaCy2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing","year":"2017","author":"Honnibal","key":"2025040107470752700_R38"},{"key":"2025040107470752700_R39","first-page":"pp. 481","article-title":"Probase: a probabilistic taxonomy for text understanding","author":"Wentao","year":"2012"},{"key":"2025040107470752700_R40","first-page":"522","article-title":"Extending biology models with deep nlp over scientific articles","author":"McDonald","year":"2016"},{"key":"2025040107470752700_R41","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2019.102088","article-title":"Textual keyword extraction and summarization: state-of-the-art","volume":"56","author":"Nasar","year":"2019","journal-title":"Information Processing and Management"},{"key":"2025040107470752700_R42","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2006.05477","article-title":"Unsupervised paraphrase generation using pre-trained language models","author":"Hegde","year":"2020","journal-title":"arXiv"},{"key":"2025040107470752700_R43","doi-asserted-by":"publisher","first-page":"232","DOI":"10.1016\/j.eswa.2016.03.045","article-title":"Ensemble of keyword extraction methods and classifiers in text classification","volume":"57","author":"Onan","year":"2016","journal-title":"Expert Systems With Applications"},{"key":"2025040107470752700_R44","first-page":"1","article-title":"Rapid automatic keyword extraction and word frequency in scientific article keywords extraction","author":"Rinartha","year":"2021"},{"volume-title":"Keyphrase Extraction Techniques.","year":"2021","author":"Papagiannopoulou","key":"2025040107470752700_R45"},{"first-page":"1262","article-title":"Automatic keyphrase extraction: a survey of the state of the art","author":"Hasan","key":"2025040107470752700_R46"},{"key":"2025040107470752700_R47","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2102.01335","article-title":"Neural Data Augmentation via Example Extrapolation","author":"Lee","year":"2021","journal-title":"arXiv"},{"key":"2025040107470752700_R48","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2003.02245","article-title":"Data augmentation using pre-trained transformer models","author":"Kumar","year":"2020","journal-title":"arXiv"},{"key":"2025040107470752700_R49","first-page":"2991","article-title":"A little goes a long way: Improving toxic language classification despite data scarcity","author":"Juuti","year":"2020"},{"key":"2025040107470752700_R50","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2004.13845","article-title":"DARE: data augmented relation extraction with GPT-2","author":"Papanikolaou","year":"2020","journal-title":"arXiv"},{"key":"2025040107470752700_R51","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2302.13007","article-title":"Auggpt: leveraging chatgpt for text data augmentation","author":"Dai","year":"2023","journal-title":"arXiv"},{"key":"2025040107470752700_R52","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2104.08826","article-title":"Gpt3mix: leveraging large-scale language models for text augmentation","author":"Min Yoo","year":"2021","journal-title":"arXiv"},{"key":"2025040107470752700_R53","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1911.03118","article-title":"Not enough data? Deep learning to the rescue!","author":"Anaby-Tavor","year":"2019","journal-title":"arXiv"},{"key":"2025040107470752700_R54","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford","year":"2019","journal-title":"OpenAI blog"},{"key":"2025040107470752700_R55","first-page":"1877","article-title":"Language Models are Few-Shot Learners","author":"Brown","year":"2020"},{"key":"2025040107470752700_R56","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2109.09193","article-title":"Towards zero-label language learning","author":"Wang","year":"2021","journal-title":"arXiv"},{"key":"2025040107470752700_R57","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2303.08774","article-title":"GPT-4 Technical report","author":"OpenAI","year":"2023","journal-title":"arXiv"},{"key":"2025040107470752700_R58","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2307.09288","article-title":"Llama: Open and Efficient Foundation Language Models","author":"Touvron","year":"2023","journal-title":"arXiv"},{"key":"2025040107470752700_R59","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1712.04621","article-title":"The effectiveness of data augmentation in image classification using deep learning","author":"Perez","year":"2017","journal-title":"arXiv"},{"key":"2025040107470752700_R60","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btae075","article-title":"Genegpt: Augmenting large language models with domain tools for improved access to biomedical information","volume":"40","author":"Jin","year":"2024","journal-title":"Bioinformatics"},{"key":"2025040107470752700_R61","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2401.11817","article-title":"Hallucination is inevitable: an innate limitation of large language models","author":"Ziwei","year":"2024","journal-title":"arXiv"}],"container-title":["Database"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baae093\/59376480\/baae093.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baae093\/59376480\/baae093.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,4,1]],"date-time":"2025-04-01T07:52:23Z","timestamp":1743493943000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/database\/article\/doi\/10.1093\/database\/baae093\/7780645"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024]]},"references-count":61,"URL":"https:\/\/doi.org\/10.1093\/database\/baae093","relation":{},"ISSN":["1758-0463"],"issn-type":[{"type":"electronic","value":"1758-0463"}],"subject":[],"published-other":{"date-parts":[[2024]]},"published":{"date-parts":[[2024]]},"article-number":"baae093"}}