{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,17]],"date-time":"2026-03-17T23:39:02Z","timestamp":1773790742069,"version":"3.50.1"},"reference-count":54,"publisher":"Oxford University Press (OUP)","license":[{"start":{"date-parts":[[2022,6,3]],"date-time":"2022-06-03T00:00:00Z","timestamp":1654214400000},"content-version":"vor","delay-in-days":153,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,6,3]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>The Gene Expression Omnibus (GEO) is a public archive containing &amp;gt;4 million digital samples from functional genomics experiments collected over almost two decades. The accompanying metadata describing the experiments suffer from redundancy, inconsistency and incompleteness due to the prevalence of free text and the lack of well-defined data formats and their validation. To remedy this situation, we created Genomic Metadata Integration (GeMI; http:\/\/gmql.eu\/gemi\/), a web application that learns to automatically extract structured metadata (in the form of key-value pairs) from the plain text descriptions of GEO experiments. The extracted information can then be indexed for structured search and used for various downstream data mining activities. GeMI works in continuous interaction with its users. The natural language processing transformer-based model at the core of our system is a fine-tuned version of the Generative Pre-trained Transformer 2 (GPT2) model that is able to learn continuously from the feedback of the users thanks to an active learning framework designed for the purpose. As a part of such a framework, a machine learning interpretation mechanism (that exploits saliency maps) allows the users to understand easily and quickly whether the predictions of the model are correct and improves the overall usability. GeMI\u2019s ability to extract attributes not explicitly mentioned (such as sex, tissue type, cell type, ethnicity and disease) allows researchers to perform specific queries and classification of experiments, which was previously possible only after spending time and resources with tedious manual annotation. The usefulness of GeMI is demonstrated on practical research use cases.<\/jats:p>\n               <jats:p>Database URL<\/jats:p>\n               <jats:p>http:\/\/gmql.eu\/gemi\/<\/jats:p>","DOI":"10.1093\/database\/baac036","type":"journal-article","created":{"date-parts":[[2022,4,27]],"date-time":"2022-04-27T11:24:29Z","timestamp":1651058669000},"source":"Crossref","is-referenced-by-count":14,"title":["GeMI: interactive interface for transformer-based Genomic Metadata Integration"],"prefix":"10.1093","volume":"2022","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5465-6182","authenticated-orcid":false,"given":"Giuseppe","family":"Serna Garcia","sequence":"first","affiliation":[{"name":"Department of Electronics, Information, and Bioengineering, Politecnico di Milano , Via Ponzio 34\/5, Milano 20133, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2618-5985","authenticated-orcid":false,"given":"Michele","family":"Leone","sequence":"additional","affiliation":[{"name":"Department of Electronics, Information, and Bioengineering, Politecnico di Milano , Via Ponzio 34\/5, Milano 20133, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8016-5750","authenticated-orcid":false,"given":"Anna","family":"Bernasconi","sequence":"additional","affiliation":[{"name":"Department of Electronics, Information, and Bioengineering, Politecnico di Milano , Via Ponzio 34\/5, Milano 20133, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6575-9737","authenticated-orcid":false,"given":"Mark J","family":"Carman","sequence":"additional","affiliation":[{"name":"Department of Electronics, Information, and Bioengineering, Politecnico di Milano , Via Ponzio 34\/5, Milano 20133, Italy"}]}],"member":"286","published-online":{"date-parts":[[2022,6,3]]},"reference":[{"key":"2022060311141246800_R1","doi-asserted-by":"crossref","first-page":"D991","DOI":"10.1093\/nar\/gks1193","article-title":"NCBI GEO: archive for functional genomics data sets\u2013update","volume":"41","author":"Barrett","year":"2012","journal-title":"Nucleic Acids Res."},{"key":"2022060311141246800_R2","doi-asserted-by":"crossref","first-page":"D54","DOI":"10.1093\/nar\/gkr854","article-title":"The Sequence Read Archive: explosive growth of sequencing data","volume":"40","author":"Kodama","year":"2011","journal-title":"Nucleic Acids Res."},{"key":"2022060311141246800_R3","doi-asserted-by":"crossref","first-page":"987","DOI":"10.1093\/nar\/gks1174","article-title":"ArrayExpress update\u2013trends in database growth and links to data analysis tools","volume":"41","author":"Rustici","year":"2013","journal-title":"Sarkans U Nucleic Acids Res."},{"key":"2022060311141246800_R4","doi-asserted-by":"crossref","first-page":"30","DOI":"10.1093\/bib\/bbaa080","article-title":"The road towards data integration in human genomics: players, steps and interactions","volume":"22","author":"Bernasconi","year":"2021","journal-title":"Brief. Bioinform."},{"key":"2022060311141246800_R5","doi-asserted-by":"crossref","first-page":"16","DOI":"10.1038\/nmeth1156","article-title":"Next-generation sequencing transforms today\u2019s biology","volume":"5","author":"Schuster","year":"2008","journal-title":"Nat. Methods"},{"key":"2022060311141246800_R6","doi-asserted-by":"crossref","first-page":"103","DOI":"10.1007\/s12551-018-0490-8","article-title":"Mining data and metadata from the gene expression omnibus","volume":"11","author":"Wang","year":"2019","journal-title":"Biophys. Rev."},{"key":"2022060311141246800_R7","first-page":"187","article-title":"Automated integration of genomic metadata with sequence-to-sequence models","author":"Cannizzaro","year":"2020"},{"key":"2022060311141246800_R8","doi-asserted-by":"crossref","first-page":"201","DOI":"10.1007\/BF00993277","article-title":"Improving generalization with active learning","volume":"15","author":"Cohn","year":"1994","journal-title":"Mach. Learn."},{"key":"2022060311141246800_R9","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2020.emnlp-main.263","article-title":"A diagnostic study of explainability techniques for text classification","volume-title":"arXiv preprint arXiv:200913295","author":"Atanasova","year":"2020"},{"key":"2022060311141246800_R10","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford","year":"2019","journal-title":"OpenAI Blog"},{"key":"2022060311141246800_R11","doi-asserted-by":"crossref","first-page":"D729","DOI":"10.1093\/nar\/gky1094","article-title":"Cistrome Data Browser: expanded datasets and new tools for gene regulatory analysis","volume":"47","author":"Zheng","year":"2018","journal-title":"Nucleic Acids Res."},{"key":"2022060311141246800_R12","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1038\/nature11247","article-title":"An integrated encyclopedia of DNA elements in the human genome","volume":"489","author":"Consortium ENCODE","year":"2012","journal-title":"Nature"},{"key":"2022060311141246800_R13","first-page":"5998","article-title":"Attention is all you need","author":"Vaswani","year":"2017"},{"key":"2022060311141246800_R14","doi-asserted-by":"crossref","first-page":"52138","DOI":"10.1109\/ACCESS.2018.2870052","article-title":"Peeking inside the black-box: a survey on explainable artificial intelligence (XAI)","volume":"6","author":"Adadi","year":"2018","journal-title":"IEEE Access"},{"key":"2022060311141246800_R15","article-title":"Investigating the influence of noise and distractors on the interpretation of neural networks","volume-title":"arXiv preprint arXiv:161107270","author":"Kindermans","year":"2016"},{"key":"2022060311141246800_R16","doi-asserted-by":"crossref","first-page":"325","DOI":"10.1007\/978-3-319-69904-2_26","volume-title":"Conceptual Modeling","author":"Bernasconi","year":"2017"},{"key":"2022060311141246800_R17","first-page":"46","article-title":"Overview of GeCo: a project for exploring and integrating signals from the genome","author":"Ceri","year":"2017"},{"key":"2022060311141246800_R18","doi-asserted-by":"crossref","first-page":"543","DOI":"10.1109\/TCBB.2020.2998954","article-title":"META-BASE: a novel architecture for large-scale genomic metadata integration","volume":"19","author":"Bernasconi","year":"2020","journal-title":"IEEE\/ACM Trans. Comput. Biol. Bioinform."},{"key":"2022060311141246800_R19","doi-asserted-by":"publisher","DOI":"10.1093\/database\/baz132","article-title":"GenoSurf: metadata driven semantic search system for integrated genomic datasets","author":"Canakoglu","year":"2019","journal-title":"Database"},{"key":"2022060311141246800_R20","article-title":"Ontology-based annotations and semantic relations in large-scale (epi) genomics data","volume":"18","author":"Galeota","year":"2017","journal-title":"Brief. Bioinformatics"},{"key":"2022060311141246800_R21","doi-asserted-by":"crossref","first-page":"1183","DOI":"10.1093\/bioinformatics\/btab815","article-title":"Identification, semantic annotation and comparison of combinations of functional elements in multiple biological conditions","volume":"38","author":"Leone","year":"2021","journal-title":"Bioinformatics"},{"key":"2022060311141246800_R22","doi-asserted-by":"crossref","DOI":"10.1038\/sdata.2017.125","article-title":"Precision annotation of digital samples in NCBI\u2019s gene expression omnibus","volume":"4","author":"Hadley","year":"2017","journal-title":"Sci. Data."},{"key":"2022060311141246800_R23","doi-asserted-by":"crossref","DOI":"10.1186\/s12859-017-1888-1","article-title":"ALE: automated label extraction from GEO metadata","volume":"18","author":"Giles","year":"2017","journal-title":"BMC Bioinform."},{"key":"2022060311141246800_R24","doi-asserted-by":"publisher","DOI":"10.1093\/database\/bay145","article-title":"Restructured GEO: restructuring Gene Expression Omnibus metadata for genome dynamics analysis","volume":"2019","author":"Chen","year":"2019","journal-title":"Database"},{"key":"2022060311141246800_R25","article-title":"AMMU: a survey of transformer-based biomedical pretrained language models","volume":"126","author":"Kalyan","year":"2021","journal-title":"J. Biomed. Inform."},{"key":"2022060311141246800_R26","doi-asserted-by":"crossref","DOI":"10.1007\/s10916-018-1003-9","article-title":"A survey of data mining and deep learning in bioinformatics","volume":"42","author":"Lan","year":"2018","journal-title":"J. Med. Syst."},{"key":"2022060311141246800_R27","doi-asserted-by":"crossref","first-page":"1750","DOI":"10.1016\/j.csbj.2021.03.022","article-title":"The language of proteins: NLP, machine learning & protein sequences","volume":"19","author":"Ofer","year":"2021","journal-title":"Comput Struct Biotechnol J"},{"key":"2022060311141246800_R28","doi-asserted-by":"crossref","DOI":"10.1007\/s00439-021-02411-y","article-title":"Embeddings from protein language models predict conservation and variant effects","author":"Marquet","year":"2021","journal-title":"Hum. Genet."},{"key":"2022060311141246800_R29","doi-asserted-by":"crossref","first-page":"2112","DOI":"10.1093\/bioinformatics\/btab083","article-title":"DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome","volume":"37","author":"Ji","year":"2021","journal-title":"Bioinformatics"},{"key":"2022060311141246800_R30","doi-asserted-by":"crossref","DOI":"10.1093\/bib\/bbab005","article-title":"A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information","volume":"22","author":"Le","year":"2021","journal-title":"Brief. Bioinformatics"},{"key":"2022060311141246800_R31","article-title":"Deep transformers and convolutional neural network in identifying DNA N6-methyladenine sites in cross-species genomes","author":"Le","year":"2021","journal-title":"Methods"},{"key":"2022060311141246800_R32","doi-asserted-by":"crossref","first-page":"1191","DOI":"10.1093\/bioinformatics\/btab823","article-title":"miRe2e: a full end-to-end deep model based on transformers for prediction of pre-miRNAs","volume":"38","author":"Raad","year":"2021","journal-title":"Bioinformatics"},{"key":"2022060311141246800_R33","doi-asserted-by":"crossref","DOI":"10.1093\/bib\/bbab200","article-title":"A novel antibacterial peptide recognition algorithm based on BERT","volume":"22","author":"Zhang","year":"2021","journal-title":"Brief. Bioinformatics"},{"key":"2022060311141246800_R34","doi-asserted-by":"crossref","first-page":"2556","DOI":"10.1093\/bioinformatics\/btab133","article-title":"BERT4Bitter: a bidirectional encoder representations from transformers (BERT)-based model for improving the prediction of bitter peptides","volume":"37","author":"Charoenkwan","year":"2021","journal-title":"Bioinformatics"},{"key":"2022060311141246800_R35","doi-asserted-by":"crossref","DOI":"10.1093\/bib\/bbab060","article-title":"Explainability in transformer models for functional genomics","volume":"22","author":"Clauwaert","year":"2021","journal-title":"Brief. Bioinformatics"},{"key":"2022060311141246800_R36","doi-asserted-by":"crossref","first-page":"404","DOI":"10.1093\/bioinformatics\/btaa721","article-title":"LBERT: lexically aware transformer-based bidirectional encoder representation model for learning universal bio-entity relations","volume":"37","author":"Warikoo","year":"2021","journal-title":"Bioinformatics"},{"key":"2022060311141246800_R37","doi-asserted-by":"crossref","first-page":"5678","DOI":"10.1093\/bioinformatics\/btaa1087","article-title":"BERT-GT: cross-sentence n-ary relation extraction with BERT and Graph Transformer","volume":"36","author":"Lai","year":"2020","journal-title":"Bioinformatics"},{"key":"2022060311141246800_R38","article-title":"Deep inside convolutional networks: visualising image classification models and saliency maps","volume-title":"arXiv preprint arXiv:13126034","author":"Simonyan","year":"2013"},{"key":"2022060311141246800_R39","first-page":"1135","article-title":"\u2018Why should i trust you?\u2019 Explaining the predictions of any classifier","author":"Ribeiro","year":"2016"},{"key":"2022060311141246800_R40","article-title":"Neural machine translation by jointly learning to align and translate","author":"Bahdanau","year":"2014"},{"key":"2022060311141246800_R41","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2021.acl-demo.30","article-title":"Ecco: an open source library for the explainability of transformer language models","author":"Alammar","year":"2021"},{"key":"2022060311141246800_R42","doi-asserted-by":"crossref","first-page":"2798","DOI":"10.1093\/bioinformatics\/btn520","article-title":"GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus","volume":"24","author":"Zhu","year":"2008","journal-title":"Bioinformatics"},{"key":"2022060311141246800_R43","doi-asserted-by":"crossref","DOI":"10.7171\/jbt.18-2902-002","article-title":"The Cellosaurus, a cell-line knowledge resource","volume":"29","author":"Bairoch","year":"2018","journal-title":"J. Biomol. Tech."},{"key":"2022060311141246800_R44","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/s41598-020-57716-1","article-title":"Ontology-driven integrative analysis of omics data through OnASSiS","volume":"10","author":"Galeota","year":"2020","journal-title":"Sci Rep"},{"key":"2022060311141246800_R45","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13326-016-0088-7","article-title":"The Cell Ontology 2016: enhanced content, modularization, and ontology interoperability","volume":"7","author":"Diehl","year":"2016","journal-title":"J. Biomed. Semantics"},{"key":"2022060311141246800_R46","doi-asserted-by":"crossref","first-page":"D955","DOI":"10.1093\/nar\/gky1032","article-title":"Human Disease Ontology 2018 update: classification, content and workflow expansion","volume":"47","author":"Schriml","year":"2019","journal-title":"Nucleic Acids Res."},{"key":"2022060311141246800_R47","doi-asserted-by":"crossref","first-page":"838","DOI":"10.1016\/B978-0-12-809633-8.20397-X","article-title":"Biological and Medical Ontologies: Disease Ontology (DO)","volume":"1","author":"Bernasconi","year":"2019","journal-title":"Reference Module in Life Sciences, Encyclopedia of Bioinformatics and Computational Biology"},{"key":"2022060311141246800_R48","doi-asserted-by":"crossref","DOI":"10.1186\/s12859-019-3159-9","article-title":"PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets","volume":"20","author":"Nanni","year":"2019","journal-title":"BMC Bioinform."},{"key":"2022060311141246800_R49","doi-asserted-by":"crossref","first-page":"729","DOI":"10.1093\/bioinformatics\/bty688","article-title":"Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data","volume":"35","author":"Masseroli","year":"2019","journal-title":"Bioinformatics"},{"key":"2022060311141246800_R50","doi-asserted-by":"crossref","first-page":"1881","DOI":"10.1093\/bioinformatics\/btv048","article-title":"GenoMetric Query Language: a novel approach to large-scale genomic data management","volume":"31","author":"Masseroli","year":"2015","journal-title":"Bioinformatics"},{"key":"2022060311141246800_R51","doi-asserted-by":"crossref","first-page":"215","DOI":"10.1038\/nmeth.1906","article-title":"ChromHMM: automating chromatin-state discovery and characterization","volume":"9","author":"Ernst","year":"2012","journal-title":"Nat. Methods"},{"key":"2022060311141246800_R52","first-page":"83","article-title":"Exploiting conceptual modeling for searching genomic metadata: A quantitative and qualitative empirical study","author":"Bernasconi","year":"2019","journal-title":"International Conference on Conceptual Modeling"},{"key":"2022060311141246800_R53","first-page":"1","article-title":"Ontology-driven metadata enrichment for genomic datasets","author":"Bernasconi","year":"2018","journal-title":"11th International Conference Semantic Web Applications and Tools for Life Sciences, SWAT4LS 2018"},{"key":"2022060311141246800_R54","article-title":"BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension","author":"Lewis","year":"2019"}],"container-title":["Database"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baac036\/43944508\/baac036.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baac036\/43944508\/baac036.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,6,3]],"date-time":"2022-06-03T11:15:01Z","timestamp":1654254901000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/database\/article\/doi\/10.1093\/database\/baac036\/6600540"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,1,1]]},"references-count":54,"URL":"https:\/\/doi.org\/10.1093\/database\/baac036","relation":{},"ISSN":["1758-0463"],"issn-type":[{"value":"1758-0463","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022,1,1]]},"published":{"date-parts":[[2022,1,1]]},"article-number":"baac036"}}