{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T06:17:17Z","timestamp":1768285037651,"version":"3.49.0"},"reference-count":24,"publisher":"Oxford University Press (OUP)","issue":"12","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2015,6,15]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Motivation: Improvement of sequencing technologies and data processing pipelines is rapidly providing sequencing data, with associated high-level features, of many individual genomes in multiple biological and clinical conditions. They allow for data-driven genomic, transcriptomic and epigenomic characterizations, but require state-of-the-art \u2018big data\u2019 computing strategies, with abstraction levels beyond available tool capabilities.<\/jats:p>\n               <jats:p>Results: We propose a high-level, declarative GenoMetric Query Language (GMQL) and a toolkit for its use. GMQL operates downstream of raw data preprocessing pipelines and supports queries over thousands of heterogeneous datasets and samples; as such it is key to genomic \u2018big data\u2019 analysis. GMQL leverages a simple data model that provides both abstractions of genomic region data and associated experimental, biological and clinical metadata and interoperability between many data formats. Based on Hadoop framework and Apache Pig platform, GMQL ensures high scalability, expressivity, flexibility and simplicity of use, as demonstrated by several biological query examples on ENCODE and TCGA datasets.<\/jats:p>\n               <jats:p>Availability and implementation: The GMQL toolkit is freely available for non-commercial use at http:\/\/www.bioinformatics.deib.polimi.it\/GMQL\/.<\/jats:p>\n               <jats:p>Contact: \u00a0marco.masseroli@polimi.it<\/jats:p>\n               <jats:p>Supplementary information: \u00a0Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/btv048","type":"journal-article","created":{"date-parts":[[2015,2,4]],"date-time":"2015-02-04T01:18:07Z","timestamp":1423012687000},"page":"1881-1888","source":"Crossref","is-referenced-by-count":83,"title":["GenoMetric Query Language: a novel approach to large-scale genomic data management"],"prefix":"10.1093","volume":"31","author":[{"given":"Marco","family":"Masseroli","sequence":"first","affiliation":[{"name":"1 Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, 20133, Milan and 2Center for Genomic Science of IIT@SEMM, Istituto Italiano di Tecnologia (IIT), 20139 Milan, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Pietro","family":"Pinoli","sequence":"additional","affiliation":[{"name":"1 Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, 20133, Milan and 2Center for Genomic Science of IIT@SEMM, Istituto Italiano di Tecnologia (IIT), 20139 Milan, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Francesco","family":"Venco","sequence":"additional","affiliation":[{"name":"1 Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, 20133, Milan and 2Center for Genomic Science of IIT@SEMM, Istituto Italiano di Tecnologia (IIT), 20139 Milan, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Abdulrahman","family":"Kaitoua","sequence":"additional","affiliation":[{"name":"1 Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, 20133, Milan and 2Center for Genomic Science of IIT@SEMM, Istituto Italiano di Tecnologia (IIT), 20139 Milan, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Vahid","family":"Jalili","sequence":"additional","affiliation":[{"name":"1 Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, 20133, Milan and 2Center for Genomic Science of IIT@SEMM, Istituto Italiano di Tecnologia (IIT), 20139 Milan, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Fernando","family":"Palluzzi","sequence":"additional","affiliation":[{"name":"1 Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, 20133, Milan and 2Center for Genomic Science of IIT@SEMM, Istituto Italiano di Tecnologia (IIT), 20139 Milan, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Heiko","family":"Muller","sequence":"additional","affiliation":[{"name":"1 Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, 20133, Milan and 2Center for Genomic Science of IIT@SEMM, Istituto Italiano di Tecnologia (IIT), 20139 Milan, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Stefano","family":"Ceri","sequence":"additional","affiliation":[{"name":"1 Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, 20133, Milan and 2Center for Genomic Science of IIT@SEMM, Istituto Italiano di Tecnologia (IIT), 20139 Milan, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2015,2,3]]},"reference":[{"key":"2023020115111492200_btv048-B1","doi-asserted-by":"crossref","first-page":"1061","DOI":"10.1038\/nature09534","article-title":"A map of human genome variation from population-scale sequencing","volume":"467","author":"1000 Genomes Project Consortium et\u00a0al.","year":"2010","journal-title":"Nature"},{"key":"2023020115111492200_btv048-B2","doi-asserted-by":"crossref","first-page":"1113","DOI":"10.1038\/ng.2764","article-title":"The Cancer Genome Atlas Pan-Cancer analysis project","volume":"45","author":"Cancer Genome Atlas Research Network et\u00a0al","year":"2013","journal-title":"Nat. Genet."},{"key":"2023020115111492200_btv048-B3","doi-asserted-by":"crossref","first-page":"72","DOI":"10.1145\/1629175.1629198","article-title":"MapReduce: a flexible data processing tool","volume":"53","author":"Dean","year":"2010","journal-title":"Commun. ACM"},{"key":"2023020115111492200_btv048-B4","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1038\/nature11247","article-title":"An integrated encyclopedia of DNA elements in the human genome","volume":"489","author":"ENCODE Project Consortium","year":"2012","journal-title":"Nature"},{"key":"2023020115111492200_btv048-B5","doi-asserted-by":"crossref","first-page":"48","DOI":"10.1145\/2063176.2063195","article-title":"Creating languages in Racket","volume":"55","author":"Flatt","year":"2012","journal-title":"Commun. ACM"},{"key":"2023020115111492200_btv048-B6","doi-asserted-by":"crossref","first-page":"1451","DOI":"10.1101\/gr.4086505","article-title":"Galaxy: a platform for interactive large-scale genome analysis","volume":"15","author":"Giardine","year":"2005","journal-title":"Genome Res."},{"key":"2023020115111492200_btv048-B7","doi-asserted-by":"crossref","first-page":"R86","DOI":"10.1186\/gb-2010-11-8-r86","article-title":"Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences","volume":"11","author":"Goecks","year":"2010","journal-title":"Genome Biol."},{"key":"2023020115111492200_btv048-B8","doi-asserted-by":"crossref","first-page":"294","DOI":"10.1038\/507294a","article-title":"Technology: The $1,000 genome","volume":"507","author":"Hayden","year":"2014","journal-title":"Nature"},{"key":"2023020115111492200_btv048-B9","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1093\/bioinformatics\/btt250","article-title":"Using Genome Query Language to uncover genetic variation","volume":"30","author":"Kozanitis","year":"2014","journal-title":"Bioinformatics"},{"key":"2023020115111492200_btv048-B10","doi-asserted-by":"crossref","first-page":"2078","DOI":"10.1093\/bioinformatics\/btp352","article-title":"The Sequence Alignment\/Map format and SAMtools","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"2023020115111492200_btv048-B11","doi-asserted-by":"crossref","first-page":"1297","DOI":"10.1101\/gr.107524.110","article-title":"The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data","volume":"20","author":"McKenna","year":"2010","journal-title":"Genome Res."},{"key":"2023020115111492200_btv048-B12","doi-asserted-by":"crossref","first-page":"1919","DOI":"10.1093\/bioinformatics\/bts277","article-title":"BEDOPS: high-performance genomic feature operations","volume":"28","author":"Neph","year":"2012","journal-title":"Bioinformatics"},{"key":"2023020115111492200_btv048-B13","doi-asserted-by":"crossref","first-page":"3014","DOI":"10.1093\/bioinformatics\/btt528","article-title":"BioPig: a Hadoop-based analytic toolkit for large-scale sequence data","volume":"29","author":"Nordberg","year":"2013","journal-title":"Bioinformatics"},{"key":"2023020115111492200_btv048-B14","doi-asserted-by":"crossref","first-page":"774","DOI":"10.1016\/j.jbi.2013.07.001","article-title":"\u2018Big data', Hadoop and cloud computing in genomics","volume":"46","author":"O'Driscoll","year":"2013","journal-title":"J. Biomed. Inf."},{"key":"2023020115111492200_btv048-B15","doi-asserted-by":"crossref","first-page":"1099","DOI":"10.1145\/1376616.1376726","article-title":"Pig Latin: a not-so-foreign language for data processing","volume-title":"Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data","author":"Olston","year":"2008"},{"key":"2023020115111492200_btv048-B16","first-page":"200","article-title":"Genomic region operation kit for extensible processing of deep sequencing data. IEEE\/ACM Trans. Comput","volume":"10","author":"Ovaska","year":"2013","journal-title":"Biol. Bioinform."},{"key":"2023020115111492200_btv048-B17","doi-asserted-by":"crossref","first-page":"841","DOI":"10.1093\/bioinformatics\/btq033","article-title":"BEDTools: a flexible suite of utilities for comparing genomic features","volume":"26","author":"Quinlan","year":"2010","journal-title":"Bioinformatics"},{"key":"2023020115111492200_btv048-B18","doi-asserted-by":"crossref","first-page":"119","DOI":"10.1093\/bioinformatics\/btt601","article-title":"SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop","volume":"30","author":"Schumacher","year":"2014","journal-title":"Bioinformatics"},{"key":"2023020115111492200_btv048-B19","doi-asserted-by":"crossref","first-page":", 115","DOI":"10.1038\/nbt0214-115a","article-title":"Illumina claims $1,000 genome win","volume":"32","author":"Sheridan","year":"2014","journal-title":"Nat. Biotechnol."},{"key":"2023020115111492200_btv048-B20","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/MSST.2010.5496972","article-title":"The Hadoop distributed file system","volume-title":"Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)","author":"Shvachko","year":"2010"},{"key":"2023020115111492200_btv048-B21","doi-asserted-by":"crossref","first-page":"S1","DOI":"10.1186\/1471-2105-11-S12-S1","article-title":"An overview of the Hadoop\/MapReduce\/HBase framework and its current applications in bioinformatics","volume":"11","author":"Taylor","year":"2010","journal-title":"BMC Bioinformatics"},{"key":"2023020115111492200_btv048-B22","doi-asserted-by":"crossref","first-page":"2652","DOI":"10.1093\/bioinformatics\/btu343","article-title":"SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision","volume":"30","author":"Weiwiorka","year":"2014","journal-title":"Bioinformatics"},{"key":"2023020115111492200_btv048-B23","first-page":"15","article-title":"Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing","volume-title":"Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation","author":"Zaharia","year":"2012"},{"key":"2023020115111492200_btv048-B24","doi-asserted-by":"crossref","first-page":"637","DOI":"10.1093\/bib\/bbs088","article-title":"Survey of MapReduce frame operation in bioinformatics","volume":"15","author":"Zou","year":"2014","journal-title":"Brief. Bioinform."}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/31\/12\/1881\/49013828\/bioinformatics_31_12_1881.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/31\/12\/1881\/49013828\/bioinformatics_31_12_1881.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,2,2]],"date-time":"2023-02-02T00:02:50Z","timestamp":1675296170000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/31\/12\/1881\/213797"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,2,3]]},"references-count":24,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2015,6,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btv048","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2015,6,15]]},"published":{"date-parts":[[2015,2,3]]}}}