{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T20:34:17Z","timestamp":1772138057839,"version":"3.50.1"},"reference-count":42,"publisher":"Oxford University Press (OUP)","issue":"1","license":[{"start":{"date-parts":[[2022,12,14]],"date-time":"2022-12-14T00:00:00Z","timestamp":1670976000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Research University"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2023,1,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Pileup analysis is a building block of many bioinformatics pipelines, including variant calling and genotyping. This step tends to become a bottleneck of the entire assay since the straightforward pileup implementations involve processing of all base calls from all alignments sequentially. On the other hand, a distributed version of the algorithm faces the intrinsic challenge of splitting reads-oriented file formats into self-contained partitions to avoid costly data exchange between computational nodes.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>Here, we present a scalable, distributed and efficient implementation of a pileup algorithm that is suitable for deploying in cloud computing environments. In particular, we implemented: (i) our custom data-partitioning algorithm optimized to work with the alignment reads, (ii) a novel and unique approach to process alignment events from sequencing reads using the MD tags, (iii) the source code micro-optimizations for recurrent operations, and (iv) a modular structure of the algorithm. We have proven that our novel approach consistently and significantly outperforms other state-of-the-art distributed tools in terms of execution time (up to 6.5\u00d7 faster) and memory usage (up to 2\u00d7 less), resulting in a substantial cloud cost reduction. SeQuiLa is a cloud-native solution that can be easily deployed using any managed Kubernetes and Hadoop services available in public clouds, like Microsoft Azure Cloud, Google Cloud Platform, or Amazon Web Services. Together with the already implemented distributed range join and coverage calculations, our package provides end-users with a unified SQL interface for convenient analyses of population-scale genomic data in an interactive way.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>https:\/\/biodatageeks.github.io\/sequila\/<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btac804","type":"journal-article","created":{"date-parts":[[2022,12,14]],"date-time":"2022-12-14T08:31:22Z","timestamp":1671006682000},"source":"Crossref","is-referenced-by-count":1,"title":["Cloud-native distributed genomic pileup operations"],"prefix":"10.1093","volume":"39","author":[{"given":"Marek","family":"Wiewi\u00f3rka","sequence":"first","affiliation":[{"name":"Institute of Computer Science, Warsaw University of Technology, Warsaw , Warsaw 00-661, Poland"}]},{"given":"Agnieszka","family":"Szmur\u0142o","sequence":"additional","affiliation":[{"name":"Institute of Computer Science, Warsaw University of Technology, Warsaw , Warsaw 00-661, Poland"}]},{"given":"Pawe\u0142","family":"Stankiewicz","sequence":"additional","affiliation":[{"name":"Department of Molecular and Human Genetics, Baylor College of Medicine , Houston, TX 77030, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0941-4571","authenticated-orcid":false,"given":"Tomasz","family":"Gambin","sequence":"additional","affiliation":[{"name":"Institute of Computer Science, Warsaw University of Technology, Warsaw , Warsaw 00-661, Poland"}]}],"member":"286","published-online":{"date-parts":[[2022,12,14]]},"reference":[{"key":"2023011906414359200_btac804-B1","doi-asserted-by":"crossref","DOI":"10.1093\/gigascience\/giab057","article-title":"VC@scale: scalable and high-performance variant calling on cluster environments","volume":"10","author":"Ahmad","year":"2021","journal-title":"GigaScience"},{"key":"2023011906414359200_btac804-B2","first-page":"1383","author":"Armbrust","year":"2015"},{"key":"2023011906414359200_btac804-B3","doi-asserted-by":"crossref","first-page":"71","DOI":"10.1145\/2723872.2723882","article-title":"An introduction to Docker for reproducible research","volume":"49","author":"Boettiger","year":"2015","journal-title":"SIGOPS Oper. Syst. Rev"},{"key":"2023011906414359200_btac804-B4","doi-asserted-by":"crossref","first-page":"337","DOI":"10.1093\/bioinformatics\/bty608","article-title":"Crumble: reference free lossy compression of sequence quality values","volume":"35","author":"Bonfield","year":"2019","journal-title":"Bioinformatics"},{"key":"2023011906414359200_btac804-B5","doi-asserted-by":"crossref","DOI":"10.1093\/gigascience\/giaa042","article-title":"MaRe: processing big data with application containers on apache spark","volume":"9","author":"Capuccini","year":"2020","journal-title":"GigaScience"},{"key":"2023011906414359200_btac804-B6","doi-asserted-by":"crossref","first-page":"07020","DOI":"10.1051\/epjconf\/201921407020","article-title":"Apache spark usage and deployment models for scientific computing","volume":"214","author":"Castro","year":"2019","journal-title":"EPJ Web Conf"},{"key":"2023011906414359200_btac804-B7","doi-asserted-by":"crossref","DOI":"10.1093\/gigascience\/giab008","article-title":"Twelve years of SAMtools and BCFtools","volume":"10","author":"Danecek","year":"2021","journal-title":"GigaScience"},{"key":"2023011906414359200_btac804-B8","first-page":"580","author":"Guerriero","year":"2019"},{"key":"2023011906414359200_btac804-B9","article-title":"Bioinformatics applications on apache spark","volume":"7","author":"Guo","year":"2018","journal-title":"GigaScience"},{"key":"2023011906414359200_btac804-B10","doi-asserted-by":"crossref","first-page":"191","DOI":"10.1007\/978-1-4842-4517-0_8","volume-title":"Pro Oracle SQL Development","author":"Heller","year":"2019"},{"key":"2023011906414359200_btac804-B11","doi-asserted-by":"crossref","DOI":"10.1002\/cpe.5523","article-title":"The impact of columnar file formats on SQL-on-hadoop engine performance: a study on ORC and parquet","volume":"32","author":"Ivanov","year":"2020","journal-title":"Concurr. Comput. Pract. Exper"},{"key":"2023011906414359200_btac804-B12","doi-asserted-by":"crossref","first-page":"568","DOI":"10.1101\/gr.129684.111","article-title":"VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing","volume":"22","author":"Koboldt","year":"2012","journal-title":"Genome Res"},{"key":"2023011906414359200_btac804-B13","doi-asserted-by":"crossref","first-page":"11779322211035921","DOI":"10.1177\/11779322211035921","article-title":"Cloud computing enabled big multi-omics data analytics","volume":"15","author":"Koppad","year":"2021","journal-title":"Bioinform. Biol. Insights"},{"key":"2023011906414359200_btac804-B14","doi-asserted-by":"crossref","first-page":"1425","DOI":"10.1093\/jamia\/ocaa068","article-title":"Scalability and cost-effectiveness analysis of whole genome-wide association studies on Google Cloud platform and Amazon Web Services","volume":"27","author":"Krissaane","year":"2020","journal-title":"J. Am. Med. Inform. Assoc"},{"key":"2023011906414359200_btac804-B15","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1093\/gigascience\/giaa063","article-title":"The democratization of bioinformatics: a software engineering perspective","volume":"9","author":"Lawlor","year":"2020","journal-title":"GigaScience"},{"key":"2023011906414359200_btac804-B16","doi-asserted-by":"crossref","first-page":"2987","DOI":"10.1093\/bioinformatics\/btr509","article-title":"A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data","volume":"27","author":"Li","year":"2011","journal-title":"Bioinformatics"},{"key":"2023011906414359200_btac804-B17","doi-asserted-by":"crossref","first-page":"2078","DOI":"10.1093\/bioinformatics\/btp352","article-title":"The sequence alignment\/map format and SAMtools","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"2023011906414359200_btac804-B18","doi-asserted-by":"crossref","first-page":"731424","DOI":"10.3389\/fcell.2021.731424","article-title":"Psi-Caller: a lightweight short read-based variant caller with high speed and accuracy","volume":"9","author":"Liu","year":"2021","journal-title":"Front. Cell Dev. Biol"},{"key":"2023011906414359200_btac804-B19","doi-asserted-by":"crossref","first-page":"220","DOI":"10.1038\/s42256-020-0167-4","article-title":"Exploring the limit of using a deep neural network on pileup data for germline variant calling","volume":"2","author":"Luo","year":"2020","journal-title":"Nat. Mach. Intell"},{"key":"2023011906414359200_btac804-B20","author":"Massie","year":"2013"},{"key":"2023011906414359200_btac804-B21","doi-asserted-by":"crossref","first-page":"1297","DOI":"10.1101\/gr.107524.110","article-title":"The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data","volume":"20","author":"McKenna","year":"2010","journal-title":"Genome Res"},{"key":"2023011906414359200_btac804-B22","doi-asserted-by":"crossref","first-page":"77","DOI":"10.1007\/978-1-4842-7328-9_4","volume-title":"Deep-Dive Terraform on Azure","author":"Modi","year":"2021"},{"key":"2023011906414359200_btac804-B23","doi-asserted-by":"crossref","first-page":"876","DOI":"10.1093\/bioinformatics\/bts054","article-title":"Hadoop-BAM: directly manipulating next generation sequencing data in the cloud","volume":"28","author":"Niemenmaa","year":"2012","journal-title":"Bioinformatics"},{"key":"2023011906414359200_btac804-B24","first-page":"119","author":"Nisbet","year":"2019"},{"key":"2023011906414359200_btac804-B25","doi-asserted-by":"crossref","first-page":"867","DOI":"10.1093\/bioinformatics\/btx699","article-title":"Mosdepth: quick coverage calculation for genomes and exomes","volume":"34","author":"Pedersen","year":"2018","journal-title":"Bioinformatics"},{"key":"2023011906414359200_btac804-B26","doi-asserted-by":"crossref","first-page":"9","DOI":"10.1186\/s12920-015-0084-2","article-title":"ASEQ: fast allele-specific studies from next-generation sequencing data","volume":"8","author":"Romanel","year":"2015","journal-title":"BMC Med. Genomics"},{"key":"2023011906414359200_btac804-B27","doi-asserted-by":"crossref","first-page":"2270","DOI":"10.1016\/j.csbj.2020.08.011","article-title":"UMI-gen: A UMI-based read simulator for variant calling evaluation in paired-end sequencing NGS libraries","volume":"18","author":"Sater","year":"2020","journal-title":"Comput. Struct. Biotechnol. J."},{"key":"2023011906414359200_btac804-B28","first-page":"1802","author":"Sethi","year":"2019"},{"key":"2023011906414359200_btac804-B29","first-page":"0184","author":"Shah","year":"2019"},{"key":"2023011906414359200_btac804-B30","first-page":"1","author":"Shen","year":"2021"},{"key":"2023011906414359200_btac804-B31","first-page":"1746","author":"Sipek","year":"2020"},{"key":"2023011906414359200_btac804-B32","doi-asserted-by":"crossref","DOI":"10.1093\/gigascience\/giab058","article-title":"Scalable analysis of multi-modal biomedical data","volume":"10","author":"Smith","year":"2021","journal-title":"GigaScience"},{"key":"2023011906414359200_btac804-B33","doi-asserted-by":"crossref","DOI":"10.1093\/gigascience\/giy052","article-title":"Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files","volume":"7","author":"Sun","year":"2018","journal-title":"GigaScience"},{"key":"2023011906414359200_btac804-B34","doi-asserted-by":"crossref","first-page":"2032","DOI":"10.1093\/bioinformatics\/btv098","article-title":"Sambamba: fast processing of NGS alignment formats","volume":"31","author":"Tarasov","year":"2015","journal-title":"Bioinformatics"},{"key":"2023011906414359200_btac804-B35","first-page":"311","author":"Vaillancourt","year":"2020"},{"key":"2023011906414359200_btac804-B36","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s12864-019-6386-6","article-title":"PaCBAM: fast and scalable processing of whole exome and targeted sequencing data","volume":"20","author":"Valentini","year":"2019","journal-title":"BMC Genomics"},{"key":"2023011906414359200_btac804-B37","doi-asserted-by":"crossref","first-page":"2156","DOI":"10.1093\/bioinformatics\/bty940","article-title":"SeQuiLa: an elastic, fast and scalable SQL-oriented solution for processing and querying genomic intervals","volume":"35","author":"Wiewi\u00f3rka","year":"2018","journal-title":"Bioinformatics"},{"key":"2023011906414359200_btac804-B38","doi-asserted-by":"crossref","DOI":"10.1093\/gigascience\/giz094","article-title":"SeQuiLa-cov: a fast and scalable library for depth of coverage calculations","volume":"8","author":"Wiewi\u00f3rka","year":"2019","journal-title":"GigaScience"},{"key":"2023011906414359200_btac804-B39","doi-asserted-by":"crossref","DOI":"10.1093\/database\/bax049","article-title":"Benchmarking distributed data warehouse solutions for storing genomic variant information","volume":"2017","author":"Wiewi\u00f3rka","year":"2017","journal-title":"Database"},{"key":"2023011906414359200_btac804-B40","doi-asserted-by":"crossref","first-page":"3014","DOI":"10.1093\/bioinformatics\/btab152","article-title":"Megadepth: efficient coverage quantification for BigWigs and BAMs","volume":"37","author":"Wilks","year":"2021","journal-title":"Bioinformatics"},{"key":"2023011906414359200_btac804-B41","first-page":"355","volume-title":"Bioinformatics Application with Kubeflow for Batch Processing in Clouds","author":"Yuan","year":"2020"},{"key":"2023011906414359200_btac804-B42","first-page":"10","author":"Zaharia","year":"2010"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/39\/1\/btac804\/48759261\/btac804.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/39\/1\/btac804\/48763654\/btac804.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/39\/1\/btac804\/48763654\/btac804.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,19]],"date-time":"2023-01-19T01:43:36Z","timestamp":1674092616000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btac804\/6900922"}},"subtitle":[],"editor":[{"given":"Peter","family":"Robinson","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2022,12,14]]},"references-count":42,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,1,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btac804","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2022.08.27.475646","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2023,1,1]]},"published":{"date-parts":[[2022,12,14]]},"article-number":"btac804"}}