{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,4,8]],"date-time":"2025-04-08T05:23:29Z","timestamp":1744089809780,"version":"3.37.3"},"reference-count":28,"publisher":"Oxford University Press (OUP)","issue":"10","license":[{"start":{"date-parts":[[2017,11,23]],"date-time":"2017-11-23T00:00:00Z","timestamp":1511395200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/about_us\/legal\/notices"}],"funder":[{"DOI":"10.13039\/100000002","name":"NIH","doi-asserted-by":"publisher","award":["1 U01 CA198943-01"],"award-info":[{"award-number":["1 U01 CA198943-01"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2018,5,15]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM\/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For the quality values, we present a novel lossy compression scheme named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>We analyze the performance of several lossy compressors for quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well-known NA12878 individual. By compressing and reconstructing quality values with CALQ, we observe a better average variant calling performance than with the original data while achieving a size reduction of about one order of magnitude with respect to the state-of-the-art lossless compressors. Furthermore, we show that CALQ performs as good as or better than the state-of-the-art lossy compressors in terms of variant calling Recall and Precision for most of the analyzed datasets.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>CALQ is written in C\u2009++ and can be downloaded from https:\/\/github.com\/voges\/calq.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Supplementary information<\/jats:title>\n                  <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btx737","type":"journal-article","created":{"date-parts":[[2017,11,22]],"date-time":"2017-11-22T20:13:53Z","timestamp":1511381633000},"page":"1650-1658","source":"Crossref","is-referenced-by-count":18,"title":["CALQ: compression of quality values of aligned sequencing data"],"prefix":"10.1093","volume":"34","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6080-660X","authenticated-orcid":false,"given":"Jan","family":"Voges","sequence":"first","affiliation":[{"name":"Fakult\u00e4t f\u00fcr Elektrotechnik und Informatik, Institut f\u00fcr Informationsverarbeitung (TNT), Leibniz Universit\u00e4t Hannover, Hannover, Germany"}]},{"given":"J\u00f6rn","family":"Ostermann","sequence":"additional","affiliation":[{"name":"Fakult\u00e4t f\u00fcr Elektrotechnik und Informatik, Institut f\u00fcr Informationsverarbeitung (TNT), Leibniz Universit\u00e4t Hannover, Hannover, Germany"}]},{"given":"Mikel","family":"Hernaez","sequence":"additional","affiliation":[{"name":"Carl R. Woese Institute for Genomic Biology, University of Illinois, Urbana-Champaign, IL, USA"}]}],"member":"286","published-online":{"date-parts":[[2017,11,23]]},"reference":[{"year":"2016","author":"Alberti","key":"2023012713460817100_btx737-B1"},{"key":"2023012713460817100_btx737-B2","doi-asserted-by":"crossref","first-page":"2818","DOI":"10.1093\/bioinformatics\/btu390","article-title":"The Scramble conversion tool","volume":"30","author":"Bonfield","year":"2014","journal-title":"Bioinformatics"},{"key":"2023012713460817100_btx737-B3","doi-asserted-by":"crossref","first-page":"2130","DOI":"10.1093\/bioinformatics\/btu183","article-title":"Lossy compression of quality scores in genomic data","volume":"30","author":"C\u00e1novas","year":"2014","journal-title":"Bioinformatics"},{"key":"2023012713460817100_btx737-B4","doi-asserted-by":"crossref","first-page":"3709","DOI":"10.1093\/bioinformatics\/btw543","article-title":"CSAM: Compressed SAM format","volume":"32","author":"C\u00e1novas","year":"2016","journal-title":"Bioinformatics"},{"key":"2023012713460817100_btx737-B5","doi-asserted-by":"crossref","first-page":"1767","DOI":"10.1093\/nar\/gkp1137","article-title":"The Sanger FASTQ file format for sequences with quality scores, and the Solexa\/Illumina FASTQ variants","volume":"38","author":"Cock","year":"2010","journal-title":"Nucleic Acids Res"},{"key":"2023012713460817100_btx737-B6","doi-asserted-by":"crossref","first-page":"860","DOI":"10.1093\/bioinformatics\/btr014","article-title":"Compression of DNA sequence reads in FASTQ format","volume":"27","author":"Deorowicz","year":"2011","journal-title":"Bioinformatics"},{"key":"2023012713460817100_btx737-B7","doi-asserted-by":"crossref","first-page":"491","DOI":"10.1038\/ng.806","article-title":"A framework for variation discovery and genotyping using next-generation DNA sequencing data","volume":"43","author":"DePristo","year":"2011","journal-title":"Nat. Genet"},{"key":"2023012713460817100_btx737-B8","doi-asserted-by":"crossref","first-page":"186","DOI":"10.1101\/gr.8.3.186","article-title":"Base-calling of automated sequencer traces using phred. II. Error probabilities","volume":"8","author":"Ewing","year":"1998","journal-title":"Genome Res"},{"key":"2023012713460817100_btx737-B9","doi-asserted-by":"crossref","first-page":"1082","DOI":"10.1038\/nmeth.3133","article-title":"DeeZ: reference-based compression by local assembly","volume":"11","author":"Hach","year":"2014","journal-title":"Nat. Methods"},{"first-page":"261","year":"2016","author":"Hernaez","key":"2023012713460817100_btx737-B10"},{"key":"2023012713460817100_btx737-B11","doi-asserted-by":"crossref","first-page":"734","DOI":"10.1101\/gr.114819.110","article-title":"Efficient storage of high throughput DNA sequencing data using reference-based compression","volume":"21","author":"Hsi-Yang Fritz","year":"2011","journal-title":"Genome Res"},{"key":"2023012713460817100_btx737-B12","doi-asserted-by":"crossref","first-page":"357","DOI":"10.1038\/nmeth.1923","article-title":"Fast gapped-read alignment with Bowtie 2","volume":"9","author":"Langmead","year":"2012","journal-title":"Nat. Methods"},{"key":"2023012713460817100_btx737-B13","doi-asserted-by":"crossref","first-page":"R25","DOI":"10.1186\/gb-2009-10-3-r25","article-title":"Ultrafast and memory-efficient alignment of short DNA sequences to the human genome","volume":"10","author":"Langmead","year":"2009","journal-title":"Genome Biol"},{"key":"2023012713460817100_btx737-B14","doi-asserted-by":"crossref","first-page":"2078","DOI":"10.1093\/bioinformatics\/btp352","article-title":"The Sequence Alignment\/Map format and SAMtools","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"2023012713460817100_btx737-B15","doi-asserted-by":"crossref","first-page":"3122","DOI":"10.1093\/bioinformatics\/btv330","article-title":"QVZ: lossy compression of quality values","volume":"31","author":"Malysa","year":"2015","journal-title":"Bioinformatics"},{"key":"2023012713460817100_btx737-B16","doi-asserted-by":"crossref","first-page":"1185","DOI":"10.1038\/nmeth.2221","article-title":"The GEM mapper: fast, accurate and versatile alignment by filtration","volume":"9","author":"Marco-Sola","year":"2012","journal-title":"Nat. Methods"},{"key":"2023012713460817100_btx737-B17","doi-asserted-by":"crossref","first-page":"198","DOI":"10.1038\/nature09796","article-title":"A decade\u2019s perspective on DNA sequencing technology","volume":"470","author":"Mardis","year":"2011","journal-title":"Nature"},{"key":"2023012713460817100_btx737-B18","doi-asserted-by":"crossref","first-page":"1297","DOI":"10.1101\/gr.107524.110","article-title":"The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data","volume":"20","author":"McKenna","year":"2010","journal-title":"Genome Res"},{"key":"2023012713460817100_btx737-B19","doi-asserted-by":"crossref","first-page":"1005","DOI":"10.1038\/nmeth.4037","article-title":"Comparison of high-throughput sequencing data compression tools","volume":"13","author":"Numanagi\u0107","year":"2016","journal-title":"Nat. Methods"},{"key":"2023012713460817100_btx737-B20","doi-asserted-by":"crossref","first-page":"1442002","DOI":"10.1142\/S0219720014420025","article-title":"Aligned genomic data compression via improved modeling","volume":"12","author":"Ochoa","year":"2014","journal-title":"J. Bioinf. Comput. Biol"},{"key":"2023012713460817100_btx737-B21","first-page":"183","article-title":"Effect of lossy compression of quality scores on variant calling","volume":"18","author":"Ochoa","year":"2016","journal-title":"Brief. Bioinf"},{"key":"2023012713460817100_btx737-B22","doi-asserted-by":"crossref","first-page":"912","DOI":"10.1038\/ng.3036","article-title":"Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications","volume":"46","author":"Rimmer","year":"2014","journal-title":"Nat. Genet"},{"key":"2023012713460817100_btx737-B23","doi-asserted-by":"crossref","first-page":"e114","DOI":"10.1093\/nar\/gkw318","article-title":"CARGO: effective format-free compressed storage of genomic information","volume":"44","author":"Roguski","year":"2016","journal-title":"Nucleic Acids Res"},{"key":"2023012713460817100_btx737-B24","doi-asserted-by":"crossref","first-page":"e1002195","DOI":"10.1371\/journal.pbio.1002195","article-title":"Big data: astronomical or genomical?","volume":"13","author":"Stephens","year":"2015","journal-title":"PLOS Biol"},{"year":"2016","author":"Voges","key":"2023012713460817100_btx737-B25"},{"key":"2023012713460817100_btx737-B26","doi-asserted-by":"crossref","first-page":"520","DOI":"10.1145\/214762.214771","article-title":"Arithmetic coding for data compression","volume":"30","author":"Witten","year":"1987","journal-title":"Commun. ACM"},{"key":"2023012713460817100_btx737-B27","doi-asserted-by":"crossref","first-page":"240","DOI":"10.1038\/nbt.3170","article-title":"Quality score compression improves genotyping accuracy","volume":"33","author":"Yu","year":"2015","journal-title":"Nat. Biotechnol"},{"key":"2023012713460817100_btx737-B28","doi-asserted-by":"crossref","first-page":"246","DOI":"10.1038\/nbt.2835","article-title":"Integrating human sequence data sets provides a resource of benchmark snp and indel genotype calls","volume":"32","author":"Zook","year":"2014","journal-title":"Nat. Biotechnol"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/34\/10\/1650\/48935993\/bioinformatics_34_10_1650.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/34\/10\/1650\/48935993\/bioinformatics_34_10_1650.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,27]],"date-time":"2023-01-27T14:25:09Z","timestamp":1674829509000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/34\/10\/1650\/4653693"}},"subtitle":[],"editor":[{"given":"Bonnie","family":"Berger","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2017,11,23]]},"references-count":28,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2018,5,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btx737","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"type":"print","value":"1367-4803"},{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2018,5,15]]},"published":{"date-parts":[[2017,11,23]]}}}