{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,12]],"date-time":"2026-03-12T00:09:44Z","timestamp":1773274184228,"version":"3.50.1"},"reference-count":29,"publisher":"Oxford University Press (OUP)","issue":"23","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2012,12,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Motivation: The high throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for the computational infrastructure. Data management, storage and analysis have become major logistical obstacles for those adopting the new platforms. The requirement for large investment for this purpose almost signalled the end of the Sequence Read Archive hosted at the National Center for Biotechnology Information (NCBI), which holds most of the sequence data generated world wide. Currently, most HTS data are compressed through general purpose algorithms such as gzip. These algorithms are not designed for compressing data generated by the HTS platforms; for example, they do not take advantage of the specific nature of genomic sequence data, that is, limited alphabet size and high similarity among reads. Fast and efficient compression algorithms designed specifically for HTS data should be able to address some of the issues in data management, storage and communication. Such algorithms would also help with analysis provided they offer additional capabilities such as random access to any read and indexing for efficient sequence similarity search. Here we present SCALCE, a \u2018boosting\u2019 scheme based on Locally Consistent Parsing technique, which reorganizes the reads in a way that results in a higher compression speed and compression rate, independent of the compression algorithm in use and without using a reference genome.<\/jats:p>\n               <jats:p>Results: Our tests indicate that SCALCE can improve the compression rate achieved through gzip by a factor of 4.19\u2014when the goal is to compress the reads alone. In fact, on SCALCE reordered reads, gzip running time can improve by a factor of 15.06 on a standard PC with a single core and 6 GB memory. Interestingly even the running time of SCALCE + gzip improves that of gzip alone by a factor of 2.09. When compared with the recently published BEETL, which aims to sort the (inverted) reads in lexicographic order for improving bzip2, SCALCE + gzip provides up to 2.01 times better compression while improving the running time by a factor of 5.17. SCALCE also provides the option to compress the quality scores as well as the read names, in addition to the reads themselves. This is achieved by compressing the quality scores through order-3 Arithmetic Coding (AC) and the read names through gzip through the reordering SCALCE provides on the reads. This way, in comparison with gzip compression of the unordered FASTQ files (including reads, read names and quality scores), SCALCE (together with gzip and arithmetic encoding) can provide up to 3.34 improvement in the compression rate and 1.26 improvement in running time.<\/jats:p>\n               <jats:p>Availability: Our algorithm, SCALCE (Sequence Compression Algorithm using Locally Consistent Encoding), is implemented in C++ with both gzip and bzip2 compression options. It also supports multithreading when gzip option is selected, and the pigz binary is available. It is available at http:\/\/scalce.sourceforge.net.<\/jats:p>\n               <jats:p>Contact: fhach@cs.sfu.ca or cenk@cs.sfu.ca<\/jats:p>\n               <jats:p>Supplementary information: \u00a0Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/bts593","type":"journal-article","created":{"date-parts":[[2012,10,10]],"date-time":"2012-10-10T08:45:27Z","timestamp":1349858727000},"page":"3051-3057","source":"Crossref","is-referenced-by-count":130,"title":["SCALCE: boosting sequence compression algorithms using locally consistent encoding"],"prefix":"10.1093","volume":"28","author":[{"given":"Faraz","family":"Hach","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ibrahim","family":"Numanagi\u0107","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Can","family":"Alkan","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"S Cenk","family":"Sahinalp","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2012,10,9]]},"reference":[{"key":"2023062411492688600_bts593-B1","doi-asserted-by":"crossref","first-page":"333","DOI":"10.1145\/360825.360855","article-title":"Efficient string matching: an aid to bibliographic search","volume":"18","author":"Aho","year":"1975","journal-title":"Commun. ACM"},{"key":"2023062411492688600_bts593-B2","doi-asserted-by":"crossref","first-page":"363","DOI":"10.1038\/nrg2958","article-title":"Genome structural variation discovery and genotyping","volume":"12","author":"Alkan","year":"2011","journal-title":"Nat. Rev. Genet."},{"key":"2023062411492688600_bts593-B3","doi-asserted-by":"crossref","first-page":"792","DOI":"10.1145\/1109557.1109644","article-title":"Oblivious string embeddings and edit distance approximations","volume-title":"SODA","author":"Batu","year":"2006"},{"key":"2023062411492688600_bts593-B4","first-page":"147","article-title":"No-reference compression of genomic data stored in fastq format","volume-title":"BIBM","author":"Bhola","year":"2011"},{"key":"2023062411492688600_bts593-B5","article-title":"A block-sorting lossless data compression algorithm","author":"Burrows","year":"1994","journal-title":"Technical report 124."},{"key":"2023062411492688600_bts593-B6","first-page":"197","article-title":"Communication complexity of document exchange","volume-title":"SODA","author":"Cormode","year":"2000"},{"key":"2023062411492688600_bts593-B7","doi-asserted-by":"crossref","first-page":"1415","DOI":"10.1093\/bioinformatics\/bts173","article-title":"Large-scale compression of genomic sequence databases with the burrows-wheeler transform","volume":"28","author":"Cox","year":"2012","journal-title":"Bioinformatics"},{"key":"2023062411492688600_bts593-B8","doi-asserted-by":"crossref","first-page":"860","DOI":"10.1093\/bioinformatics\/btr014","article-title":"Compression of DNA sequence reads in fastq format","volume":"27","author":"Deorowicz","year":"2011","journal-title":"Bioinformatics"},{"key":"2023062411492688600_bts593-B9","doi-asserted-by":"crossref","first-page":"491","DOI":"10.1038\/ng.806","article-title":"A framework for variation discovery and genotyping using next-generation DNA sequencing data","volume":"43","author":"DePristo","year":"2011","journal-title":"Nat. Genet."},{"key":"2023062411492688600_bts593-B10","doi-asserted-by":"crossref","first-page":"186","DOI":"10.1101\/gr.8.3.186","article-title":"Base-calling of automated sequencer traces using phred. II. Error probabilities","volume":"8","author":"Ewing","year":"1998","journal-title":"Genome Res."},{"key":"2023062411492688600_bts593-B11","first-page":"655","article-title":"Compression boosting in optimal linear time using the burrows-wheeler transform","volume-title":"SODA","author":"Ferragina","year":"2004"},{"key":"2023062411492688600_bts593-B12","doi-asserted-by":"crossref","first-page":"688","DOI":"10.1145\/1082036.1082043","article-title":"Boosting textual compression in optimal linear time","volume":"52","author":"Ferragina","year":"2005","journal-title":"J. ACM"},{"key":"2023062411492688600_bts593-B13","first-page":"756","article-title":"The engineering of a compression boosting library: theory vs practice in bwt compression","volume-title":"ESA","author":"Ferragina","year":"2006"},{"key":"2023062411492688600_bts593-B14","doi-asserted-by":"crossref","first-page":"659","DOI":"10.1093\/jhered\/esp086","article-title":"Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species","volume":"100","author":"Haussler","year":"2009","journal-title":"J. Hered."},{"key":"2023062411492688600_bts593-B15","doi-asserted-by":"crossref","first-page":"734","DOI":"10.1101\/gr.114819.110","article-title":"Efficient storage of high throughput DNA sequencing data using reference-based compression","volume":"21","author":"Hsi-Yang Fritz","year":"2011","journal-title":"Genome Res."},{"key":"2023062411492688600_bts593-B16","first-page":"1098","article-title":"A Method for the Construction of Minimum-Redundancy Codes","volume-title":"Proceedings of the IRE","author":"Huffman","year":"1952"},{"key":"2023062411492688600_bts593-B17","doi-asserted-by":"crossref","first-page":"D54","DOI":"10.1093\/nar\/gkr854","article-title":"The sequence read archive: explosive growth of sequencing data","volume":"40","author":"Kodama","year":"2011","journal-title":"Nucleic Acids Res"},{"key":"2023062411492688600_bts593-B18","first-page":"310","article-title":"Compressing genomic sequence fragments using SlimGene","volume-title":"RECOMB","author":"Kozanitis","year":"2010"},{"key":"2023062411492688600_bts593-B19","doi-asserted-by":"crossref","first-page":"1754","DOI":"10.1093\/bioinformatics\/btp324","article-title":"Fast and accurate short read alignment with Burrows-Wheeler transform","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"2023062411492688600_bts593-B20","doi-asserted-by":"crossref","first-page":"2078","DOI":"10.1093\/bioinformatics\/btp352","article-title":"The sequence alignment\/map format and SAMtools","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"2023062411492688600_bts593-B21","first-page":"300","article-title":"Symmetry breaking for suffix tree construction","volume-title":"STOC","author":"Sahinalp","year":"1994"},{"key":"2023062411492688600_bts593-B22","first-page":"320","article-title":"Efficient approximate and dynamic matching of patterns using a labeling paradigm","volume-title":"FOCS","author":"Sahinalp","year":"1996"},{"key":"2023062411492688600_bts593-B23","doi-asserted-by":"crossref","first-page":"647","DOI":"10.1038\/nrg2857","article-title":"Computational solutions to large-scale data management and analysis","volume":"11","author":"Schadt","year":"2010","journal-title":"Nat. Rev. Genet."},{"key":"2023062411492688600_bts593-B24","doi-asserted-by":"crossref","first-page":"308","DOI":"10.1093\/nar\/29.1.308","article-title":"dbSNP: the NCBI database of genetic variation","volume":"29","author":"Sherry","year":"2001","journal-title":"Nucleic Acids Res."},{"key":"2023062411492688600_bts593-B25","doi-asserted-by":"crossref","first-page":"2192","DOI":"10.1093\/bioinformatics\/btq346","article-title":"G-sqz: compact encoding of genomic sequence and quality data","volume":"26","author":"Tembe","year":"2010","journal-title":"Bioinformatics"},{"key":"2023062411492688600_bts593-B26","doi-asserted-by":"crossref","first-page":"628","DOI":"10.1093\/bioinformatics\/btr689","article-title":"Transformations for the compression of fastq quality scores of next-generation sequencing data","volume":"28","author":"Wan","year":"2012","journal-title":"Bioinformatics"},{"key":"2023062411492688600_bts593-B27","doi-asserted-by":"crossref","first-page":"23","DOI":"10.1186\/1748-7188-6-23","article-title":"ReCoil\u2014an Algorithm for compression of extremely large datasets of DNA data","volume":"6","author":"Yanovsky","year":"2011","journal-title":"Algorithms Mol. Biol."},{"key":"2023062411492688600_bts593-B28","first-page":"337","article-title":"A universal algorithm for sequential data compression","volume":"23","author":"Ziv","year":"1977","journal-title":"IEEE Trans Image Process"},{"key":"2023062411492688600_bts593-B29","doi-asserted-by":"crossref","first-page":"530","DOI":"10.1109\/TIT.1978.1055934","article-title":"Compression of individual sequences via variable-rate coding","volume":"24","author":"Ziv","year":"1978","journal-title":"IEEE Trans Inf Theory"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/28\/23\/3051\/50695292\/bioinformatics_28_23_3051.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/28\/23\/3051\/50695292\/bioinformatics_28_23_3051.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,6,25]],"date-time":"2023-06-25T00:25:59Z","timestamp":1687652759000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/28\/23\/3051\/195414"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,10,9]]},"references-count":29,"journal-issue":{"issue":"23","published-print":{"date-parts":[[2012,12,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bts593","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2012,12]]},"published":{"date-parts":[[2012,10,9]]}}}