{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,31]],"date-time":"2026-01-31T09:54:38Z","timestamp":1769853278546,"version":"3.49.0"},"reference-count":29,"publisher":"Oxford University Press (OUP)","issue":"9","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2015,5,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Motivation: High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be easily captured in the (relatively small) main memory. More interesting solutions for this problem are disk based, where the better of these two, from Cox et\u00a0al. (2012), is based on the Burrows\u2013Wheeler transform (BWT) and achieves 0.518 bits per base for a 134.0\u2009Gbp human genome sequencing collection with almost 45-fold coverage.<\/jats:p>\n               <jats:p>Results: We propose overlapping reads compression with minimizers, a compression algorithm dedicated to sequencing reads (DNA only). Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.317 bits per base as the compression ratio, allowing to fit the 134.0\u2009Gbp dataset into only 5.31\u2009GB of space.<\/jats:p>\n               <jats:p>Availability and implementation: \u00a0http:\/\/sun.aei.polsl.pl\/orcom under a free license.<\/jats:p>\n               <jats:p>Contact: \u00a0sebastian.deorowicz@polsl.pl<\/jats:p>\n               <jats:p>Supplementary information: \u00a0Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/btu844","type":"journal-article","created":{"date-parts":[[2014,12,24]],"date-time":"2014-12-24T01:14:33Z","timestamp":1419383673000},"page":"1389-1395","source":"Crossref","is-referenced-by-count":58,"title":["Disk-based compression of data from genome sequencing"],"prefix":"10.1093","volume":"31","author":[{"given":"Szymon","family":"Grabowski","sequence":"first","affiliation":[{"name":"1 Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 \u0141\u00f3d\u017a, 2Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, 3Polish-Japanese Institute of Information Technology, Koszykowa 86, 02-008 Warszawa, Poland and 4Centro Nacional de An\u00e1lisis Gen\u00f3mico (CNAG), 08-028 Barcelona, Spain"}]},{"given":"Sebastian","family":"Deorowicz","sequence":"additional","affiliation":[{"name":"1 Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 \u0141\u00f3d\u017a, 2Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, 3Polish-Japanese Institute of Information Technology, Koszykowa 86, 02-008 Warszawa, Poland and 4Centro Nacional de An\u00e1lisis Gen\u00f3mico (CNAG), 08-028 Barcelona, Spain"}]},{"given":"\u0141ukasz","family":"Roguski","sequence":"additional","affiliation":[{"name":"1 Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 \u0141\u00f3d\u017a, 2Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, 3Polish-Japanese Institute of Information Technology, Koszykowa 86, 02-008 Warszawa, Poland and 4Centro Nacional de An\u00e1lisis Gen\u00f3mico (CNAG), 08-028 Barcelona, Spain"},{"name":"1 Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 \u0141\u00f3d\u017a, 2Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, 3Polish-Japanese Institute of Information Technology, Koszykowa 86, 02-008 Warszawa, Poland and 4Centro Nacional de An\u00e1lisis Gen\u00f3mico (CNAG), 08-028 Barcelona, Spain"}]}],"member":"286","published-online":{"date-parts":[[2014,12,22]]},"reference":[{"key":"2023051308505191300_btu844-B1","doi-asserted-by":"crossref","first-page":"e59190","DOI":"10.1371\/journal.pone.0059190","article-title":"Compression of FASTQ and SAM format sequencing data","volume":"8","author":"Bonfield","year":"2013","journal-title":"PLoS One"},{"key":"2023051308505191300_btu844-B2","doi-asserted-by":"crossref","first-page":"e79871","DOI":"10.1371\/journal.pone.0079871","article-title":"Compression of structured high-throughput sequencing data","volume":"8","author":"Campagne","year":"2013","journal-title":"PLoS One"},{"key":"2023051308505191300_btu844-B3","first-page":"51","article-title":"Practical compression for multi-alignment genomic files","volume-title":"Proceeding ACSC\u201913 Proceedings of the Thirty-Sixth Australasian Computer Science Conference","author":"C\u00e1novas","year":"2013"},{"key":"2023051308505191300_btu844-B4","doi-asserted-by":"crossref","first-page":"2130","DOI":"10.1093\/bioinformatics\/btu183","article-title":"Lossy compression of quality scores in genomic data","volume":"30","author":"C\u00e1novas","year":"2014","journal-title":"Bioinformatics"},{"key":"2023051308505191300_btu844-B5","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-319-05269-4_4","volume-title":"On the representation of de Bruijn graphs","author":"Chikhi","year":"2014"},{"key":"2023051308505191300_btu844-B6","doi-asserted-by":"crossref","first-page":"1415","DOI":"10.1093\/bioinformatics\/bts173","article-title":"Large-scale compression of genomic sequence databases with the Burrows\u2013Wheeler transform","volume":"28","author":"Cox","year":"2012","journal-title":"Bioinformatics"},{"key":"2023051308505191300_btu844-B7","doi-asserted-by":"crossref","first-page":"860","DOI":"10.1093\/bioinformatics\/btr014","article-title":"Compression of DNA sequence reads in FASTQ format","volume":"27","author":"Deorowicz","year":"2011","journal-title":"Bioinformatics"},{"key":"2023051308505191300_btu844-B8","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1186\/1748-7188-8-25","article-title":"Data compression for sequencing data","volume":"8","author":"Deorowicz","year":"2013","journal-title":"Algorithms Mol. Biol."},{"key":"2023051308505191300_btu844-B9","volume-title":"KMC 2: Fast and resource-frugal k-mer counting","author":"Deorowicz","year":"2014"},{"key":"2023051308505191300_btu844-B10","doi-asserted-by":"crossref","first-page":"734","DOI":"10.1101\/gr.114819.110","article-title":"Efficient storage of high throughput DNA sequencing data using reference-based compression","volume":"21","author":"Fritz","year":"2011","journal-title":"Genome Res."},{"key":"2023051308505191300_btu844-B11","doi-asserted-by":"crossref","first-page":"3051","DOI":"10.1093\/bioinformatics\/bts593","article-title":"SCALCE: boosting sequence compression algorithms using locally consistent encoding","volume":"28","author":"Hach","year":"2012","journal-title":"Bioinformatics"},{"key":"2023051308505191300_btu844-B12","doi-asserted-by":"crossref","first-page":"1082","DOI":"10.1038\/nmeth.3133","article-title":"Deez: reference-based compression by local assembly","volume":"11","author":"Hach","year":"2014","journal-title":"Nat. Methods"},{"key":"2023051308505191300_btu844-B13","article-title":"Reducing whole-genome data storage footprint","volume-title":"Technical report","author":"Illumina","year":"2012"},{"key":"2023051308505191300_btu844-B14","doi-asserted-by":"crossref","first-page":"24","DOI":"10.1093\/bioinformatics\/btt257","article-title":"Adaptive reference-free compression of sequence quality scores","volume":"30","author":"Janin","year":"2014","journal-title":"Bioinformatics"},{"key":"2023051308505191300_btu844-B15","doi-asserted-by":"crossref","first-page":"e171","DOI":"10.1093\/nar\/gks754","article-title":"Compression of next-generation sequencing reads aided by highly efficient de novo assembly","volume":"40","author":"Jones","year":"2012","journal-title":"Nucleic Acids Res."},{"key":"2023051308505191300_btu844-B16","doi-asserted-by":"crossref","first-page":"728","DOI":"10.1126\/science.1197891","article-title":"On the future of genomic data","volume":"331","author":"Kahn","year":"2011","journal-title":"Science(Washington)"},{"key":"2023051308505191300_btu844-B17","doi-asserted-by":"crossref","first-page":"401","DOI":"10.1089\/cmb.2010.0253","article-title":"Compressing genomic sequence fragments using SlimGene","volume":"18","author":"Kozanitis","year":"2011","journal-title":"J. Comput. Biol."},{"key":"2023051308505191300_btu844-B18","first-page":"169","article-title":"Memory efficient minimum substring partitioning","volume-title":"Proceedings of the 39th International Conference on Very Large Data Bases","author":"Li","year":"2013"},{"key":"2023051308505191300_btu844-B19","first-page":"1","article-title":"De\u00a0novo co-assembly of bacterial genomes from multiple single cells","volume-title":"BIBM","author":"Movahedi","year":"2012"},{"key":"2023051308505191300_btu844-B20","doi-asserted-by":"crossref","first-page":"3363","DOI":"10.1093\/bioinformatics\/bth408","article-title":"Reducing storage requirements for biological sequence comparison","volume":"20","author":"Roberts","year":"2004","journal-title":"Bioinformatics"},{"key":"2023051308505191300_btu844-B21","doi-asserted-by":"crossref","first-page":"2213","DOI":"10.1093\/bioinformatics\/btu208","article-title":"DSRC 2\u2014industry-oriented compression of FASTQ files","volume":"30","author":"Roguski","year":"2014","journal-title":"Bioinformatics"},{"key":"2023051308505191300_btu844-B22","doi-asserted-by":"crossref","DOI":"10.1007\/978-1-84882-903-9","volume-title":"Handbook of Data Compression","author":"Salomon","year":"2010"},{"key":"2023051308505191300_btu844-B23","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pone.0081414","article-title":"SRComp: Short read sequence compression using burstsort and elias omega coding","volume":"8","author":"Selva","year":"2013","journal-title":"PLoS One"},{"key":"2023051308505191300_btu844-B24","first-page":"202","article-title":"PPM: one step to practicality","volume-title":"Data Compression Conference (DCC)","author":"Shkarin","year":"2002"},{"key":"2023051308505191300_btu844-B25","doi-asserted-by":"crossref","first-page":"628","DOI":"10.1093\/bioinformatics\/btr689","article-title":"Transformations for the compression of FASTQ quality scores of next-generation sequencing data","volume":"28","author":"Wan","year":"2012","journal-title":"Bioinformatics"},{"key":"2023051308505191300_btu844-B26","doi-asserted-by":"crossref","first-page":"R46","DOI":"10.1186\/gb-2014-15-3-r46","article-title":"Kraken: ultrafast metagenomic sequence classification using exact alignments","volume":"15","author":"Wood","year":"2014","journal-title":"Genome Biol."},{"key":"2023051308505191300_btu844-B27","doi-asserted-by":"crossref","first-page":"23","DOI":"10.1186\/1748-7188-6-23","article-title":"Recoil\u2014an algorithm for compression of extremely large datasets of DNA data","volume":"6","author":"Yanovsky","year":"2011","journal-title":"Algorithms Mol. Biol."},{"key":"2023051308505191300_btu844-B28","first-page":"385","article-title":"Traversing the k-mer landscape of NGS read datasets for quality score sparsification","volume-title":"Research in Computational Molecular Biology, Vol. 8394 Lecture Notes in Computer Science","author":"Yu","year":"2014"},{"key":"2023051308505191300_btu844-B29","first-page":"127","article-title":"FQZip: lossless reference-based compression of next generation sequencing data in FASTQ format","volume-title":"18th Asia Pacific Symposium on Intelligent and Evolutionary Systems","author":"Zhang","year":"2015"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/31\/9\/1389\/50306347\/bioinformatics_31_9_1389.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/31\/9\/1389\/50306347\/bioinformatics_31_9_1389.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,5,13]],"date-time":"2023-05-13T08:52:47Z","timestamp":1683967967000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/31\/9\/1389\/200464"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2014,12,22]]},"references-count":29,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2015,5,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btu844","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2015,5,1]]},"published":{"date-parts":[[2014,12,22]]}}}