{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T03:53:34Z","timestamp":1772078014120,"version":"3.50.1"},"reference-count":22,"publisher":"Oxford University Press (OUP)","issue":"11","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2012,6,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Motivation: The Burrows\u2013Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being widely applied to the large sets of sequences often encountered as the outcome of DNA sequencing experiments. In previous work, we presented a novel algorithm that allows the BWT of human genome scale data to be computed on very moderate hardware, thus enabling us to investigate the BWT as a tool for the compression of such datasets.<\/jats:p>\n               <jats:p>Results: We first used simulated reads to explore the relationship between the level of compression and the error rate, the length of the reads and the level of sampling of the underlying genome and compare choices of second-stage compression algorithm.<\/jats:p>\n               <jats:p>We demonstrate that compression may be greatly improved by a particular reordering of the sequences in the collection and give a novel \u2018implicit sorting\u2019 strategy that enables these benefits to be realized without the overhead of sorting the reads. With these techniques, a 45\u00d7 coverage of real human genome sequence data compresses losslessly to under 0.5 bits per base, allowing the 135.3 Gb of sequence to fit into only 8.2 GB of space (trimming a small proportion of low-quality bases from the reads improves the compression still further).<\/jats:p>\n               <jats:p>This is &amp;gt;4 times smaller than the size achieved by a standard BWT-based compressor (bzip2) on the untrimmed reads, but an important further advantage of our approach is that it facilitates the building of compressed full text indexes such as the FM-index on large-scale DNA sequence collections.<\/jats:p>\n               <jats:p>Availability: Code to construct the BWT and SAP-array on large genomic datasets is part of the BEETL library, available as a github repository at https:\/\/github.com\/BEETL\/BEETL.<\/jats:p>\n               <jats:p>Contact: \u00a0acox@illumina.com<\/jats:p>","DOI":"10.1093\/bioinformatics\/bts173","type":"journal-article","created":{"date-parts":[[2012,5,4]],"date-time":"2012-05-04T00:59:48Z","timestamp":1336093188000},"page":"1415-1419","source":"Crossref","is-referenced-by-count":116,"title":["Large-scale compression of genomic sequence databases with the Burrows\u2013Wheeler transform"],"prefix":"10.1093","volume":"28","author":[{"given":"Anthony J.","family":"Cox","sequence":"first","affiliation":[{"name":"1 Computational Biology Group, Illumina Cambridge Ltd., Chesterford Research Park, Little Chesterford, Essex CB10 1XL, UK, 2Computational Genomics, CeBiTec, Bielefeld University, 33615 Bielefeld, Germany and 3University of Palermo, Dipartimento di Matematica e Informatica,Via Archirafi 34, 90123 Palermo,Italy"}]},{"given":"Markus J.","family":"Bauer","sequence":"additional","affiliation":[{"name":"1 Computational Biology Group, Illumina Cambridge Ltd., Chesterford Research Park, Little Chesterford, Essex CB10 1XL, UK, 2Computational Genomics, CeBiTec, Bielefeld University, 33615 Bielefeld, Germany and 3University of Palermo, Dipartimento di Matematica e Informatica,Via Archirafi 34, 90123 Palermo,Italy"}]},{"given":"Tobias","family":"Jakobi","sequence":"additional","affiliation":[{"name":"1 Computational Biology Group, Illumina Cambridge Ltd., Chesterford Research Park, Little Chesterford, Essex CB10 1XL, UK, 2Computational Genomics, CeBiTec, Bielefeld University, 33615 Bielefeld, Germany and 3University of Palermo, Dipartimento di Matematica e Informatica,Via Archirafi 34, 90123 Palermo,Italy"}]},{"given":"Giovanna","family":"Rosone","sequence":"additional","affiliation":[{"name":"1 Computational Biology Group, Illumina Cambridge Ltd., Chesterford Research Park, Little Chesterford, Essex CB10 1XL, UK, 2Computational Genomics, CeBiTec, Bielefeld University, 33615 Bielefeld, Germany and 3University of Palermo, Dipartimento di Matematica e Informatica,Via Archirafi 34, 90123 Palermo,Italy"}]}],"member":"286","published-online":{"date-parts":[[2012,5,3]]},"reference":[{"key":"2023012512312122700_B1","doi-asserted-by":"crossref","DOI":"10.1007\/978-0-387-78909-5","volume-title":"The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching.","author":"Adjeroh","year":"2008","edition":"1st"},{"key":"2023012512312122700_B2","first-page":"219","article-title":"Lightweight BWT construction for very large string collections","volume-title":"CPM 2011","author":"Bauer","year":"2011"},{"key":"2023012512312122700_B3","doi-asserted-by":"publisher","DOI":"10.1016\/j.tcs.2012.02.002","article-title":"Lightweight algorithms for constructing and inverting the BWT of string collections","author":"Bauer","year":"2012","journal-title":"Theor. Comput. Sci."},{"key":"2023012512312122700_B4","volume-title":"A block sorting data compression algorithm.","author":"Burrows","year":"1994"},{"key":"2023012512312122700_B5","doi-asserted-by":"crossref","first-page":"1696","DOI":"10.1093\/bioinformatics\/18.12.1696","article-title":"DNACompress: fast and effective DNA sequence compression","volume":"18","author":"Chen","year":"2002","journal-title":"Bioinformatics"},{"key":"2023012512312122700_B6","doi-asserted-by":"crossref","first-page":"860","DOI":"10.1093\/bioinformatics\/btr014","article-title":"Compression of genomic sequences in FASTQ format","volume":"27","author":"Deorowicz","year":"2011","journal-title":"Bioinformatics"},{"key":"2023012512312122700_B7","doi-asserted-by":"crossref","first-page":"e1002280","DOI":"10.1371\/journal.pgen.1002280","article-title":"Phased whole-genome genetic risk in a family quartet using a major allele reference sequence","volume":"7","author":"Dewey","year":"2011","journal-title":"PLoS Genet."},{"key":"2023012512312122700_B8","doi-asserted-by":"crossref","first-page":"390","DOI":"10.1109\/SFCS.2000.892127","article-title":"Opportunistic data structures with applications","volume-title":"Proceedings of the 41st Annual Symposium on Foundations of Computer Science.","author":"Ferragina","year":"2000"},{"key":"2023012512312122700_B9","doi-asserted-by":"crossref","first-page":"552","DOI":"10.1145\/1082036.1082039","article-title":"Indexing compressed text","volume":"52","author":"Ferragina","year":"2005","journal-title":"J. ACM"},{"issue":"2","key":"2023012512312122700_B10","article-title":"Compressed representations of sequences and full-text indexes","volume":"3","author":"Ferragina","year":"2007","journal-title":"ACM Trans. Algor."},{"key":"2023012512312122700_B11","doi-asserted-by":"crossref","first-page":"734","DOI":"10.1101\/gr.114819.110","article-title":"Efficient storage of high throughput DNA sequencing data using reference-based compression","volume":"21","author":"Fritz","year":"2011","journal-title":"Genome Res."},{"key":"2023012512312122700_B12","doi-asserted-by":"crossref","first-page":"1575","DOI":"10.1093\/bioinformatics\/btp117","article-title":"Textual data compression in computational biology: a synopsis","volume":"25","author":"Giancarlo","year":"2009","journal-title":"Bioinformatics"},{"key":"2023012512312122700_B13","doi-asserted-by":"crossref","first-page":"875","DOI":"10.1016\/0306-4573(94)90014-0","article-title":"A new challenge for compression algorithms: genetic sequences","volume":"30","author":"Grumbach","year":"1994","journal-title":"Inf. Process. Manage."},{"key":"2023012512312122700_B14","first-page":"310","article-title":"Compressing genomic sequence fragments using SlimGene","volume-title":"RECOMB.","author":"Kozanitis","year":"2010"},{"key":"2023012512312122700_B15","doi-asserted-by":"crossref","first-page":"1754","DOI":"10.1093\/bioinformatics\/btp324","article-title":"Fast and accurate short read alignment with Burrows-Wheeler transform","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"2023012512312122700_B16","first-page":"178\u2014189","article-title":"An extension of the Burrows Wheeler transform and applications to sequence comparison and data compression","volume-title":"CPM 2005","author":"Mantaci","year":"2005"},{"key":"2023012512312122700_B17","first-page":"407","article-title":"Discovering simple DNA sequences by the algorithmic significance method","volume":"9","author":"Milosavljevic","year":"1993","journal-title":"Comput. Appl. Biosci. CABIOS"},{"key":"2023012512312122700_B18","doi-asserted-by":"crossref","first-page":"315","DOI":"10.1016\/0300-9084(96)84763-8","article-title":"Compression and genetic sequence analysis","volume":"78","author":"Rivals","year":"1996","journal-title":"Biochimie"},{"key":"2023012512312122700_B19","doi-asserted-by":"crossref","first-page":"i367","DOI":"10.1093\/bioinformatics\/btq217","article-title":"Efficient construction of an assembly string graph using the FM-index","volume":"26","author":"Simpson","year":"2010","journal-title":"Bioinformatics"},{"key":"2023012512312122700_B20","doi-asserted-by":"crossref","first-page":"549","DOI":"10.1101\/gr.126953.111","article-title":"Efficient de novo assembly of large genomes using compressed data structures","volume":"22","author":"Simpson","year":"2012","journal-title":"Genome Res"},{"key":"2023012512312122700_B21","doi-asserted-by":"crossref","first-page":"2192","DOI":"10.1093\/bioinformatics\/btq346","article-title":"G-SQZ: compact encoding of genomic sequence and quality data","volume":"26","author":"Tembe","year":"2010","journal-title":"Bioinformatics"},{"key":"2023012512312122700_B22","doi-asserted-by":"crossref","first-page":"23","DOI":"10.1186\/1748-7188-6-23","article-title":"ReCoil - an algorithm for compression of extremely large datasets of DNA data","volume":"6","author":"Yanovsky","year":"2011","journal-title":"Algor. Mol. Biol."}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/28\/11\/1415\/48868379\/bioinformatics_28_11_1415.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/28\/11\/1415\/48868379\/bioinformatics_28_11_1415.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,25]],"date-time":"2023-01-25T15:59:52Z","timestamp":1674662392000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/28\/11\/1415\/266914"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,5,3]]},"references-count":22,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2012,6,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bts173","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2012,6,1]]},"published":{"date-parts":[[2012,5,3]]}}}