{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,8]],"date-time":"2026-02-08T04:20:36Z","timestamp":1770524436927,"version":"3.49.0"},"reference-count":61,"publisher":"Oxford University Press (OUP)","issue":"7","license":[{"start":{"date-parts":[[2024,5,24]],"date-time":"2024-05-24T00:00:00Z","timestamp":1716508800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100000780","name":"European Union","doi-asserted-by":"publisher","award":["REGINDEX, 101039208"],"award-info":[{"award-number":["REGINDEX, 101039208"]}],"id":[{"id":"10.13039\/501100000780","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,7,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>In recent years, the focus of bioinformatics research has moved from individual sequences to collections of sequences. Given the fundamental role of the Burrows\u2013Wheeler transform (BWT) in string processing, a number of dedicated tools have been developed for computing the BWT of string collections. While the focus has been on improving efficiency, both in space and time, the exact definition of the BWT used has not been at the center of attention. As we show in this paper, the different tools in use often compute non-equivalent BWT variants: the resulting transforms can differ from each other significantly, including the number r of runs, a central parameter of the BWT. Moreover, with many tools, the transform depends on the input order of the collection. In other words, on the same dataset, the same tool may output different transforms if the dataset is given in a different order.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>We studied 18 dedicated tools for computing the BWT of string collections and were able to identify 6 different BWT variants computed by these tools. We review the differences between these BWT variants, both from a theoretical and from a practical point of view, comparing them on eight real-life biological datasets with different characteristics. We find that the differences can be extensive, depending on the datasets, and are largest on collections of many similar short sequences. The parameter r, the number of runs of the BWT, also shows notable variation between the different BWT variants; on our datasets, it varied by a multiplicative factor of up to 4.2.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>Source code and scripts to replicate the results and download the data used in the article are available at https:\/\/github.com\/davidecenzato\/BWT-variants-for-string-collections.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btae333","type":"journal-article","created":{"date-parts":[[2024,5,24]],"date-time":"2024-05-24T21:35:06Z","timestamp":1716586506000},"source":"Crossref","is-referenced-by-count":6,"title":["A survey of BWT variants for string collections"],"prefix":"10.1093","volume":"40","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0098-3620","authenticated-orcid":false,"given":"Davide","family":"Cenzato","sequence":"first","affiliation":[{"name":"Department of Environmental Sciences, Informatics and Statistics, Ca\u2019 Foscari University , Venice, 30123, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3233-0691","authenticated-orcid":false,"given":"Zsuzsanna","family":"Lipt\u00e1k","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Verona , Verona, 37134, Italy"}]}],"member":"286","published-online":{"date-parts":[[2024,5,24]]},"reference":[{"key":"2024072922455154000_btae333-B1","doi-asserted-by":"crossref","first-page":"104999","DOI":"10.1016\/j.ic.2022.104999","article-title":"Sensitivity of string compressors and repetitiveness measures","volume":"291","author":"Akagi","year":"2023","journal-title":"Inf Comput"},{"key":"2024072922455154000_btae333-B2","doi-asserted-by":"crossref","first-page":"68","DOI":"10.1038\/nature15393","article-title":"A global reference for human genetic variation","volume":"526","author":"Auton","year":"2015","journal-title":"Nature"},{"key":"2024072922455154000_btae333-B3","doi-asserted-by":"crossref","first-page":"96","DOI":"10.1016\/j.tcs.2019.08.005","article-title":"Refining the r-index","volume":"812","author":"Bannai","year":"2020","journal-title":"Theor Comput Sci"},{"key":"2024072922455154000_btae333-B4","doi-asserted-by":"crossref","first-page":"134","DOI":"10.1016\/j.tcs.2012.02.002","article-title":"Lightweight algorithms for constructing and inverting the BWT of string collections","volume":"483","author":"Bauer","year":"2013","journal-title":"Theor Comput Sci"},{"key":"2024072922455154000_btae333-B5","first-page":"1","author":"Bentley"},{"key":"2024072922455154000_btae333-B6","doi-asserted-by":"crossref","first-page":"948","DOI":"10.1089\/cmb.2018.0230","article-title":"Multithread multistring Burrows\u2013Wheeler transform and longest common prefix array","volume":"26","author":"Bonizzoni","year":"2019","journal-title":"J Comput Biol"},{"key":"2024072922455154000_btae333-B7","doi-asserted-by":"crossref","first-page":"13","DOI":"10.1186\/s13015-019-0148-5","article-title":"Prefix-free parsing for building big BWTs","volume":"14","author":"Boucher","year":"2019","journal-title":"Algorithms Mol Biol"},{"key":"2024072922455154000_btae333-B8","first-page":"129","author":"Boucher"},{"key":"2024072922455154000_btae333-B9","first-page":"60","author":"Boucher","year":"2021"},{"key":"2024072922455154000_btae333-B10","doi-asserted-by":"crossref","first-page":"105155","DOI":"10.1016\/j.ic.2024.105155","article-title":"Indexing the eBWT","volume":"298","author":"Boucher","year":"2024","journal-title":"Inf Comput"},{"key":"2024072922455154000_btae333-B11","author":"Burrows","year":"1994"},{"key":"2024072922455154000_btae333-B12","first-page":"1","author":"Cazaux"},{"key":"2024072922455154000_btae333-B13","first-page":"1","author":"Cenzato"},{"key":"2024072922455154000_btae333-B14","author":"Cenzato"},{"key":"2024072922455154000_btae333-B15","first-page":"1","author":"Cobas"},{"key":"2024072922455154000_btae333-B16","doi-asserted-by":"crossref","first-page":"1415","DOI":"10.1093\/bioinformatics\/bts173","article-title":"Large-scale compression of genomic sequence databases with the Burrows\u2013Wheeler transform","volume":"28","author":"Cox","year":"2012","journal-title":"Bioinformatics"},{"key":"2024072922455154000_btae333-B17","author":"D\u00edaz-Dom\u00ednguez","year":"2021"},{"key":"2024072922455154000_btae333-B18","doi-asserted-by":"crossref","first-page":"105088","DOI":"10.1016\/j.ic.2023.105088","article-title":"Efficient construction of the BWT for repetitive text using string compression","volume":"294","author":"D\u00edaz-Dom\u00ednguez","year":"2023","journal-title":"Inf Comput"},{"key":"2024072922455154000_btae333-B19","doi-asserted-by":"crossref","first-page":"2371","DOI":"10.1093\/bioinformatics\/bty113","article-title":"Updating the 97% identity threshold for 16S ribosomal RNA OTUs","volume":"34","author":"Edgar","year":"2018","journal-title":"Bioinformatics"},{"key":"2024072922455154000_btae333-B20","doi-asserted-by":"crossref","first-page":"6","DOI":"10.1186\/s13015-019-0140-0","article-title":"External memory BWT and LCP computation for sequence collections with applications","volume":"14","author":"Egidi","year":"2019","journal-title":"Algorithms Mol Biol"},{"key":"2024072922455154000_btae333-B21","first-page":"184","author":"Ferragina","year":"2005"},{"key":"2024072922455154000_btae333-B22","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/1613676.1613680","article-title":"Compressing and indexing labeled trees, with applications","volume":"57","author":"Ferragina","year":"2009","journal-title":"J ACM"},{"key":"2024072922455154000_btae333-B23","doi-asserted-by":"crossref","first-page":"707","DOI":"10.1007\/s00453-011-9535-0","article-title":"Lightweight data indexing and compression in external memory","volume":"63","author":"Ferragina","year":"2012","journal-title":"Algorithmica"},{"key":"2024072922455154000_btae333-B24","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3375890","article-title":"Fully functional suffix trees and optimal text searching in BWT-runs bounded space","volume":"67","author":"Gagie","year":"2020","journal-title":"J ACM"},{"key":"2024072922455154000_btae333-B25","first-page":"1","author":"Gagie"},{"key":"2024072922455154000_btae333-B26","doi-asserted-by":"crossref","first-page":"659","DOI":"10.1093\/jhered\/esp086","article-title":"A proposal to obtain whole-genome sequence for 10,000 vertebrate species","volume":"100","author":"Genome 10K Community of Scientists","year":"2009","journal-title":"J Hered"},{"key":"2024072922455154000_btae333-B27","author":"Gil","year":"2012"},{"key":"2024072922455154000_btae333-B28","first-page":"249","author":"Giuliani"},{"key":"2024072922455154000_btae333-B29","doi-asserted-by":"crossref","first-page":"e1010248","DOI":"10.1371\/journal.ppat.1010248","article-title":"A SARS-CoV-2 variant elicits an antibody response with a shifted immunodominance hierarchy","volume":"18","author":"Greaney","year":"2022","journal-title":"PLoS Pathog"},{"key":"2024072922455154000_btae333-B30","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9780511574931","volume-title":"Algorithms on Strings, Trees, and Sequences\u2014Computer Science and Computational Biology","author":"Gusfield","year":"1997"},{"key":"2024072922455154000_btae333-B31","doi-asserted-by":"crossref","first-page":"3524","DOI":"10.1093\/bioinformatics\/btu584","article-title":"Merging of multi-string BWTs with applications","volume":"30","author":"Holt","year":"2014","journal-title":"Bioinformatics"},{"key":"2024072922455154000_btae333-B32","doi-asserted-by":"crossref","first-page":"492","DOI":"10.1016\/j.cell.2016.06.044","article-title":"Epigenomic diversity in a global collection of Arabidopsis thaliana accessions","volume":"166","author":"Kawakatsu","year":"2016","journal-title":"Cell"},{"key":"2024072922455154000_btae333-B33","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1145\/3531445","article-title":"Resolution of the Burrows\u2013Wheeler transform conjecture","volume":"65","author":"Kempa","year":"2022","journal-title":"Commun ACM"},{"key":"2024072922455154000_btae333-B34","first-page":"1","author":"K\u00f6ppl"},{"key":"2024072922455154000_btae333-B35","doi-asserted-by":"crossref","first-page":"500","DOI":"10.1089\/cmb.2019.0309","article-title":"Efficient construction of a complete index for pan-genomics read alignment","volume":"27","author":"Kuhnle","year":"2020","journal-title":"J Comput Biol"},{"key":"2024072922455154000_btae333-B36","doi-asserted-by":"crossref","first-page":"357","DOI":"10.1038\/nmeth.1923","article-title":"Fast gapped-read alignment with Bowtie 2","volume":"9","author":"Langmead","year":"2012","journal-title":"Nat Methods"},{"key":"2024072922455154000_btae333-B37","doi-asserted-by":"crossref","first-page":"R25","DOI":"10.1186\/gb-2009-10-3-r25","article-title":"Ultrafast and memory-efficient alignment of short DNA sequences to the human genome","volume":"10","author":"Langmead","year":"2009","journal-title":"Genome Biol"},{"key":"2024072922455154000_btae333-B38","doi-asserted-by":"crossref","first-page":"3274","DOI":"10.1093\/bioinformatics\/btu541","article-title":"Fast construction of FM-index for long sequence reads","volume":"30","author":"Li","year":"2014","journal-title":"Bioinformatics"},{"key":"2024072922455154000_btae333-B39","doi-asserted-by":"crossref","first-page":"589","DOI":"10.1093\/bioinformatics\/btp698","article-title":"Fast and accurate long-read alignment with Burrows\u2013Wheeler transform","volume":"26","author":"Li","year":"2010","journal-title":"Bioinformatics"},{"key":"2024072922455154000_btae333-B40","doi-asserted-by":"crossref","first-page":"26","DOI":"10.1186\/s13015-017-0117-9","article-title":"Generalized enhanced suffix array construction in external memory","volume":"12","author":"Louza","year":"2017","journal-title":"Algorithms Mol Biol"},{"key":"2024072922455154000_btae333-B41","doi-asserted-by":"crossref","first-page":"18","DOI":"10.1186\/s13015-020-00177-y","article-title":"gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections","volume":"15","author":"Louza","year":"2020","journal-title":"Algorithms Mol Biol"},{"key":"2024072922455154000_btae333-B42","first-page":"40","article-title":"Succinct suffix arrays based on run-length encoding","volume":"12","author":"M\u00e4kinen","year":"2005","journal-title":"Nordic J Comput"},{"key":"2024072922455154000_btae333-B43","doi-asserted-by":"crossref","first-page":"201","DOI":"10.1038\/nature18964","article-title":"The simons genome diversity project: 300 genomes from 142 diverse populations","volume":"538","author":"Mallick","year":"2016","journal-title":"Nature"},{"key":"2024072922455154000_btae333-B44","doi-asserted-by":"crossref","first-page":"298","DOI":"10.1016\/j.tcs.2007.07.014","article-title":"An extension of the Burrows\u2013Wheeler transform","volume":"387","author":"Mantaci","year":"2007","journal-title":"Theor Comput Sci"},{"key":"2024072922455154000_btae333-B45","first-page":"80","author":"Manzini"},{"key":"2024072922455154000_btae333-B46","first-page":"1","author":"Masillo"},{"key":"2024072922455154000_btae333-B47","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3409371","article-title":"Indexing highly repetitive string collections, part I: repetitiveness measures","volume":"54","author":"Navarro","year":"2021","journal-title":"ACM Comput Surv"},{"key":"2024072922455154000_btae333-B48","volume-title":"Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction","author":"Ohlebusch","year":"2013"},{"key":"2024072922455154000_btae333-B49","first-page":"325","author":"Ohlebusch"},{"key":"2024072922455154000_btae333-B50","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3641854","article-title":"Generic non-recursive sufix array construction","volume":"20","author":"Olbrich","year":"2024","journal-title":"ACM Trans Algorithms"},{"key":"2024072922455154000_btae333-B51","first-page":"203","author":"Oliva","year":"2021"},{"key":"2024072922455154000_btae333-B52","first-page":"62","author":"Oliva","year":"2023"},{"key":"2024072922455154000_btae333-B53","author":"Pantaleoni","year":"2014"},{"key":"2024072922455154000_btae333-B54","first-page":"1","author":"Puglisi"},{"key":"2024072922455154000_btae333-B55","first-page":"211","author":"Sir\u00e9n","year":"2016"},{"key":"2024072922455154000_btae333-B56","doi-asserted-by":"crossref","first-page":"1295","DOI":"10.1016\/j.cell.2020.08.012","article-title":"Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding","volume":"182","author":"Starr","year":"2020","journal-title":"Cell"},{"key":"2024072922455154000_btae333-B57","doi-asserted-by":"crossref","first-page":"597","DOI":"10.1093\/nar\/gkw958","article-title":"RPAN: rice pan-genome browser for 3000 rice genomes","volume":"45","author":"Sun","year":"2017","journal-title":"Nucleic Acids Res"},{"key":"2024072922455154000_btae333-B58","doi-asserted-by":"crossref","first-page":"k1687","DOI":"10.1136\/bmj.k1687","article-title":"The 100,000 genomes project: bringing whole genome sequencing to the NHS","volume":"361","author":"Turnbull","year":"2018","journal-title":"Br Med J"},{"key":"2024072922455154000_btae333-B59","doi-asserted-by":"crossref","first-page":"79","DOI":"10.1186\/s12864-015-1284-z","article-title":"Analysis of the genetic diversity of influenza a viruses using next-generation DNA sequencing","volume":"16","author":"Van den Hoecke","year":"2015","journal-title":"BMC Genomics"},{"key":"2024072922455154000_btae333-B60","doi-asserted-by":"crossref","first-page":"298","DOI":"10.3390\/ijms21010298","article-title":"Targeting the 16s rRNA gene for bacterial identification in complex mixed samples: comparative evaluation of second (illumina) and third (oxford nanopore technologies) generation sequencing technologies","volume":"21","author":"Winand","year":"2019","journal-title":"IJMS"},{"key":"2024072922455154000_btae333-B61","doi-asserted-by":"crossref","first-page":"677","DOI":"10.1089\/mdr.2018.0408","article-title":"Sentinel case of Candida auris in the Western United States following prolonged occult colonization in a returned traveler from India","volume":"25","author":"Woodworth","year":"2019","journal-title":"Microb Drug Resist"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btae333\/57890132\/btae333.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/7\/btae333\/58682567\/btae333.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/7\/btae333\/58682567\/btae333.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,30]],"date-time":"2024-07-30T01:31:52Z","timestamp":1722303112000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btae333\/7681884"}},"subtitle":[],"editor":[{"given":"Peter","family":"Robinson","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2024,5,24]]},"references-count":61,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2024,7,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btae333","relation":{},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024,7]]},"published":{"date-parts":[[2024,5,24]]},"article-number":"btae333"}}