{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,31]],"date-time":"2026-01-31T02:45:29Z","timestamp":1769827529843,"version":"3.49.0"},"reference-count":27,"publisher":"Oxford University Press (OUP)","issue":"9","license":[{"start":{"date-parts":[[2023,9,9]],"date-time":"2023-09-09T00:00:00Z","timestamp":1694217600000},"content-version":"vor","delay-in-days":8,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"name":"European Union\u2019s Horizon 2020 ITN programme"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2023,9,2]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>The Positional Burrows\u2013Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as \u03bc-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the \u03bc-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, \u03bc-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. \u03bc-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>Our implementation is open source and available at https:\/\/github.com\/dlcgold\/muPBWT. The binary is available at https:\/\/bioconda.github.io\/recipes\/mupbwt\/README.html.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btad552","type":"journal-article","created":{"date-parts":[[2023,9,9]],"date-time":"2023-09-09T13:21:50Z","timestamp":1694265710000},"source":"Crossref","is-referenced-by-count":4,"title":["\u03bc- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data"],"prefix":"10.1093","volume":"39","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2439-0608","authenticated-orcid":false,"given":"Davide","family":"Cozzi","sequence":"first","affiliation":[{"name":"Department of Informatics, Systems and Communication, University of Milano-Bicocca , Milan 20126, Italy"}]},{"given":"Massimiliano","family":"Rossi","sequence":"additional","affiliation":[{"name":"Department of Computer & Information Science & Engineering, Herbert-Wertheim College of Engineering, University of Florida , Gainesville, Florida 32611, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7725-4813","authenticated-orcid":false,"given":"Simone","family":"Rubinacci","sequence":"additional","affiliation":[{"name":"Department of Computational Biology, University of Lausanne , Lausanne 1015, Switzerland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3689-327X","authenticated-orcid":false,"given":"Travis","family":"Gagie","sequence":"additional","affiliation":[{"name":"Faculty of Computer Science, Dalhousie University , Halifax B3H 4R2, Canada"}]},{"given":"Dominik","family":"K\u00f6ppl","sequence":"additional","affiliation":[{"name":"M&D Data Science Center, Tokyo Medical and Dental University , Tokyo 113-8510, Japan"},{"name":"Department of Computer Science, University of Muenster , Muenster 48149, Germany"}]},{"given":"Christina","family":"Boucher","sequence":"additional","affiliation":[{"name":"Department of Computer & Information Science & Engineering, Herbert-Wertheim College of Engineering, University of Florida , Gainesville, Florida 32611, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7289-4988","authenticated-orcid":false,"given":"Paola","family":"Bonizzoni","sequence":"additional","affiliation":[{"name":"Department of Informatics, Systems and Communication, University of Milano-Bicocca , Milan 20126, Italy"}]}],"member":"286","published-online":{"date-parts":[[2023,9,9]]},"reference":[{"key":"2023091504322447500_btad552-B1","doi-asserted-by":"crossref","first-page":"2","DOI":"10.1186\/s13015-020-0163-6","article-title":"Finding all maximal perfect haplotype blocks in linear time","volume":"15","author":"Alanko","year":"2020","journal-title":"Algorithms Mol Biol"},{"key":"2023091504322447500_btad552-B2","doi-asserted-by":"crossref","first-page":"81","DOI":"10.1007\/s11047-022-09882-6","article-title":"Computational graph pangenomics: a tutorial on data structures and their applications","volume":"21","author":"Baaijens","year":"2022","journal-title":"Nat Comput"},{"key":"2023091504322447500_btad552-B3","doi-asserted-by":"crossref","first-page":"96","DOI":"10.1016\/j.tcs.2019.08.005","article-title":"Refining the r-index","volume":"812","author":"Bannai","year":"2020","journal-title":"Theor Comput Sci"},{"key":"2023091504322447500_btad552-B4","doi-asserted-by":"crossref","first-page":"2047","DOI":"10.1093\/gigascience\/giab007","article-title":"HTSlib: C library for reading\/writing high-throughput sequencing data","volume":"10","author":"Bonfield","year":"2021","journal-title":"Gigascience"},{"key":"2023091504322447500_btad552-B5","author":"Bonizzoni","year":"2022"},{"key":"2023091504322447500_btad552-B6","doi-asserted-by":"crossref","first-page":"203","DOI":"10.1038\/s41586-018-0579-z","article-title":"The UK Biobank resource with deep phenotyping and genomic data","volume":"562","author":"Bycroft","year":"2018","journal-title":"Nature"},{"key":"2023091504322447500_btad552-B7","doi-asserted-by":"publisher","first-page":"giab008","DOI":"10.1093\/gigascience\/giab008","article-title":"Twelve years of SAMtools and BCFtools","volume":"10","author":"Danecek","year":"2021","journal-title":"Gigascience"},{"key":"2023091504322447500_btad552-B8","doi-asserted-by":"crossref","first-page":"1266","DOI":"10.1093\/bioinformatics\/btu014","article-title":"Efficient haplotype matching and storage using the Positional Burrows\u2013Wheeler Transform (PBWT)","volume":"30","author":"Durbin","year":"2014","journal-title":"Bioinformatics"},{"key":"2023091504322447500_btad552-B9","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3375890","article-title":"Fully functional suffix trees and optimal text searching in BWT-runs bounded space","volume":"67","author":"Gagie","year":"2020","journal-title":"J ACM"},{"key":"2023091504322447500_btad552-B10","first-page":"326","author":"Gog","year":"2014"},{"key":"2023091504322447500_btad552-B11","doi-asserted-by":"crossref","first-page":"732","DOI":"10.1038\/s41586-022-04965-x","article-title":"The sequences of 150,119 genomes in the UK Biobank","volume":"607","author":"Halldorsson","year":"2022","journal-title":"Nature"},{"key":"2023091504322447500_btad552-B12","doi-asserted-by":"crossref","first-page":"1243","DOI":"10.1038\/s41588-023-01415-w","article-title":"Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK biobank","volume":"55","author":"Hofmeister","year":"2023","journal-title":"Nat Genet"},{"key":"2023091504322447500_btad552-B13","doi-asserted-by":"crossref","first-page":"181","DOI":"10.1007\/978-3-642-02441-2_17","volume-title":"Combinatorial Pattern Matching","author":"K\u00e4rkk\u00e4inen","year":"2009"},{"key":"2023091504322447500_btad552-B14","doi-asserted-by":"crossref","first-page":"590","DOI":"10.1093\/bioinformatics\/btv613","article-title":"BGT: efficient and flexible genotype query across many samples","volume":"32","author":"Li","year":"2016","journal-title":"Bioinformatics"},{"key":"2023091504322447500_btad552-B15","first-page":"17","author":"M\u00e4kinen","year":"2004"},{"key":"2023091504322447500_btad552-B16","doi-asserted-by":"crossref","first-page":"17","DOI":"10.1016\/j.ipl.2019.02.003","article-title":"Applying the Positional Burrows-Wheeler Transform to all-pairs Hamming distance","volume":"146","author":"M\u00e4kinen","year":"2019","journal-title":"Inf Process Lett"},{"key":"2023091504322447500_btad552-B17","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13015-017-0109-9","article-title":"A graph extension of the Positional Burrows\u2013Wheeler Transform and its applications","volume":"12","author":"Novak","year":"2017","journal-title":"Algorithms Mol Biol"},{"key":"2023091504322447500_btad552-B18","doi-asserted-by":"crossref","first-page":"16","DOI":"10.1016\/j.fsigen.2012.05.002","article-title":"Assessing paternities with inconclusive str results: the suitability of bi-allelic markers","volume":"7","author":"Pinto","year":"2013","journal-title":"Forensic Sci Int Genet"},{"key":"2023091504322447500_btad552-B19","doi-asserted-by":"crossref","first-page":"169","DOI":"10.1089\/cmb.2021.0290","article-title":"Moni: a pangenomic index for finding maximal exact matches","volume":"29","author":"Rossi","year":"2022","journal-title":"J Comput Biol"},{"key":"2023091504322447500_btad552-B20","doi-asserted-by":"crossref","first-page":"e1009049","DOI":"10.1371\/journal.pgen.1009049","article-title":"Genotype imputation using the Positional Burrows Wheeler Transform","volume":"16","author":"Rubinacci","year":"2020","journal-title":"PLoS Genet"},{"key":"2023091504322447500_btad552-B21","doi-asserted-by":"crossref","first-page":"2390","DOI":"10.1093\/bioinformatics\/btab117","article-title":"d-PBWT: dynamic Positional Burrows\u2013Wheeler Transform","volume":"37","author":"Sanaullah","year":"2021","journal-title":"Bioinformatics"},{"key":"2023091504322447500_btad552-B22","doi-asserted-by":"crossref","first-page":"1652","DOI":"10.1093\/bioinformatics\/btw050","article-title":"Efficient privacy-preserving string search and an application in genomics","volume":"32","author":"Shimizu","year":"2016","journal-title":"Bioinformatics"},{"key":"2023091504322447500_btad552-B23","doi-asserted-by":"crossref","first-page":"400","DOI":"10.1093\/bioinformatics\/btz575","article-title":"Haplotype-aware graph indexes","volume":"36","author":"Sir\u00e9n","year":"2020","journal-title":"Bioinformatics"},{"key":"2023091504322447500_btad552-B24","doi-asserted-by":"crossref","first-page":"290","DOI":"10.1038\/s41586-021-03205-y","article-title":"Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program","volume":"590","author":"Taliun","year":"2021","journal-title":"Nature"},{"key":"2023091504322447500_btad552-B25","doi-asserted-by":"crossref","first-page":"68","DOI":"10.1038\/nature15393","article-title":"A global reference for human genetic variation","volume":"526","author":"The 1000 Genomes Project Consortium","year":"2015","journal-title":"Nature"},{"key":"2023091504322447500_btad552-B26","doi-asserted-by":"crossref","first-page":"btac734","DOI":"10.1093\/bioinformatics\/btac734","article-title":"Syllable-PBWT for space-efficient haplotype long-match query","volume":"39","author":"Wang","year":"2023","journal-title":"Bioinformatics"},{"key":"2023091504322447500_btad552-B27","doi-asserted-by":"crossref","first-page":"e1005932","DOI":"10.1371\/journal.pcbi.1005932","article-title":"Bayesian inference of phylogenetic networks from bi-allelic genetic markers","volume":"14","author":"Zhu","year":"2018","journal-title":"PLoS Comput Biol"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btad552\/51472749\/btad552.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/39\/9\/btad552\/51556136\/btad552.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/39\/9\/btad552\/51556136\/btad552.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,9,15]],"date-time":"2023-09-15T04:44:39Z","timestamp":1694753079000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btad552\/7265394"}},"subtitle":[],"editor":[{"given":"Peter","family":"Robinson","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2023,9,1]]},"references-count":27,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2023,9,2]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btad552","relation":{},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2023,9,1]]},"published":{"date-parts":[[2023,9,1]]},"article-number":"btad552"}}