{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T20:34:17Z","timestamp":1772138057331,"version":"3.50.1"},"reference-count":23,"publisher":"Oxford University Press (OUP)","issue":"1","license":[{"start":{"date-parts":[[2022,11,28]],"date-time":"2022-11-28T00:00:00Z","timestamp":1669593600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["R01HG010086"],"award-info":[{"award-number":["R01HG010086"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["R56HG011509"],"award-info":[{"award-number":["R56HG011509"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["OT2OD002751"],"award-info":[{"award-number":["OT2OD002751"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2023,1,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>The positional Burrows\u2013Wheeler transform (PBWT) has led to tremendous strides in haplotype matching on biobank-scale data. For genetic genealogical search, PBWT-based methods have optimized the asymptotic runtime of finding long matches between a query haplotype and a predefined panel of haplotypes. However, to enable fast query searches, the full-sized panel and PBWT data structures must be kept in memory, preventing existing algorithms from scaling up to modern biobank panels consisting of millions of haplotypes. In this work, we propose a space-efficient variation of PBWT named Syllable-PBWT, which divides every haplotype into syllables, builds the PBWT positional prefix arrays on the compressed syllabic panel, and leverages the polynomial rolling hash function for positional substring comparison. With the Syllable-PBWT data structures, we then present a long match query algorithm named Syllable-Query.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>Compared to the most time- and space-efficient previously published solution to the long match query problem, Syllable-Query reduced the memory use by a factor of over 100 on both the UK Biobank genotype data and the 1000 Genomes Project sequence data. Surprisingly, the smaller size of our syllabic data structures allows for more efficient iteration and CPU cache usage, granting Syllable-Query even faster runtime than existing solutions.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>https:\/\/github.com\/ZhiGroup\/Syllable-PBWT<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btac734","type":"journal-article","created":{"date-parts":[[2022,11,28]],"date-time":"2022-11-28T08:30:43Z","timestamp":1669624243000},"source":"Crossref","is-referenced-by-count":10,"title":["Syllable-PBWT for space-efficient haplotype long-match query"],"prefix":"10.1093","volume":"39","author":[{"given":"Victor","family":"Wang","sequence":"first","affiliation":[{"name":"School of Biomedical Informatics, University of Texas Health Science Center at Houston , Houston, TX 77030, USA"}]},{"given":"Ardalan","family":"Naseri","sequence":"additional","affiliation":[{"name":"School of Biomedical Informatics, University of Texas Health Science Center at Houston , Houston, TX 77030, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4051-5549","authenticated-orcid":false,"given":"Shaojie","family":"Zhang","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Central Florida , Orlando, FL 32816, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7754-1890","authenticated-orcid":false,"given":"Degui","family":"Zhi","sequence":"additional","affiliation":[{"name":"School of Biomedical Informatics, University of Texas Health Science Center at Houston , Houston, TX 77030, USA"}]}],"member":"286","published-online":{"date-parts":[[2022,11,28]]},"reference":[{"key":"2023010107513317400_btac734-B1","author":"23andMe","year":"2022"},{"key":"2023010107513317400_btac734-B2","doi-asserted-by":"crossref","first-page":"121","DOI":"10.1515\/jmc.2010.005","article-title":"The power of primes: security of authentication based on a universal hash-function family","volume":"4","author":"Alomair","year":"2010","journal-title":"J. Math. Cryptol"},{"key":"2023010107513317400_btac734-B3","doi-asserted-by":"crossref","first-page":"855","DOI":"10.1111\/1755-0998.12357","article-title":"Genotyping-in-Thousands by sequencing (GT-seq): a cost effective SNP genotyping method based on custom amplicon sequencing","volume":"15","author":"Campbell","year":"2015","journal-title":"Mol. Ecol. Resour"},{"key":"2023010107513317400_btac734-B4","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3445967","article-title":"Data structures to represent a set of k-long DNA sequences","volume":"54","author":"Chikhi","year":"2021","journal-title":"ACM Comput. Surv"},{"key":"2023010107513317400_btac734-B5","doi-asserted-by":"crossref","first-page":"5436","DOI":"10.1038\/s41467-019-13225-y","article-title":"Accurate, scalable and integrative haplotype estimation","volume":"10","author":"Delaneau","year":"2019","journal-title":"Nat. Commun"},{"key":"2023010107513317400_btac734-B6","doi-asserted-by":"crossref","first-page":"1266","DOI":"10.1093\/bioinformatics\/btu014","article-title":"Efficient haplotype matching and storage using the positional burrows\u2013wheeler transform (PBWT)","volume":"30","author":"Durbin","year":"2014","journal-title":"Bioinformatics"},{"key":"2023010107513317400_btac734-B7","doi-asserted-by":"crossref","first-page":"958","DOI":"10.1016\/j.cels.2021.08.009","article-title":"Minimizer-space de Bruijn graphs: whole-genome assembly of long reads in minutes on a personal computer","volume":"12","author":"Ekim","year":"2021","journal-title":"Cell Syst"},{"key":"2023010107513317400_btac734-B8","doi-asserted-by":"crossref","first-page":"2131","DOI":"10.1093\/molbev\/msaa328","article-title":"Fast and robust identity-by-descent inference with the templated positional burrows\u2013wheeler transform","volume":"38","author":"Freyman","year":"2020","journal-title":"Mol. Biol. Evol"},{"key":"2023010107513317400_btac734-B11","doi-asserted-by":"crossref","first-page":"249","DOI":"10.1147\/rd.312.0249","article-title":"Efficient randomized pattern-matching algorithms","volume":"31","author":"Karp","year":"1987","journal-title":"IBM J. Res. Dev"},{"key":"2023010107513317400_btac734-B12","doi-asserted-by":"crossref","first-page":"120","DOI":"10.1186\/s13059-018-1506-1","article-title":"Consumer genomics will change your life, whether you get tested or not","volume":"19","author":"Khan","year":"2018","journal-title":"Genome Biol"},{"key":"2023010107513317400_btac734-B14","doi-asserted-by":"crossref","first-page":"1443","DOI":"10.1038\/ng.3679","article-title":"Reference-based phasing using the haplotype reference consortium panel","volume":"48","author":"Loh","year":"2016","journal-title":"Nat. Genet"},{"key":"2023010107513317400_btac734-B15","doi-asserted-by":"crossref","first-page":"i233","DOI":"10.1093\/bioinformatics\/btz347","article-title":"Efficient haplotype matching between a query and a panel for genealogical search","volume":"35","author":"Naseri","year":"2019","journal-title":"Bioinformatics"},{"key":"2023010107513317400_btac734-B16","doi-asserted-by":"crossref","first-page":"279","DOI":"10.1186\/s12859-019-2821-6","article-title":"Multi-allelic positional Burrows-Wheeler transform","volume":"20","author":"Naseri","year":"2019","journal-title":"BMC Bioinformatics"},{"key":"2023010107513317400_btac734-B17","doi-asserted-by":"crossref","first-page":"143","DOI":"10.1186\/s13059-019-1754-8","article-title":"RaPID: ultra-fast, powerful, and accurate detection of segments identical by descent (IBD) in biobank-scale cohorts","volume":"20","author":"Naseri","year":"2019","journal-title":"Genome Biol"},{"key":"2023010107513317400_btac734-B18","first-page":"19:1","volume-title":"21st International Workshop on Algorithms in Bioinformatics (WABI 2021), Volume 201 of Leibniz International Proceedings in Informatics (LIPIcs)","author":"Naseri","year":"2021"},{"key":"2023010107513317400_btac734-B19","doi-asserted-by":"crossref","first-page":"443","DOI":"10.1038\/nrg2986","article-title":"Genotype and SNP calling from next-generation sequencing data","volume":"12","author":"Nielsen","year":"2011","journal-title":"Nat. Rev. Genet"},{"key":"2023010107513317400_btac734-B20","doi-asserted-by":"crossref","first-page":"327","DOI":"10.1007\/s10897-011-9356-y","article-title":"Self diagnosis of lynch syndrome using direct to consumer genetic testing: a case study","volume":"20","author":"Roberts","year":"2011","journal-title":"J. Genet. Couns"},{"key":"2023010107513317400_btac734-B21","doi-asserted-by":"crossref","first-page":"e1009049","DOI":"10.1371\/journal.pgen.1009049","article-title":"Genotype imputation using the positional burrows wheeler transform","volume":"16","author":"Rubinacci","year":"2020","journal-title":"PLoS Genet"},{"key":"2023010107513317400_btac734-B22","doi-asserted-by":"crossref","first-page":"2390","DOI":"10.1093\/bioinformatics\/btab117","article-title":"d-PBWT: dynamic positional burrows\u2013wheeler transform","volume":"37","author":"Sanaullah","year":"2021","journal-title":"Bioinformatics. btab117"},{"key":"2023010107513317400_btac734-B24","doi-asserted-by":"crossref","first-page":"400","DOI":"10.1093\/bioinformatics\/btz575","article-title":"Haplotype-aware graph indexes","volume":"36","author":"Sir\u00e9n","year":"2019","journal-title":"Bioinformatics"},{"key":"2023010107513317400_btac734-B25","doi-asserted-by":"crossref","first-page":"477","DOI":"10.1038\/nrg2361","article-title":"Linkage disequilibrium\u2014understanding the evolutionary past and mapping the medical future","volume":"9","author":"Slatkin","year":"2008","journal-title":"Nat. Rev. Genet"},{"key":"2023010107513317400_btac734-B26","doi-asserted-by":"crossref","first-page":"301","DOI":"10.1534\/genetics.112.148825","article-title":"Identity by descent: variation in meiosis, across genomes, and in populations","volume":"194","author":"Thompson","year":"2013","journal-title":"Genetics"},{"key":"2023010107513317400_btac734-B27","doi-asserted-by":"crossref","first-page":"426","DOI":"10.1016\/j.ajhg.2020.02.010","article-title":"A fast and simple method for detecting identity-by-descent segments in large-scale data","volume":"106","author":"Zhou","year":"2020","journal-title":"Am. J. Hum. Genet"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btac734\/47762316\/btac734.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/39\/1\/btac734\/48448727\/btac734.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/39\/1\/btac734\/48448727\/btac734.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,3,14]],"date-time":"2023-03-14T05:14:19Z","timestamp":1678770859000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btac734\/6849513"}},"subtitle":[],"editor":[{"given":"Christina","family":"Kendziorski","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2022,11,28]]},"references-count":23,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2022,11,28]]},"published-print":{"date-parts":[[2023,1,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btac734","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2022.01.31.478234","asserted-by":"object"}]},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2023,1,1]]},"published":{"date-parts":[[2022,11,28]]},"article-number":"btac734"}}