{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T20:34:10Z","timestamp":1772138050491,"version":"3.50.1"},"reference-count":43,"publisher":"Oxford University Press (OUP)","issue":"7","license":[{"start":{"date-parts":[[2022,2,4]],"date-time":"2022-02-04T00:00:00Z","timestamp":1643932800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,3,28]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Fast, lightweight methods for comparing the sequence of ever larger assembled genomes from ever growing databases are increasingly needed in the era of accurate long reads and pan-genome initiatives. Matching statistics is a popular method for computing whole-genome phylogenies and for detecting structural rearrangements between two genomes, since it is amenable to fast implementations that require a minimal setup of data structures. However, current implementations use a single core, take too much memory to represent the result, and do not provide efficient ways to analyze the output in order to explore local similarities between the sequences.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We develop practical tools for computing matching statistics between large-scale strings, and for analyzing its values, faster and using less memory than the state-of-the-art. Specifically, we design a parallel algorithm for shared-memory machines that computes matching statistics 30 times faster with 48 cores in the cases that are most difficult to parallelize. We design a lossy compression scheme that shrinks the matching statistics array to a bitvector that takes from 0.8 to 0.2\u2009bits per character, depending on the dataset and on the value of a threshold, and that achieves 0.04\u2009bits per character in some variants. And we provide efficient implementations of range-maximum and range-sum queries that take a few tens of milliseconds while operating on our compact representations, and that allow computing key local statistics about the similarity between two strings. Our toolkit makes construction, storage and analysis of matching statistics arrays practical for multiple pairs of the largest genomes available today, possibly enabling new applications in comparative genomics.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>Our C\/C++ code is available at https:\/\/github.com\/odenas\/indexed_ms under GPL-3.0. The data underlying this article are available in NCBI Genome at https:\/\/www.ncbi.nlm.nih.gov\/genome and in the International Genome Sample Resource (IGSR) at https:\/\/www.internationalgenome.org.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btac064","type":"journal-article","created":{"date-parts":[[2022,1,31]],"date-time":"2022-01-31T15:13:34Z","timestamp":1643642014000},"page":"1838-1845","source":"Crossref","is-referenced-by-count":3,"title":["Fast and compact matching statistics analytics"],"prefix":"10.1093","volume":"38","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0282-5738","authenticated-orcid":false,"given":"Fabio","family":"Cunial","sequence":"first","affiliation":[{"name":"Max Planck Institute for Molecular Cell Biology and Genetics (MPI-CBG and CSBD) , Dresden 01307, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Olgert","family":"Denas","sequence":"additional","affiliation":[{"name":"Blue River Technology , Sunnyvale, CA 94086, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Djamal","family":"Belazzougui","sequence":"additional","affiliation":[{"name":"CAPA, DTISI, Centre de Recherche sur l\u2019Information Scientifique et Techique , Algiers, Algeria"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2022,2,4]]},"reference":[{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"102696","DOI":"10.1016\/j.isci.2021.102696","article-title":"Pan-genomic matching statistics for targeted Nanopore sequencing","volume":"24","author":"Ahmed","year":"2021","journal-title":"iScience"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"76","DOI":"10.1016\/j.tcs.2016.01.023","article-title":"Sequence similarity measures based on bounded Hamming distance","volume":"638","author":"Apostolico","year":"2016","journal-title":"Theor. Comput. Sci"},{"key":"2023030120002862400_","first-page":"179","volume-title":"International Symposium on String Processing and Information Retrieval","author":"Belazzougui","year":"2014"},{"key":"2023030120002862400_","volume-title":"Proceedings of the 17th International Symposium on Experimental Algorithms (SEA 2018)","author":"Belazzougui","year":"2018"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"46","DOI":"10.1137\/1.9781611976472.4","volume-title":"2021 Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX)","author":"Boffa","year":"2021"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"193","DOI":"10.1109\/DCC50243.2021.00027","volume-title":"2021 Data Compression Conference (DCC)","author":"Boucher","year":"2021"},{"key":"2023030120002862400_","first-page":"97","article-title":"Some investigations on similarity measures based on absent words","volume":"171","author":"Castiglione","year":"2019","journal-title":"Fund. Inform"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"945","DOI":"10.1089\/cmb.2012.0122","article-title":"Detecting phylogenetic signals in eukaryotic whole genome sequences","volume":"19","author":"Cohen","year":"2012","journal-title":"J. Comput. Biol"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"4607","DOI":"10.1093\/bioinformatics\/btz268","article-title":"A framework for space-efficient variable-order Markov models","volume":"35","author":"Cunial","year":"2019","journal-title":"Bioinformatics"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"3221","DOI":"10.1093\/bioinformatics\/btp590","article-title":"Efficient estimation of pairwise distances between genomes","volume":"25","author":"Domazet-Loso","year":"2009","journal-title":"Bioinformatics"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"230","DOI":"10.4161\/mge.1.3.18065","article-title":"Alignment-free detection of horizontal gene transfer between closely related bacterial genomes","volume":"1","author":"Domazet-Lo\u0161o","year":"2011","journal-title":"Mobile Genet. Elem"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"1466","DOI":"10.1093\/bioinformatics\/btr176","article-title":"Alignment-free detection of local similarity among viral and bacterial genomes","volume":"27","author":"Domazet-Lo\u0161o","year":"2011","journal-title":"Bioinformatics"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"191","DOI":"10.1016\/0166-218X(88)90076-5","article-title":"A new distance metric on strings computable in linear time","volume":"20","author":"Ehrenfeucht","year":"1988","journal-title":"Discrete Appl. Math"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"252","DOI":"10.1038\/s41586-020-2873-9","article-title":"Dense sampling of bird diversity increases power of comparative genomics","volume":"587","author":"Feng","year":"2020","journal-title":"Nature"},{"key":"2023030120002862400_","first-page":"26:1","volume-title":"27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016)","author":"Fischer","year":"2016"},{"key":"2023030120002862400_","author":"Formenti","year":"2020"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"153","DOI":"10.1007\/978-3-030-00479-8_13","article-title":"The colored longest common prefix array computed via sequential scans","volume-title":"International Symposium on String Processing and Information Retrieval","author":"Garofalo","year":"2018"},{"key":"2023030120002862400_","first-page":"326","article-title":"From theory to practice: plug and play with succinct data structures","volume-title":"13th International Symposium on Experimental Algorithms (SEA 2014)","author":"Gog","year":"2014"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"883","DOI":"10.1534\/g3.112.002527","article-title":"Alignment-free population genomics: an efficient estimator of sequence diversity","volume":"2","author":"Haubold","year":"2012","journal-title":"G3"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"541","DOI":"10.1186\/1471-2105-7-541","article-title":"How repetitive are genomes?","volume":"7","author":"Haubold","year":"2006","journal-title":"BMC Bioinform"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"123","DOI":"10.1186\/1471-2105-6-123","article-title":"Genome comparison without alignment using shortest unique substrings","volume":"6","author":"Haubold","year":"2005","journal-title":"BMC Bioinform"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"1487","DOI":"10.1089\/cmb.2009.0106","article-title":"Estimating mutation distances from unaligned genomes","volume":"16","author":"Haubold","year":"2009","journal-title":"J. Comput. Biol"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"449","DOI":"10.1093\/bioinformatics\/btq689","article-title":"Alignment-free estimation of nucleotide diversity","volume":"27","author":"Haubold","year":"2011","journal-title":"Bioinformatics"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"3121","DOI":"10.1093\/bioinformatics\/btt550","article-title":"An alignment-free test for recombination","volume":"29","author":"Haubold","year":"2013","journal-title":"Bioinformatics"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"giz159","DOI":"10.1093\/gigascience\/giz159","article-title":"A genome alignment of 120 mammals highlights ultraconserved element variability and placenta-associated enhancers","volume":"9","author":"Hecker","year":"2020","journal-title":"GigaScience"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"578","DOI":"10.1038\/s41586-020-2486-3","article-title":"Six reference-quality genomes reveal evolution of bat adaptations","volume":"583","author":"Jebb","year":"2020","journal-title":"Nature"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"2000","DOI":"10.1093\/bioinformatics\/btu331","article-title":"Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison","volume":"30","author":"Leimeister","year":"2014","journal-title":"Bioinformatics"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2601073","article-title":"Fully functional static and dynamic succinct trees","volume":"10","author":"Navarro","year":"2014","journal-title":"ACM Trans. Algor"},{"key":"2023030120002862400_","first-page":"347","article-title":"Computing matching statistics and maximal exact matches on compressed full-text indexes","volume-title":"SPIRE","author":"Ohlebusch","year":"2010"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"6","DOI":"10.1186\/s13015-016-0072-x","article-title":"MissMax: alignment-free sequence comparison with mismatches through filtering and heuristics","volume":"11","author":"Pizzi","year":"2016","journal-title":"Algor. Mol. Biol"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"901","DOI":"10.1101\/gr.228718.117","article-title":"Detection and analysis of ancient segmental duplications in mammalian genomes","volume":"28","author":"Pu","year":"2018","journal-title":"Genome Res"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"43","DOI":"10.1145\/1290672.1290680","article-title":"Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets","volume":"3","author":"Raman","year":"2007","journal-title":"ACM Trans. Algor"},{"key":"2023030120002862400_","author":"Rhie","year":"2020"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"589","DOI":"10.1007\/s00224-006-1198-x","article-title":"Compressed suffix trees with full functionality","volume":"41","author":"Sadakane","year":"2007","journal-title":"Theory Comput. Syst"},{"key":"2023030120002862400_","first-page":"134","article-title":"Fully-functional succinct trees","volume-title":"Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms","author":"Sadakane","year":"2010"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"240","DOI":"10.1038\/s41586-020-2876-6","article-title":"A comparative genomics multitool for scientific discovery and conservation","volume":"587","author":"Serres Armero","year":"2020","journal-title":"Nature"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1007\/978-3-642-03784-9_7","article-title":"Compressed suffix arrays for massive data","volume-title":"International Symposium on String Processing and Information Retrieval","author":"Sir\u00e9n","year":"2009"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"23","DOI":"10.1146\/annurev-animal-022516-022811","article-title":"Bat biology, genomes, and the Bat1K project: to generate chromosome-level genomes for all living bat species","volume":"6","author":"Teeling","year":"2018","journal-title":"Annu. Rev. Anim. Biosci"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"472","DOI":"10.1089\/cmb.2015.0235","article-title":"A provably efficient algorithm for the k-mismatch average common substring problem","volume":"23","author":"Thankachan","year":"2016","journal-title":"J. Comput. Biol"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"238","DOI":"10.1186\/s12859-017-1658-0","article-title":"A greedy alignment-free distance estimator for phylogenetic inference","volume":"18","author":"Thankachan","year":"2017","journal-title":"BMC Bioinform"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"191","DOI":"10.1016\/0304-3975(92)90143-4","article-title":"Approximate string-matching with q-grams and maximal matches","volume":"92","author":"Ukkonen","year":"1992","journal-title":"Theor. Comput. Sci"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"336","DOI":"10.1089\/cmb.2006.13.336","article-title":"The average common substring approach to phylogenomic reconstruction","volume":"13","author":"Ulitsky","year":"2006","journal-title":"J. Comput. Biol"},{"key":"2023030120002862400_","doi-asserted-by":"crossref","first-page":"1311","DOI":"10.1126\/science.1251385","article-title":"Comparative genomics reveals insights into avian genome evolution and adaptation","volume":"346","author":"Zhang","year":"2014","journal-title":"Science"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btac064\/43015240\/btac064.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/7\/1838\/49384054\/btac064.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/7\/1838\/49384054\/btac064.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,11,16]],"date-time":"2023-11-16T11:28:32Z","timestamp":1700134112000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/38\/7\/1838\/6522115"}},"subtitle":[],"editor":[{"given":"Yann","family":"Ponty","sequence":"additional","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]}],"short-title":[],"issued":{"date-parts":[[2022,2,4]]},"references-count":43,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2022,3,28]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btac064","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2021.10.05.463202","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022,4,1]]},"published":{"date-parts":[[2022,2,4]]}}}