{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,2]],"date-time":"2025-08-02T14:19:14Z","timestamp":1754144354428,"version":"3.41.2"},"reference-count":18,"publisher":"Oxford University Press (OUP)","issue":"Supplement_1","license":[{"start":{"date-parts":[[2025,7,15]],"date-time":"2025-07-15T00:00:00Z","timestamp":1752537600000},"content-version":"vor","delay-in-days":14,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"French Agence Nationale de la Recherche"},{"name":"INSSANE","award":["ANR21-CE45-0034"],"award-info":[{"award-number":["ANR21-CE45-0034"]}]},{"DOI":"10.13039\/100016842","name":"Institut Fran\u00e7ais de Bioinformatique","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100016842","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Programme d\u2019Investissements d\u2019Avenir","award":["ANR-11-INBS-0013","ANR-21-ESRE-0048"],"award-info":[{"award-number":["ANR-11-INBS-0013","ANR-21-ESRE-0048"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,7,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>Recent viral outbreaks motivate the systematic collection of pathogenic genomes in order to accelerate their study and monitor the apparition\/spread of variants. Due to their limited length and temporal proximity of their sequencing, viral genomes are usually organized, and analyzed as oversized Multiple Sequence Alignments (MSAs). Such MSAs are largely ungapped, and mostly homogeneous on a column-wise level but not at a sequential level due to local variations, hindering the performances of sequential compression algorithms.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>In order to enable an efficient handling of MSAs, including subsequent statistical analyses, we introduce CREMSA (Column-wise Run-length Encoding for MSAs), a new index that builds on sparse bitvector representations to compress an existing or streamed MSA, all the while allowing for an expressive set of accelerated requests to query the alignment without prior decompression. Using CREMSA, a 65 GB MSA consisting of 1.9M SARS-CoV 2 genomes could be compressed into 22 MB using less than half a gigabyte of main memory, while executing access requests in the order of 100\u2009ns. Such a speed up enables a comprehensive analysis of covariation over this very large MSA. We further assess the impact of the sequence ordering on the compressibility of MSAs and propose a resorting strategy that, despite the proven NP-hardness of an optimal sort, induces greatly increased compression ratios at a marginal computational cost.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>CREMSA is freely accessible at https:\/\/gitlab.univ-lille.fr\/cremsa\/cremsa. The Snakemake workflow for the benchmarks is available at: https:\/\/gitlab.univ-lille.fr\/cremsa\/bench. The data used in the paper is on Zenodo at https:\/\/zenodo.org\/records\/14698859 and https:\/\/zenodo.org\/records\/15100011.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaf211","type":"journal-article","created":{"date-parts":[[2025,7,15]],"date-time":"2025-07-15T13:02:22Z","timestamp":1752584542000},"page":"i246-i254","source":"Crossref","is-referenced-by-count":0,"title":["<tt>CREMSA<\/tt>: compressed indexing of (ultra) large multiple sequence alignments"],"prefix":"10.1093","volume":"41","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1166-1720","authenticated-orcid":false,"given":"Mika\u00ebl","family":"Salson","sequence":"first","affiliation":[{"name":"Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL , F-59000 Lille,","place":["France"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Arthur","family":"Boddaert","sequence":"additional","affiliation":[{"name":"D\u00e9partement d\u2019Informatique, Lille University , F-59000 Lille,","place":["France"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Awa Bousso","family":"Gueye","sequence":"additional","affiliation":[{"name":"D\u00e9partement d\u2019Informatique, Lille University , F-59000 Lille,","place":["France"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Laurent","family":"Bulteau","sequence":"additional","affiliation":[{"name":"CNRS UMR 8049 LIGM, Gustave Eiffel University , F-77420 Marne-la-Vall\u00e9e,","place":["France"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yohan","family":"Hernandez--Courbevoie","sequence":"additional","affiliation":[{"name":"D\u00e9partement d\u2019Informatique, Lille University , F-59000 Lille,","place":["France"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Camille","family":"Marchet","sequence":"additional","affiliation":[{"name":"Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL , F-59000 Lille,","place":["France"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Nan","family":"Pan","sequence":"additional","affiliation":[{"name":"CNRS UMR 7161 LIX, Ecole Polytechnique, Institut Polytechnique de Paris , F-91120 Palaiseau,","place":["France"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2376-9205","authenticated-orcid":false,"given":"Sebastian","family":"Will","sequence":"additional","affiliation":[{"name":"CNRS UMR 7161 LIX, Ecole Polytechnique, Institut Polytechnique de Paris , F-91120 Palaiseau,","place":["France"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7615-3930","authenticated-orcid":false,"given":"Yann","family":"Ponty","sequence":"additional","affiliation":[{"name":"CNRS UMR 7161 LIX, Ecole Polytechnique, Institut Polytechnique de Paris , F-91120 Palaiseau,","place":["France"]}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2025,7,15]]},"reference":[{"key":"2025071509021559800_btaf211-B1","doi-asserted-by":"publisher","first-page":"474","DOI":"10.1186\/1471-2105-9-474","article-title":"RNAalifold: improved consensus structure prediction for RNA alignments","volume":"9","author":"Bernhart","year":"2008","journal-title":"BMC Bioinformatics"},{"volume-title":"Nat Methods","year":"2025","author":"B\u0159inda","key":"2025071509021559800_btaf211-B2"},{"key":"2025071509021559800_btaf211-B3","doi-asserted-by":"publisher","first-page":"227","DOI":"10.1093\/bioinformatics\/bty619","article-title":"CoMSA: compression of protein multiple sequence alignment files","volume":"35","author":"Deorowicz","year":"2019","journal-title":"Bioinformatics"},{"key":"2025071509021559800_btaf211-B4","doi-asserted-by":"crossref","first-page":"1266","DOI":"10.1093\/bioinformatics\/btu014","article-title":"Efficient haplotype matching and storage using the positional burrows\u2013wheeler transform (pbwt)","volume":"30","author":"Durbin","year":"2014","journal-title":"Bioinformatics"},{"key":"2025071509021559800_btaf211-B5","doi-asserted-by":"crossref","first-page":"1575","DOI":"10.1093\/bioinformatics\/btp117","article-title":"Textual data compression in computational biology: a synopsis","volume":"25","author":"Giancarlo","year":"2009","journal-title":"Bioinformatics"},{"key":"2025071509021559800_btaf211-B6","doi-asserted-by":"publisher","first-page":"10","DOI":"10.1128\/mr.58.1.10-26.1994","article-title":"Lessons from an evolving rRNA: 16S and 23S rRNA structures from a comparative perspective","volume":"58","author":"Gutell","year":"1994","journal-title":"Microbiol Rev"},{"year":"2024","author":"Hunt","key":"2025071509021559800_btaf211-B7","doi-asserted-by":"publisher","DOI":"10.1101\/2024.04.29.591666"},{"year":"2025","author":"Los Alamos National Laboratory","key":"2025071509021559800_btaf211-B8"},{"key":"2025071509021559800_btaf211-B9","doi-asserted-by":"publisher","first-page":"12436","DOI":"10.1093\/nar\/gkaa1053","article-title":"Genome-wide mapping of sars-cov-2 RNA structures identifies therapeutically-relevant elements","volume":"48","author":"Manfredonia","year":"2020","journal-title":"Nucleic Acids Res"},{"key":"2025071509021559800_btaf211-B10","doi-asserted-by":"crossref","first-page":"2","DOI":"10.1145\/1216370.1216372","article-title":"Compressed full-text indexes","volume":"39","author":"Navarro","year":"2007","journal-title":"ACM Comput Surv"},{"year":"2025","author":"Pfam","key":"2025071509021559800_btaf211-B11"},{"year":"2017","author":"Prezza","key":"2025071509021559800_btaf211-B12"},{"key":"2025071509021559800_btaf211-B13","doi-asserted-by":"crossref","first-page":"43","DOI":"10.1145\/1290672.1290680","article-title":"Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets","volume":"3","author":"Raman","year":"2007","journal-title":"ACM Trans Algorithms"},{"key":"2025071509021559800_btaf211-B14","doi-asserted-by":"crossref","first-page":"1403","DOI":"10.1038\/s41564-020-0770-5","article-title":"A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology","volume":"5","author":"Rambaut","year":"2020","journal-title":"Nat Microbiol"},{"key":"2025071509021559800_btaf211-B15","doi-asserted-by":"publisher","first-page":"45","DOI":"10.1038\/nmeth.4066","article-title":"A statistical test for conserved RNA structure shows lack of evidence for structure in lncRNAs","volume":"14","author":"Rivas","year":"2017","journal-title":"Nat Methods"},{"key":"2025071509021559800_btaf211-B16","doi-asserted-by":"crossref","first-page":"msac166","DOI":"10.1093\/molbev\/msac166","article-title":"Halign 3: fast multiple alignment of ultra-large numbers of similar dna\/rna sequences","volume":"39","author":"Tang","year":"2022","journal-title":"Mol Biol Evol"},{"key":"2025071509021559800_btaf211-B17","doi-asserted-by":"publisher","first-page":"15145","DOI":"10.1038\/s41598-024-62897-0","article-title":"Comprehensive survey of conserved RNA secondary structures in full-genome alignment of hepatitis c virus","volume":"14","author":"Triebel","year":"2024","journal-title":"Sci Rep"},{"key":"2025071509021559800_btaf211-B18","doi-asserted-by":"publisher","first-page":"397","DOI":"10.1111\/j.1469-1809.1972.tb00293.x","article-title":"The log likelihood ratio test (the G-test)","volume":"21","author":"Woolf","year":"1957","journal-title":"Ann. Hum. Genet"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/41\/Supplement_1\/i246\/63745245\/btaf211.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/41\/Supplement_1\/i246\/63745245\/btaf211.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,15]],"date-time":"2025-07-15T13:02:23Z","timestamp":1752584543000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/41\/Supplement_1\/i246\/8199345"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,1]]},"references-count":18,"journal-issue":{"issue":"Supplement_1","published-print":{"date-parts":[[2025,7,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaf211","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"type":"print","value":"1367-4803"},{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2025,7]]},"published":{"date-parts":[[2025,7,1]]}}}