{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,7]],"date-time":"2026-05-07T05:50:42Z","timestamp":1778133042511,"version":"3.51.4"},"reference-count":32,"publisher":"Oxford University Press (OUP)","issue":"15","license":[{"start":{"date-parts":[[2022,6,24]],"date-time":"2022-06-24T00:00:00Z","timestamp":1656028800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"School of Management and Engineering Vaud","award":["SNSF-PP00P3_176977"],"award-info":[{"award-number":["SNSF-PP00P3_176977"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,8,2]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Motivation<\/jats:title><jats:p>Generation of genotype data has been growing exponentially over the last decade. With the large size of recent datasets comes a storage and computational burden with ever increasing costs. To reduce this burden, we propose XSI, a file format with reduced storage footprint that also allows computation on the compressed data and we show how this can improve future analyses.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>We show that xSqueezeIt (XSI) allows for a file size reduction of 4-20\u00d7 compared with compressed BCF and demonstrate its potential for \u2018compressive genomics\u2019 on the UK Biobank whole-genome sequencing genotypes with 8\u00d7 faster loading times, 5\u00d7 faster run of homozygozity computation, 30\u00d7 faster dot products computation and 280\u00d7 faster allele counts.<\/jats:p><\/jats:sec><jats:sec><jats:title>Availability and implementation<\/jats:title><jats:p>The XSI file format specifications, API and command line tool are released under open-source (MIT) license and are available at https:\/\/github.com\/rwk-unil\/xSqueezeIt<\/jats:p><\/jats:sec><jats:sec><jats:title>Supplementary information<\/jats:title><jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/btac413","type":"journal-article","created":{"date-parts":[[2022,6,24]],"date-time":"2022-06-24T13:30:32Z","timestamp":1656077432000},"page":"3778-3784","source":"Crossref","is-referenced-by-count":22,"title":["XSI\u2014a genotype compression tool for compressive genomics in large biobanks"],"prefix":"10.1093","volume":"38","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4114-6615","authenticated-orcid":false,"given":"Rick","family":"Wertenbroek","sequence":"first","affiliation":[{"name":"School of Management and Engineering Vaud (HEIG-VD), HES-SO University of Applied Sciences and Arts Western Switzerland , Yverdon-les-Bains 1401, Switzerland"},{"name":"Department of Computational Biology, University of Lausanne , Lausanne 1015, Switzerland"}]},{"given":"Simone","family":"Rubinacci","sequence":"additional","affiliation":[{"name":"Department of Computational Biology, University of Lausanne , Lausanne 1015, Switzerland"}]},{"given":"Ioannis","family":"Xenarios","sequence":"additional","affiliation":[{"name":"Department of Computational Biology, University of Lausanne , Lausanne 1015, Switzerland"}]},{"given":"Yann","family":"Thoma","sequence":"additional","affiliation":[{"name":"School of Management and Engineering Vaud (HEIG-VD), HES-SO University of Applied Sciences and Arts Western Switzerland , Yverdon-les-Bains 1401, Switzerland"}]},{"given":"Olivier","family":"Delaneau","sequence":"additional","affiliation":[{"name":"Department of Computational Biology, University of Lausanne , Lausanne 1015, Switzerland"}]}],"member":"286","published-online":{"date-parts":[[2022,6,24]]},"reference":[{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"72","DOI":"10.1145\/2957324","article-title":"Computational biology in the 21st century: Scaling with compressive algorithms","volume":"59","author":"Berger","year":"2016","journal-title":"Commun. ACM"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"203","DOI":"10.1038\/s41586-018-0579-z","article-title":"The UK biobank resource with deep phenotyping and genomic data","volume":"562","author":"Bycroft","year":"2018","journal-title":"Nature"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"s13742","DOI":"10.1186\/s13742-015-0047-8","article-title":"Second-generation PLINK: rising to the challenge of larger and richer datasets","volume":"4","author":"Chang","year":"2015","journal-title":"Gigascience"},{"key":"2023041405345524600_","author":"Collet","year":"2018"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"2156","DOI":"10.1093\/bioinformatics\/btr330","article-title":"The variant call format and VCFtools","volume":"27","author":"Danecek","year":"2011","journal-title":"Bioinformatics"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"1834","DOI":"10.1093\/bioinformatics\/bty023","article-title":"GTC: How to maintain huge genotype collections in a compressed form","volume":"34","author":"Danek","year":"2018","journal-title":"Bioinformatics"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/s41467-019-13225-y","article-title":"Accurate, scalable and integrative haplotype estimation","volume":"10","author":"Delaneau","year":"2019","journal-title":"Nat. Commun"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"4791","DOI":"10.1093\/bioinformatics\/btz508","article-title":"GTShark: Genotype compression in large projects","volume":"35","author":"Deorowicz","year":"2019","journal-title":"Bioinformatics"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"2572","DOI":"10.1093\/bioinformatics\/btt460","article-title":"Genome compression: A novel approach for large collections","volume":"29","author":"Deorowicz","year":"2013","journal-title":"Bioinformatics"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"1266","DOI":"10.1093\/bioinformatics\/btu014","article-title":"Efficient haplotype matching and storage using the positional burrows\u2013wheeler transform (PBWT)","volume":"30","author":"Durbin","year":"2014","journal-title":"Bioinformatics"},{"key":"2023041405345524600_","article-title":"Variant interpretation using population databases: Lessons from gnomAD","author":"Gudmundsson","year":"2021","journal-title":"Hum Mutat"},{"key":"2023041405345524600_","author":"Halldorsson","year":"2021"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"434","DOI":"10.1038\/s41586-020-2308-7","article-title":"The mutational constraint spectrum quantified from variation in 141,456 humans","volume":"581","author":"Karczewski","year":"2020","journal-title":"Nature"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"191","DOI":"10.1007\/978-1-0716-0199-0_9","volume-title":"Statistical Population Genomics","author":"Kelleher","year":"2020"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1038\/nmeth.3654","article-title":"Efficient genotype compression and analysis of large genetic-variation data sets","volume":"13","author":"Layer","year":"2016","journal-title":"Nat. Methods"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"4248","DOI":"10.1093\/bioinformatics\/btab378","article-title":"Sparse allele vectors and the savvy software suite","volume":"37","author":"LeFaive","year":"2021","journal-title":"Bioinformatics"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"590","DOI":"10.1093\/bioinformatics\/btv613","article-title":"BGT: Efficient and flexible genotype query across many samples","volume":"32","author":"Li","year":"2016","journal-title":"Bioinformatics"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"627","DOI":"10.1038\/nbt.2241","article-title":"Compressive genomics","volume":"30","author":"Loh","year":"2012","journal-title":"Nat. Biotechnol"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"87","DOI":"10.1002\/9781119487845.ch3","article-title":"Haplotype estimation and genotype imputation","author":"Marchini","year":"2019","journal-title":"Handbook of Statistical Genomics: Two Volume Set"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"1097","DOI":"10.1038\/s41588-021-00870-7","article-title":"Computationally efficient whole-genome regression for quantitative and binary traits","volume":"53","author":"Mbatchou","year":"2021","journal-title":"Nat. Genet"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"1279","DOI":"10.1038\/ng.3643","article-title":"A reference panel of 64,976 haplotypes for genotype imputation","volume":"48","author":"McCarthy","year":"2016","journal-title":"Nat. Genet"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"597","DOI":"10.1002\/9781119487845.ch21","article-title":"Genome-wide association studies","author":"Morris","year":"2019","journal-title":"Handbook of Statistical Genomics: Two Volume Set"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/s41467-020-19588-x","article-title":"Identity-by-descent detection across 487,409 British samples reveals fine scale population structure and ultra-rare variant associations","volume":"11","author":"Nait Saada","year":"2020","journal-title":"Nat. Commun"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"1749","DOI":"10.1093\/bioinformatics\/btw044","article-title":"BCFtools\/RoH: A hidden Markov model approach for detecting autozygosity from next-generation sequencing data","volume":"32","author":"Narasimhan","year":"2016","journal-title":"Bioinformatics"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"e1001779","DOI":"10.1371\/journal.pmed.1001779","article-title":"UK biobank: An open access resource for identifying the causes of a wide range of complex diseases of Middle and old age","volume":"12","author":"Sudlow","year":"2015","journal-title":"PLoS Med"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"290","DOI":"10.1038\/s41586-021-03205-y","article-title":"Sequencing of 53,831 diverse genomes from the NHLBI TOPMed program","volume":"590","author":"Taliun","year":"2021","journal-title":"Nature"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"i479","DOI":"10.1093\/bioinformatics\/btw437","article-title":"GTRAC: Fast retrieval from compressed collections of genomic variants","volume":"32","author":"Tatwawadi","year":"2016","journal-title":"Bioinformatics"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"68","DOI":"10.1038\/nature15393","article-title":"A global reference for human genetic variation","volume":"526","year":"2015","journal-title":"Nature"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1016\/j.ajhg.2017.06.005","article-title":"10 years of GWAS discovery: Biology, function, and translation","volume":"101","author":"Visscher","year":"2017","journal-title":"Am. J. Hum. Genet"},{"key":"2023041405345524600_","author":"Wu","year":"2001"},{"key":"2023041405345524600_","author":"Wu","year":"2001"},{"key":"2023041405345524600_","doi-asserted-by":"crossref","first-page":"337","DOI":"10.1109\/TIT.1977.1055714","article-title":"A universal algorithm for sequential data compression","volume":"23","author":"Ziv","year":"1977","journal-title":"IEEE Trans. Inform. Theory"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btac413\/44372472\/btac413.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/15\/3778\/49883953\/btac413.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/15\/3778\/49883953\/btac413.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,11,23]],"date-time":"2023-11-23T18:14:25Z","timestamp":1700763265000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/38\/15\/3778\/6617346"}},"subtitle":[],"editor":[{"given":"Russell","family":"Schwartz","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2022,6,24]]},"references-count":32,"journal-issue":{"issue":"15","published-print":{"date-parts":[[2022,8,2]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btac413","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022,8,1]]},"published":{"date-parts":[[2022,6,24]]}}}