{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,25]],"date-time":"2026-03-25T03:22:29Z","timestamp":1774408949226,"version":"3.50.1"},"reference-count":34,"publisher":"Oxford University Press (OUP)","issue":"1","license":[{"start":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T00:00:00Z","timestamp":1763424000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2026,1,2]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>The first step when working with DNA data of human-derived microbiomes is to remove human contamination for two reasons. First, many countries have strict privacy and data protection guidelines for human sequence data, so microbiome data containing partly human data cannot be easily further processed or published. Second, human contamination may cause problems in downstream analysis, such as metagenomic binning or genome assembly. For large-scale metagenomics projects, fast and accurate removal of human contamination is therefore critical.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We introduce Cleanifier, a fast and memory frugal alignment-free tool for detecting and removing human contamination based on gapped k-mers, or spaced seeds. Cleanifier uses a pangenome index of known human gapped k-mers, and the creation and use of alternative references is also possible. Reads are classified and filtered according to their gapped k-mer content. Cleanifier supports two filtering modes: one that queries all gapped k-mers and one that queries only a sample of them. A comparison of Cleanifier with other state-of-the-art tools shows that the sampling mode makes Cleanifier the fastest method with comparable accuracy. When using a probabilistic Cuckoo filter to store the complete k-mer set, Cleanifier has similar memory requirements to methods that use a sampled minimizer index. At the same time, Cleanifier is more flexible, because it can use different sampling methods on the same index.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>Cleanifier is available via gitlab (https:\/\/gitlab.com\/rahmannlab\/cleanifier), PyPi (https:\/\/pypi.org\/project\/cleanifier\/), and Bioconda (https:\/\/anaconda.org\/bioconda\/cleanifier). The pre-computed human pangenome index is available at Zenodo (https:\/\/doi.org\/10.5281\/zenodo.15639519).<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaf632","type":"journal-article","created":{"date-parts":[[2025,11,16]],"date-time":"2025-11-16T12:57:49Z","timestamp":1763297869000},"source":"Crossref","is-referenced-by-count":3,"title":["Cleanifier: contamination removal from microbial sequences using spaced seeds of a human pangenome index"],"prefix":"10.1093","volume":"42","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9444-2755","authenticated-orcid":false,"given":"Jens","family":"Zentgraf","sequence":"first","affiliation":[{"name":"Algorithmic Bioinformatics, Saarland University , Saarbr\u00fccken 66123,","place":["Germany"]},{"name":"Saarbr\u00fccken Graduate School of Computer Science, Saarland Informatics Campus , Saarbr\u00fccken 66123,","place":["Germany"]},{"name":"Center for Bioinformatics Saar, Saarland Informatics Campus , Saarbr\u00fccken 66123,","place":["Germany"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-6377-2561","authenticated-orcid":false,"given":"Johanna Elena","family":"Schmitz","sequence":"additional","affiliation":[{"name":"Algorithmic Bioinformatics, Saarland University , Saarbr\u00fccken 66123,","place":["Germany"]},{"name":"Saarbr\u00fccken Graduate School of Computer Science, Saarland Informatics Campus , Saarbr\u00fccken 66123,","place":["Germany"]},{"name":"Center for Bioinformatics Saar, Saarland Informatics Campus , Saarbr\u00fccken 66123,","place":["Germany"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8536-6065","authenticated-orcid":false,"given":"Sven","family":"Rahmann","sequence":"additional","affiliation":[{"name":"Algorithmic Bioinformatics, Saarland University , Saarbr\u00fccken 66123,","place":["Germany"]},{"name":"Center for Bioinformatics Saar, Saarland Informatics Campus , Saarbr\u00fccken 66123,","place":["Germany"]}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2025,11,18]]},"reference":[{"key":"2026010213513282900_btaf632-B1","doi-asserted-by":"publisher","first-page":"68","DOI":"10.1038\/nature15393","article-title":"A global reference for human genetic variation","volume":"526","author":"Auton","year":"2015","journal-title":"Nature"},{"key":"2026010213513282900_btaf632-B2","doi-asserted-by":"publisher","first-page":"D1053","DOI":"10.1093\/nar\/gkac1011","article-title":"The IPD-IMGT\/HLA database","volume":"51","author":"Barker","year":"2023","journal-title":"Nucleic Acids Res"},{"key":"2026010213513282900_btaf632-B3","doi-asserted-by":"publisher","first-page":"198","DOI":"10.1186\/s13059-018-1568-0","article-title":"KrakenUniq: confident and fast metagenomics classification using unique k-mer counts","volume":"19","author":"Breitwieser","year":"2018","journal-title":"Genome Biol"},{"key":"2026010213513282900_btaf632-B4","doi-asserted-by":"publisher","first-page":"3584","DOI":"10.1093\/bioinformatics\/btv419","article-title":"Spaced seeds improve k-mer-based metagenomic classification","volume":"31","author":"B\u0159inda","year":"2015","journal-title":"Bioinformatics"},{"key":"2026010213513282900_btaf632-B5","doi-asserted-by":"publisher","first-page":"btad728","DOI":"10.1093\/bioinformatics\/btad728","article-title":"Hostile: accurate decontamination of microbial host sequences","volume":"39","author":"Constantinides","year":"2023","journal-title":"Bioinformatics"},{"key":"2026010213513282900_btaf632-B6","doi-asserted-by":"publisher","author":"Constantinides","year":"2025","DOI":"10.1101\/2025.06.09.658732"},{"key":"2026010213513282900_btaf632-B7","doi-asserted-by":"publisher","first-page":"D948","DOI":"10.1093\/nar\/gkae1071","article-title":"Ensembl 2025","volume":"53","author":"Dyer","year":"2025","journal-title":"Nucleic Acids Res"},{"key":"2026010213513282900_btaf632-B8","author":"Fritz","year":"2021"},{"key":"2026010213513282900_btaf632-B9","doi-asserted-by":"publisher","first-page":"giaf004","DOI":"10.1093\/gigascience\/giaf004","article-title":"Benchmarking short-read metagenomics tools for removing host contamination","volume":"14","author":"Gao","year":"2025","journal-title":"GigaScience"},{"key":"2026010213513282900_btaf632-B10","doi-asserted-by":"publisher","first-page":"1","DOI":"10.4230\/LIPICS.SEA.2025.20","author":"Groot Koerkamp","year":"2025"},{"key":"2026010213513282900_btaf632-B11","doi-asserted-by":"publisher","first-page":"825","DOI":"10.1038\/s41467-025-56077-5","article-title":"Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data","volume":"16","author":"Guccione","year":"2025","journal-title":"Nat Commun"},{"key":"2026010213513282900_btaf632-B12","doi-asserted-by":"publisher","first-page":"giae010","DOI":"10.1093\/gigascience\/giae010","article-title":"Pangenome databases improve host removal and mycobacteria classification from clinical metagenomic data","volume":"13","author":"Hall","year":"2024","journal-title":"GigaScience"},{"key":"2026010213513282900_btaf632-B13","doi-asserted-by":"publisher","first-page":"207","DOI":"10.1038\/nature11234","article-title":"Structure, function and diversity of the healthy human microbiome","volume":"486","author":"Huttenhower","year":"2012","journal-title":"Nature"},{"key":"2026010213513282900_btaf632-B14","doi-asserted-by":"publisher","first-page":"270","DOI":"10.1186\/s13059-021-02490-0","article-title":"STAT: a fast, scalable, MinHash-based k-mer tool to assess sequence read archive next-generation sequence submissions","volume":"22","author":"Katz","year":"2021","journal-title":"Genome Biol"},{"key":"2026010213513282900_btaf632-B15","doi-asserted-by":"publisher","first-page":"D54","DOI":"10.1093\/nar\/gkr854","article-title":"The sequence read archive: explosive growth of sequencing data","volume":"40","author":"Kodama","year":"2012","journal-title":"Nucleic Acids Res"},{"key":"2026010213513282900_btaf632-B16","doi-asserted-by":"publisher","first-page":"7:1","DOI":"10.1145\/2833157.2833162","author":"Lam","year":"2015"},{"key":"2026010213513282900_btaf632-B17","doi-asserted-by":"publisher","first-page":"357","DOI":"10.1038\/nmeth.1923","article-title":"Fast gapped-read alignment with Bowtie 2","volume":"9","author":"Langmead","year":"2012","journal-title":"Nat Methods"},{"key":"2026010213513282900_btaf632-B18","doi-asserted-by":"publisher","first-page":"3094","DOI":"10.1093\/bioinformatics\/bty191","article-title":"Minimap2: pairwise alignment for nucleotide sequences","volume":"34","author":"Li","year":"2018","journal-title":"Bioinformatics"},{"key":"2026010213513282900_btaf632-B19","doi-asserted-by":"publisher","first-page":"1754","DOI":"10.1093\/bioinformatics\/btp324","article-title":"Fast and accurate short read alignment with Burrows\u2013Wheeler transform","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"2026010213513282900_btaf632-B20","doi-asserted-by":"publisher","first-page":"312","DOI":"10.1038\/s41586-023-05896-x","article-title":"A draft human pangenome reference","volume":"617","author":"Liao","year":"2023","journal-title":"Nature"},{"key":"2026010213513282900_btaf632-B21","doi-asserted-by":"publisher","first-page":"2815","DOI":"10.1038\/s41596-022-00738-y","article-title":"Metagenome analysis using the Kraken software suite","volume":"17","author":"Lu","year":"2022","journal-title":"Nat Protoc"},{"key":"2026010213513282900_btaf632-B22","doi-asserted-by":"publisher","first-page":"429","DOI":"10.1038\/s41592-022-01431-4","article-title":"Critical assessment of metagenome interpretation: the second round of challenges","volume":"19","author":"Meyer","year":"2022","journal-title":"Nat Methods"},{"key":"2026010213513282900_btaf632-B23","doi-asserted-by":"publisher","first-page":"W102","DOI":"10.1093\/nar\/gkaf369","article-title":"CAMI benchmarking portal: online evaluation and ranking of metagenomic software","volume":"53","author":"Meyer","year":"2025","journal-title":"Nucleic Acids Res"},{"key":"2026010213513282900_btaf632-B24","doi-asserted-by":"publisher","first-page":"33","DOI":"10.12688\/f1000research.29032.2","article-title":"Sustainable data analysis with Snakemake [version 2; peer review: 2 approved]","volume":"10","author":"M\u00f6lder","year":"2021","journal-title":"F1000Res"},{"key":"2026010213513282900_btaf632-B25","doi-asserted-by":"publisher","first-page":"44","DOI":"10.1126\/science.abj6987","article-title":"The complete sequence of a human genome","volume":"376","author":"Nurk","year":"2022","journal-title":"Science"},{"key":"2026010213513282900_btaf632-B26","doi-asserted-by":"publisher","first-page":"1277","DOI":"10.3389\/fmicb.2019.01277","article-title":"Impact of host DNA and sequencing depth on the taxonomic resolution of whole metagenome sequencing for microbiome analysis","volume":"10","author":"Pereira-Marques","year":"2019","journal-title":"Front Microbiol"},{"key":"2026010213513282900_btaf632-B27","doi-asserted-by":"publisher","first-page":"641","DOI":"10.1038\/s41586-019-1238-8","article-title":"The integrative human microbiome project","volume":"569","author":"Proctor","year":"2019","journal-title":"Nature"},{"key":"2026010213513282900_btaf632-B28","doi-asserted-by":"publisher","author":"Schmitz","year":"2025","DOI":"10.48550\/arXiv.2505.05847"},{"key":"2026010213513282900_btaf632-B29","doi-asserted-by":"publisher","first-page":"1079","DOI":"10.1038\/s41564-023-01381-3","article-title":"Reconstruction of the personal information from human genome reads in gut metagenome sequencing data","volume":"8","author":"Tomofuji","year":"2023","journal-title":"Nat Microbiol"},{"key":"2026010213513282900_btaf632-B30","doi-asserted-by":"publisher","first-page":"24:1","DOI":"10.1145\/3589558","article-title":"Load thresholds for Cuckoo hashing with overlapping blocks","volume":"19","author":"Walzer","year":"2023","journal-title":"ACM Trans Algorithms"},{"key":"2026010213513282900_btaf632-B31","doi-asserted-by":"publisher","first-page":"257","DOI":"10.1186\/s13059-019-1891-0","article-title":"Improved metagenomic analysis with Kraken 2","volume":"20","author":"Wood","year":"2019","journal-title":"Genome Biol"},{"key":"2026010213513282900_btaf632-B32","doi-asserted-by":"publisher","first-page":"12:1","DOI":"10.4230\/LIPIcs.WABI.2022.12","author":"Zentgraf","year":"2022"},{"key":"2026010213513282900_btaf632-B33","doi-asserted-by":"publisher","first-page":"22:1","DOI":"10.4230\/LIPIcs.WABI.2025.22","volume-title":"25th International Conference on Algorithms for Bioinformatics (WABI 2025), Volume 344 of Leibniz International Proceedings in Informatics (LIPIcs)","author":"Zentgraf","year":"2025"},{"key":"2026010213513282900_btaf632-B34","doi-asserted-by":"publisher","first-page":"160025","DOI":"10.1038\/sdata.2016.25","article-title":"Extensive sequencing of seven human genomes to characterize benchmark reference materials","volume":"3","author":"Zook","year":"2016","journal-title":"Sci Data"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btaf632\/65372192\/btaf632.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/42\/1\/btaf632\/65372192\/btaf632.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/42\/1\/btaf632\/65372192\/btaf632.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,2]],"date-time":"2026-01-02T18:51:40Z","timestamp":1767379900000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btaf632\/8326858"}},"subtitle":[],"editor":[{"given":"Inanc","family":"Birol","sequence":"additional","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]}],"short-title":[],"issued":{"date-parts":[[2025,11,18]]},"references-count":34,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,1,2]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaf632","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2026,1]]},"published":{"date-parts":[[2025,11,18]]},"article-number":"btaf632"}}