{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,13]],"date-time":"2025-09-13T16:23:09Z","timestamp":1757780589261,"version":"3.37.3"},"reference-count":41,"publisher":"Oxford University Press (OUP)","issue":"6","license":[{"start":{"date-parts":[[2020,10,29]],"date-time":"2020-10-29T00:00:00Z","timestamp":1603929600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/100000002","name":"US National Institutes of Health","doi-asserted-by":"publisher","award":["R01GM120624","1R01GM131407"],"award-info":[{"award-number":["R01GM120624","1R01GM131407"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,5,5]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Motivation<\/jats:title><jats:p>The rapid development of sequencing technologies has enabled us to generate a large number of metagenomic reads from genetic materials in microbial communities, making it possible to gain deep insights into understanding the differences between the genetic materials of different groups of microorganisms, such as bacteria, viruses, plasmids, etc. Computational methods based on k-mer frequencies have been shown to be highly effective for classifying metagenomic sequencing reads into different groups. However, such methods usually use all the k-mers as features for prediction without selecting relevant k-mers for the different groups of sequences, i.e. unique nucleotide patterns containing biological significance.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>To select k-mers for distinguishing different groups of sequences with guaranteed false discovery rate (FDR) control, we develop KIMI, a general framework based on model-X Knockoffs regarded as the state-of-the-art statistical method for FDR control, for sequence motif discovery with arbitrary target FDR level, such that reproducibility can be theoretically guaranteed. KIMI is shown through simulation studies to be effective in simultaneously controlling FDR and yielding high power, outperforming the broadly used Benjamini\u2013Hochberg procedure and the q-value method for FDR control. To illustrate the usefulness of KIMI in analyzing real datasets, we take the viral motif discovery problem as an example and implement KIMI on a real dataset consisting of viral and bacterial contigs. We show that the accuracy of predicting viral and bacterial contigs can be increased by training the prediction model only on relevant k-mers selected by KIMI.<\/jats:p><\/jats:sec><jats:sec><jats:title>Availabilityand implementation<\/jats:title><jats:p>Our implementation of KIMI is available at https:\/\/github.com\/xinbaiusc\/KIMI.<\/jats:p><\/jats:sec><jats:sec><jats:title>Supplementary information<\/jats:title><jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaa912","type":"journal-article","created":{"date-parts":[[2020,10,14]],"date-time":"2020-10-14T19:26:30Z","timestamp":1602703590000},"page":"759-766","source":"Crossref","is-referenced-by-count":3,"title":["KIMI: Knockoff Inference for Motif Identification from molecular sequences with controlled false discovery rate"],"prefix":"10.1093","volume":"37","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3755-0730","authenticated-orcid":false,"given":"Xin","family":"Bai","sequence":"first","affiliation":[{"name":"Quantitative and Computational Biology Program, Department of Biological Sciences , Los Angeles, CA 90089, USA"}]},{"given":"Jie","family":"Ren","sequence":"additional","affiliation":[{"name":"Quantitative and Computational Biology Program, Department of Biological Sciences , Los Angeles, CA 90089, USA"}]},{"given":"Yingying","family":"Fan","sequence":"additional","affiliation":[{"name":"Data Sciences and Operations Department, Marshall School of Business, University of Southern California , Los Angeles, CA 90089, USA"}]},{"given":"Fengzhu","family":"Sun","sequence":"additional","affiliation":[{"name":"Quantitative and Computational Biology Program, Department of Biological Sciences , Los Angeles, CA 90089, USA"}]}],"member":"286","published-online":{"date-parts":[[2020,10,29]]},"reference":[{"key":"2023051705202530400_btaa912-B1","doi-asserted-by":"crossref","first-page":"e126","DOI":"10.1093\/nar\/gks406","article-title":"PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity-and composition-based strategies","volume":"40","author":"Akhter","year":"2012","journal-title":"Nucleic Acids Res"},{"key":"2023051705202530400_btaa912-B2","doi-asserted-by":"crossref","first-page":"633","DOI":"10.1016\/0022-5193(83)90251-5","article-title":"A Markov analysis of DNA sequences","volume":"104","author":"Almagor","year":"1983","journal-title":"J. Theor. Biol"},{"key":"2023051705202530400_btaa912-B3","doi-asserted-by":"crossref","first-page":"1047","DOI":"10.1126\/science.1157358","article-title":"Virus population dynamics and acquired virus resistance in natural microbial communities","volume":"320","author":"Andersson","year":"2008","journal-title":"Science"},{"key":"2023051705202530400_btaa912-B4","doi-asserted-by":"crossref","first-page":"7145","DOI":"10.1093\/nar\/16.14.7145","article-title":"Mono-through hexanucleotide composition of the sense strand of yeast DNA: a Markov chain analysis","volume":"16","author":"Arnold","year":"1988","journal-title":"Nucleic Acids Res"},{"key":"2023051705202530400_btaa912-B5","doi-asserted-by":"crossref","first-page":"335","DOI":"10.1007\/BF02101152","article-title":"The analysis of intron data and their use in the detection of short signals","volume":"26","author":"Avery","year":"1987","journal-title":"J. Mol. Evolu"},{"key":"2023051705202530400_btaa912-B6","doi-asserted-by":"crossref","first-page":"53","DOI":"10.1111\/1467-9876.00139","article-title":"Fitting Markov chain models to discrete state series such as DNA sequences","volume":"48","author":"Avery","year":"1999","journal-title":"J. R. Stat. Soc. Ser. C (Appl. Stat.)"},{"key":"2023051705202530400_btaa912-B7","doi-asserted-by":"crossref","first-page":"1409","DOI":"10.1214\/19-AOS1852","article-title":"Robust inference with knockoffs","volume":"48","author":"Barber","year":"2020","journal-title":"Ann. Stat"},{"key":"2023051705202530400_btaa912-B8","doi-asserted-by":"crossref","first-page":"289","DOI":"10.1111\/j.2517-6161.1995.tb02031.x","article-title":"Controlling the false discovery rate: a practical and powerful approach to multiple testing","volume":"57","author":"Benjamini","year":"1995","journal-title":"J. R. Stat. Soc. Ser. B (Methodological)"},{"key":"2023051705202530400_btaa912-B9","doi-asserted-by":"crossref","first-page":"802","DOI":"10.1214\/12-AOS1077","article-title":"Valid post-selection inference","volume":"41","author":"Berk","year":"2013","journal-title":"Ann. Stat"},{"key":"2023051705202530400_btaa912-B10","doi-asserted-by":"crossref","first-page":"278","DOI":"10.1007\/BF02102360","article-title":"Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding","volume":"21","author":"Blaisdell","year":"1985","journal-title":"J. Mol. Evol"},{"key":"2023051705202530400_btaa912-B11","doi-asserted-by":"crossref","first-page":"5155","DOI":"10.1073\/pnas.83.14.5155","article-title":"A measure of the similarity of sets of sequences not requiring sequence alignment","volume":"83","author":"Blaisdell","year":"1986","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2023051705202530400_btaa912-B12","doi-asserted-by":"crossref","first-page":"551","DOI":"10.1111\/rssb.12265","article-title":"Panning for gold: \u2018model-x\u2019 knockoffs for high dimensional controlled variable selection","volume":"80","author":"Candes","year":"2018","journal-title":"J. R. Stat. Soc. Ser. B (Stat. Methodol.)"},{"key":"2023051705202530400_btaa912-B13","doi-asserted-by":"crossref","first-page":"811","DOI":"10.1038\/nature06245","article-title":"An ecological and evolutionary perspective on human\u2013microbe mutualism and disease","volume":"449","author":"Dethlefsen","year":"2007","journal-title":"Nature"},{"key":"2023051705202530400_btaa912-B14","first-page":"1","article-title":"IPAD: stable interpretable forecasting with knockoffs inference","author":"Fan","year":"2019","journal-title":"J. Am. Stat. Assoc"},{"key":"2023051705202530400_btaa912-B15","doi-asserted-by":"crossref","first-page":"362","DOI":"10.1080\/01621459.2018.1546589","article-title":"Rank: large-scale inference with graphical nonlinear knockoffs","volume":"115","author":"Fan","year":"2019","journal-title":"J. Am. Stat. Assoc"},{"key":"2023051705202530400_btaa912-B16","doi-asserted-by":"crossref","first-page":"giz066","DOI":"10.1093\/gigascience\/giz066","article-title":"PPR-Meta: a tool for identifying phages and plasmids from metagenomic fragments using deep learning","volume":"8","author":"Fang","year":"2019","journal-title":"GigaScience"},{"key":"2023051705202530400_btaa912-B17","doi-asserted-by":"crossref","first-page":"117","DOI":"10.1016\/0022-2836(85)90262-1","article-title":"Rigorous pattern-recognition methods for DNA sequences: analysis of promoter sequences from Escherichia coli","volume":"186","author":"Galas","year":"1985","journal-title":"J. Mol. Biol"},{"key":"2023051705202530400_btaa912-B18","doi-asserted-by":"crossref","first-page":"619","DOI":"10.1214\/07-AOS586","article-title":"An adaptive step-down procedure with proven FDR control under independence","volume":"37","author":"Gavrilov","year":"2009","journal-title":"Ann. Stat"},{"key":"2023051705202530400_btaa912-B19","doi-asserted-by":"crossref","first-page":"1703","DOI":"10.1016\/j.soilbio.2007.01.018","article-title":"Relationships between microbial community structure and soil environmental conditions in a recently burned system","volume":"39","author":"Hamman","year":"2007","journal-title":"Soil Biol. Biochem"},{"key":"2023051705202530400_btaa912-B20","doi-asserted-by":"crossref","first-page":"3097","DOI":"10.1093\/bioinformatics\/bti456","article-title":"Sample size for FDR-control in microarray data analysis","volume":"21","author":"Jung","year":"2005","journal-title":"Bioinformatics"},{"key":"2023051705202530400_btaa912-B21","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1002\/prot.340070105","article-title":"An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences","volume":"7","author":"Lawrence","year":"1990","journal-title":"Proteins Struct. Funct. Bioinf"},{"key":"2023051705202530400_btaa912-B22","doi-asserted-by":"crossref","first-page":"208","DOI":"10.1126\/science.8211139","article-title":"Detecting subtle sequence signals: a gibbs sampling strategy for multiple alignment","volume":"262","author":"Lawrence","year":"1993","journal-title":"Science"},{"key":"2023051705202530400_btaa912-B23","doi-asserted-by":"crossref","first-page":"785","DOI":"10.1093\/biomet\/asu031","article-title":"Variable selection in regression with compositional covariates","volume":"101","author":"Lin","year":"2014","journal-title":"Biometrika"},{"first-page":"1","year":"2005","author":"Lones","key":"2023051705202530400_btaa912-B24"},{"key":"2023051705202530400_btaa912-B25","doi-asserted-by":"crossref","first-page":"50","DOI":"10.1214\/aoms\/1177730491","article-title":"On a test of whether one of two random variables is stochastically larger than the other","volume":"18","author":"Mann","year":"1947","journal-title":"Ann. Math. Stat"},{"key":"2023051705202530400_btaa912-B26","doi-asserted-by":"crossref","first-page":"345","DOI":"10.1089\/106652700750050826","article-title":"Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification","volume":"7","author":"Marsan","year":"2000","journal-title":"J. Comput. Biol"},{"key":"2023051705202530400_btaa912-B27","doi-asserted-by":"crossref","first-page":"223","DOI":"10.1093\/bioinformatics\/3.3.223","article-title":"Recognition of characteristic patterns in sets of functionally equivalent DNA sequences","volume":"3","author":"Mengeritsky","year":"1987","journal-title":"Bioinformatics"},{"key":"2023051705202530400_btaa912-B28","doi-asserted-by":"crossref","first-page":"59","DOI":"10.1038\/nature13568","article-title":"Alterations of the human gut microbiome in liver cirrhosis","volume":"513","author":"Qin","year":"2014","journal-title":"Nature"},{"key":"2023051705202530400_btaa912-B29","doi-asserted-by":"crossref","first-page":"107","DOI":"10.1002\/bimj.200510313","article-title":"Fdr control by the BH procedure for two-sided correlated tests with implications to gene expression data analysis","volume":"49","author":"Reiner-Benaim","year":"2007","journal-title":"Biometrical J"},{"key":"2023051705202530400_btaa912-B30","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1089\/10665270050081360","article-title":"Probabilistic and statistical properties of words: an overview","volume":"7","author":"Reinert","year":"2000","journal-title":"J. Comput. Biol"},{"key":"2023051705202530400_btaa912-B31","doi-asserted-by":"crossref","first-page":"69","DOI":"10.1186\/s40168-017-0283-5","article-title":"Virfinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data","volume":"5","author":"Ren","year":"2017","journal-title":"Microbiome"},{"key":"2023051705202530400_btaa912-B32","doi-asserted-by":"crossref","first-page":"64","DOI":"10.1007\/s40484-019-0187-4","article-title":"Identifying viruses from metagenomic data using deep learning","volume":"8","author":"Ren","year":"2020","journal-title":"Quant. Biol"},{"key":"2023051705202530400_btaa912-B33","doi-asserted-by":"crossref","first-page":"343","DOI":"10.1258\/000456307780945633","article-title":"Within-subject biological variation in disease: collated data and clinical consequences","volume":"44","author":"Ric\u00f3s","year":"2007","journal-title":"Ann. Clin. Biochem"},{"key":"2023051705202530400_btaa912-B34","doi-asserted-by":"crossref","first-page":"e985","DOI":"10.7717\/peerj.985","article-title":"VirSorter: mining viral signal from microbial genomic data","volume":"3","author":"Roux","year":"2015","journal-title":"PeerJ"},{"key":"2023051705202530400_btaa912-B35","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1093\/biomet\/asy033","article-title":"Gene hunting with hidden Markov model knockoffs","volume":"106","author":"Sesia","year":"2019","journal-title":"Biometrika"},{"key":"2023051705202530400_btaa912-B36","doi-asserted-by":"crossref","first-page":"1077","DOI":"10.1038\/nbt.3981","article-title":"Assessment of variation in microbial community amplicon sequencing by the Microbiome Quality Control (MBQC) project consortium","volume":"35","author":"Sinha","year":"2017","journal-title":"Nat. Biotechnol"},{"key":"2023051705202530400_btaa912-B37","doi-asserted-by":"crossref","first-page":"12115","DOI":"10.1073\/pnas.0605127103","article-title":"Microbial diversity in the deep sea and the underexplored \u201crare biosphere\u201d","volume":"103","author":"Sogin","year":"2006","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2023051705202530400_btaa912-B38","doi-asserted-by":"crossref","first-page":"479","DOI":"10.1111\/1467-9868.00346","article-title":"A direct approach to false discovery rates","volume":"64","author":"Storey","year":"2002","journal-title":"J. R. Stat. Soc. Ser. B (Stat. Methodol.)"},{"key":"2023051705202530400_btaa912-B39","doi-asserted-by":"crossref","first-page":"2013","DOI":"10.1214\/aos\/1074290335","article-title":"The positive false discovery rate: a bayesian interpretation and the q-value","volume":"31","author":"Storey","year":"2003","journal-title":"Ann. Stat"},{"key":"2023051705202530400_btaa912-B40","doi-asserted-by":"crossref","first-page":"600","DOI":"10.1080\/01621459.2015.1108848","article-title":"Exact post-selection inference for sequential regression procedures","volume":"111","author":"Tibshirani","year":"2016","journal-title":"J. Am. Stat. Assoc"},{"key":"2023051705202530400_btaa912-B41","doi-asserted-by":"crossref","DOI":"10.1007\/978-1-4899-6846-3","volume-title":"Introduction to Computational Biology: Maps, Sequences and Genomes","author":"Waterman","year":"1995"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btaa912\/35064760\/btaa912.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/6\/759\/50357651\/btaa912.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/6\/759\/50357651\/btaa912.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,15]],"date-time":"2024-08-15T22:36:32Z","timestamp":1723761392000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/37\/6\/759\/5942973"}},"subtitle":[],"editor":[{"given":"Alfonso","family":"Valencia","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2020,10,29]]},"references-count":41,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2021,5,5]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaa912","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"type":"print","value":"1367-4803"},{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2021,3,15]]},"published":{"date-parts":[[2020,10,29]]}}}