{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,22]],"date-time":"2025-02-22T00:45:26Z","timestamp":1740185126086,"version":"3.37.3"},"reference-count":20,"publisher":"Oxford University Press (OUP)","issue":"7","license":[{"start":{"date-parts":[[2022,2,3]],"date-time":"2022-02-03T00:00:00Z","timestamp":1643846400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/100000057","name":"National Institute of General Medical Sciences","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000057","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"NIH","doi-asserted-by":"publisher","award":["R01-GM135341"],"award-info":[{"award-number":["R01-GM135341"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100004917","name":"Cancer Prevention Research Institute of Texas","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100004917","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100004917","name":"CPRIT","doi-asserted-by":"publisher","award":["RR170068"],"award-info":[{"award-number":["RR170068"]}],"id":[{"id":"10.13039\/100004917","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"NIH","doi-asserted-by":"publisher","award":["5U24DK110814-05"],"award-info":[{"award-number":["5U24DK110814-05"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100004917","name":"Cancer Prevention and Research Institute of Texas","doi-asserted-by":"publisher","award":["RP150596"],"award-info":[{"award-number":["RP150596"]}],"id":[{"id":"10.13039\/100004917","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,3,28]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Motivation<\/jats:title><jats:p>With the vast improvements in sequencing technologies and increased number of protocols, sequencing is being used to answer complex biological problems. Subsequently, analysis pipelines have become more time consuming and complicated, usually requiring highly extensive prevalidation steps. Here, we present SeqWho, a program designed to assess heuristically the quality of sequencing files and reliably classify the organism and protocol type by using Random Forest classifiers trained on biases native in k-mer frequencies and repeat sequence identities.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>Using one of our primary models, we show that our method accurately and rapidly classifies human and mouse sequences from nine different sequencing libraries by species, library and both together, 98.32%, 97.86% and 96.38% of the time, respectively. Ultimately, we demonstrate that SeqWho is a powerful method for reliably validating the quality and identity of the sequencing files used in any pipeline.<\/jats:p><\/jats:sec><jats:sec><jats:title>Availability and implementation<\/jats:title><jats:p>https:\/\/github.com\/DaehwanKimLab\/seqwho.<\/jats:p><\/jats:sec><jats:sec><jats:title>Supplementary information<\/jats:title><jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/btac050","type":"journal-article","created":{"date-parts":[[2022,1,27]],"date-time":"2022-01-27T04:13:44Z","timestamp":1643256824000},"page":"1830-1837","source":"Crossref","is-referenced-by-count":0,"title":["SeqWho: reliable, rapid determination of sequence file identity using<i>k<\/i>-mer frequencies in Random Forest classifiers"],"prefix":"10.1093","volume":"38","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3329-2567","authenticated-orcid":false,"given":"Christopher","family":"Bennett","sequence":"first","affiliation":[{"name":"Lyda Hill Department of Bioinformatics, University of Texas Southwestern , Dallas, TX 75390, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9093-045X","authenticated-orcid":false,"given":"Micah","family":"Thornton","sequence":"additional","affiliation":[{"name":"Lyda Hill Department of Bioinformatics, University of Texas Southwestern , Dallas, TX 75390, USA"}]},{"given":"Chanhee","family":"Park","sequence":"additional","affiliation":[{"name":"Lyda Hill Department of Bioinformatics, University of Texas Southwestern , Dallas, TX 75390, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7772-9578","authenticated-orcid":false,"given":"Gervaise","family":"Henry","sequence":"additional","affiliation":[{"name":"Department of Urology, University of Texas Southwestern , Dallas, TX 75390, USA"}]},{"given":"Yun","family":"Zhang","sequence":"additional","affiliation":[{"name":"Lyda Hill Department of Bioinformatics, University of Texas Southwestern , Dallas, TX 75390, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0144-0564","authenticated-orcid":false,"given":"Venkat","family":"Malladi","sequence":"additional","affiliation":[{"name":"Lyda Hill Department of Bioinformatics, University of Texas Southwestern , Dallas, TX 75390, USA"}]},{"given":"Daehwan","family":"Kim","sequence":"additional","affiliation":[{"name":"Lyda Hill Department of Bioinformatics, University of Texas Southwestern , Dallas, TX 75390, USA"}]}],"member":"286","published-online":{"date-parts":[[2022,2,3]]},"reference":[{"article-title":"Babraham Bioinformatics\u2014FastQC a Quality Control Tool for High Throughput Sequence Data","year":"2010","author":"Andrews","key":"2023020109005908400_btac050-B1"},{"key":"2023020109005908400_btac050-B2","doi-asserted-by":"crossref","first-page":"229","DOI":"10.1016\/j.jbi.2017.06.015","article-title":"Automated detection of records in biological sequence databases that are inconsistent with the literature","volume":"71","author":"Bouadjenek","year":"2017","journal-title":"J. Biomed. Inform"},{"key":"2023020109005908400_btac050-B3","doi-asserted-by":"crossref","first-page":"525","DOI":"10.1038\/nbt.3519","article-title":"Near-optimal probabilistic RNA-seq quantification","volume":"34","author":"Bray","year":"2016","journal-title":"Nat. Biotechnol"},{"key":"2023020109005908400_btac050-B4","doi-asserted-by":"crossref","first-page":"198","DOI":"10.1186\/s13059-018-1568-0","article-title":"KrakenUniq: confident and fast metagenomics classification using unique k-mer counts","volume":"19","author":"Breitwieser","year":"2018","journal-title":"Genome Biol"},{"key":"2023020109005908400_btac050-B5","doi-asserted-by":"crossref","first-page":"132","DOI":"10.1016\/S0168-9525(99)01706-0","article-title":"Errors in genome annotation","volume":"15","author":"Brenner","year":"1999","journal-title":"Trends Genet"},{"key":"2023020109005908400_btac050-B6","doi-asserted-by":"crossref","first-page":"D794","DOI":"10.1093\/nar\/gkx1081","article-title":"The Encyclopedia of DNA elements (ENCODE): data portal update","volume":"46","author":"Davis","year":"2018","journal-title":"Nucleic Acids Res"},{"key":"2023020109005908400_btac050-B7","doi-asserted-by":"crossref","first-page":"38","DOI":"10.1186\/s12859-015-0875-7","article-title":"Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis","volume":"17","author":"Dubinkina","year":"2016","journal-title":"BMC Bioinform"},{"key":"2023020109005908400_btac050-B8","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1038\/nature11247","article-title":"An integrated encyclopedia of DNA elements in the human genome","volume":"489","author":"Dunham","year":"2012","journal-title":"Nature"},{"key":"2023020109005908400_btac050-B9","doi-asserted-by":"crossref","first-page":"157","DOI":"10.1101\/gr.210500.116","article-title":"A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree","volume":"27","author":"Eberle","year":"2017","journal-title":"Genome Res"},{"key":"2023020109005908400_btac050-B10","doi-asserted-by":"crossref","first-page":"179","DOI":"10.1038\/550179a","article-title":"The future of DNA sequencing","volume":"550","author":"Green","year":"2017","journal-title":"Nature"},{"key":"2023020109005908400_btac050-B11","first-page":"1","article-title":"Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype","volume":"2019","author":"Kim","year":"2019","journal-title":"Nat. Biotechnol"},{"key":"2023020109005908400_btac050-B12","doi-asserted-by":"crossref","first-page":"2011","DOI":"10.1093\/nar\/gkr854","article-title":"The Sequence Read Archive: explosive growth of sequencing data","volume":"40","author":"Kodama","year":"2012","journal-title":"Nucleic Acids Res"},{"key":"2023020109005908400_btac050-B13","doi-asserted-by":"crossref","first-page":"3296","DOI":"10.1016\/j.celrep.2020.02.048","article-title":"Genomic repeats categorize genes with distinct functions for orchestrated regulation","volume":"30","author":"Lu","year":"2020","journal-title":"Cell Rep"},{"key":"2023020109005908400_btac050-B14","doi-asserted-by":"crossref","first-page":"574","DOI":"10.1093\/bioinformatics\/btw663","article-title":"KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies","volume":"33","author":"Mapleson","year":"2017","journal-title":"Bioinformatics"},{"key":"2023020109005908400_btac050-B15","doi-asserted-by":"crossref","first-page":"53","DOI":"10.1186\/s13059-016-0917-0","article-title":"The real cost of sequencing: scaling computation to keep pace with data generation","volume":"17","author":"Muir","year":"2016","journal-title":"Genome Biol"},{"key":"2023020109005908400_btac050-B16","doi-asserted-by":"crossref","first-page":"D649","DOI":"10.1093\/nar\/gky977","article-title":"Genomes OnLine database (GOLD) v.7: updates and new features","volume":"47","author":"Mukherjee","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2023020109005908400_btac050-B17","doi-asserted-by":"crossref","first-page":"e77910","DOI":"10.1371\/journal.pone.0077910","article-title":"Experimental design-based functional mining and characterization of high-throughput sequencing data in the Sequence Read Archive","volume":"8","author":"Nakazato","year":"2013","journal-title":"PLoS One"},{"key":"2023020109005908400_btac050-B18","doi-asserted-by":"crossref","first-page":"183","DOI":"10.1186\/s12864-015-1406-7","article-title":"Quality control of microbiota metagenomics by k-mer analysis","volume":"16","author":"Plaza Onate","year":"2015","journal-title":"BMC Genomics"},{"key":"2023020109005908400_btac050-B19","doi-asserted-by":"crossref","first-page":"108","DOI":"10.1186\/s40793-015-0101-2","article-title":"Annotation inconsistencies beyond sequence similarity-based function prediction\u2014phylogeny and genome structure","volume":"10","author":"Promponas","year":"2015","journal-title":"Stand. Genomic Sci"},{"key":"2023020109005908400_btac050-B20","doi-asserted-by":"crossref","first-page":"23243","DOI":"10.1073\/pnas.1912175116","article-title":"Human-specific tandem repeat expansion and differential gene expression during primate evolution","volume":"116","author":"Sulovari","year":"2019","journal-title":"Proc. Natl. Acad. Sci. USA"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btac050\/42516088\/btac050.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/7\/1830\/49009203\/btac050.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/7\/1830\/49009203\/btac050.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,11,16]],"date-time":"2023-11-16T09:32:58Z","timestamp":1700127178000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/38\/7\/1830\/6520802"}},"subtitle":[],"editor":[{"given":"Can","family":"Alkan","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2022,2,3]]},"references-count":20,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2022,3,28]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btac050","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"type":"print","value":"1367-4803"},{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2022,4,1]]},"published":{"date-parts":[[2022,2,3]]}}}