{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,5]],"date-time":"2025-11-05T21:10:39Z","timestamp":1762377039828,"version":"3.37.3"},"reference-count":21,"publisher":"Oxford University Press (OUP)","issue":"4","license":[{"start":{"date-parts":[[2021,10,30]],"date-time":"2021-10-30T00:00:00Z","timestamp":1635552000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"name":"INdAM - GNCS Project 2019"},{"name":"MIUR-PRIN project \u2018Multicriteria Data Structures"},{"name":"Italian Association of Cancer Research","award":["IG21837"],"award-info":[{"award-number":["IG21837"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,1,27]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Motivation<\/jats:title><jats:p>Alignment-free (AF) distance\/similarity functions are a key tool for sequence analysis. Experimental studies on real datasets abound and, to some extent, there are also studies regarding their control of false positive rate (Type I error). However, assessment of their power, i.e. their ability to identify true similarity, has been limited to some members of the D2 family. The corresponding experimental studies have concentrated on short sequences, a scenario no longer adequate for current applications, where sequence lengths may vary considerably. Such a State of the Art is methodologically problematic, since information regarding a key feature such as power is either missing or limited.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>By concentrating on a representative set of word-frequency-based AF functions, we perform the first coherent and uniform evaluation of the power, involving also Type I error for completeness. Two alternative models of important genomic features (CIS Regulatory Modules and Horizontal Gene Transfer), a wide range of sequence lengths from a few thousand to millions, and different values of k have been used. As a result, we provide a characterization of those AF functions that is novel and informative. Indeed, we identify weak and strong points of each function considered, which may be used as a guide to choose one for analysis tasks. Remarkably, of the 15 functions that we have considered, only four stand out, with small differences between small and short sequence length scenarios. Finally, to encourage the use of our methodology for validation of future AF functions, the Big Data platform supporting it is public.<\/jats:p><\/jats:sec><jats:sec><jats:title>Availability and implementation<\/jats:title><jats:p>The software is available at: https:\/\/github.com\/pipp8\/power_statistics.<\/jats:p><\/jats:sec><jats:sec><jats:title>Supplementary information<\/jats:title><jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/btab747","type":"journal-article","created":{"date-parts":[[2021,10,26]],"date-time":"2021-10-26T19:56:47Z","timestamp":1635278207000},"page":"925-932","source":"Crossref","is-referenced-by-count":7,"title":["The power of word-frequency-based alignment-free functions: a comprehensive large-scale experimental analysis"],"prefix":"10.1093","volume":"38","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6983-4818","authenticated-orcid":false,"given":"Giuseppe","family":"Cattaneo","sequence":"first","affiliation":[{"name":"Dipartimento di Informatica, Universit\u00e0 di Salerno , Fisciano, SA 84084, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4308-5126","authenticated-orcid":false,"given":"Umberto","family":"Ferraro Petrillo","sequence":"additional","affiliation":[{"name":"Dipartimento di Scienze Statistiche, Universit\u00e0 di Roma\u2014La Sapienza , 00185 Rome, Italy"}]},{"given":"Raffaele","family":"Giancarlo","sequence":"additional","affiliation":[{"name":"Dipartimento di Matematica ed Informatica, Universit\u00e0 di Palermo , 90133 Palermo, Italy"}]},{"given":"Francesco","family":"Palini","sequence":"additional","affiliation":[{"name":"Dipartimento di Scienze Statistiche, Universit\u00e0 di Roma\u2014La Sapienza , 00185 Rome, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4792-9047","authenticated-orcid":false,"given":"Chiara","family":"Romualdi","sequence":"additional","affiliation":[{"name":"Dipartimento di Biologia, Universit\u00e0 di Padova , 35131 Padova, Italy"}]}],"member":"286","published-online":{"date-parts":[[2021,10,30]]},"reference":[{"key":"2023020108530217900_btab747-B1","doi-asserted-by":"crossref","first-page":"403","DOI":"10.1016\/S0022-2836(05)80360-2","article-title":"Basic local alignment search tool","volume":"215","author":"Altschul","year":"1990","journal-title":"J. Mol. Biol"},{"key":"2023020108530217900_btab747-B2","doi-asserted-by":"crossref","first-page":"e94","DOI":"10.7717\/peerj-cs.94","article-title":"Multiple comparative metagenomics using multiset k-mer counting","volume":"2","author":"Benoit","year":"2016","journal-title":"PeerJ. Comput. Sci"},{"key":"2023020108530217900_btab747-B3","doi-asserted-by":"crossref","first-page":"28970","DOI":"10.1038\/srep28970","article-title":"Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer","volume":"6","author":"Bernard","year":"2016","journal-title":"Sci. Rep"},{"key":"2023020108530217900_btab747-B4","article-title":"Alignment-free genomic analysis via a big data spark platform","volume":"38","author":"Ferraro Petrillo","year":"2021","journal-title":"Bioinformatics"},{"key":"2023020108530217900_btab747-B5","doi-asserted-by":"crossref","first-page":"2939","DOI":"10.1093\/bioinformatics\/btv295","article-title":"Epigenomic k-mer dictionaries: shedding light on how sequence composition influences nucleosome positioning in vivo","volume":"31","author":"Giancarlo","year":"2015","journal-title":"Bioinformatics"},{"key":"2023020108530217900_btab747-B6","doi-asserted-by":"crossref","first-page":"3454","DOI":"10.1093\/bioinformatics\/bty799","article-title":"In vitro versus in vivo compositional landscapes of histone sequence preferences in eucaryotic genomes","volume":"34","author":"Giancarlo","year":"2018","journal-title":"Bioinformatics"},{"key":"2023020108530217900_btab747-B7","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9780511574931","volume-title":"Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology","author":"Gusfield","year":"1997"},{"key":"2023020108530217900_btab747-B8","doi-asserted-by":"crossref","first-page":"150","DOI":"10.1016\/j.synbio.2019.08.001","article-title":"The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer","volume":"4","author":"Huang","year":"2019","journal-title":"Synth. Syst. Biotechnol"},{"volume-title":"Algorithms for Clustering Data","year":"1988","author":"Jain","key":"2023020108530217900_btab747-B9"},{"key":"2023020108530217900_btab747-B10","doi-asserted-by":"crossref","first-page":"971","DOI":"10.1093\/bioinformatics\/btw776","article-title":"Fast and accurate phylogeny reconstruction using filtered spaced-word matches","volume":"33","author":"Leimeister","year":"2017","journal-title":"Bioinformatics"},{"key":"2023020108530217900_btab747-B11","doi-asserted-by":"crossref","first-page":"106","DOI":"10.1016\/j.jtbi.2011.06.020","article-title":"New powerful statistics for alignment-free sequence comparison under a pattern transfer model","volume":"284","author":"Liu","year":"2011","journal-title":"J. Theor. Biol"},{"key":"2023020108530217900_btab747-B12","doi-asserted-by":"crossref","first-page":"W554","DOI":"10.1093\/nar\/gkx351","article-title":"CAFE: aCcelerated Alignment-FrEe sequence analysis","volume":"45","author":"Lu","year":"2017","journal-title":"Nucleic Acids Res"},{"key":"2023020108530217900_btab747-B13","doi-asserted-by":"crossref","first-page":"1222","DOI":"10.1093\/bib\/bbx161","article-title":"A survey and evaluations of histogram-based statistics in alignment-free sequence comparison","volume":"20","author":"Luczak","year":"2019","journal-title":"Brief. Bioinf"},{"key":"2023020108530217900_btab747-B14","doi-asserted-by":"crossref","first-page":"132","DOI":"10.1186\/s13059-016-0997-x","article-title":"Mash: fast genome and metagenome distance estimation using minhash","volume":"17","author":"Ondov","year":"2016","journal-title":"Genome Biol"},{"key":"2023020108530217900_btab747-B15","doi-asserted-by":"crossref","first-page":"1615","DOI":"10.1089\/cmb.2009.0198","article-title":"Alignment-free sequence comparison (I): statistics and power","volume":"16","author":"Reinert","year":"2009","journal-title":"J. Comput. Biol"},{"key":"2023020108530217900_btab747-B16","doi-asserted-by":"crossref","first-page":"195","DOI":"10.1016\/0022-2836(81)90087-5","article-title":"Identification of common molecular subsequences","volume":"147","author":"Smith","year":"1981","journal-title":"J. Mol. Biol"},{"key":"2023020108530217900_btab747-B17","doi-asserted-by":"crossref","first-page":"343","DOI":"10.1093\/bib\/bbt067","article-title":"New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing","volume":"15","author":"Song","year":"2013","journal-title":"Brief. Bioinf"},{"key":"2023020108530217900_btab747-B18","doi-asserted-by":"crossref","first-page":"19","DOI":"10.1007\/978-3-030-14160-8_3","volume-title":"Computational Intelligence Methods for Bioinformatics and Biostatistics","author":"Utro","year":"2019"},{"key":"2023020108530217900_btab747-B19","doi-asserted-by":"crossref","first-page":"513","DOI":"10.1093\/bioinformatics\/btg005","article-title":"Alignment-free sequence comparison \u2013 a review","volume":"19","author":"Vinga","year":"2003","journal-title":"Bioinformatics"},{"key":"2023020108530217900_btab747-B20","doi-asserted-by":"crossref","first-page":"1467","DOI":"10.1089\/cmb.2010.0056","article-title":"Alignment-free sequence comparison (II): theoretical power of comparison statistics","volume":"17","author":"Wan","year":"2010","journal-title":"J. Comput. Biol"},{"key":"2023020108530217900_btab747-B21","doi-asserted-by":"crossref","first-page":"144","DOI":"10.1186\/s13059-019-1755-7","article-title":"Benchmarking of alignment-free sequence comparison methods","volume":"20","author":"Zielezinski","year":"2019","journal-title":"Genome Biol"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btab747\/41298700\/btab747.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/4\/925\/49008672\/btab747.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/4\/925\/49008672\/btab747.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,11,11]],"date-time":"2023-11-11T19:44:37Z","timestamp":1699731877000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/38\/4\/925\/6414612"}},"subtitle":[],"editor":[{"given":"Pier Luigi","family":"Martelli","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2021,10,30]]},"references-count":21,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2022,1,27]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btab747","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"type":"print","value":"1367-4803"},{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2022,2,15]]},"published":{"date-parts":[[2021,10,30]]}}}