{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,7]],"date-time":"2025-11-07T09:15:55Z","timestamp":1762506955696},"reference-count":58,"publisher":"Oxford University Press (OUP)","issue":"7","license":[{"start":{"date-parts":[[2017,1,10]],"date-time":"2017-01-10T00:00:00Z","timestamp":1484006400000},"content-version":"vor","delay-in-days":6,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"funder":[{"name":"International Max Planck Research School Molecular Biology, G\u00f6ttingen."}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2017,4,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>Word-based or \u2018alignment-free\u2019 algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignment-based methods.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>We propose Filtered Spaced Word Matches (FSWM), a fast alignment-free approach to estimate phylogenetic distances between large genomic sequences. For a pre-defined binary pattern of match and don\u2019t-care positions, FSWM rapidly identifies spaced word-matches between input sequences, i.e. gap-free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don\u2019t-care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don\u2019t-care positions of the identified spaced-word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced-word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment-free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and Implementation<\/jats:title>\n                  <jats:p>The program source code for FSWM including a documentation, as well as the software that we used to generate artificial genome sequences are freely available at http:\/\/fswm.gobics.de\/<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Supplementary information<\/jats:title>\n                  <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btw776","type":"journal-article","created":{"date-parts":[[2016,12,2]],"date-time":"2016-12-02T12:05:39Z","timestamp":1480680339000},"page":"971-979","source":"Crossref","is-referenced-by-count":58,"title":["Fast and accurate phylogeny reconstruction using filtered spaced-word matches"],"prefix":"10.1093","volume":"33","author":[{"given":"Chris-Andr\u00e9","family":"Leimeister","sequence":"first","affiliation":[{"name":"Department of Bioinformatics, University of G\u00f6ttingen, Institute of Microbiology and Genetics, Goldschmidtstr. 1, 37077\u2009G\u00f6ttingen, Germany"}]},{"given":"Salma","family":"Sohrabi-Jahromi","sequence":"additional","affiliation":[{"name":"Department of Bioinformatics, University of G\u00f6ttingen, Institute of Microbiology and Genetics, Goldschmidtstr. 1, 37077\u2009G\u00f6ttingen, Germany"}]},{"given":"Burkhard","family":"Morgenstern","sequence":"additional","affiliation":[{"name":"Department of Bioinformatics, University of G\u00f6ttingen, Institute of Microbiology and Genetics, Goldschmidtstr. 1, 37077\u2009G\u00f6ttingen, Germany"},{"name":"University of G\u00f6ttingen, Center for Computational Sciences, Goldschmidtstr. 1, 37077\u2009G\u00f6ttingen, Germany"}]}],"member":"286","published-online":{"date-parts":[[2017,1,4]]},"reference":[{"key":"2023020205011818400_btw776-B1","doi-asserted-by":"crossref","first-page":"e41.","DOI":"10.1093\/nar\/gkr1246","article-title":"Hobbes: optimized gram-based methods for efficient read alignment","volume":"40","author":"Ahmadi","year":"2011","journal-title":"Nucleic Acids Res"},{"key":"2023020205011818400_btw776-B2","doi-asserted-by":"crossref","first-page":"28970","DOI":"10.1038\/srep28970","article-title":"Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer","volume":"6","author":"Bernard","year":"2016","journal-title":"Scientific Reports"},{"key":"2023020205011818400_btw776-B3","doi-asserted-by":"crossref","first-page":"e1004985.","DOI":"10.1371\/journal.pcbi.1004985","article-title":"Phylogeny reconstruction with alignment-free method that corrects for horizontal gene transfer","volume":"12","author":"Bromberg","year":"2016","journal-title":"PLOS Comput. Biol"},{"key":"2023020205011818400_btw776-B4","first-page":"17","author":"Chatterji","year":"2008"},{"key":"2023020205011818400_btw776-B5","first-page":"115","volume-title":"Pacific Symposium on Biocomputing","author":"Chiaromonte","year":"2002"},{"key":"2023020205011818400_btw776-B6","doi-asserted-by":"crossref","first-page":"R108.","DOI":"10.1186\/gb-2009-10-10-r108","article-title":"Genomic dna k-mer spectra: models and modalities","volume":"10","author":"Chor","year":"2009","journal-title":"Genome Biol"},{"key":"2023020205011818400_btw776-B7","doi-asserted-by":"crossref","first-page":"1819","DOI":"10.1089\/cmb.2010.0171","article-title":"The irredundant class method for remote homology detection of protein sequences","volume":"18","author":"Comin","year":"2011","journal-title":"J. Comput. Biol"},{"key":"2023020205011818400_btw776-B8","doi-asserted-by":"crossref","first-page":"34.","DOI":"10.1186\/1748-7188-7-34","article-title":"Alignment-free phylogeny of whole genomes using underlying subwords","volume":"7","author":"Comin","year":"2012","journal-title":"Algorithms Mol. Biol"},{"key":"2023020205011818400_btw776-B9","doi-asserted-by":"crossref","first-page":"1115","DOI":"10.1093\/molbev\/msr268","article-title":"Alf-a simulation framework for genome evolution","volume":"29","author":"Dalquen","year":"2012","journal-title":"Mol. Biol. Evol"},{"key":"2023020205011818400_btw776-B10","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1089\/cmb.2011.0070","article-title":"Separating significant matches from spurious matches in DNA sequences","volume":"19","author":"Devillers","year":"2012","journal-title":"J. Comput. Biol"},{"key":"2023020205011818400_btw776-B11","doi-asserted-by":"crossref","first-page":"1.","DOI":"10.1186\/1471-2105-8-1","article-title":"Comparing sequences without using alignments: application to HIV\/SIV subtyping","volume":"8","author":"Didier","year":"2007","journal-title":"BMC Bioinformatics"},{"key":"2023020205011818400_btw776-B12","doi-asserted-by":"crossref","first-page":"3221","DOI":"10.1093\/bioinformatics\/btp590","article-title":"Efficient estimation of pairwise distances between genomes","volume":"25","author":"Domazet-Loso","year":"2009","journal-title":"Bioinformatics"},{"key":"2023020205011818400_btw776-B13","doi-asserted-by":"crossref","first-page":"754.","DOI":"10.1186\/s12864-016-2889-6","article-title":"Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons","volume":"17","author":"Drouin","year":"2016","journal-title":"BMC Genomics"},{"key":"2023020205011818400_btw776-B14","doi-asserted-by":"crossref","first-page":"20.","DOI":"10.1186\/1748-7188-7-20","article-title":"Direct vs 2-stage approaches to structured motif finding","volume":"7","author":"Federico","year":"2012","journal-title":"Algorithms Mol. Biol"},{"key":"2023020205011818400_btw776-B15","author":"Felsenstein","year":"1993"},{"key":"2023020205011818400_btw776-B16","doi-asserted-by":"crossref","first-page":"2864","DOI":"10.1128\/JB.01581-08","article-title":"Whole-genome-based phylogeny and divergence of the genus brucella","volume":"191","author":"Foster","year":"2009","journal-title":"J. Bacteriol"},{"key":"2023020205011818400_btw776-B17","doi-asserted-by":"crossref","first-page":"e1005107.","DOI":"10.1371\/journal.pcbi.1005107","article-title":"rasbhari: optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison","volume":"12","author":"Hahn","year":"2016","journal-title":"PLOS Comput. Biol"},{"key":"2023020205011818400_btw776-B18","doi-asserted-by":"crossref","first-page":"192.","DOI":"10.3389\/fpls.2012.00192","article-title":"A phylogenetic analysis of the brassicales clade based on an alignment-free sequence comparison method","volume":"3","author":"Hatje","year":"2012","journal-title":"Front. Plant Sci"},{"key":"2023020205011818400_btw776-B19","doi-asserted-by":"crossref","first-page":"407","DOI":"10.1093\/bib\/bbt083","article-title":"Alignment-free phylogenetics and population genetics","volume":"15","author":"Haubold","year":"2014","journal-title":"Brief. Bioinf"},{"key":"2023020205011818400_btw776-B20","doi-asserted-by":"crossref","first-page":"1169","DOI":"10.1093\/bioinformatics\/btu815","article-title":"andi: fast and accurate estimation of evolutionary distances between closely related genomes","volume":"31","author":"Haubold","year":"2015","journal-title":"Bioinformatics"},{"key":"2023020205011818400_btw776-B21","doi-asserted-by":"crossref","first-page":"1487","DOI":"10.1089\/cmb.2009.0106","article-title":"Estimating mutation distances from unaligned genomes","volume":"16","author":"Haubold","year":"2009","journal-title":"J. Comput. Biol"},{"key":"2023020205011818400_btw776-B22","doi-asserted-by":"crossref","first-page":"123.","DOI":"10.1186\/1471-2105-6-123","article-title":"Genome comparison without alignment using shortest unique substrings","volume":"6","author":"Haubold","year":"2005","journal-title":"BMC Bioinf"},{"key":"2023020205011818400_btw776-B23","doi-asserted-by":"crossref","first-page":"W7","DOI":"10.1093\/nar\/gku398","article-title":"Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches","volume":"42","author":"Horwege","year":"2014","journal-title":"Nucleic Acids Res"},{"key":"2023020205011818400_btw776-B24","doi-asserted-by":"crossref","first-page":"D286","DOI":"10.1093\/nar\/gkv1248","article-title":"eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences","volume":"44","author":"Huerta-Cepas","year":"2016","journal-title":"Nucleic Acids Res"},{"key":"2023020205011818400_btw776-B25","doi-asserted-by":"crossref","DOI":"10.1016\/B978-1-4832-3211-9.50009-7","volume-title":"Evolution of Protein Molecules","author":"Jukes","year":"1969"},{"key":"2023020205011818400_btw776-B26","doi-asserted-by":"crossref","first-page":"i249","DOI":"10.1093\/bioinformatics\/btm211","article-title":"A statistical method for alignment-free comparison of regulatory sequences","volume":"23","author":"Kantorovitz","year":"2007","journal-title":"Bioinformatics"},{"key":"2023020205011818400_btw776-B27","doi-asserted-by":"crossref","first-page":"R12.","DOI":"10.1186\/gb-2004-5-2-r12","article-title":"Versatile and open software for comparing large genomes","volume":"5","author":"Kurtz","year":"2004","journal-title":"Genome Biol"},{"key":"2023020205011818400_btw776-B28","doi-asserted-by":"crossref","first-page":"R25","DOI":"10.1186\/gb-2009-10-3-r25","article-title":"Ultrafast and memory-efficient alignment of short DNA sequences to the human genome","volume":"10","author":"Langmead","year":"2009","journal-title":"Genome Biol"},{"key":"2023020205011818400_btw776-B29","doi-asserted-by":"crossref","first-page":"1991","DOI":"10.1093\/bioinformatics\/btu177","article-title":"Fast alignment-free sequence comparison using spaced-word frequencies","volume":"30","author":"Leimeister","year":"2014","journal-title":"Bioinformatics"},{"key":"2023020205011818400_btw776-B30","doi-asserted-by":"crossref","first-page":"2000","DOI":"10.1093\/bioinformatics\/btu331","article-title":"kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison","volume":"30","author":"Leimeister","year":"2014","journal-title":"Bioinformatics"},{"key":"2023020205011818400_btw776-B31","author":"Leslie","year":"2002"},{"key":"2023020205011818400_btw776-B32","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1371\/journal.pone.0006901","article-title":"Identifying cis-regulatory sequences by word profile similarity","volume":"4","author":"Leung","year":"2009","journal-title":"PLOS One"},{"key":"2023020205011818400_btw776-B33","doi-asserted-by":"crossref","first-page":"1489","DOI":"10.1093\/bioinformatics\/btr186","article-title":"A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio","volume":"27","author":"Leung","year":"2011","journal-title":"Bioinformatics"},{"key":"2023020205011818400_btw776-B34","doi-asserted-by":"crossref","first-page":"713","DOI":"10.1093\/bioinformatics\/btn025","article-title":"Soap: short oligonucleotide alignment program","volume":"24","author":"Li","year":"2008","journal-title":"Bioinformatics"},{"key":"2023020205011818400_btw776-B35","doi-asserted-by":"crossref","first-page":"2224","DOI":"10.1093\/bioinformatics\/btl376","article-title":"Remote homology detection based on oligomer distances","volume":"22","author":"Lingner","year":"2006","journal-title":"Bioinformatics"},{"key":"2023020205011818400_btw776-B36","doi-asserted-by":"crossref","first-page":"259.","DOI":"10.1186\/1471-2105-9-259","article-title":"Word correlation matrices for protein sequence analysis and remote homology detection","volume":"9","author":"Lingner","year":"2008","journal-title":"BMC Bioinformatics"},{"key":"2023020205011818400_btw776-B37","doi-asserted-by":"crossref","first-page":"1382","DOI":"10.1093\/bioinformatics\/btu843","article-title":"UProC: tools for ultra-fast protein domain classification","volume":"31","author":"Meinicke","year":"2015","journal-title":"Bioinformatics"},{"key":"2023020205011818400_btw776-B38","doi-asserted-by":"crossref","first-page":"5.","DOI":"10.1186\/s13015-015-0032-x","article-title":"Estimating evolutionary distances between genomic sequences from spaced-word matches","volume":"10","author":"Morgenstern","year":"2015","journal-title":"Algorithms Mol. Biol"},{"key":"2023020205011818400_btw776-B39","doi-asserted-by":"crossref","first-page":"462","DOI":"10.1038\/nbt.2862","article-title":"Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms","volume":"32","author":"Patro","year":"2014","journal-title":"Nat. Biotechnol"},{"key":"2023020205011818400_btw776-B40","doi-asserted-by":"crossref","first-page":"1615","DOI":"10.1089\/cmb.2009.0198","article-title":"Alignment-free sequence comparison (I): statistics and power","volume":"16","author":"Reinert","year":"2009","journal-title":"J. Comput. Biol"},{"key":"2023020205011818400_btw776-B41","doi-asserted-by":"crossref","first-page":"131","DOI":"10.1016\/0025-5564(81)90043-2","article-title":"Comparison of phylogenetic trees","volume":"53","author":"Robinson","year":"1981","journal-title":"Math. Biosci"},{"key":"2023020205011818400_btw776-B42","first-page":"406","article-title":"The neighbor-joining method: a new method for reconstructing phylogenetic trees","volume":"4","author":"Saitou","year":"1987","journal-title":"Mol. Biol. Evol"},{"key":"2023020205011818400_btw776-B43","doi-asserted-by":"crossref","first-page":"219.","DOI":"10.1186\/1471-2105-10-219","article-title":"Orthoselect: a protocol for selecting orthologous groups in phylogenomics","volume":"10","author":"Schreiber","year":"2009","journal-title":"BMC Bioinf"},{"key":"2023020205011818400_btw776-B44","doi-asserted-by":"crossref","first-page":"2677","DOI":"10.1073\/pnas.0813249106","article-title":"Alignment-free genome comparison with feature frequency profiles (ffp) and optimal resolutions","volume":"106","author":"Sims","year":"2009","journal-title":"Proc. Natl. Acad. Sci"},{"key":"2023020205011818400_btw776-B45","doi-asserted-by":"crossref","first-page":"343","DOI":"10.1093\/bib\/bbt067","article-title":"New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing","volume":"15","author":"Song","year":"2014","journal-title":"Brief. Bioinf"},{"key":"2023020205011818400_btw776-B46","doi-asserted-by":"crossref","first-page":"27.","DOI":"10.1186\/1748-7188-7-27","article-title":"Separating metagenomic short reads into genomes via clustering","volume":"7","author":"Tanaseichuk","year":"2012","journal-title":"Algorithms Mol. Biol"},{"key":"2023020205011818400_btw776-B47","doi-asserted-by":"crossref","first-page":"163.","DOI":"10.1186\/1471-2105-5-163","article-title":"TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences","volume":"5","author":"Teeling","year":"2004","journal-title":"BMC Bioinformatics"},{"key":"2023020205011818400_btw776-B48","doi-asserted-by":"crossref","first-page":"452","DOI":"10.1089\/cmb.2015.0217","article-title":"ALFRED: a practical method for alignment-free distance computation","volume":"23","author":"Thankachan","year":"2016","journal-title":"J. Comput. Biol"},{"key":"2023020205011818400_btw776-B49","doi-asserted-by":"crossref","first-page":"336","DOI":"10.1089\/cmb.2006.13.336","article-title":"The average common substring approach to phylogenomic reconstruction","volume":"13","author":"Ulitsky","year":"2006","journal-title":"J. Comput. Biol"},{"key":"2023020205011818400_btw776-B50","doi-asserted-by":"crossref","first-page":"341","DOI":"10.1093\/bib\/bbu005","article-title":"Editorial: alignment-free methods in computational biology","volume":"15","author":"Vinga","year":"2014","journal-title":"Brief. Bioinf"},{"key":"2023020205011818400_btw776-B51","doi-asserted-by":"crossref","first-page":"10.","DOI":"10.1186\/1748-7188-7-10","article-title":"Pattern matching through chaos game representation: bridging numerical and discrete data structures for biological sequence analysis","volume":"7","author":"Vinga","year":"2012","journal-title":"Algorithms Mol. Biol"},{"key":"2023020205011818400_btw776-B52","doi-asserted-by":"crossref","first-page":"1467","DOI":"10.1089\/cmb.2010.0056","article-title":"Alignment-free sequence comparison (II): theoretical power of comparison statistics","volume":"17","author":"Wan","year":"2010","journal-title":"J. Comput. Biol"},{"key":"2023020205011818400_btw776-B53","doi-asserted-by":"crossref","first-page":"i356","DOI":"10.1093\/bioinformatics\/bts397","article-title":"MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample","volume":"28","author":"Wang","year":"2012","journal-title":"Bioinformatics"},{"key":"2023020205011818400_btw776-B54","doi-asserted-by":"crossref","first-page":"D358","DOI":"10.1093\/nar\/gks1116","article-title":"Orthodb: a hierarchical catalog of animal, fungal and bacterial orthologs","volume":"41","author":"Waterhouse","year":"2013","journal-title":"Nucleic Acids Res"},{"key":"2023020205011818400_btw776-B55","doi-asserted-by":"crossref","first-page":"523","DOI":"10.1089\/cmb.2010.0245","article-title":"A novel abundance-based algorithm for binning metagenomic sequences using l-tuples","volume":"18","author":"Wu","year":"2011","journal-title":"J. Comput. Biol"},{"key":"2023020205011818400_btw776-B56","doi-asserted-by":"crossref","first-page":"e75.","DOI":"10.1093\/nar\/gkt003","article-title":"Co-phylog: an assembly-free phylogenomic approach for closely related organisms","volume":"41","author":"Yi","year":"2013","journal-title":"Nucleic Acids Res"},{"key":"2023020205011818400_btw776-B57","doi-asserted-by":"crossref","first-page":"821","DOI":"10.1101\/gr.074492.107","article-title":"Velvet: Algorithms for de novo short read assembly using de Bruijn graphs","volume":"18","author":"Zerbino","year":"2008","journal-title":"Genome Res"},{"key":"2023020205011818400_btw776-B58","doi-asserted-by":"crossref","first-page":"321","DOI":"10.1016\/j.gpb.2015.08.004","article-title":"CVTree3 web server for whole-genome-based and alignment-free prokaryotic phylogeny and taxonomy","volume":"13","author":"Zuo","year":"2015","journal-title":"Genomics Proteomics Bioinf"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/33\/7\/971\/49038594\/bioinformatics_33_7_971.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/33\/7\/971\/49038594\/bioinformatics_33_7_971.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,2,2]],"date-time":"2023-02-02T05:04:38Z","timestamp":1675314278000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/33\/7\/971\/2883388"}},"subtitle":[],"editor":[{"given":"Alfonso","family":"Valencia","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2017,1,4]]},"references-count":58,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2017,4,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btw776","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2017,4,1]]},"published":{"date-parts":[[2017,1,4]]}}}