{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,12]],"date-time":"2026-03-12T01:13:10Z","timestamp":1773277990046,"version":"3.50.1"},"reference-count":30,"publisher":"Oxford University Press (OUP)","issue":"5","license":[{"start":{"date-parts":[[2016,10,2]],"date-time":"2016-10-02T00:00:00Z","timestamp":1475366400000},"content-version":"vor","delay-in-days":1725,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/3.0"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2012,3,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Motivation: The identity of cells and tissues is to a large degree governed by transcriptional regulation. A major part is accomplished by the combinatorial binding of transcription factors at regulatory sequences, such as enhancers. Even though binding of transcription factors is sequence-specific, estimating the sequence similarity of two functionally similar enhancers is very difficult. However, a similarity measure for regulatory sequences is crucial to detect and understand functional similarities between two enhancers and will facilitate large-scale analyses like clustering, prediction and classification of genome-wide datasets.<\/jats:p>\n               <jats:p>Results: We present the standardized alignment-free sequence similarity measure N2, a flexible framework that is defined for word neighbourhoods. We explore the usefulness of adding reverse complement words as well as words including mismatches into the neighbourhood. On simulated enhancer sequences as well as functional enhancers in mouse development, N2 is shown to outperform previous alignment-free measures. N2 is flexible, faster than competing methods and less susceptible to single sequence noise and the occurrence of repetitive sequences. Experiments on the mouse enhancers reveal that enhancers active in different tissues can be separated by pairwise comparison using N2.<\/jats:p>\n               <jats:p>Conclusion: \u00a0N2 represents an improvement over previous alignment-free similarity measures without compromising speed, which makes it a good candidate for large-scale sequence comparison of regulatory sequences.<\/jats:p>\n               <jats:p>Availability: The software is part of the open-source C++ library SeqAn (www.seqan.de) and a compiled version can be downloaded at http:\/\/www.seqan.de\/projects\/alf.html<\/jats:p>\n               <jats:p>Contact: \u00a0goeke@molgen.mpg.de; vingron@molgen.mpg.de<\/jats:p>\n               <jats:p>Supplementary information: \u00a0Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/bts028","type":"journal-article","created":{"date-parts":[[2012,1,14]],"date-time":"2012-01-14T01:50:47Z","timestamp":1326505847000},"page":"656-663","source":"Crossref","is-referenced-by-count":39,"title":["Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts"],"prefix":"10.1093","volume":"28","author":[{"given":"Jonathan","family":"G\u00f6ke","sequence":"first","affiliation":[{"name":"1 Department for Computational Molecular Biology, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany and 2Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, PA 15213, USA"}]},{"given":"Marcel H.","family":"Schulz","sequence":"additional","affiliation":[{"name":"1 Department for Computational Molecular Biology, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany and 2Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, PA 15213, USA"}]},{"given":"Julia","family":"Lasserre","sequence":"additional","affiliation":[{"name":"1 Department for Computational Molecular Biology, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany and 2Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, PA 15213, USA"}]},{"given":"Martin","family":"Vingron","sequence":"additional","affiliation":[{"name":"1 Department for Computational Molecular Biology, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany and 2Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, PA 15213, USA"}]}],"member":"286","published-online":{"date-parts":[[2012,1,12]]},"reference":[{"key":"2023012512194932000_B1","doi-asserted-by":"crossref","first-page":"573","DOI":"10.1093\/nar\/27.2.573","article-title":"Tandem repeats finder: a program to analyze dna sequences","volume":"27","author":"Benson","year":"1999","journal-title":"Nucleic Acids Res."},{"key":"2023012512194932000_B2","doi-asserted-by":"crossref","first-page":"5155","DOI":"10.1073\/pnas.83.14.5155","article-title":"A measure of the similarity of sets of sequences not requiring sequence alignment","volume":"83","author":"Blaisdell","year":"1986","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023012512194932000_B3","doi-asserted-by":"crossref","first-page":"806","DOI":"10.1038\/ng.650","article-title":"Chip-seq identification of weakly conserved heart enhancers","volume":"42","author":"Blow","year":"2010","journal-title":"Nat. Genet."},{"key":"2023012512194932000_B4","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1214\/07-AAP452","article-title":"Approximate word matches between two random sequences","volume":"18","author":"Burden","year":"2008","journal-title":"Ann. Appl. Probab."},{"key":"2023012512194932000_B5","doi-asserted-by":"crossref","first-page":"755","DOI":"10.1002\/jcc.10025","article-title":"Assessment of the parallelization approach of d2-cluster for high-performance sequence clustering","volume":"23","author":"Carpenter","year":"2002","journal-title":"J. Comput. Chem."},{"key":"2023012512194932000_B6","doi-asserted-by":"crossref","first-page":"2296","DOI":"10.1093\/bioinformatics\/btn436","article-title":"Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison","volume":"24","author":"Dai","year":"2008","journal-title":"Bioinformatics"},{"key":"2023012512194932000_B7","doi-asserted-by":"crossref","first-page":"11","DOI":"10.1186\/1471-2105-9-11","article-title":"SeqAn an efficient, generic C++ library for sequence analysis","volume":"9","author":"Doering","year":"2008","journal-title":"BMC Bioinformatics"},{"issue":"Suppl. 5","key":"2023012512194932000_B8","doi-asserted-by":"crossref","first-page":"S21","DOI":"10.1186\/1471-2105-7-S5-S21","article-title":"Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences","volume":"7","author":"For\u00eat,S.","year":"2006","journal-title":"BMC Bioinformatics"},{"key":"2023012512194932000_B9","doi-asserted-by":"crossref","first-page":"261","DOI":"10.1016\/0022-2836(87)90689-9","article-title":"CpG islands in vertebrate genomes","volume":"196","author":"Gardiner-Garden","year":"1987","journal-title":"J. Mol. Biol."},{"key":"2023012512194932000_B10","doi-asserted-by":"crossref","first-page":"e90","DOI":"10.1093\/nar\/gkp1166","article-title":"Finding regulatory dna motifs using alignment-free evolutionary conservation information","volume":"38","author":"Gord\u00e2n,R.","year":"2010","journal-title":"Nucleic Acids Res."},{"key":"2023012512194932000_B11","doi-asserted-by":"crossref","first-page":"413","DOI":"10.1016\/0092-8674(89)90916-1","article-title":"Early and late periodic patterns of even skipped expression are controlled by distinct regulatory elements that respond to different spatial cues","volume":"57","author":"Goto","year":"1989","journal-title":"Cell"},{"key":"2023012512194932000_B12","doi-asserted-by":"crossref","first-page":"199","DOI":"10.1089\/cmb.1994.1.199","article-title":"Biological evaluation of d2, an algorithm for high-performance sequence comparison","volume":"1","author":"Hide","year":"1994","journal-title":"J. Comput. Biol."},{"key":"2023012512194932000_B13","doi-asserted-by":"crossref","first-page":"i249","DOI":"10.1093\/bioinformatics\/btm211","article-title":"A statistical method for alignment-free comparison of regulatory sequences","volume":"23","author":"Kantorovitz","year":"2007","journal-title":"Bioinformatics"},{"key":"2023012512194932000_B14","doi-asserted-by":"crossref","first-page":"568","DOI":"10.1016\/j.devcel.2009.09.002","article-title":"Motif-blind, genome-wide discovery of cis-regulatory modules in drosophila and mouse","volume":"17","author":"Kantorovitz","year":"2009","journal-title":"Dev. Cell"},{"key":"2023012512194932000_B15","doi-asserted-by":"crossref","first-page":"631","DOI":"10.1038\/ng.600","article-title":"Transposable elements have rewired the core regulatory network of human embryonic stem cells","volume":"42","author":"Kunarso","year":"2010","journal-title":"Nat. Genet."},{"key":"2023012512194932000_B16","doi-asserted-by":"crossref","DOI":"10.1101\/gr.121905.111","article-title":"Discriminative prediction of mammalian enhancers from DNA sequence","author":"Lee","year":"2011","journal-title":"Genome Res."},{"key":"2023012512194932000_B17","doi-asserted-by":"crossref","first-page":"13980","DOI":"10.1073\/pnas.202468099","article-title":"Distributional regimes for the number of k-word matches between two random sequences","volume":"99","author":"Lippert","year":"2002","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023012512194932000_B18","doi-asserted-by":"crossref","first-page":"443","DOI":"10.1016\/0022-2836(70)90057-4","article-title":"A general method applicable to the search for similarities in the amino acid sequence of two proteins","volume":"48","author":"Needleman","year":"1970","journal-title":"J. Mol. Biol."},{"key":"2023012512194932000_B19","doi-asserted-by":"crossref","DOI":"10.1089\/cmb.2009.0198","article-title":"Alignment-free sequence comparison (i): Statistics and power","author":"Reinert","year":"2009","journal-title":"J. Comput. Biol."},{"key":"2023012512194932000_B20","volume-title":"DNA, Words and Models.","author":"Robin","year":"2005"},{"key":"2023012512194932000_B21","doi-asserted-by":"crossref","first-page":"827","DOI":"10.1101\/gad.5.5.827","article-title":"Transcriptional regulation of a pair-rule stripe in drosophila","volume":"5","author":"Small","year":"1991","journal-title":"Genes Dev."},{"key":"2023012512194932000_B22","doi-asserted-by":"crossref","first-page":"195","DOI":"10.1016\/0022-2836(81)90087-5","article-title":"Identification of common molecular subsequences","volume":"147","author":"Smith","year":"1981","journal-title":"J. Mol. Biol."},{"key":"2023012512194932000_B23","doi-asserted-by":"crossref","first-page":"W86","DOI":"10.1093\/nar\/gkr377","article-title":"RSAT 2011: regulatory sequence analysis tools","volume":"39","author":"Thomas-Chollier","year":"2011","journal-title":"Nucleic Acids Res."},{"key":"2023012512194932000_B24","doi-asserted-by":"crossref","first-page":"399","DOI":"10.1093\/bioinformatics\/btg425","article-title":"Metrics for comparing regulatory sequences on the basis of pattern counts","volume":"20","author":"van Helden","year":"2004","journal-title":"Bioinformatics"},{"key":"2023012512194932000_B25","doi-asserted-by":"crossref","first-page":"513","DOI":"10.1093\/bioinformatics\/btg005","article-title":"Alignment-free sequence comparison-a review","volume":"19","author":"Vinga","year":"2003","journal-title":"Bioinformatics"},{"key":"2023012512194932000_B26","doi-asserted-by":"crossref","first-page":"854","DOI":"10.1038\/nature07730","article-title":"Chip-seq accurately predicts tissue-specific activity of enhancers","volume":"457","author":"Visel","year":"2009","journal-title":"Nature"},{"key":"2023012512194932000_B27","doi-asserted-by":"crossref","first-page":"434","DOI":"10.1126\/science.1160930","article-title":"Species-specific transcription in mice carrying human chromosome 21","volume":"322","author":"Wilson","year":"2008","journal-title":"Science"},{"key":"2023012512194932000_B28","doi-asserted-by":"crossref","first-page":"12826","DOI":"10.1073\/pnas.0905115106","article-title":"Whole-proteome phylogeny of large dsdna virus families by an alignment-free method","volume":"106","author":"Wu","year":"2009","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023012512194932000_B29","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1016\/j.tig.2008.11.005","article-title":"Methylation and deamination of cpgs generate p53-binding sites on a genomic scale","volume":"25","author":"Zemojtel","year":"2009","journal-title":"Trends Genet."},{"key":"2023012512194932000_B30","doi-asserted-by":"crossref","first-page":"65","DOI":"10.1038\/nature08531","article-title":"Combinatorial binding predicts spatio-temporal cis-regulatory activity","volume":"462","author":"Zinzen","year":"2009","journal-title":"Nature"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/28\/5\/656\/48880946\/bioinformatics_28_5_656.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/28\/5\/656\/48880946\/bioinformatics_28_5_656.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,25]],"date-time":"2023-01-25T15:36:58Z","timestamp":1674661018000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/28\/5\/656\/248629"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,1,12]]},"references-count":30,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2012,3,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bts028","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2012,3,1]]},"published":{"date-parts":[[2012,1,12]]}}}