{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,4,7]],"date-time":"2024-04-07T19:40:55Z","timestamp":1712518855177},"reference-count":31,"publisher":"Oxford University Press (OUP)","issue":"15","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2011,8,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Motivation: Pattern discovery algorithms are widely used for the analysis of DNA and protein sequences. Most algorithms have been designed to find overrepresented motifs in sparse datasets of long sequences, and ignore most positional information. We introduce an algorithm optimized to exploit spatial information in sparse-but-populous datasets.<\/jats:p><jats:p>Results: Our algorithm Tree-based Weighted-Position Pattern Discovery and Classification (T-WPPDC) supports both unsupervised pattern discovery and supervised sequence classification. It identifies positionally enriched patterns using the Kullback\u2013Leibler distance between foreground and background sequences at each position. This spatial information is used to discover positionally important patterns. T-WPPDC then uses a scoring function to discriminate different biological classes. We validated T-WPPDC on an important biological problem: prediction of single nucleotide polymorphisms (SNPs) from flanking sequence. We evaluated 672 separate experiments on 120 datasets derived from multiple species. T-WPPDC outperformed other pattern discovery methods and was comparable to the supervised machine learning algorithms. The algorithm is computationally efficient and largely insensitive to dataset size. It allows arbitrary parameterization and is embarrassingly parallelizable.<\/jats:p><jats:p>Conclusions: T-WPPDC is a minimally parameterized algorithm for both pattern discovery and sequence classification that directly incorporates positional information. We use it to confirm the predictability of SNPs from flanking sequence, and show that positional information is a key to this biological problem.<\/jats:p><jats:p>Contacts: \u00a0ruiyan@cs.toronto.edu; paul.boutros@oicr.on.ca; juris@ai.toronto.edu<\/jats:p><jats:p>Availability: The algorithm, code and data are available at: http:\/\/www.cs.utoronto.ca\/~juris\/data\/TWPPDC<\/jats:p><jats:p>Supplementary information: \u00a0Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/btr353","type":"journal-article","created":{"date-parts":[[2011,6,18]],"date-time":"2011-06-18T04:15:07Z","timestamp":1308370507000},"page":"2054-2061","source":"Crossref","is-referenced-by-count":3,"title":["A tree-based approach for motif discovery and sequence classification"],"prefix":"10.1093","volume":"27","author":[{"given":"Rui","family":"Yan","sequence":"first","affiliation":[{"name":"1 Department of Computer Science, University of Toronto, Toronto, Canada M5S 3G4, 2Ontario Cancer Institute and the Campbell Family Institute for Cancer Research, Princess Margaret Hospital\/University Health Network, Toronto, Canada M5G 2L7, 3Ontario Institute for Cancer Research, Toronto, Canada M5S 0A3 and 4Department of Medical Biophysics, University of Toronto, Toronto, Canada M5S 1A8"},{"name":"1 Department of Computer Science, University of Toronto, Toronto, Canada M5S 3G4, 2Ontario Cancer Institute and the Campbell Family Institute for Cancer Research, Princess Margaret Hospital\/University Health Network, Toronto, Canada M5G 2L7, 3Ontario Institute for Cancer Research, Toronto, Canada M5S 0A3 and 4Department of Medical Biophysics, University of Toronto, Toronto, Canada M5S 1A8"}]},{"given":"Paul C.","family":"Boutros","sequence":"additional","affiliation":[{"name":"1 Department of Computer Science, University of Toronto, Toronto, Canada M5S 3G4, 2Ontario Cancer Institute and the Campbell Family Institute for Cancer Research, Princess Margaret Hospital\/University Health Network, Toronto, Canada M5G 2L7, 3Ontario Institute for Cancer Research, Toronto, Canada M5S 0A3 and 4Department of Medical Biophysics, University of Toronto, Toronto, Canada M5S 1A8"}]},{"given":"Igor","family":"Jurisica","sequence":"additional","affiliation":[{"name":"1 Department of Computer Science, University of Toronto, Toronto, Canada M5S 3G4, 2Ontario Cancer Institute and the Campbell Family Institute for Cancer Research, Princess Margaret Hospital\/University Health Network, Toronto, Canada M5G 2L7, 3Ontario Institute for Cancer Research, Toronto, Canada M5S 0A3 and 4Department of Medical Biophysics, University of Toronto, Toronto, Canada M5S 1A8"},{"name":"1 Department of Computer Science, University of Toronto, Toronto, Canada M5S 3G4, 2Ontario Cancer Institute and the Campbell Family Institute for Cancer Research, Princess Margaret Hospital\/University Health Network, Toronto, Canada M5G 2L7, 3Ontario Institute for Cancer Research, Toronto, Canada M5S 0A3 and 4Department of Medical Biophysics, University of Toronto, Toronto, Canada M5S 1A8"},{"name":"1 Department of Computer Science, University of Toronto, Toronto, Canada M5S 3G4, 2Ontario Cancer Institute and the Campbell Family Institute for Cancer Research, Princess Margaret Hospital\/University Health Network, Toronto, Canada M5G 2L7, 3Ontario Institute for Cancer Research, Toronto, Canada M5S 0A3 and 4Department of Medical Biophysics, University of Toronto, Toronto, Canada M5S 1A8"}]}],"member":"286","published-online":{"date-parts":[[2011,6,17]]},"reference":[{"key":"2023012511531456100_B1","doi-asserted-by":"crossref","first-page":"179","DOI":"10.1186\/1471-2105-11-179","article-title":"The value of position-specific priors in motif discovery using MEME","volume":"11","author":"Bailey","year":"2010","journal-title":"BMC Bioinformatics"},{"key":"2023012511531456100_B2","first-page":"21","article-title":"The value of prior knowledge in discovering motifs with MEME","volume-title":"Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology","author":"Bailey","year":"1995"},{"key":"2023012511531456100_B3","doi-asserted-by":"crossref","first-page":"799","DOI":"10.1038\/nature05874","article-title":"Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project","volume":"447","author":"Birney","year":"2007","journal-title":"Nature"},{"key":"2023012511531456100_B4","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Mach. Learn."},{"key":"2023012511531456100_B5","doi-asserted-by":"crossref","first-page":"225","DOI":"10.1089\/10665270252935430","article-title":"Finding motifs using random projections","volume":"9","author":"Buhler","year":"2002","journal-title":"J. Comput. Biol."},{"key":"2023012511531456100_B6","volume-title":"Pattern Classification","author":"Duda","year":"2001","edition":"2nd"},{"key":"2023012511531456100_B7","first-page":"41","article-title":"MOPAC: motif binding by preprocessing and agglomerative clustering from microarrays","volume":"8","author":"Ganesh","year":"2003","journal-title":"Pac. Symp. Biocomput."},{"key":"2023012511531456100_B8","doi-asserted-by":"crossref","first-page":"1426","DOI":"10.1038\/ng.262","article-title":"Meta-analysis of genome-wide association data identifies four new susceptibility loci for colorectal cancer","volume":"40","author":"Houlston","year":"2008","journal-title":"Nat. Genet."},{"key":"2023012511531456100_B9","doi-asserted-by":"crossref","first-page":"993","DOI":"10.1038\/nature08987","article-title":"International network of cancer genome projects","volume":"464","author":"Hudson","year":"2010","journal-title":"Nature"},{"key":"2023012511531456100_B10","doi-asserted-by":"crossref","first-page":"51","DOI":"10.1093\/nar\/gkg129","article-title":"The UCSC genome browser database","volume":"31","author":"Karolchik","year":"2003","journal-title":"Nucleic Acids Res."},{"key":"2023012511531456100_B11","first-page":"340","article-title":"Letter to the editor: the Kullback-Leibler distance","volume":"41","author":"Kullback","year":"1987","journal-title":"Am. Stat."},{"key":"2023012511531456100_B12","doi-asserted-by":"crossref","first-page":"79","DOI":"10.1214\/aoms\/1177729694","article-title":"On information and sufficiency","volume":"22","author":"Kullback","year":"1951","journal-title":"Ann. Math. Stat."},{"key":"2023012511531456100_B13","doi-asserted-by":"crossref","first-page":"208","DOI":"10.1126\/science.8211139","article-title":"Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment","volume":"262","author":"Lawrence","year":"1993","journal-title":"Science"},{"key":"2023012511531456100_B14","doi-asserted-by":"crossref","first-page":"97","DOI":"10.1007\/s00702-009-0334-6","article-title":"Association between the RAGE G82S polymorphism and Alzheimer's disease","volume":"117","author":"Li","year":"2010","journal-title":"J. Neural Transm."},{"key":"2023012511531456100_B15","doi-asserted-by":"crossref","first-page":"1180","DOI":"10.1101\/gr.076117.108","article-title":"Transcription factor and microRNA motif discovery: the Amadeus platform and a compendium of metazoan target sets","volume":"18","author":"Linhart","year":"2008","journal-title":"Genome Res."},{"key":"2023012511531456100_B16","first-page":"127","article-title":"BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes","author":"Liu","year":"2001","journal-title":"Pac. Symp. Biocomput."},{"key":"2023012511531456100_B17","doi-asserted-by":"crossref","first-page":"1152","DOI":"10.1093\/bioinformatics\/btq106","article-title":"Localized motif discovery in gene regulatory sequences","volume":"26","author":"Narang","year":"2010","journal-title":"Bioinformatics"},{"key":"2023012511531456100_B18","doi-asserted-by":"crossref","first-page":"W199","DOI":"10.1093\/nar\/gkh465","article-title":"Weeder WEB: discovery of transcription factor binding sites in a set of sequences from co-regulated genes","volume":"32","author":"Pevesi","year":"2004","journal-title":"Nucleic Acids Res."},{"key":"2023012511531456100_B19","doi-asserted-by":"crossref","first-page":"669","DOI":"10.1007\/s00439-005-0094-9","article-title":"Evaluating HapMap SNP data transferability in a large-scale genotyping project involving 175 cancer-associated genes","volume":"118","author":"Ribas","year":"2006","journal-title":"Hum. Genet."},{"key":"2023012511531456100_B20","doi-asserted-by":"crossref","first-page":"928","DOI":"10.1038\/35057149","article-title":"A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms","volume":"409","author":"Sachidanandam","year":"2001","journal-title":"Nature"},{"key":"2023012511531456100_B21","doi-asserted-by":"crossref","first-page":"33","DOI":"10.1038\/nbt0198-33","article-title":"DNA variation and the future of human genetics","volume":"16","author":"Schafer","year":"1997","journal-title":"Nat. Biotechnol."},{"key":"2023012511531456100_B22","doi-asserted-by":"crossref","first-page":"3586","DOI":"10.1093\/nar\/gkg618","article-title":"YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation","volume":"31","author":"Sinha","year":"2003","journal-title":"Nucleic Acids Res."},{"key":"2023012511531456100_B23","doi-asserted-by":"crossref","first-page":"553","DOI":"10.1038\/ng.375","article-title":"The transcriptional network that controls growth arrest and differentiation in a human myeloid leukemia cell line","volume":"41","author":"Suzuki","year":"2009","journal-title":"Nat. Genet."},{"key":"2023012511531456100_B24","doi-asserted-by":"crossref","first-page":"447","DOI":"10.1089\/10665270252935566","article-title":"A Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes","volume":"9","author":"Thijs","year":"2002","journal-title":"J. Comput. Biol."},{"key":"2023012511531456100_B25","doi-asserted-by":"crossref","first-page":"137","DOI":"10.1038\/nbt1053","article-title":"Assessing computational tools for the discovery of transcription factor binding sites","volume":"23","author":"Tompa","year":"2005","journal-title":"Nat. Biotechnol."},{"key":"2023012511531456100_B26","doi-asserted-by":"crossref","first-page":"1808","DOI":"10.1093\/nar\/28.8.1808","article-title":"Discovering regulatory elements in non-coding sequences by analysis of spaced dyads","volume":"28","author":"Van Helden","year":"2000","journal-title":"Nucleic Acids Res."},{"key":"2023012511531456100_B27","doi-asserted-by":"crossref","first-page":"71","DOI":"10.1038\/ng.285","article-title":"Common variants in the NLRP3 region contribute to Crohn's disease susceptibility","volume":"41","author":"Vilani","year":"2009","journal-title":"Nat. Genet."},{"key":"2023012511531456100_B28","doi-asserted-by":"crossref","first-page":"167","DOI":"10.1006\/jmbi.1998.1700","article-title":"Identification of regulatory regions which confer muscle-specific gene expression","volume":"278","author":"Wasserman","year":"1998","journal-title":"J. Mol. Biol."},{"key":"2023012511531456100_B29","doi-asserted-by":"crossref","first-page":"452","DOI":"10.1109\/GrC.2007.72","article-title":"Comparison of machine learning and pattern discovery algorithms for the prediction of human single nucleotide polymorphisms","volume-title":"IEEE International Conference on Granular Computing (GRC 2007)","author":"Yan","year":"2007"},{"key":"2023012511531456100_B30","doi-asserted-by":"crossref","first-page":"207","DOI":"10.1016\/S0378-1119(03)00670-X","article-title":"Investigating single nucleotide polymorphism (SNP) density in the human genome and its implications for molecular evolution","volume":"312","author":"Zhao","year":"2003","journal-title":"Gene"},{"key":"2023012511531456100_B31","doi-asserted-by":"crossref","first-page":"785","DOI":"10.1016\/j.ygeno.2004.06.015","article-title":"The influence of neighboring-nucleotide composition on single nucleotide polymorphisms (SNPs) in the mouse genome and its comparison with human SNPs","volume":"84","author":"Zhang","year":"2004","journal-title":"Genomics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/27\/15\/2054\/48864412\/bioinformatics_27_15_2054.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/27\/15\/2054\/48864412\/bioinformatics_27_15_2054.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,4,7]],"date-time":"2024-04-07T18:57:49Z","timestamp":1712516269000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/27\/15\/2054\/404480"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2011,6,17]]},"references-count":31,"journal-issue":{"issue":"15","published-print":{"date-parts":[[2011,8,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btr353","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2011,8,1]]},"published":{"date-parts":[[2011,6,17]]}}}