{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,13]],"date-time":"2026-02-13T07:46:34Z","timestamp":1770968794605,"version":"3.50.1"},"reference-count":57,"publisher":"Oxford University Press (OUP)","issue":"12","license":[{"start":{"date-parts":[[2016,10,2]],"date-time":"2016-10-02T00:00:00Z","timestamp":1475366400000},"content-version":"vor","delay-in-days":2685,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/2.0\/uk\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2009,6,15]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Motivation: Identifying transcription factor binding sites (TFBSs) encoding complex regulatory signals in metazoan genomes remains a challenging problem in computational genomics. Due to degeneracy of nucleotide content among binding site instances or motifs, and intricate \u2018grammatical organization\u2019 of motifs within cis-regulatory modules (CRMs), extant pattern matching-based in silico motif search methods often suffer from impractically high false positive rates, especially in the context of analyzing large genomic datasets, and noisy position weight matrices which characterize binding sites. Here, we try to address this problem by using a framework to maximally utilize the information content of the genomic DNA in the region of query, taking cues from values of various biologically meaningful genetic and epigenetic factors in the query region such as clade-specific evolutionary parameters, presence\/absence of nearby coding regions, etc. We present a new method for TFBS prediction in metazoan genomes that utilizes both the CRM architecture of sequences and a variety of features of individual motifs. Our proposed approach is based on a discriminative probabilistic model known as conditional random fields that explicitly optimizes the predictive probability of motif presence in large sequences, based on the joint effect of all such features.<\/jats:p>\n               <jats:p>Results: This model overcomes weaknesses in earlier methods based on less effective statistical formalisms that are sensitive to spurious signals in the data. We evaluate our method on both simulated CRMs and real Drosophila sequences in comparison with a wide spectrum of existing models, and outperform the state of the art by 22% in F1 score.<\/jats:p>\n               <jats:p>Availability and Implementation: The code is publicly available at http:\/\/www.sailing.cs.cmu.edu\/discover.html.<\/jats:p>\n               <jats:p>Contact: \u00a0epxing@cs.cmu.edu<\/jats:p>\n               <jats:p>Supplementary information: \u00a0Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/btp230","type":"journal-article","created":{"date-parts":[[2009,5,28]],"date-time":"2009-05-28T15:48:54Z","timestamp":1243525734000},"page":"i321-i329","source":"Crossref","is-referenced-by-count":10,"title":["DISCOVER: a feature-based discriminative method for motif search in complex genomes"],"prefix":"10.1093","volume":"25","author":[{"given":"Wenjie","family":"Fu","sequence":"first","affiliation":[{"name":"School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213, USA"}]},{"given":"Pradipta","family":"Ray","sequence":"additional","affiliation":[{"name":"School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213, USA"}]},{"given":"Eric P.","family":"Xing","sequence":"additional","affiliation":[{"name":"School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213, USA"}]}],"member":"286","published-online":{"date-parts":[[2009,5,27]]},"reference":[{"key":"2023013112013432600_B1","doi-asserted-by":"crossref","first-page":"W195","DOI":"10.1093\/nar\/gkh387","article-title":"Mscan: identification of functional clusters of transcription factor binding sites","volume":"32","author":"Alkema","year":"2004","journal-title":"Nucleic Acids Res"},{"key":"2023013112013432600_B2","volume-title":"Nonlinear Programming: Analysis and Methods.","author":"Avriel","year":"2003"},{"key":"2023013112013432600_B3","doi-asserted-by":"crossref","first-page":"757","DOI":"10.1073\/pnas.231608898","article-title":"Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome","volume":"99","author":"Berman","year":"2002","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023013112013432600_B4","doi-asserted-by":"crossref","first-page":"708","DOI":"10.1101\/gr.1933104","article-title":"Aligning multiple genomic sequences with the threaded blockset aligner","volume":"14","author":"Blanchette","year":"2004","journal-title":"Genome Res."},{"key":"2023013112013432600_B5","first-page":"193","article-title":"Markov networks for detecting overlapping elements in sequence data","volume":"17","author":"Bockhurst","year":"2005","journal-title":"Proc. Adv. Neural Inform. Process. Syst."},{"key":"2023013112013432600_B6","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9780511804441","volume-title":"Convex Optimization.","author":"Boyd","year":"2004"},{"key":"2023013112013432600_B7","doi-asserted-by":"crossref","first-page":"5992","DOI":"10.1073\/pnas.91.13.5992","article-title":"Evolutionary selection against change in many Alu repeat sequences interspersed through primate genomes","volume":"91","author":"Britten","year":"1994","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023013112013432600_B8","doi-asserted-by":"crossref","first-page":"1255","DOI":"10.1093\/nar\/30.5.1255","article-title":"Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors","volume":"30","author":"Bulyk","year":"2002","journal-title":"Nucleic Acids Res."},{"key":"2023013112013432600_B9","doi-asserted-by":"crossref","first-page":"33","DOI":"10.1016\/j.cell.2005.05.008","article-title":"Chromosome-wide mapping of estrogen receptor binding reveals long-range regulation requiring the forkhead protein FoxA1","volume":"122","author":"Carroll","year":"2005","journal-title":"Cell"},{"key":"2023013112013432600_B10","doi-asserted-by":"crossref","first-page":"1264","DOI":"10.1093\/bioinformatics\/btn112","article-title":"Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection","volume":"24","author":"Damoulas","year":"2008","journal-title":"Bioinformatics"},{"key":"2023013112013432600_B11","volume-title":"Genomic Regulatory Systems.","author":"Davidson","year":"2001"},{"key":"2023013112013432600_B12","doi-asserted-by":"crossref","first-page":"1389","DOI":"10.1101\/gr.6558107","article-title":"Conrad: gene prediction using conditional random fields","volume":"17","author":"DeCaprio","year":"2007","journal-title":"Genome Res."},{"key":"2023013112013432600_B13","doi-asserted-by":"crossref","first-page":"396","DOI":"10.1186\/1471-2105-7-396","article-title":"Predicting transcription factor binding sites using local over-representation and comparative genomics","volume":"7","author":"Defrance","year":"2006","journal-title":"BMC Bioinformatics"},{"key":"2023013112013432600_B14","doi-asserted-by":"crossref","first-page":"3058","DOI":"10.1093\/bioinformatics\/bti461","article-title":"Tfbscluster: a resource for the characterization of transcriptional regulatory networks","volume":"21","author":"Donaldson","year":"2005","journal-title":"Bioinformatics"},{"key":"2023013112013432600_B15","doi-asserted-by":"crossref","first-page":"1455","DOI":"10.1101\/gr.4140006","article-title":"Locating mammalian transcription factor binding sites: a survey of computational and experimental techniques","volume":"16","author":"Elnitski","year":"2006","journal-title":"Genome Res."},{"key":"2023013112013432600_B16","article-title":"Computational Methods for Analyzing and Modeling Gene Regulation Dynamics","volume-title":"PhD dissertation.","author":"Ernst","year":"2008"},{"key":"2023013112013432600_B17","doi-asserted-by":"crossref","first-page":"3214","DOI":"10.1093\/nar\/gkf438","article-title":"Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences","volume":"30","author":"Frith","year":"2002","journal-title":"Nucleic Acids Res."},{"key":"2023013112013432600_B18","doi-asserted-by":"crossref","first-page":"3666","DOI":"10.1093\/nar\/gkg540","article-title":"Cluster-buster: finding dense clusters of motifs in dna sequences","volume":"31","author":"Frith","year":"2003","journal-title":"Nucleic Acids Res."},{"key":"2023013112013432600_B19","doi-asserted-by":"crossref","first-page":"381","DOI":"10.1093\/bioinformatics\/bti794","article-title":"Redfly: a regulatory element database for drosophila","volume":"22","author":"Gallo","year":"2006","journal-title":"Bioinformatics"},{"key":"2023013112013432600_B20","doi-asserted-by":"crossref","first-page":"R269","DOI":"10.1186\/gb-2007-8-12-r269","article-title":"CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction","volume":"8","author":"Gros","year":"2007","journal-title":"Genome Biol."},{"issue":"Suppl. 1","key":"2023013112013432600_B21","doi-asserted-by":"crossref","first-page":"i169","DOI":"10.1093\/bioinformatics\/btg1021","article-title":"Identification of functional clusters of transcription factor binding motifs in genome sequences: the mscan algorithm","volume":"19","author":"Johansson","year":"2003","journal-title":"Bioinformatics"},{"key":"2023013112013432600_B22","doi-asserted-by":"crossref","first-page":"462","DOI":"10.1159\/000084979","article-title":"Repbase Update, a database of eukaryotic repetitive elements","volume":"110","author":"Jurka","year":"2005","journal-title":"Cytogenet. Genome Res."},{"key":"2023013112013432600_B23","doi-asserted-by":"crossref","first-page":"2740","DOI":"10.1073\/pnas.0511238103","article-title":"A large family of ancient repeat elements in the human genome is under strong selection","volume":"103","author":"Kamal","year":"2006","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023013112013432600_B24","doi-asserted-by":"crossref","first-page":"262","DOI":"10.1186\/1471-2105-9-262","article-title":"Finding sequence motifs with Bayesian models incorporating positional information: an application to transcription factor binding sites","volume":"9","author":"Kim","year":"2008","journal-title":"BMC Bioinformatics"},{"key":"2023013112013432600_B25","article-title":"Conditional random fields: probabilistic models for segmenting and labeling sequence data","volume-title":"Proceedings of the 18th International Conference on Machine Learning (ICML 2001).","author":"Lafferty","year":"2001"},{"key":"2023013112013432600_B26","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-540-78839-3_7","article-title":"Baycis: a bayesian hierarchical hmm for cis-regulatory module decoding in metazoan genomes","volume-title":"Proceedings of RECOMB 2008.","author":"Lin","year":"2008"},{"key":"2023013112013432600_B27","doi-asserted-by":"crossref","first-page":"832","DOI":"10.1101\/gr.225502","article-title":"rVista for comparative sequence-based discovery of functional transcription factor binding sites","volume":"12","author":"Loots","year":"2002","journal-title":"Genome Res."},{"key":"2023013112013432600_B28","doi-asserted-by":"crossref","first-page":"2507","DOI":"10.1101\/gr.1602203","article-title":"Identification & characterization of multi-species conserved sequences","volume":"13","author":"Margulies","year":"2003","journal-title":"Genome Res."},{"key":"2023013112013432600_B29","doi-asserted-by":"crossref","first-page":"546","DOI":"10.1073\/pnas.032685999","article-title":"Deciphering genetic regulatory codes: a challenge for functional genomics","volume":"99","author":"Michelson","year":"2002","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023013112013432600_B30","first-page":"324","article-title":"Phylogenetic motif detection by expectation-maximization on evolutionary mixtures","volume-title":"Proceedings of Pac. Symp. Biocomput. 2004.","author":"Moses","year":"2004"},{"key":"2023013112013432600_B31","article-title":"Computational annotation of transcription factor binding sites in D. melanogaster developmental genes","volume-title":"Proceedings of The 17th International Conference on Genome Informatics.","author":"Narang","year":"2006"},{"key":"2023013112013432600_B32","doi-asserted-by":"crossref","first-page":"e215","DOI":"10.1371\/journal.pcbi.0030215","article-title":"A nucleosome-guided map of transcription factor binding sites in yeast","volume":"3","author":"Narlikar","year":"2007","journal-title":"PLoS Comput. Biol."},{"key":"2023013112013432600_B33","doi-asserted-by":"crossref","first-page":"5730","DOI":"10.1093\/nar\/gkl585","article-title":"A graph-based motif detection algorithm models complex nucleotide dependencies in transcription factor binding sites","volume":"34","author":"Naughton","year":"2006","journal-title":"Nucleic Acids Res."},{"key":"2023013112013432600_B34","doi-asserted-by":"crossref","first-page":"e156","DOI":"10.1093\/bioinformatics\/btl319","article-title":"Learning probabilistic models of cis-regulatory modules that represent logical and spatial aspects","volume":"23","author":"Noto","year":"2007","journal-title":"Bioinformatics"},{"key":"2023013112013432600_B35","doi-asserted-by":"crossref","first-page":"244","DOI":"10.1038\/nbt1279","article-title":"High-throughput mapping of the chromatin structure of human promoters","volume":"25","author":"Ozsolak","year":"2007","journal-title":"Nat. Biotechnol."},{"key":"2023013112013432600_B36","volume-title":"Probabilistic Reasoning in Intelligent System: Networks of Plausible Inference.","author":"Pearl","year":"1988"},{"key":"2023013112013432600_B37","doi-asserted-by":"crossref","first-page":"654","DOI":"10.1093\/bioinformatics\/15.7.654","article-title":"Conformational and physicochemical DNA features specific for transcription factor binding sites","volume":"15","author":"Ponomarenko","year":"1999","journal-title":"Bioinformatics"},{"key":"2023013112013432600_B38","first-page":"43","article-title":"Feature based representation and detection of transcription factor binding sites","volume-title":"Proceedings of the German Conference on Bioinformatics 2004.","author":"Pudimat","year":"2004"},{"key":"2023013112013432600_B39","doi-asserted-by":"crossref","first-page":"30","DOI":"10.1186\/1471-2105-3-30","article-title":"Computational detection of genomic cis-regulatory modules applied to body patterning in the early drosophila embryo","volume":"3","author":"Rajewsky","year":"2002","journal-title":"BMC bioinformatics"},{"key":"2023013112013432600_B40","doi-asserted-by":"crossref","first-page":"e1000090","DOI":"10.1371\/journal.pcbi.1000090","article-title":"Csmet: comparative genomic motif detection via multi-resolution phylogenetic shadowing","volume":"4","author":"Ray","year":"2008","journal-title":"PLoS Comput. Biol."},{"key":"2023013112013432600_B41","doi-asserted-by":"crossref","first-page":"9888","DOI":"10.1073\/pnas.152320899","article-title":"Score: a computational approach to the identification of cis-regulatory modules and target genes in whole-genome sequence data. site clustering over random expectation","volume":"99","author":"Rebeiz","year":"2002","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023013112013432600_B42","doi-asserted-by":"crossref","DOI":"10.1186\/1745-6150-1-11","article-title":"A survey of motif discovery methods in an integrated framework","volume":"1","author":"Sandve","year":"2006","journal-title":"Biol. Direct"},{"key":"2023013112013432600_B43","doi-asserted-by":"crossref","first-page":"772","DOI":"10.1038\/nature04979","article-title":"A genomic code for nucleosome positioning","volume":"442","author":"Segal","year":"2006","journal-title":"Nature"},{"key":"2023013112013432600_B44","first-page":"134","article-title":"Shallow parsing with conditional random fields","volume":"1","author":"Sha","year":"2003","journal-title":"Proc. Hum. Lang. Tech.-NAACL"},{"issue":"Suppl. 1","key":"2023013112013432600_B45","doi-asserted-by":"crossref","first-page":"i283","DOI":"10.1093\/bioinformatics\/btg1039","article-title":"Creme: a framework for identifying cis-regulatory modules in human-mouse conserved segments","volume":"19","author":"Sharan","year":"2003","journal-title":"Bioinformatics"},{"key":"2023013112013432600_B46","doi-asserted-by":"crossref","first-page":"77","DOI":"10.1007\/978-3-540-71681-5_6","article-title":"A feature-based approach to modeling protein-dna interactions","volume":"4453","author":"Sharon","year":"2007","journal-title":"Lect. Notes Comput. Sci."},{"key":"2023013112013432600_B47","first-page":"30","article-title":"Phylogibbs: a gibbs sampler incorporating phylogenetic information","volume-title":"Regulatory Genomics.","author":"Siddharthan","year":"2004"},{"key":"2023013112013432600_B48","doi-asserted-by":"crossref","first-page":"e216","DOI":"10.1371\/journal.pcbi.0030216","article-title":"MORPH: probabilistic alignment combined with hidden Markov models of cis-regulatory modules","volume":"3","author":"Sinha","year":"2007","journal-title":"PLoS Comput. Biol."},{"key":"2023013112013432600_B49","doi-asserted-by":"crossref","first-page":"170","DOI":"10.1186\/1471-2105-5-170","article-title":"PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences","volume":"5","author":"Sinha","year":"2004","journal-title":"BMC Bioinformatics"},{"key":"2023013112013432600_B50","doi-asserted-by":"crossref","first-page":"W555","DOI":"10.1093\/nar\/gkl224","article-title":"Stubb: a program for discovery and analysis of cis-regulatory modules","volume":"34","author":"Sinha","year":"2006","journal-title":"Nucleic Acids Res."},{"key":"2023013112013432600_B51","doi-asserted-by":"crossref","first-page":"477","DOI":"10.1101\/gr.6828808","article-title":"Systematic functional characterization of cis-regulatory motifs in human core promoters","volume":"18","author":"Sinha","year":"2008","journal-title":"Genome Res."},{"key":"2023013112013432600_B52","doi-asserted-by":"crossref","first-page":"505","DOI":"10.1093\/nar\/12.1Part2.505","article-title":"Computer methods to locate signals in nucleic acid sequences","volume":"12","author":"Staden","year":"1984","journal-title":"Nucleic Acids Res."},{"issue":"Suppl. 1","key":"2023013112013432600_B53","doi-asserted-by":"crossref","first-page":"i440","DOI":"10.1093\/bioinformatics\/bti1028","article-title":"Alignments anchored on genomic landmarks can aid in the identification of regulatory elements","volume":"21","author":"Tharakaraman","year":"2005","journal-title":"Bioinformatics"},{"key":"2023013112013432600_B54","doi-asserted-by":"crossref","first-page":"1967","DOI":"10.1101\/gr.2589004","article-title":"Decoding human regulatory circuits","volume":"14","author":"Thompson","year":"2004","journal-title":"Genome Res."},{"key":"2023013112013432600_B55","doi-asserted-by":"crossref","first-page":"i165","DOI":"10.1093\/bioinformatics\/btn154","article-title":"Predicting functional transcription factor binding through alignment-free and affinity-based analysis of orthologous promoter sequences","volume":"24","author":"Ward","year":"2008","journal-title":"Bioinformatics"},{"key":"2023013112013432600_B56","doi-asserted-by":"crossref","first-page":"316","DOI":"10.1093\/nar\/28.1.316","article-title":"TRANSFAC: an integrated system for gene expression regulation","volume":"28","author":"Wingender","year":"2000","journal-title":"Nucleic Acids Res."},{"issue":"Suppl. 6","key":"2023013112013432600_B57","doi-asserted-by":"crossref","first-page":"S3","DOI":"10.1186\/1471-2105-8-S6-S3","article-title":"Computational analyses of eukaryotic promoters","volume":"8","author":"Zhang","year":"2007","journal-title":"BMC Bioinformatics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/25\/12\/i321\/48992603\/bioinformatics_25_12_i321.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/25\/12\/i321\/48992603\/bioinformatics_25_12_i321.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,31]],"date-time":"2023-01-31T21:09:43Z","timestamp":1675199383000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/25\/12\/i321\/192868"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2009,5,27]]},"references-count":57,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2009,6,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btp230","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2009,6,15]]},"published":{"date-parts":[[2009,5,27]]}}}