{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,29]],"date-time":"2025-09-29T08:25:38Z","timestamp":1759134338056},"reference-count":49,"publisher":"Oxford University Press (OUP)","issue":"1","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2006,1,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Motivation: Motif discovery in sequential data is a problem of great interest and with many applications. However, previous methods have been unable to combine exhaustive search with complex motif representations and are each typically only applicable to a certain class of problems.<\/jats:p>\n               <jats:p>Results: Here we present a generic motif discovery algorithm (Gemoda) for sequential data. Gemoda can be applied to any dataset with a sequential character, including both categorical and real-valued data. As we show, Gemoda deterministically discovers motifs that are maximal in composition and length. As well, the algorithm allows any choice of similarity metric for finding motifs. Finally, Gemoda's output motifs are representation-agnostic: they can be represented using regular expressions, position weight matrices or any number of other models for any type of sequential data. We demonstrate a number of applications of the algorithm, including the discovery of motifs in amino acids sequences, a new solution to the (l,d)-motif problem in DNA sequences and the discovery of conserved protein substructures.<\/jats:p>\n               <jats:p>Availability: Gemoda is freely available at<\/jats:p>\n               <jats:p>Contact: \u00a0gregstep@mit.edu<\/jats:p>\n               <jats:p>Supplementary Information: Available at<\/jats:p>","DOI":"10.1093\/bioinformatics\/bti745","type":"journal-article","created":{"date-parts":[[2005,10,29]],"date-time":"2005-10-29T00:13:06Z","timestamp":1130544786000},"page":"21-28","source":"Crossref","is-referenced-by-count":48,"title":["A generic motif discovery algorithm for sequential data"],"prefix":"10.1093","volume":"22","author":[{"given":"Kyle L.","family":"Jensen","sequence":"first","affiliation":[{"name":"Department of Chemical Engineering, Massachusetts Institute of Technology 1 \u00a0 1 \u00a0 \u00a0 Cambridge, MA 02139, USA"}]},{"given":"Mark P.","family":"Styczynski","sequence":"additional","affiliation":[{"name":"Department of Chemical Engineering, Massachusetts Institute of Technology 1 \u00a0 1 \u00a0 \u00a0 Cambridge, MA 02139, USA"}]},{"given":"Isidore","family":"Rigoutsos","sequence":"additional","affiliation":[{"name":"Department of Chemical Engineering, Massachusetts Institute of Technology 1 \u00a0 1 \u00a0 \u00a0 Cambridge, MA 02139, USA"},{"name":"IBM Research Division, Thomas J. Watson Research Center 2 \u00a0 2 \u00a0 \u00a0 Yorktown Heights, NY 10598, USA"}]},{"given":"Gregory N.","family":"Stephanopoulos","sequence":"additional","affiliation":[{"name":"Department of Chemical Engineering, Massachusetts Institute of Technology 1 \u00a0 1 \u00a0 \u00a0 Cambridge, MA 02139, USA"}]}],"member":"286","published-online":{"date-parts":[[2005,10,27]]},"reference":[{"key":"2023012408301769500_b1","doi-asserted-by":"crossref","first-page":"727","DOI":"10.1093\/protein\/9.9.727","article-title":"SARFing the PDB","volume":"9","author":"Alexandrov","year":"1996","journal-title":"Protein Eng."},{"key":"2023012408301769500_b2","doi-asserted-by":"crossref","first-page":"354","DOI":"10.1002\/(SICI)1097-0134(199607)25:3<354::AID-PROT7>3.0.CO;2-F","article-title":"Analysis of topological and nontopological structural similarities in the PDB: new examples with old structures","volume":"25","author":"Alexandrov","year":"1996","journal-title":"Proteins"},{"key":"2023012408301769500_b3","doi-asserted-by":"crossref","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","article-title":"Gapped BLAST and PSI-BLAST: a new generation of protein database search programs","volume":"25","author":"Altschul","year":"1997","journal-title":"Nucleic Acids Res."},{"key":"2023012408301769500_b4","doi-asserted-by":"crossref","first-page":"469","DOI":"10.1016\/S0968-0004(98)01293-6","article-title":"The HD domain defines a new superfamily of metal-dependent phosphohydrolases","volume":"23","author":"Aravind","year":"1998","journal-title":"Trends Biochem Sci."},{"key":"2023012408301769500_b5","doi-asserted-by":"crossref","first-page":"698","DOI":"10.1109\/TPAMI.1987.4767965","article-title":"Least-squares fitting of two 3-d point sets","volume":"9","author":"Arun","year":"1987","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"2023012408301769500_b6","first-page":"28","article-title":"Fitting a mixture model by expectation maximization to discover motifs in biopolymers","volume":"2","author":"Bailey","year":"1994","journal-title":"Proc. Int. Conf. Intell. Syst. Mol. Biol."},{"key":"2023012408301769500_b7","doi-asserted-by":"crossref","first-page":"304","DOI":"10.1093\/nar\/28.1.304","article-title":"The ENZYME database in 2000","volume":"28","author":"Bairoch","year":"2000","journal-title":"Nucleic Acids Res."},{"key":"2023012408301769500_b8","doi-asserted-by":"crossref","first-page":"45","DOI":"10.1093\/nar\/28.1.45","article-title":"The Swiss-Prot protein sequence database and its supplement TrEMBL in 2000","volume":"28","author":"Bairoch","year":"2000","journal-title":"Nucleic Acids Res."},{"issue":"Database issue","key":"2023012408301769500_b9","doi-asserted-by":"crossref","first-page":"D138","DOI":"10.1093\/nar\/gkh121","article-title":"The Pfam protein families database","volume":"32","author":"Bateman","year":"2004","journal-title":"Nucleic Acids Res."},{"key":"2023012408301769500_b10","first-page":"69","article-title":"Finding motifs using random projections","author":"Buhler","year":"2001"},{"key":"2023012408301769500_b11","doi-asserted-by":"crossref","first-page":"1188","DOI":"10.1101\/gr.849004","article-title":"WebLogo: A sequence logo generator","volume":"14","author":"Crooks","year":"2004","journal-title":"Genome Res."},{"key":"2023012408301769500_b12","doi-asserted-by":"crossref","first-page":"953","DOI":"10.1038\/nsb1101-953","article-title":"Identification of homology in protein structure classification","volume":"8","author":"Dietmann","year":"2001","journal-title":"Nat. Struct. Biol."},{"key":"2023012408301769500_b13","doi-asserted-by":"crossref","first-page":"755","DOI":"10.1093\/bioinformatics\/14.9.755","article-title":"Profile hidden Markov models","volume":"14","author":"Eddy","year":"1998","journal-title":"Bioinformatics"},{"key":"2023012408301769500_b14","doi-asserted-by":"crossref","first-page":"685","DOI":"10.1089\/106652701446152","article-title":"Structure comparison and structure patterns","volume":"7","author":"Eidhammer","year":"2000","journal-title":"J. Comput. Biol."},{"key":"2023012408301769500_b15","doi-asserted-by":"crossref","first-page":"354","DOI":"10.1093\/bioinformatics\/18.suppl_1.S354","article-title":"Finding composite regulatory patterns in DNA sequences","volume":"18","author":"Eskin","year":"2002","journal-title":"Bioinformatics"},{"key":"2023012408301769500_b16","volume-title":"Computers and Intractability: A Guide to the Theory of NP\u2014Completeness","author":"Garey","year":"1979"},{"key":"2023012408301769500_b17","doi-asserted-by":"crossref","first-page":"10915","DOI":"10.1073\/pnas.89.22.10915","article-title":"Amino acid substitution matrices from protein blocks","volume":"89","author":"Henikoff","year":"1992","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023012408301769500_b18","doi-asserted-by":"crossref","first-page":"GC17","DOI":"10.1016\/0378-1119(95)00486-P","article-title":"Automated construction and graphical presentation of protein blocks from unaligned sequences","volume":"163","author":"Henikoff","year":"1995","journal-title":"Gene"},{"key":"2023012408301769500_b19","doi-asserted-by":"crossref","first-page":"563","DOI":"10.1093\/bioinformatics\/15.7.563","article-title":"Identifying DNA and protein patterns with statistically significant alignments of multiple sequences","volume":"15","author":"Hertz","year":"1999","journal-title":"Bioinformatics"},{"key":"2023012408301769500_b20","doi-asserted-by":"crossref","first-page":"215","DOI":"10.1093\/nar\/27.1.215","article-title":"The PROSITE database, its status in 1999","volume":"27","author":"Hofmann","year":"1999","journal-title":"Nucleic Acids Res."},{"key":"2023012408301769500_b21","doi-asserted-by":"crossref","first-page":"123","DOI":"10.1006\/jmbi.1993.1489","article-title":"Protein structure comparison by alignment of distance matrices","volume":"233","author":"Holm","year":"1993","journal-title":"J. Mol. Biol."},{"key":"2023012408301769500_b22","doi-asserted-by":"crossref","first-page":"116","DOI":"10.1016\/S0968-0004(97)01021-9","article-title":"Enzyme HIT","volume":"22","author":"Holm","year":"1997","journal-title":"Trends Biochem Sci."},{"key":"2023012408301769500_b23","doi-asserted-by":"crossref","first-page":"1691","DOI":"10.1002\/pro.5560011217","article-title":"A database of protein structure families with common folding motifs","volume":"1","author":"Holm","year":"1992","journal-title":"Protein Sci."},{"key":"2023012408301769500_b24","doi-asserted-by":"crossref","first-page":"629","DOI":"10.1364\/JOSAA.4.000629","article-title":"Closed-form solution of absolute orientation using unit quaternions","volume":"4","author":"Horn","year":"1987","journal-title":"J. Optical Soc. America A"},{"key":"2023012408301769500_b25","doi-asserted-by":"crossref","first-page":"580","DOI":"10.1002\/prot.10309","article-title":"Protein fragment clustering and canonical local shapes","volume":"50","author":"Hunter","year":"2003","journal-title":"Proteins"},{"key":"2023012408301769500_b26","doi-asserted-by":"crossref","first-page":"1587","DOI":"10.1002\/pro.5560040817","article-title":"Finding flexible patterns in unaligned protein sequences","volume":"4","author":"Jonassen","year":"1995","journal-title":"Protein Sci"},{"key":"2023012408301769500_b27","doi-asserted-by":"crossref","first-page":"362","DOI":"10.1093\/bioinformatics\/18.2.362","article-title":"Structure motif discovery and mining the PDB","volume":"18","author":"Jonassen","year":"2002","journal-title":"Bioinformatics"},{"key":"2023012408301769500_b28","doi-asserted-by":"crossref","first-page":"1374","DOI":"10.1093\/bioinformatics\/18.10.1374","article-title":"Finding motifs in the twilight zone","volume":"18","author":"Keich","year":"2002","journal-title":"Bioinformatics"},{"key":"2023012408301769500_b29","doi-asserted-by":"crossref","first-page":"1173","DOI":"10.1016\/j.jmb.2004.12.032","article-title":"Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures","volume":"346","author":"Kolodny","year":"2005","journal-title":"J. Mol. Biol."},{"key":"2023012408301769500_b30","doi-asserted-by":"crossref","first-page":"208","DOI":"10.1126\/science.8211139","article-title":"Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment","volume":"262","author":"Lawrence","year":"1993","journal-title":"Science"},{"key":"2023012408301769500_b31","doi-asserted-by":"crossref","first-page":"763","DOI":"10.1016\/S0969-2126(97)00231-1","article-title":"MAD analysis of FHIT, a putative human tumor suppressor from the HIT protein family","volume":"5","author":"Lima","year":"1997","journal-title":"Structure"},{"key":"2023012408301769500_b32","doi-asserted-by":"crossref","first-page":"356","DOI":"10.1002\/prot.340230309","article-title":"Threading a database of protein cores","volume":"23","author":"Madej","year":"1995","journal-title":"Proteins"},{"key":"2023012408301769500_b33","first-page":"124","article-title":"Pattern discovery allowing wild-cards, substitution matrices, and multiple score functions","author":"Mancheron","year":"2003"},{"key":"2023012408301769500_b34","doi-asserted-by":"crossref","first-page":"383","DOI":"10.1093\/nar\/gkg087","article-title":"CDD: a curated Entrez database of conserved domain alignments","volume":"31","author":"Marchler-Bauer","year":"2003","journal-title":"Nucleic Acids Res."},{"key":"2023012408301769500_b35","doi-asserted-by":"crossref","first-page":"502","DOI":"10.1093\/nar\/gkg012","article-title":"RNABase: an annotated database of RNA structures","volume":"31","author":"Murthy","year":"2003","journal-title":"Nucleic Acids Res."},{"key":"2023012408301769500_b36","doi-asserted-by":"crossref","first-page":"617","DOI":"10.1016\/S0076-6879(96)66038-8","article-title":"SSAP: sequential structure alignment program for protein structure comparison","volume":"266","author":"Orengo","year":"1996","journal-title":"Methods Enzymol"},{"key":"2023012408301769500_b37","doi-asserted-by":"crossref","first-page":"2606","DOI":"10.1110\/ps.0215902","article-title":"MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison","volume":"11","author":"Ortiz","year":"2002","journal-title":"Protein Sci."},{"key":"2023012408301769500_b38","first-page":"269","article-title":"Combinatorial Approaches to finding subtle signals in DNA sequences","author":"Pevzner","year":"2000"},{"key":"2023012408301769500_b39","author":"Pevzner","year":"2001"},{"key":"2023012408301769500_b40","doi-asserted-by":"crossref","first-page":"II149","DOI":"10.1093\/bioinformatics\/btg1072","article-title":"Finding subtle motifs by branching from sample strings","volume":"19","author":"Price","year":"2003","journal-title":"Bioinformatics"},{"key":"2023012408301769500_b41","doi-asserted-by":"crossref","first-page":"264","DOI":"10.1002\/(SICI)1097-0134(19991101)37:2<264::AID-PROT11>3.0.CO;2-C","article-title":"Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins","volume":"37","author":"Rigoutsos","year":"1999","journal-title":"Proteins"},{"key":"2023012408301769500_b42","doi-asserted-by":"crossref","first-page":"55","DOI":"10.1093\/bioinformatics\/14.1.55","article-title":"Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm","volume":"14","author":"Rigoutsos","year":"1998","journal-title":"Bioinformatics"},{"issue":"Database issue","key":"2023012408301769500_b43","doi-asserted-by":"crossref","first-page":"D303","DOI":"10.1093\/nar\/gkh140","article-title":"RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12","volume":"32","author":"Salgado","year":"2004","journal-title":"Nucleic Acids Res."},{"key":"2023012408301769500_b44","first-page":"63","article-title":"An extension and novel solution to the motif challenge problem","volume":"15","author":"Styczynski","year":"2004","journal-title":"Genome Informatics"},{"key":"2023012408301769500_b45","first-page":"91","article-title":"An Optimal Algorithm for finding all the cliques","volume":"12","author":"Tomita","year":"1989","journal-title":"SIG Algorithms"},{"key":"2023012408301769500_b46","doi-asserted-by":"crossref","first-page":"137","DOI":"10.1038\/nbt1053","article-title":"Assessing computational tools for the discovery of transcription factor binding sites","volume":"23","author":"Tompa","year":"2005","journal-title":"Nat. Biotechnol."},{"key":"2023012408301769500_b47","doi-asserted-by":"crossref","first-page":"11560","DOI":"10.1021\/bi9612677","article-title":"The structure of nucleotidylated histidine-166 of galactose-1-phosphate uridylyltransferase provides insight into phosphoryl group transfer","volume":"35","author":"Wedekind","year":"1996","journal-title":"Biochemistry"},{"key":"2023012408301769500_b48","first-page":"7:1","article-title":"Theoretical foundations of association rules","author":"Zaki","year":"1998"},{"key":"2023012408301769500_b49","doi-asserted-by":"crossref","first-page":"372","DOI":"10.1109\/69.846291","article-title":"Scalable algorithms for association mining","volume":"12","author":"Zaki","year":"2000","journal-title":"Knowledge Data Eng."}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/22\/1\/21\/48837906\/bioinformatics_22_1_21.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/22\/1\/21\/48837906\/bioinformatics_22_1_21.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,24]],"date-time":"2023-01-24T08:33:26Z","timestamp":1674549206000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/22\/1\/21\/218799"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2005,10,27]]},"references-count":49,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2006,1,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bti745","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2006,1,1]]},"published":{"date-parts":[[2005,10,27]]}}}