{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,30]],"date-time":"2025-12-30T17:51:39Z","timestamp":1767117099723},"reference-count":23,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2004,9,2]],"date-time":"2004-09-02T00:00:00Z","timestamp":1094083200000},"content-version":"tdm","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/2.0"},{"start":{"date-parts":[[2004,9,2]],"date-time":"2004-09-02T00:00:00Z","timestamp":1094083200000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/2.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n                        <jats:title>Background<\/jats:title>\n                        <jats:p>Conserved protein sequence motifs are short stretches of amino acid sequence patterns that potentially encode the function of proteins. Several sequence pattern searching algorithms and programs exist foridentifying candidate protein motifs at the whole genome level. However, amuch needed and importanttask is to determine the functions of the newly identified protein motifs. The Gene Ontology (GO) project is an endeavor to annotate the function of genes or protein sequences with terms from a dynamic, controlled vocabulary and these annotations serve well as a knowledge base.<\/jats:p>\n                     <\/jats:sec><jats:sec>\n                        <jats:title>Results<\/jats:title>\n                        <jats:p>This paperpresents methods to mine the GO knowledge base and use the association between the GO terms assigned to a sequence and the motifs matched by the same sequence as evidence for predicting the functions of novel protein motifs automatically. The task of assigning GO terms to protein motifsis viewed as both a binary classification and information retrieval problem, where PROSITE motifs are used as samples for mode training and functional prediction. The mutual information of a motif and aGO term association isfound to be a very useful feature. We take advantageof the known motifs to train a logistic regression classifier, which allows us to combine mutual information with other frequency-based features and obtain a probability of correctassociation. The trained logistic regression model has intuitively meaningful and logically plausible parameter values, and performs very well empirically according to our evaluation criteria.<\/jats:p>\n                     <\/jats:sec><jats:sec>\n                        <jats:title>Conclusions<\/jats:title>\n                        <jats:p>In this research, different methods for automatic annotation of protein motifs have been investigated. Empirical result demonstrated that the methods have a great potential for detecting and augmenting information about thefunctions of newly discovered candidate protein motifs.<\/jats:p>\n                     <\/jats:sec>","DOI":"10.1186\/1471-2105-5-122","type":"journal-article","created":{"date-parts":[[2004,9,7]],"date-time":"2004-09-07T06:23:56Z","timestamp":1094538236000},"update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":28,"title":["Automatic annotation of protein motif function with Gene Ontology terms"],"prefix":"10.1186","volume":"5","author":[{"given":"Xinghua","family":"Lu","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chengxiang","family":"Zhai","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Vanathi","family":"Gopalakrishnan","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Bruce G","family":"Buchanan","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2004,9,2]]},"reference":[{"issue":"2","key":"238_CR1","doi-asserted-by":"publisher","first-page":"279","DOI":"10.1089\/cmb.1998.5.279","volume":"5","author":"A Brazma","year":"1998","unstructured":"Brazma A, Jonassen IDG: Approaches to the automatic discovery of patterns in biosequences.\n                           J Comput Biol 1998, 5(2):279.","journal-title":"J Comput Biol"},{"key":"238_CR2","volume-title":"In: Technical Report CS-2000-22, University of Waterloo","author":"B Brejova","year":"2000","unstructured":"Brejova B, DiMarco C, Vinar T, Hidalgo SR, Holguin G, Patten D: Finding Patterns in Biological Sequences.\n                           In: Technical Report CS-2000\u201322, University of Waterloo 2000."},{"key":"238_CR3","doi-asserted-by":"publisher","first-page":"235","DOI":"10.1093\/nar\/30.1.235","volume":"30","author":"L Falquet","year":"2002","unstructured":"Falquet L, Pagni M, Bucher P, Hulo N, Sigrist CJ, Hofmann K, Bairoch A: The PROSITE database, its status in 2002.\n                           Nucleic Acids Res 2002, 30: 235\u2013238. 10.1093\/nar\/30.1.235","journal-title":"Nucleic Acids Res"},{"key":"238_CR4","doi-asserted-by":"publisher","first-page":"228","DOI":"10.1093\/nar\/28.1.228","volume":"28","author":"JG Henikoff","year":"2000","unstructured":"Henikoff JG, Greene EA, Pietrokovski S, Henikoff S: Increased coverage of protein families with the blocks database servers.\n                           Nucl Acids Res 2000, 28: 228\u2013230. 10.1093\/nar\/28.1.228","journal-title":"Nucl Acids Res"},{"issue":"1","key":"238_CR5","doi-asserted-by":"publisher","first-page":"276","DOI":"10.1093\/nar\/30.1.276","volume":"30","author":"A Bateman","year":"2002","unstructured":"Bateman A, Birney E, Cerruti L, Durbin RLE, Eddy SR, Griffiths-Jones S, Howe KL, Marshall MELS: The Pfam protein families database.\n                           Nucleic Acids Res 2002, 30(1):276\u2013280. 10.1093\/nar\/30.1.276","journal-title":"Nucleic Acids Res"},{"issue":"11","key":"238_CR6","doi-asserted-by":"publisher","first-page":"5857","DOI":"10.1073\/pnas.95.11.5857","volume":"95","author":"J Schultz","year":"1998","unstructured":"Schultz J, Milpetz F, Bork P, Ponting CP: SMART, a simple modular architecture research tool: identification of signaling domains.\n                           Proc Natl Acad Sci USA 1998, 95(11):5857\u20135864. 10.1073\/pnas.95.11.5857","journal-title":"Proc Natl Acad Sci USA"},{"issue":"5131","key":"238_CR7","doi-asserted-by":"publisher","first-page":"208","DOI":"10.1126\/science.8211139","volume":"262","author":"CE Lawrence","year":"1993","unstructured":"Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment.\n                           Science 1993, 262(5131):208\u2013214.","journal-title":"Science"},{"issue":"4","key":"238_CR8","doi-asserted-by":"publisher","first-page":"341","DOI":"10.1093\/bioinformatics\/16.4.341","volume":"16","author":"A Califano","year":"2000","unstructured":"Califano A: SPLASH: structural pattern localization analysis by sequential histograms.\n                           Bioinformatics 2000, 16(4):341\u2013357. 10.1093\/bioinformatics\/16.4.341","journal-title":"Bioinformatics"},{"issue":"2","key":"238_CR9","first-page":"229","volume":"14","author":"I Rigoutsos","year":"1998","unstructured":"Rigoutsos I, Floratos A: Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm.\n                           Bioinformatics 1998, 14(2):229.","journal-title":"Bioinformatics"},{"key":"238_CR10","doi-asserted-by":"crossref","unstructured":"Consortium TGO: Creating the gene ontology resource: design and implementation.\n                           Genome Res 2001, (11):1425\u20131433. 10.1101\/gr.180801","DOI":"10.1101\/gr.180801"},{"issue":"1","key":"238_CR11","doi-asserted-by":"publisher","first-page":"203","DOI":"10.1101\/gr.199701","volume":"12","author":"S Raychaudhuri","year":"2002","unstructured":"Raychaudhuri S, Chang JT, Sutphin PD, Altman RB: Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature.\n                           Genome Res 2002, 12(1):203\u2013214. 10.1101\/gr.199701","journal-title":"Genome Res"},{"issue":"5","key":"238_CR12","doi-asserted-by":"publisher","first-page":"785","DOI":"10.1101\/gr.86902","volume":"12","author":"H Xie","year":"2002","unstructured":"Xie H, Wasserman A, Levine Z, Novik A, Grebinskiy V, Shoshan A, Mintz L: Large-scale protein annotation through gene ontology.\n                           Genome Res 2002, 12(5):785\u2013794. 10.1101\/gr.86902","journal-title":"Genome Res"},{"issue":"4","key":"238_CR13","doi-asserted-by":"publisher","first-page":"648","DOI":"10.1101\/gr.222902","volume":"12","author":"J Schug","year":"2002","unstructured":"Schug J, Diskin S, Mazzarelli J, Brunk BP, Stoeckert CJJ: Predicting gene ontology functions from ProDom and CDD protein domains.\n                           Genome Res 2002, 12(4):648\u2013655. 10.1101\/gr.222902","journal-title":"Genome Res"},{"issue":"1","key":"238_CR14","doi-asserted-by":"publisher","first-page":"383","DOI":"10.1093\/nar\/gkg087","volume":"31","author":"A Marchler-Bauer","year":"2003","unstructured":"Marchler-Bauer A, Anderson J, DeWeese-Scott C, Fedorova N, Geer LHeS, Hurwitz D, Jackson J, Jacobs A, Lanczycki C, et al.: CDD: a curated Entrez database of conserved domain alignments.\n                           Nucleic Acids Res 2003, 31(1):383\u2013387. 10.1093\/nar\/gkg087","journal-title":"Nucleic Acids Res"},{"issue":"1","key":"238_CR15","doi-asserted-by":"publisher","first-page":"267","DOI":"10.1093\/nar\/28.1.267","volume":"28","author":"F Corpet","year":"2000","unstructured":"Corpet F, Servant F, Gouzy J, Kahn D: ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons.\n                           Nucleic Acids Res 2000, 28(1):267\u2013269. 10.1093\/nar\/28.1.267","journal-title":"Nucleic Acids Res"},{"key":"238_CR16","doi-asserted-by":"publisher","first-page":"403","DOI":"10.1016\/S0022-2836(05)80360-2","volume":"215","author":"SF Altschul","year":"1990","unstructured":"Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool.\n                           J Mol Biol 1990, 215: 403\u2013410. 10.1006\/jmbi.1990.9999","journal-title":"J Mol Biol"},{"key":"238_CR17","unstructured":"Gene Ontology Consortium download site[http:\/\/www.godatabase.org\/dev\/database\/archive\/]"},{"key":"238_CR18","doi-asserted-by":"publisher","DOI":"10.1002\/0471200611","volume-title":"Elements of information Theory","author":"T Cover","year":"1991","unstructured":"Cover T, Thomas J: Elements of information Theory. John Wiley & Sons, Inc. 1991."},{"key":"238_CR19","unstructured":"Yang Y: An Evaluation of Statistical Approaches to Text Categorization.\n                           J of Information Retrieval 1999., 1(1\/2):"},{"issue":"7","key":"238_CR20","doi-asserted-by":"publisher","first-page":"1145","DOI":"10.1016\/S0031-3203(96)00142-2","volume":"30","author":"AP Bradley","year":"1997","unstructured":"Bradley AP: The use of the area under the ROC curve in the evaluation of machine learning algorithms.\n                           Pattern Recognition 1997, 30(7):1145\u20131159. 10.1016\/S0031-3203(96)00142-2","journal-title":"Pattern Recognition"},{"key":"238_CR21","volume-title":"Applied logistic regression","author":"DWJ Hosmer","year":"1989","unstructured":"Hosmer DWJ, Lemeshow S: Applied logistic regression. John Wiley & Sons, Inc. 1989."},{"issue":"17","key":"238_CR22","doi-asserted-by":"publisher","first-page":"3901","DOI":"10.1093\/nar\/gkf464","volume":"30","author":"I Rigoutsos","year":"2002","unstructured":"Rigoutsos I, Huynh T, Floratos A, Parida L, Platt D: Dictionary-driven protein annotation.\n                           Nucleic Acids Res 2002, 30(17):3901\u20133916. 10.1093\/nar\/gkf464","journal-title":"Nucleic Acids Res"},{"key":"238_CR23","volume-title":"Springer","author":"T Hastie","year":"2001","unstructured":"Hastie T, Tibshirani R, Friedman J: The elements of statistical learning.\n                           Springer 2001."}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-5-122.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/1471-2105-5-122\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-5-122.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,10,7]],"date-time":"2024-10-07T12:15:22Z","timestamp":1728303322000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-5-122"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2004,9,2]]},"references-count":23,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2004,12]]}},"alternative-id":["238"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-5-122","relation":{},"ISSN":["1471-2105"],"issn-type":[{"type":"electronic","value":"1471-2105"}],"subject":[],"published":{"date-parts":[[2004,9,2]]},"assertion":[{"value":"16 October 2003","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 September 2004","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 September 2004","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"122"}}