{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,7]],"date-time":"2026-02-07T15:47:30Z","timestamp":1770479250938,"version":"3.49.0"},"reference-count":15,"publisher":"Springer Science and Business Media LLC","issue":"1","content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2009,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:sec>\n            <jats:title>Background<\/jats:title>\n            <jats:p>Predicting the function of a protein from its sequence is a long-standing challenge of bioinformatic research, typically addressed using either sequence-similarity or sequence-motifs. We employ the novel motif method that consists of Specific Peptides (SPs) that are unique to specific branches of the Enzyme Commission (EC) functional classification. We devise the Data Mining of Enzymes (DME) methodology that allows for searching SPs on arbitrary proteins, determining from its sequence whether a protein is an enzyme and what the enzyme's EC classification is.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Results<\/jats:title>\n            <jats:p>We extract novel SP sets from Swiss-Prot enzyme data. Using a training set of July 2006, and test sets of July 2008, we find that the predictive power of SPs, both for true-positives (enzymes) and true-negatives (non-enzymes), depends on the coverage length of all SP matches (the number of amino-acids matched on the protein sequence). DME is quite different from BLAST. Comparing the two on an enzyme test set of July 2008, we find that DME has lower recall. On the other hand, DME can provide predictions for proteins regarded by BLAST as having low homologies with known enzymes, thus supplying complementary information. We test our method on a set of proteins belonging to 10 bacteria, dated July 2008, establishing the usefulness of the coverage-length cutoff to determine true-negatives. Moreover, sifting through our predictions we find that some of them have been substantiated by Swiss-Prot annotations by July 2009. Finally we extract, for production purposes, a novel SP set trained on all Swiss-Prot enzymes as of July 2009. This new set increases considerably the recall of DME. The new SP set is being applied to three metagenomes: Sargasso Sea with over 1,000,000 proteins, producing predictions of over 220,000 enzymes, and two human gut metagenomes. The outcome of these analyses can be characterized by the enzymatic profile of the metagenomes, describing the relative numbers of enzymes observed for different EC categories.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Conclusions<\/jats:title>\n            <jats:p>Employing SPs for predicting enzymatic activity of proteins works well once one utilizes coverage-length criteria. In our analysis, L \u2265 7 has led to highly accurate results.<\/jats:p>\n          <\/jats:sec>","DOI":"10.1186\/1471-2105-10-446","type":"journal-article","created":{"date-parts":[[2009,12,24]],"date-time":"2009-12-24T19:14:38Z","timestamp":1261682078000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":10,"title":["Data mining of enzymes using specific peptides"],"prefix":"10.1186","volume":"10","author":[{"given":"Uri","family":"Weingart","sequence":"first","affiliation":[]},{"given":"Yair","family":"Lavi","sequence":"additional","affiliation":[]},{"given":"David","family":"Horn","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2009,12,24]]},"reference":[{"issue":"11","key":"3176_CR1","doi-asserted-by":"publisher","first-page":"e368","DOI":"10.1371\/journal.pbio.0040368","volume":"4","author":"FE Angly","year":"2006","unstructured":"Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, Carlson C, Chan AM, Haynes M, Kelley S, Liu H, Mahaffy JM, Mueller JE, Nulton J, Olson R, Parsons R, Rayhawk S, Suttle CA, Rohwer F: The Marine Viromes of Four Oceanic Regions. PLoS Biol 2006, 4(11):e368. 10.1371\/journal.pbio.0040368","journal-title":"PLoS Biol"},{"issue":"3","key":"3176_CR2","doi-asserted-by":"publisher","first-page":"e82","DOI":"10.1371\/journal.pbio.0050082","volume":"5","author":"JA Eisen","year":"2007","unstructured":"Eisen JA: Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes. PLoS Biol 2007, 5(3):e82. 10.1371\/journal.pbio.0050082","journal-title":"PLoS Biol"},{"key":"3176_CR3","doi-asserted-by":"publisher","first-page":"863","DOI":"10.1016\/j.jmb.2003.08.057","volume":"333","author":"W Tian","year":"2003","unstructured":"Tian W, Skolnick J: How well is enzyme function conserved as a function of pairwise sequence identity? J Mol Biol 2003, 333: 863\u2013882. 10.1016\/j.jmb.2003.08.057","journal-title":"J Mol Biol"},{"key":"3176_CR4","doi-asserted-by":"publisher","first-page":"366","DOI":"10.1016\/S0959-440X(96)80057-1","volume":"6","author":"P Bork","year":"1996","unstructured":"Bork P, Koonin EV: Protein sequence motifs. Curr Op Structural Biology 1996, 6: 366\u2013376. 10.1016\/S0959-440X(96)80057-1","journal-title":"Curr Op Structural Biology"},{"key":"3176_CR5","doi-asserted-by":"publisher","first-page":"217","DOI":"10.1093\/nar\/25.1.217","volume":"25","author":"A Bairoch","year":"1997","unstructured":"Bairoch A, Bucher P, Hofmann K: Prosite. Nuc Acids Res 1997, 25: 217\u2013221. 10.1093\/nar\/25.1.217","journal-title":"Nuc Acids Res"},{"issue":"8","key":"3176_CR6","doi-asserted-by":"publisher","first-page":"e167","DOI":"10.1371\/journal.pcbi.0030167","volume":"3","author":"V Kunik","year":"2007","unstructured":"Kunik V, Meroz Y, Solan Z, Sandbank B, Weingart U, Ruppin E, Horn D: Functional representation of enzymes by specific peptides. PLOS Comp Biol 2007, 3(8):e167. 10.1371\/journal.pcbi.0030167","journal-title":"PLOS Comp Biol"},{"key":"3176_CR7","doi-asserted-by":"publisher","first-page":"11629","DOI":"10.1073\/pnas.0409746102","volume":"102","author":"Z Solan","year":"2005","unstructured":"Solan Z, Horn D, Ruppin E, Edelman S: Unsupervised learning of natural languages. Proc Natl Acad Sci USA 2005, 102: 11629\u201311634. 10.1073\/pnas.0409746102","journal-title":"Proc Natl Acad Sci USA"},{"issue":"2","key":"3176_CR8","doi-asserted-by":"publisher","first-page":"606","DOI":"10.1002\/prot.21951","volume":"72","author":"Y Meroz","year":"2008","unstructured":"Meroz Y, Horn D: Biological Roles of Specific Peptides in Enzymes. Proteins: Structure, Function, and Bioinformatics 2008, 72(2):606\u2013612. 10.1002\/prot.21951","journal-title":"Proteins: Structure, Function, and Bioinformatics"},{"key":"3176_CR9","doi-asserted-by":"publisher","first-page":"403","DOI":"10.1016\/S0022-2836(05)80360-2","volume":"215","author":"SF Altschul","year":"1990","unstructured":"Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403\u2013410.","journal-title":"J Mol Biol"},{"issue":"8","key":"3176_CR10","doi-asserted-by":"publisher","first-page":"1035","DOI":"10.1038\/nbt0804-1035","volume":"22","author":"SR Eddy","year":"2004","unstructured":"Eddy SR: Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol 2004, 22(8):1035\u20136. 10.1038\/nbt0804-1035","journal-title":"Nat Biotechnol"},{"key":"3176_CR11","doi-asserted-by":"publisher","first-page":"66","DOI":"10.1126\/science.1093857","volume":"304","author":"JC Venter","year":"2004","unstructured":"Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu DY, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, Rogers YH, Smith HO: Environmental Genome Shotgun Sequencing of the Sargasso Sea. Science 2004, 304: 66\u201374. 10.1126\/science.1093857","journal-title":"Science"},{"issue":"1","key":"3176_CR12","doi-asserted-by":"publisher","first-page":"344","DOI":"10.1093\/nar\/29.1.344","volume":"29","author":"K Watanabe","year":"2001","unstructured":"Watanabe K, Nelson J, Harayama S, Kasai H: ICB database: the gyrB database for identification and classification of bacteria. Nucleic Acids Res 2001, 29(1):344\u20135. 10.1093\/nar\/29.1.344","journal-title":"Nucleic Acids Res"},{"key":"3176_CR13","doi-asserted-by":"publisher","first-page":"1355","DOI":"10.1126\/science.1124234","volume":"312","author":"SR Gill","year":"2006","unstructured":"Gill SR, Pop M, DeBoy RT, Eckburg PB, Turnbaugh PJ, Samuel BS, Gordon JI, Relman DA, Fraser-Liggett CM, Nelson KE: Metagenomic Analysis of the Human Distal Gut Microbiome. Science 2006, 312: 1355\u20131359. 10.1126\/science.1124234","journal-title":"Science"},{"key":"3176_CR14","doi-asserted-by":"publisher","first-page":"554","DOI":"10.1126\/science.1107851","volume":"308","author":"SG Tringe","year":"2005","unstructured":"Tringe SG, Von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EM: Comparative Metagenomics of Microbial Communities. Science 2005, 308: 554\u2013557. 10.1126\/science.1107851","journal-title":"Science"},{"key":"3176_CR15","doi-asserted-by":"publisher","first-page":"1126","DOI":"10.1126\/science.1133420","volume":"315","author":"C von Mering","year":"2007","unstructured":"von Mering C, Hugenholtz P, Raes J, Tringe SG, Doerks T, Jensen LJ, Ward N, Bork P: Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments. Science 2007, 315: 1126\u20131130. 10.1126\/science.1133420","journal-title":"Science"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-10-446.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,8,31]],"date-time":"2021-08-31T21:37:03Z","timestamp":1630445823000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-10-446"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2009,12]]},"references-count":15,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2009,12]]}},"alternative-id":["3176"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-10-446","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2009,12]]},"assertion":[{"value":"4 June 2009","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"24 December 2009","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"24 December 2009","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"446"}}