{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,23]],"date-time":"2025-10-23T16:37:45Z","timestamp":1761237465446},"reference-count":46,"publisher":"Springer Science and Business Media LLC","issue":"1","content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2007,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:sec>\n            <jats:title>Background<\/jats:title>\n            <jats:p>Incorrectly annotated sequence data are becoming more commonplace as databases increasingly rely on automated techniques for annotation. Hence, there is an urgent need for computational methods for checking consistency of such annotations against independent sources of evidence and detecting potential annotation errors. We show how a machine learning approach designed to automatically predict a protein's Gene Ontology (GO) functional class can be employed to identify potential gene annotation errors.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Results<\/jats:title>\n            <jats:p>In a set of 211 previously annotated mouse protein kinases, we found that 201 of the GO annotations returned by AmiGO appear to be <jats:italic>inconsistent<\/jats:italic> with the UniProt functions assigned to their human counterparts. In contrast, 97% of the predicted annotations generated using a machine learning approach were <jats:italic>consistent<\/jats:italic> with the UniProt annotations of the human counterparts, as well as with available annotations for these mouse protein kinases in the Mouse Kinome database.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Conclusion<\/jats:title>\n            <jats:p>We conjecture that most of our predicted annotations are, therefore, correct and suggest that the machine learning approach developed here could be routinely used to detect potential errors in GO annotations generated by high-throughput gene annotation projects.<\/jats:p>\n            <jats:p>Editors Note : Authors from the original publication (Okazaki et al.: <jats:italic>Nature<\/jats:italic> 2002, <jats:bold>420<\/jats:bold>:563\u201373) have provided their response to Andorf et al, directly following the correspondence.<\/jats:p>\n          <\/jats:sec>","DOI":"10.1186\/1471-2105-8-284","type":"journal-article","created":{"date-parts":[[2007,8,4]],"date-time":"2007-08-04T06:13:22Z","timestamp":1186208002000},"update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":28,"title":["Exploring inconsistencies in genome-wide protein function annotations: a machine learning approach"],"prefix":"10.1186","volume":"8","author":[{"given":"Carson","family":"Andorf","sequence":"first","affiliation":[]},{"given":"Drena","family":"Dobbs","sequence":"additional","affiliation":[]},{"given":"Vasant","family":"Honavar","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2007,8,3]]},"reference":[{"key":"1656_CR1","doi-asserted-by":"publisher","first-page":"25","DOI":"10.1038\/75556","volume":"25","author":"The Gene Ontology Consortium","year":"2000","unstructured":"The Gene Ontology Consortium: Gene ontology: tool for the unification of biology. Nature Genet 2000, 25: 25\u201329. 10.1038\/75556","journal-title":"Nature Genet"},{"key":"1656_CR2","doi-asserted-by":"publisher","first-page":"248","DOI":"10.1016\/S0168-9525(98)01486-3","volume":"14","author":"T Doerks","year":"1998","unstructured":"Doerks T, Bairoch A, Bork P: Protein annotation : detective work for function prediction. Trends Genet 1998, 14: 248\u2013250. 10.1016\/S0168-9525(98)01486-3","journal-title":"Trends Genet"},{"issue":"4","key":"1656_CR3","doi-asserted-by":"publisher","first-page":"313","DOI":"10.1038\/ng0498-313","volume":"18","author":"P Bork","year":"1998","unstructured":"Bork P, Koonin EV: Predicting functions from protein sequences \u2013 where are the bottlenecks? Nat Genet 1998, 18(4):313\u2013318. 10.1038\/ng0498-313","journal-title":"Nat Genet"},{"issue":"2","key":"1656_CR4","doi-asserted-by":"publisher","first-page":"223","DOI":"10.1016\/j.mbs.2004.08.001","volume":"193","author":"WR Gilks","year":"2005","unstructured":"Gilks WR, Audit B, de Angelis D, Tsoka S, Ouzounis CA: Percolation of annotation errors through hierarchically structured protein sequence databases. Math Biosci 2005, 193(2):223\u2013234. 10.1016\/j.mbs.2004.08.001","journal-title":"Math Biosci"},{"key":"1656_CR5","doi-asserted-by":"publisher","first-page":"1641","DOI":"10.1093\/bioinformatics\/18.12.1641","volume":"18","author":"WR Gilks","year":"2002","unstructured":"Gilks WR, Audit B, De Angelis D, Tsoka S, Ouzounis CA: Modeling the percolation of annotation errors in a database of protein sequences. Bioinformatics 2002, 18: 1641\u20131649. 10.1093\/bioinformatics\/18.12.1641","journal-title":"Bioinformatics"},{"key":"1656_CR6","doi-asserted-by":"publisher","first-page":"52","DOI":"10.1186\/1471-2164-5-52","volume":"5","author":"DG Naumoff","year":"2004","unstructured":"Naumoff DG, Xu Y, Glansdorff N, Labedan B: Retrieving sequences of enzymes experimentally characterized but erroneously annotated : the case of the putrescine carbamoyltransferase. BMC Genomics 2004, 5: 52. 10.1186\/1471-2164-5-52","journal-title":"BMC Genomics"},{"key":"1656_CR7","doi-asserted-by":"publisher","first-page":"4035","DOI":"10.1093\/nar\/gki711","volume":"33","author":"ML Green","year":"2005","unstructured":"Green ML, Karp PD: Genome annotation errors in pathway databases due to semantic ambiguity in partial EC numbers. Nucleic Acids Res 2005, 33: 4035\u20134039. 10.1093\/nar\/gki711","journal-title":"Nucleic Acids Res"},{"key":"1656_CR8","doi-asserted-by":"publisher","first-page":"136","DOI":"10.1093\/bioinformatics\/bti1019","volume":"21","author":"ME Dolan","year":"2005","unstructured":"Dolan ME, Ni L, Camon E, Blake JA: A procedure for assessing GO annotation consistency. Bioinformatics 2005, 21: 136\u2013143. 10.1093\/bioinformatics\/bti1019","journal-title":"Bioinformatics"},{"key":"1656_CR9","doi-asserted-by":"publisher","first-page":"829","DOI":"10.1093\/bioinformatics\/bti106","volume":"21","author":"YR Park","year":"2005","unstructured":"Park YR, Park CH, Kim JH: GOChase: correcting errors from gene ontology-based annotations for gene products. Bioinformatics 2005, 21: 829\u2013831. 10.1093\/bioinformatics\/bti106","journal-title":"Bioinformatics"},{"issue":"1","key":"1656_CR10","doi-asserted-by":"publisher","first-page":"98","DOI":"10.1002\/1097-0134(20001001)41:1<98::AID-PROT120>3.0.CO;2-S","volume":"41","author":"D Devos","year":"2000","unstructured":"Devos D, Valencia A: Practical limits of function prediction. Proteins 2000, 41(1):98\u2013107. 10.1002\/1097-0134(20001001)41:1<98::AID-PROT120>3.0.CO;2-S","journal-title":"Proteins"},{"key":"1656_CR11","doi-asserted-by":"publisher","first-page":"302","DOI":"10.1186\/1471-2105-6-302","volume":"6","author":"ED Levy","year":"2005","unstructured":"Levy ED, Ouzounis CA, Gilks WR, Audit B: Probabilistic annotation of protein sequences based on functional classifications. BMC Bioinformatics 2005, 6: 302. 10.1186\/1471-2105-6-302","journal-title":"BMC Bioinformatics"},{"key":"1656_CR12","first-page":"256","volume-title":"Fifth Int Conf Knowledge Based Computer Systems, India","author":"C Andorf","year":"2004","unstructured":"Andorf C, Silvescu A, Dobbs D, Honavar V: Learning classifiers for assigning protein sequences to gene ontology functional families. Fifth Int Conf Knowledge Based Computer Systems, India 2004, 256\u2013265. [http:\/\/www.cs.iastate.edu\/~honavar\/Papers\/nbk.pdf]"},{"key":"1656_CR13","volume-title":"Learning classifiers for assigning protein sequences to Gene Ontology functional families: combining of function annotation using sequence homology with that based on amino acid k-gram composition yields more accurate classifiers than either of the individ","author":"C Andorf","year":"2004","unstructured":"Andorf C, Silvescu A, Dobbs D, Honavar V: Learning classifiers for assigning protein sequences to Gene Ontology functional families: combining of function annotation using sequence homology with that based on amino acid k-gram composition yields more accurate classifiers than either of the individual approaches.Department of Computer Science, Iowa State University; 2004. [http:\/\/www.cs.iastate.edu\/~andorfc\/hdtree\/HDtree2006.pdf]"},{"key":"1656_CR14","doi-asserted-by":"publisher","first-page":"i26","DOI":"10.1093\/bioinformatics\/btg1002","volume":"19","author":"A Ben-Hur","year":"2003","unstructured":"Ben-Hur A, Brutlag D: Remote homology detection : a motif based approach. Bioinformatics 2003, 19: i26-i33. 10.1093\/bioinformatics\/btg1002","journal-title":"Bioinformatics"},{"key":"1656_CR15","first-page":"127","volume-title":"Pac Symp Biocomput","author":"B Hayete","year":"2005","unstructured":"Hayete B, Bienkowska JR: Gotrees : predicting go associations from protein domain composition using decision trees. Pac Symp Biocomput 2005, 127\u2013138."},{"key":"1656_CR16","doi-asserted-by":"publisher","first-page":"178","DOI":"10.1186\/1471-2105-5-178","volume":"5","author":"DM Martin","year":"2004","unstructured":"Martin DM, Berriman M, Barton GJ: GOtcha : a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 2004, 5: 178. 10.1186\/1471-2105-5-178","journal-title":"BMC Bioinformatics"},{"key":"1656_CR17","doi-asserted-by":"publisher","first-page":"1410","DOI":"10.1101\/gr.168701","volume":"11","author":"J Murvai","year":"2001","unstructured":"Murvai J, Vlahovicek K, Szepesvari C, Pongor S: Prediction of protein functional domains from sequences using artificial neural networks. Genome Research 2001, 11: 1410\u20131417. 10.1101\/gr.168701","journal-title":"Genome Research"},{"key":"1656_CR18","doi-asserted-by":"publisher","first-page":"161","DOI":"10.1186\/1471-2105-7-161","volume":"7","author":"A Vinayagam","year":"2006","unstructured":"Vinayagam A, del Val C, Schubert F, Eils R, Glatting KH, Suhai S, Konig R: GOPET : a tool for automated predictions of Gene Ontology terms. BMC Bioinformatics 2006, 7: 161. 10.1186\/1471-2105-7-161","journal-title":"BMC Bioinformatics"},{"issue":"1\u20132","key":"1656_CR19","doi-asserted-by":"publisher","first-page":"113","DOI":"10.1016\/j.gene.2006.12.008","volume":"391","author":"M Zhu","year":"2007","unstructured":"Zhu M, Gao L, Guo Z, Li Y, Wang D, Wang J, Wang C: Globally predicting protein functions based on co-expressed protein-protein interaction networks and ontology taxonomy similarities. Gene 2007, 391(1\u20132):113\u2013119. 10.1016\/j.gene.2006.12.008","journal-title":"Gene"},{"key":"1656_CR20","doi-asserted-by":"publisher","first-page":"197","DOI":"10.1016\/j.ceb.2005.01.002","volume":"17","author":"M Gallego","year":"2005","unstructured":"Gallego M, Virshup DM: Protein serine\/threonine phosphatases: life, death, and sleeping. Curr Opin Cell Biol 2005, 17: 197\u2013202. 10.1016\/j.ceb.2005.01.002","journal-title":"Curr Opin Cell Biol"},{"key":"1656_CR21","doi-asserted-by":"publisher","first-page":"203","DOI":"10.1016\/j.ceb.2005.02.001","volume":"17","author":"A Bourdeau","year":"2005","unstructured":"Bourdeau A, Dube N, Tremblay ML: Cytoplasmic protein tyrosine phosphatases, regulation and function: the roles of PTP1B and TC-PTP. Curr Opin Cell Biol 2005, 17: 203\u2013209. 10.1016\/j.ceb.2005.02.001","journal-title":"Curr Opin Cell Biol"},{"issue":"Database issue","key":"1656_CR22","doi-asserted-by":"publisher","first-page":"D322","DOI":"10.1093\/nar\/gkj021","volume":"34","author":"Gene Ontology Consortium","year":"2006","unstructured":"Gene Ontology Consortium: The Gene Ontology (GO) project in 2006. Nucleic Acids Res 2006, 34(Database issue):D322\u20136. 10.1093\/nar\/gkj021","journal-title":"Nucleic Acids Res"},{"key":"1656_CR23","doi-asserted-by":"publisher","first-page":"86","DOI":"10.1093\/bib\/bbk007","volume":"7","author":"P Larranaga","year":"2006","unstructured":"Larranaga P, Calvo B, Santana R: Machine learning in bioinformatics. Brief Bioinform 2006, 7: 86\u2013112. 10.1093\/bib\/bbk007","journal-title":"Brief Bioinform"},{"key":"1656_CR24","doi-asserted-by":"publisher","first-page":"471","DOI":"10.1093\/nar\/gki113","volume":"33","author":"JT Eppig","year":"2005","unstructured":"Eppig JT, Bult CJ, Kadin JA: The Mouse Genome Database (MGD): from genes to mice \u2013 a community resource for mouse biology. Nucleic Acids Res 2005, 33: 471\u2013475. 10.1093\/nar\/gki113","journal-title":"Nucleic Acids Res"},{"key":"1656_CR25","doi-asserted-by":"publisher","first-page":"563","DOI":"10.1038\/nature01266","volume":"420","author":"Y Okazaki","year":"2002","unstructured":"Okazaki Y, Furuno M: Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 2002, 420: 563\u2013573. 10.1038\/nature01266","journal-title":"Nature"},{"key":"1656_CR26","doi-asserted-by":"publisher","first-page":"154","DOI":"10.1093\/nar\/gki070","volume":"33","author":"A Bairoch","year":"2005","unstructured":"Bairoch A, Apweiler R, Wu CH: The Universal Protein Resource (UniProt). Nucleic Acids Res 2005, 33: 154\u2013159. 10.1093\/nar\/gki070","journal-title":"Nucleic Acids Res"},{"key":"1656_CR27","volume-title":"C4.5: Programs for Machine Learning","author":"JR Quinlan","year":"1993","unstructured":"Quinlan JR: C4.5: Programs for Machine Learning. Morgan Kauffman; 1993."},{"key":"1656_CR28","doi-asserted-by":"publisher","first-page":"11707","DOI":"10.1073\/pnas.0306880101","volume":"101","author":"S Caenepeel","year":"2004","unstructured":"Caenepeel S, Charydczak G, Sudarsanam S, Hunter T, Manning G: The mouse kinome: discovery and comparative genomics of all mouse protein kinases. PNAS 2004, 101: 11707\u201311712. 10.1073\/pnas.0306880101","journal-title":"PNAS"},{"issue":"1","key":"1656_CR29","doi-asserted-by":"publisher","first-page":"170","DOI":"10.1186\/1471-2105-8-170","volume":"8","author":"CE Jones","year":"2007","unstructured":"Jones CE, Brown AL, Baumann U: Estimating the annotation error rate of curated GO database sequence annotations. BMC Bioinformatics 2007, 8(1):170. 10.1186\/1471-2105-8-170","journal-title":"BMC Bioinformatics"},{"issue":"3","key":"1656_CR30","doi-asserted-by":"publisher","first-page":"1","DOI":"10.4018\/jdwm.2007070101","volume":"3","author":"G Tsoumakas","year":"2007","unstructured":"Tsoumakas G, Katakis I: Multi-label classification: An overview. Int J Data Warehousing and Mining 2007, 3(3):1\u201313.","journal-title":"Int J Data Warehousing and Mining"},{"issue":"7","key":"1656_CR31","doi-asserted-by":"publisher","first-page":"830","DOI":"10.1093\/bioinformatics\/btk048","volume":"22","author":"Z Barutcuoglu","year":"2006","unstructured":"Barutcuoglu Z, Schapire RE, Troyanskaya OG: Hierarchical multi-label prediction of gene function. Bioinformatics 2006, 22(7):830\u2013836. 10.1093\/bioinformatics\/btk048","journal-title":"Bioinformatics"},{"key":"1656_CR32","first-page":"1601","volume":"7","author":"J Rousu","year":"2006","unstructured":"Rousu J, Saunders C, Szedmak S, Shawe-Taylor J: Kernel-Based Learning of Hierarchical Multilabel Classification Models. J Mach Learn Res 2006, 7: 1601\u20131626.","journal-title":"J Mach Learn Res"},{"key":"1656_CR33","first-page":"18","volume-title":"Proceedings of 10th European Conference on Principles and Practice of Knowledge Discovery in Databases","author":"H Blockeel","year":"2006","unstructured":"Blockeel H, Schietgat L, Struyf J, Dzeroski S, Clare A: Decision Trees for Hierarchical Multilabel Classification : A Case Study in Functional Genomics. In Proceedings of 10th European Conference on Principles and Practice of Knowledge Discovery in Databases. Volume 4213. Berlin: Springer, Lecture Notes in Computer Science; 2006:18\u201329."},{"key":"1656_CR34","doi-asserted-by":"publisher","first-page":"83","DOI":"10.1038\/47048","volume":"402","author":"EM Marcotte","year":"1999","unstructured":"Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D: A combined algorithm for genome-wide prediction of protein function. Nature 1999, 402: 83\u201386. 10.1038\/47048","journal-title":"Nature"},{"key":"1656_CR35","doi-asserted-by":"publisher","first-page":"4285","DOI":"10.1073\/pnas.96.8.4285","volume":"96","author":"M Pellegrini","year":"1999","unstructured":"Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis : protein phylogenetic profiles. Proc Natl Acad Sci USA 1999, 96: 4285\u20134288. 10.1073\/pnas.96.8.4285","journal-title":"Proc Natl Acad Sci USA"},{"key":"1656_CR36","doi-asserted-by":"publisher","first-page":"14863","DOI":"10.1073\/pnas.95.25.14863","volume":"95","author":"MB Eisen","year":"1998","unstructured":"Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95: 14863\u201314868. 10.1073\/pnas.95.25.14863","journal-title":"Proc Natl Acad Sci USA"},{"key":"1656_CR37","doi-asserted-by":"publisher","first-page":"2888","DOI":"10.1073\/pnas.0307326101","volume":"101","author":"U Karaoz","year":"2004","unstructured":"Karaoz U, Murali TM, Letovsky S, Zheng Y, Ding C, Cantor C, Kasif S: Whole-genome annotation by using evidence integration in functional-linkage networks. Proc Natl Acad Sci USA 2004, 101: 2888\u20132893. 10.1073\/pnas.0307326101","journal-title":"Proc Natl Acad Sci USA"},{"key":"1656_CR38","doi-asserted-by":"publisher","first-page":"e337","DOI":"10.1371\/journal.pone.0000337","volume":"2","author":"N Nariai","year":"2007","unstructured":"Nariai N, Kolaczyk ED, Kasif S: Probabilistic protein function prediction from heterogeneous genome-wide data. PLoS ONE 2007, 2: e337. 10.1371\/journal.pone.0000337","journal-title":"PLoS ONE"},{"key":"1656_CR39","doi-asserted-by":"publisher","first-page":"268","DOI":"10.1186\/1471-2105-7-268","volume":"7","author":"J Xiong","year":"2006","unstructured":"Xiong J, Rayner S, Luo K, Li Y, Chen S: Genome wide prediction of protein function via a generic knowledge discovery approach based on evidence integration. BMC Bioinformatics 2006, 7: 268. 10.1186\/1471-2105-7-268","journal-title":"BMC Bioinformatics"},{"key":"1656_CR40","volume-title":"Data Mining: Practical machine learning tools and techniques","author":"I Witten","year":"2005","unstructured":"Witten I, Frank E: Data mining in bioinformatics using Weka. In Data Mining: Practical machine learning tools and techniques. 2nd edition. San Francisco: Morgan Kaufmann; 2005.","edition":"2"},{"key":"1656_CR41","volume-title":"Inter-element dependency models for sequence classification Technical report","author":"A Silvescu","year":"2004","unstructured":"Silvescu A, Andorf C, Dobbs D, Honavar V: Inter-element dependency models for sequence classification Technical report.Department of Computer Science, Iowa State University; 2004. [http:\/\/www.cs.iastate.edu\/~silvescu\/papers\/nbktr\/nbktr.ps]"},{"key":"1656_CR42","volume-title":"Probabilistic Networks and Expert Systems","author":"R Cowell","year":"1999","unstructured":"Cowell R, Dawid A, Lauritzen S, Spiegelhalter D: Probabilistic Networks and Expert Systems. Springer; 1999."},{"key":"1656_CR43","volume-title":"Machine learning","author":"T Mitchell","year":"1997","unstructured":"Mitchell T: Machine learning. New York, USA: McGraw Hill; 1997."},{"key":"1656_CR44","doi-asserted-by":"publisher","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","volume":"2","author":"S Altschul","year":"1997","unstructured":"Altschul S, Madden T, Schaffer A, Zhang J, Miller W, Lipman D: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acid Res 1997, 2: 3389\u20133402. 10.1093\/nar\/25.17.3389","journal-title":"Nucleic Acid Res"},{"key":"1656_CR45","volume-title":"Bioinformatics: The Machine Learning Approach","author":"P Baldi","year":"1998","unstructured":"Baldi P, Brunak S: Bioinformatics: The Machine Learning Approach. Cambridge, MA: MIT Press; 1998."},{"key":"1656_CR46","unstructured":"Fantom[http:\/\/fantom2.gsc.riken.jp]"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-8-284.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,1]],"date-time":"2021-09-01T01:49:38Z","timestamp":1630460978000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-8-284"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2007,8,3]]},"references-count":46,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2007,12]]}},"alternative-id":["1656"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-8-284","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2007,8,3]]},"assertion":[{"value":"14 December 2006","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 August 2007","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 August 2007","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"284"}}