{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,21]],"date-time":"2026-03-21T08:15:29Z","timestamp":1774080929126,"version":"3.50.1"},"reference-count":17,"publisher":"Springer Science and Business Media LLC","issue":"S3","license":[{"start":{"date-parts":[[2013,2,1]],"date-time":"2013-02-01T00:00:00Z","timestamp":1359676800000},"content-version":"tdm","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/2.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2013,2]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:sec>\n            <jats:title>Background<\/jats:title>\n            <jats:p>Protein function determination is a key challenge in the post-genomic era. Experimental determination of protein functions is accurate, but time-consuming and resource-intensive. A cost-effective alternative is to use the known information about sequence, structure, and functional properties of genes and proteins to predict functions using statistical methods. In this paper, we describe the Multi-Source <jats:italic>k<\/jats:italic>-Nearest Neighbor (MS-<jats:italic>k<\/jats:italic> NN) algorithm for function prediction, which finds <jats:italic>k<\/jats:italic>-nearest neighbors of a query protein based on different types of similarity measures and predicts its function by weighted averaging of its neighbors' functions. Specifically, we used 3 data sources to calculate the similarity scores: sequence similarity, protein-protein interactions, and gene expressions.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Results<\/jats:title>\n            <jats:p>We report the results in the context of 2011 Critical Assessment of Function Annotation (CAFA). Prior to CAFA submission deadline, we evaluated our algorithm on 1,302 human test proteins that were represented in all 3 data sources. Using only the sequence similarity information, MS-<jats:italic>k<\/jats:italic> NN had term-based Area Under the Curve (AUC) accuracy of Gene Ontology (GO) molecular function predictions of 0.728 when 7,412 human training proteins were used, and 0.819 when 35,622 training proteins from multiple eukaryotic and prokaryotic organisms were used. By aggregating predictions from all three sources, the AUC was further improved to 0.848. Similar result was observed on prediction of GO biological processes. Testing on 595 proteins that were annotated after the CAFA submission deadline showed that overall MS-<jats:italic>k<\/jats:italic> NN accuracy was higher than that of baseline algorithms Gotcha and BLAST, which were based solely on sequence similarity information. Since only 10 of the 595 proteins were represented by all 3 data sources, and 66 by two data sources, the difference between 3-source and one-source MS-<jats:italic>k<\/jats:italic> NN was rather small.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Conclusions<\/jats:title>\n            <jats:p>Based on our results, we have several useful insights: (1) the <jats:italic>k<\/jats:italic>-nearest neighbor algorithm is an efficient and effective model for protein function prediction; (2) it is beneficial to transfer functions across a wide range of organisms; (3) it is helpful to integrate multiple sources of protein information.<\/jats:p>\n          <\/jats:sec>","DOI":"10.1186\/1471-2105-14-s3-s8","type":"journal-article","created":{"date-parts":[[2019,12,11]],"date-time":"2019-12-11T02:00:27Z","timestamp":1576029627000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":68,"title":["MS-k NN: protein function prediction by integrating multiple data sources"],"prefix":"10.1186","volume":"14","author":[{"given":"Liang","family":"Lan","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Nemanja","family":"Djuric","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yuhong","family":"Guo","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Slobodan","family":"Vucetic","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2013,2,28]]},"reference":[{"issue":"3","key":"5697_CR1","doi-asserted-by":"publisher","first-page":"225","DOI":"10.1093\/bib\/bbl004","volume":"7","author":"I Friedberg","year":"2006","unstructured":"Friedberg I: Automated protein function prediction--the genomic challenge. Briefings in bioinformatics. 2006, 7 (3): 225-242. 10.1093\/bib\/bbl004.","journal-title":"Briefings in bioinformatics"},{"issue":"17","key":"5697_CR2","doi-asserted-by":"publisher","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","volume":"25","author":"SF Altschul","year":"1997","unstructured":"Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research. 1997, 25 (17): 3389-3402. 10.1093\/nar\/25.17.3389.","journal-title":"Nucleic acids research"},{"key":"5697_CR3","doi-asserted-by":"publisher","first-page":"178","DOI":"10.1186\/1471-2105-5-178","volume":"5","author":"DM Martin","year":"2004","unstructured":"Martin DM, Berriman M, Barton GJ: GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC bioinformatics. 2004, 5: 178-10.1186\/1471-2105-5-178.","journal-title":"BMC bioinformatics"},{"issue":"12","key":"5697_CR4","doi-asserted-by":"publisher","first-page":"1257","DOI":"10.1038\/82360","volume":"18","author":"B Schwikowski","year":"2000","unstructured":"Schwikowski B, Uetz P, Fields S: A network of protein-protein interactions in yeast. Nature biotechnology. 2000, 18 (12): 1257-1261. 10.1038\/82360.","journal-title":"Nature biotechnology"},{"issue":"6","key":"5697_CR5","doi-asserted-by":"publisher","first-page":"523","DOI":"10.1002\/yea.706","volume":"18","author":"H Hishigaki","year":"2001","unstructured":"Hishigaki H, Nakai K, Ono T, Tanigami A, Takagi T: Assessment of prediction accuracy of protein function from protein--protein interaction data. Yeast. 2001, 18 (6): 523-531. 10.1002\/yea.706.","journal-title":"Yeast"},{"issue":"20","key":"5697_CR6","doi-asserted-by":"publisher","first-page":"12783","DOI":"10.1073\/pnas.192159399","volume":"99","author":"X Zhou","year":"2002","unstructured":"Zhou X, Kao MC, Wong WH: Transitive functional annotation by shortest-path analysis of gene expression data. Proceedings of the National Academy of Sciences of the United States of America. 2002, 99 (20): 12783-12788. 10.1073\/pnas.192159399.","journal-title":"Proceedings of the National Academy of Sciences of the United States of America"},{"issue":"25","key":"5697_CR7","doi-asserted-by":"publisher","first-page":"14863","DOI":"10.1073\/pnas.95.25.14863","volume":"95","author":"MB Eisen","year":"1998","unstructured":"Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America. 1998, 95 (25): 14863-14868. 10.1073\/pnas.95.25.14863.","journal-title":"Proceedings of the National Academy of Sciences of the United States of America"},{"key":"5697_CR8","first-page":"249","volume-title":"Proceedings of the fifth annual international conference on Computational biology; Montreal, Quebec, Canada","author":"P Pavlidis","year":"2001","unstructured":"Pavlidis P, Weston J, Cai J, Grundy WN: Gene functional classification from heterogeneous data. Proceedings of the fifth annual international conference on Computational biology; Montreal, Quebec, Canada. 2001, ACM, 369228: 249-255."},{"issue":"1","key":"5697_CR9","doi-asserted-by":"publisher","first-page":"262","DOI":"10.1073\/pnas.97.1.262","volume":"97","author":"MP Brown","year":"2000","unstructured":"Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M, Haussler D: Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences of the United States of America. 2000, 97 (1): 262-267. 10.1073\/pnas.97.1.262.","journal-title":"Proceedings of the National Academy of Sciences of the United States of America"},{"key":"5697_CR10","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4757-2440-0","volume-title":"The nature of statistical learning theory","author":"VN Vapnik","year":"1995","unstructured":"Vapnik VN: The nature of statistical learning theory. 1995, Springer-Verlag New York, Inc"},{"issue":"14","key":"5697_CR11","doi-asserted-by":"publisher","first-page":"8348","DOI":"10.1073\/pnas.0832373100","volume":"100","author":"OG Troyanskaya","year":"2003","unstructured":"Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D: A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proceedings of the National Academy of Sciences of the United States of America. 2003, 100 (14): 8348-8353. 10.1073\/pnas.0832373100.","journal-title":"Proceedings of the National Academy of Sciences of the United States of America"},{"issue":"7","key":"5697_CR12","doi-asserted-by":"publisher","first-page":"830","DOI":"10.1093\/bioinformatics\/btk048","volume":"22","author":"Z Barutcuoglu","year":"2006","unstructured":"Barutcuoglu Z, Schapire RE, Troyanskaya OG: Hierarchical multi-label prediction of gene function. Bioinformatics. 2006, 22 (7): 830-836. 10.1093\/bioinformatics\/btk048.","journal-title":"Bioinformatics"},{"issue":"14","key":"5697_CR13","doi-asserted-by":"publisher","first-page":"1759","DOI":"10.1093\/bioinformatics\/btq262","volume":"26","author":"S Mostafavi","year":"2010","unstructured":"Mostafavi S, Morris Q: Fast integration of heterogeneous data sources for predicting gene function with limited annotation. Bioinformatics. 2010, 26 (14): 1759-1765. 10.1093\/bioinformatics\/btq262.","journal-title":"Bioinformatics"},{"issue":"1","key":"5697_CR14","doi-asserted-by":"publisher","first-page":"25","DOI":"10.1038\/75556","volume":"25","author":"M Ashburner","year":"2000","unstructured":"Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics. 2000, 25 (1): 25-29. 10.1038\/75556.","journal-title":"Nature genetics"},{"key":"5697_CR15","doi-asserted-by":"publisher","first-page":"142","DOI":"10.1186\/1471-2105-10-142","volume":"10","author":"G Pandey","year":"2009","unstructured":"Pandey G, Myers CL, Kumar V: Incorporating functional interrelationships into protein function prediction algorithms. BMC bioinformatics. 2009, 10: 142-10.1186\/1471-2105-10-142.","journal-title":"BMC bioinformatics"},{"issue":"12","key":"5697_CR16","doi-asserted-by":"publisher","first-page":"1499","DOI":"10.1038\/nbt1205-1499","volume":"23","author":"P D'Haeseleer","year":"2005","unstructured":"D'Haeseleer P: How does gene expression clustering work?. Nature biotechnology. 2005, 23 (12): 1499-1501. 10.1038\/nbt1205-1499.","journal-title":"Nature biotechnology"},{"key":"5697_CR17","first-page":"296","volume-title":"Proceedings of the Fifteenth International Conference on Machine Learning","author":"D Lin","year":"1998","unstructured":"Lin D: An Information-Theoretic Definition of Similarity. Proceedings of the Fifteenth International Conference on Machine Learning. 1998, Morgan Kaufmann Publishers Inc, 657297: 296-304."}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-14-S3-S8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1186\/1471-2105-14-S3-S8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-14-S3-S8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,1]],"date-time":"2021-09-01T22:26:22Z","timestamp":1630535182000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-14-S3-S8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2013,2]]},"references-count":17,"journal-issue":{"issue":"S3","published-print":{"date-parts":[[2013,2]]}},"alternative-id":["5697"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-14-s3-s8","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2013,2]]},"assertion":[{"value":"28 February 2013","order":1,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"S8"}}