{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,2]],"date-time":"2025-12-02T15:14:15Z","timestamp":1764688455035},"reference-count":29,"publisher":"Oxford University Press (OUP)","issue":"15","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2005,8,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Motivation: Building an accurate protein classification system depends critically upon choosing a good representation of the input sequences of amino acids. Recent work using string kernels for protein data has achieved state-of-the-art classification performance. However, such representations are based only on labeled data\u2014examples with known 3D structures, organized into structural classes\u2014whereas in practice, unlabeled data are far more plentiful.<\/jats:p><jats:p>Results: In this work, we develop simple and scalable cluster kernel techniques for incorporating unlabeled data into the representation of protein sequences. We show that our methods greatly improve the classification performance of string kernels and outperform standard approaches for using unlabeled data, such as adding close homologs of the positive examples to the training data. We achieve equal or superior performance to previously presented cluster kernel methods and at the same time achieving far greater computationalefficiency.<\/jats:p><jats:p>Availability: Source code is available at www.kyb.tuebingen.mpg.de\/bs\/people\/weston\/semiprot. The Spider matlab package is available at www.kyb.tuebingen.mpg.de\/bs\/people\/spider<\/jats:p><jats:p>Contact: \u00a0jasonw@nec-labs.com<\/jats:p><jats:p>Supplementary information: \u00a0www.kyb.tuebingen.mpg.de\/bs\/people\/weston\/semiprot<\/jats:p>","DOI":"10.1093\/bioinformatics\/bti497","type":"journal-article","created":{"date-parts":[[2005,5,20]],"date-time":"2005-05-20T00:24:12Z","timestamp":1116548652000},"page":"3241-3247","source":"Crossref","is-referenced-by-count":145,"title":["Semi-supervised protein classification using cluster kernels"],"prefix":"10.1093","volume":"21","author":[{"given":"Jason","family":"Weston","sequence":"first","affiliation":[]},{"given":"Christina","family":"Leslie","sequence":"additional","affiliation":[]},{"given":"Eugene","family":"Ie","sequence":"additional","affiliation":[]},{"given":"Dengyong","family":"Zhou","sequence":"additional","affiliation":[]},{"given":"Andre","family":"Elisseeff","sequence":"additional","affiliation":[]},{"given":"William Stafford","family":"Noble","sequence":"additional","affiliation":[]}],"member":"286","published-online":{"date-parts":[[2005,5,19]]},"reference":[{"key":"2023051612004206600_B1","doi-asserted-by":"crossref","unstructured":"Altschul, S.F., et al. 1990A basic local alignment search tool. J. Mol. Biol. \u00a0215 \u00a0403\u2013410","DOI":"10.1016\/S0022-2836(05)80360-2"},{"key":"2023051612004206600_B2","doi-asserted-by":"crossref","unstructured":"Altschul, S.F., et al. 1997Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. \u00a025 \u00a03389\u20133402","DOI":"10.1093\/nar\/25.17.3389"},{"key":"2023051612004206600_B3","doi-asserted-by":"crossref","unstructured":"Ben-Hur, A. and Brutlag, D. 2003Remote homology detection: a motif based approach. Bioinformatics \u00a019 \u00a0i26\u2013i33","DOI":"10.1093\/bioinformatics\/btg1002"},{"key":"2023051612004206600_B4","unstructured":"Chapelle, O., et al. 2002Cluster kernels for semi-supervised learning. Adv. Neural Inf. Process. Syst. \u00a015 \u00a0601\u2013608"},{"key":"2023051612004206600_B5","unstructured":"Cortes, C., et al. 2002Rational kernels. Adv. Neural Inf. Process. Syst. \u00a015 \u00a0617\u2013624"},{"key":"2023051612004206600_B6","doi-asserted-by":"crossref","unstructured":"Gribskov, M. and Robinson, N.L. 1996Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem. \u00a020 \u00a025\u201333","DOI":"10.1016\/S0097-8485(96)80004-0"},{"key":"2023051612004206600_B7","doi-asserted-by":"crossref","unstructured":"Hanley, J.A. and McNeil, B.J. 1982The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology \u00a0143 \u00a029\u201336","DOI":"10.1148\/radiology.143.1.7063747"},{"key":"2023051612004206600_B8","unstructured":"Haussler, D. 1999Convolution kernels on discrete structures. , Santa Cruz Technical report UCSC-CRL-99;10 University of California"},{"key":"2023051612004206600_B9","doi-asserted-by":"crossref","unstructured":"Jaakkola, T., et al. 2000A discriminative framework for detecting remote protein homologies. J. Comput. Biol. \u00a07 \u00a095\u2013114","DOI":"10.1089\/10665270050081405"},{"key":"2023051612004206600_B10","unstructured":"Jebara, T., et al. 2004Probability product kernels. J. Mach. Learn. \u00a05 \u00a0819\u2013844"},{"key":"2023051612004206600_B11","unstructured":"Joachims, T. 1999Transductive inference for text classification using support vector machines. Proceedings of the Sixteenth International Conference on Machine LearningBled, Slovenia , pp. 200\u2013209"},{"key":"2023051612004206600_B12","doi-asserted-by":"crossref","unstructured":"Krogh, A., et al. 1994Hidden Markov models in computational biology: applications to protein modeling. J. Mol. Biol. \u00a0235 \u00a01501\u20131531","DOI":"10.1006\/jmbi.1994.1104"},{"key":"2023051612004206600_B13","unstructured":"Kuang, R., Ie, E., Wang, K., Wang, K., Siddiqi, M., Freund, Y., Leslie, C. 2004Profile-based string kernels for remote homology detection and motif extraction. 3rd International IEEE Computer Society Computational Systems Bioinformatics Conference , Stanford, CA IEEE Computer Society, pp. 152\u2013160"},{"key":"2023051612004206600_B14","unstructured":"Kuang, R., et al. 2005Profile kernels for detecting remote protein homologs and discriminative motifs. J. Bioinform. Comput. Biol. \u00a03 \u00a01\u201323"},{"key":"2023051612004206600_B15","doi-asserted-by":"crossref","unstructured":"Leslie, C. and Kuang, R. 2003Fast kernels for inexact string matching. Proceedings of the Sixteenth Annual Conference on Learning Theory and Seventh Kernel WorkshopWashington, DC , pp. 114\u2013128","DOI":"10.1007\/978-3-540-45167-9_10"},{"key":"2023051612004206600_B16","unstructured":"Leslie, C., et al. 2002Mismatch string kernels for SVM protein classification. Adv. Neural Inf. Process. Syst. \u00a015 \u00a01441\u20131448"},{"key":"2023051612004206600_B17","doi-asserted-by":"crossref","unstructured":"Liao, C. and Noble, W.S. 2002Combining pairwise sequence similarity and support vector machines for remote protein homology detection. Proceedings of the Sixth Annual International Conference on Research in Computational Molecular BiologyWashington, DC , pp. 225\u2013232","DOI":"10.1145\/565196.565225"},{"key":"2023051612004206600_B18","unstructured":"Lodhi, H., et al. 2000Text classification using string kernels. Adv. Neural Inf. Process. Syst. \u00a013 \u00a0563\u2013569"},{"key":"2023051612004206600_B19","doi-asserted-by":"crossref","unstructured":"Murzin, A.G., et al. 1995SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. \u00a0247 \u00a0536\u2013540","DOI":"10.1016\/S0022-2836(05)80134-2"},{"key":"2023051612004206600_B20","unstructured":"Ng, A., et al. 2001On spectral clustering: analysis and an algorithm. Adv. Neural Process. Inform. Syst. \u00a014 \u00a0849\u2013856"},{"key":"2023051612004206600_B21","doi-asserted-by":"crossref","unstructured":"Park, J., et al. 1998Sequence comparisons using multiple sequences detect twice as many remote homologues as pairwise methods. J. Mol. Biol. \u00a0284 \u00a01201\u20131210","DOI":"10.1006\/jmbi.1998.2221"},{"key":"2023051612004206600_B22","doi-asserted-by":"crossref","unstructured":"Saigo, H., et al. 2004Protein homology detection using string alignment kernels. Bioinformatics \u00a020 \u00a01682\u20131689","DOI":"10.1093\/bioinformatics\/bth141"},{"key":"2023051612004206600_B23","unstructured":"Technical report Seeger, M. 2001Learning with labeled and unlabeled data. , UK University of Edinburgh"},{"key":"2023051612004206600_B24","doi-asserted-by":"crossref","unstructured":"Smith, T. and Waterman, M. 1981Identification of common molecular subsequences. J. Mol. Biol. \u00a0147 \u00a0195\u2013197","DOI":"10.1016\/0022-2836(81)90087-5"},{"key":"2023051612004206600_B25","unstructured":"Szummer, M. and Jaakkola, T. 2001Partially labeled classification with Markov random walks. Adv. Neural Inf. Process. Syst. \u00a014 \u00a0945\u2013952"},{"key":"2023051612004206600_B26","unstructured":"Vishwanathan, S.V.N. and Smola, A. 2002Fast kernels for string and tree matching. Adv. Neural Inf. Process. Syst. \u00a015 \u00a0585\u2013592"},{"key":"2023051612004206600_B27","doi-asserted-by":"crossref","unstructured":"Technical report Watkins, C. 1999Dynamic alignment kernels. Advances in Large Margin Classifiers. , Royal Holloway, UK University of London39\u201350","DOI":"10.7551\/mitpress\/1113.003.0006"},{"key":"2023051612004206600_B28","unstructured":"Weston, J., et al. 2003Cluster kernels for semi-supervised protein classification. Adv. Neural Inf. Process. Syst. \u00a016 \u00a0595\u2013602"},{"key":"2023051612004206600_B29","unstructured":"Zhu, X. and Ghahramani, Z. 2002Learning from labeled and unlabeled data with label propagation. , Pittsburgh, PA Technical report CMUCALD-02-107 Carnegie Mellon University"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/21\/15\/3241\/50340678\/bioinformatics_21_15_3241.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/21\/15\/3241\/50340678\/bioinformatics_21_15_3241.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,1,26]],"date-time":"2024-01-26T01:35:50Z","timestamp":1706232950000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/21\/15\/3241\/195405"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2005,5,19]]},"references-count":29,"journal-issue":{"issue":"15","published-print":{"date-parts":[[2005,8,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bti497","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2005,8]]},"published":{"date-parts":[[2005,5,19]]}}}