{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,8,4]],"date-time":"2024-08-04T12:37:33Z","timestamp":1722775053832},"reference-count":24,"publisher":"Springer Science and Business Media LLC","issue":"S4","content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2009,4]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:sec>\n            <jats:title>Background<\/jats:title>\n            <jats:p>Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags\u2013the unlabeled data. In this study, we present a principled and biologically motivated computational framework that more effectively exploits the unlabeled data by only using the sequence regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias the estimation of computational models that rely on unlabeled data, we also propose a method to remove this bias and improve performance of the resulting classifiers.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Results<\/jats:title>\n            <jats:p>Combined with state-of-the-art string kernels, our proposed computational framework achieves very accurate semi-supervised protein remote fold and homology detection on three large unlabeled databases. It outperforms current state-of-the-art methods and exhibits significant reduction in running time.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Conclusion<\/jats:title>\n            <jats:p>The unlabeled sequences used under the semi-supervised setting resemble the unpolished gemstones; when used as-is, they may carry unnecessary features and hence compromise the classification accuracy but once cut and polished, they improve the accuracy of the classifiers considerably.<\/jats:p>\n          <\/jats:sec>","DOI":"10.1186\/1471-2105-10-s4-s2","type":"journal-article","created":{"date-parts":[[2012,5,1]],"date-time":"2012-05-01T09:53:17Z","timestamp":1335865997000},"update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":9,"title":["Efficient use of unlabeled data for protein sequence classification: a comparative study"],"prefix":"10.1186","volume":"10","author":[{"given":"Pavel","family":"Kuksa","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Pai-Hsi","family":"Huang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Vladimir","family":"Pavlovic","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2009,4,29]]},"reference":[{"issue":"suppl-1","key":"3281_CR1","first-page":"D34","volume":"33","author":"DA Benson","year":"2005","unstructured":"Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucl Acids Res 2005, 33(suppl-1):D34\u201338. [http:\/\/nar.oxfordjournals.org\/cgi\/content\/abstract\/33\/suppl_1\/D34]","journal-title":"Nucl Acids Res"},{"issue":"suppl-1","key":"3281_CR2","first-page":"D154","volume":"33","author":"A Bairoch","year":"2005","unstructured":"Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LSL: The Universal Protein Resource (UniProt). Nucl Acids Res 2005, 33(suppl-1):D154\u2013159. [http:\/\/nar.oxfordjournals.org\/cgi\/content\/full\/35\/suppl_1\/D193]","journal-title":"Nucl Acids Res"},{"key":"3281_CR3","volume-title":"Statistical Learning Theory","author":"VN Vapnik","year":"1998","unstructured":"Vapnik VN:Statistical Learning Theory. Wiley-Interscience; 1998. [http:\/\/www.wiley.com\/WileyCDA\/WileyTitle\/productCd-0471030031.html]"},{"key":"3281_CR4","doi-asserted-by":"publisher","first-page":"95","DOI":"10.1089\/10665270050081405","volume":"7","author":"T Jaakkola","year":"2000","unstructured":"Jaakkola T, Diekhans M, Haussler D: A Discriminative Framework for Detecting Remote Protein Homologies. Journal of Computational Biology 2000, 7: 95\u2013114. 10.1089\/10665270050081405","journal-title":"Journal of Computational Biology"},{"key":"3281_CR5","first-page":"1417","volume-title":"NIPS","author":"CS Leslie","year":"2002","unstructured":"Leslie CS, Eskin E, Weston J, Noble WS: Mismatch String Kernels for SVM Protein Classification. NIPS 2002, 1417\u20131424."},{"issue":"9","key":"3281_CR6","doi-asserted-by":"publisher","first-page":"755","DOI":"10.1093\/bioinformatics\/14.9.755","volume":"14","author":"S Eddy","year":"1998","unstructured":"Eddy S: Profile hidden Markov models. Bioinformatics 1998, 14(9):755\u2013763. [http:\/\/bioinformatics.oxfordjournals.org\/cgi\/content\/abstract\/14\/9\/755] 10.1093\/bioinformatics\/14.9.755","journal-title":"Bioinformatics"},{"issue":"3","key":"3281_CR7","doi-asserted-by":"publisher","first-page":"527","DOI":"10.1142\/S021972000500120X","volume":"3","author":"R Kuang","year":"2005","unstructured":"Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie C: Profile-based string kernels for remote homology detection and motif extraction. J Bioinform Comput Biol 2005, 3(3):527\u2013550. [http:\/\/www-users.cs.umn.edu\/~kuang\/paper\/jbcb-profile-kernel.pdf] 10.1142\/S021972000500120X","journal-title":"J Bioinform Comput Biol"},{"issue":"15","key":"3281_CR8","doi-asserted-by":"publisher","first-page":"3241","DOI":"10.1093\/bioinformatics\/bti497","volume":"21","author":"J Weston","year":"2005","unstructured":"Weston J, Leslie C, Ie E, Zhou D, Elisseeff A, Noble WS: Semi-supervised protein classification using cluster kernels. Bioinformatics 2005, 21(15):3241\u20133247. [http:\/\/bioinformatics.oxfordjournals.org\/cgi\/content\/abstract\/21\/15\/3241] 10.1093\/bioinformatics\/bti497","journal-title":"Bioinformatics"},{"key":"3281_CR9","first-page":"1435","volume":"5","author":"C Leslie","year":"2004","unstructured":"Leslie C, Kuang R: Fast String Kernels using Inexact Matching for Protein Sequences. J Mach Learn Res 2004, 5: 1435\u20131455. [http:\/\/jmlr.csail.mit.edu\/papers\/volume5\/leslie04a\/leslie04a.pdf]","journal-title":"J Mach Learn Res"},{"key":"3281_CR10","first-page":"152","volume-title":"CSB","author":"R Kuang","year":"2004","unstructured":"Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie CS: Profile-Based String Kernels for Remote Homology Detection and Motif Extraction. CSB 2004, 152\u2013160."},{"key":"3281_CR11","doi-asserted-by":"publisher","first-page":"4355","DOI":"10.1073\/pnas.84.13.4355","volume":"84","author":"M Gribskov","year":"1987","unstructured":"Gribskov M, McLachlan A, Eisenberg D: Profile analysis: detection of distantly related proteins. PNAS 1987, 84: 4355\u20134358. 10.1073\/pnas.84.13.4355","journal-title":"PNAS"},{"key":"3281_CR12","first-page":"1557","volume":"8","author":"I Melvin","year":"2007","unstructured":"Melvin I, Ie E, Weston J, Noble WS, Leslie C: Multi-class Protein Classification Using Adaptive Codes. J Mach Learn Res 2007, 8: 1557\u20131581.","journal-title":"J Mach Learn Res"},{"key":"3281_CR13","volume-title":"Proceedings of the Nineteenth International Conference on Pattern Recognition (ICPR 2008)","author":"P Kuksa","year":"2008","unstructured":"Kuksa P, Huang PH, Pavlovic V: Fast Protein Homology and Fold Detection with Sparse Spatial Sample Kernels. Proceedings of the Nineteenth International Conference on Pattern Recognition (ICPR 2008) 2008."},{"key":"3281_CR14","first-page":"403","volume-title":"Journal of Molecular Biology","author":"S Altschul","year":"1990","unstructured":"Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic Local Alignment Search Tool. Journal of Molecular Biology 1990, 403\u2013410."},{"key":"3281_CR15","doi-asserted-by":"publisher","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","volume":"25","author":"S Altschul","year":"1997","unstructured":"Altschul S, et al.: Gapped Blast and PSI-Blast: A New Generation of Protein Database Search Programs. NAR 1997, 25: 3389\u20133402. 10.1093\/nar\/25.17.3389","journal-title":"NAR"},{"issue":"13","key":"3281_CR16","doi-asserted-by":"publisher","first-page":"1658","DOI":"10.1093\/bioinformatics\/btl158","volume":"22","author":"W Li","year":"2006","unstructured":"Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22(13):1658\u20131659. [http:\/\/cd-hit.org] 10.1093\/bioinformatics\/btl158","journal-title":"Bioinformatics"},{"key":"3281_CR17","doi-asserted-by":"publisher","first-page":"257","DOI":"10.1093\/nar\/28.1.257","volume":"28","author":"L Lo Conte","year":"2000","unstructured":"Lo Conte L, Ailey B, Hubbard T, Brenner S, Murzin A, Chothia C: a structural classification of proteins database. Nucleic Acids Res 2000, 28: 257\u2013259. 10.1093\/nar\/28.1.257","journal-title":"Nucleic Acids Res"},{"key":"3281_CR18","doi-asserted-by":"publisher","first-page":"235","DOI":"10.1093\/nar\/28.1.235","volume":"28","author":"HM Berman","year":"2000","unstructured":"Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat T, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Research 2000, 28: 235\u2013242. 10.1093\/nar\/28.1.235","journal-title":"Nucleic Acids Research"},{"key":"3281_CR19","doi-asserted-by":"publisher","first-page":"365","DOI":"10.1093\/nar\/gkg095","volume":"31","author":"B Boeckmann","year":"2003","unstructured":"Boeckmann B, Bairoch A, Apweiler R, Blatter M, Estreicher A, Gasteiger E, Martin M, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003, 31: 365\u2013370. 10.1093\/nar\/gkg095","journal-title":"Nucleic Acids Res"},{"key":"3281_CR20","unstructured":"[http:\/\/www.kyb.tuebingen.mpg.de\/bs\/people\/spider]"},{"key":"3281_CR21","volume-title":"NIPS","author":"P Kuksa","year":"2008","unstructured":"Kuksa P, Huang PH, Pavlovic V: Scalable Algorithms for String Kernels with Inexact Matching. NIPS 2008."},{"key":"3281_CR22","doi-asserted-by":"publisher","first-page":"25","DOI":"10.1016\/S0097-8485(96)80004-0","volume":"20","author":"M Gribskov","year":"1996","unstructured":"Gribskov M, Robinson NL: Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching. Computers & Chemistry 1996, 20: 25\u201333. 10.1016\/S0097-8485(96)80004-0","journal-title":"Computers & Chemistry"},{"key":"3281_CR23","unstructured":"[http:\/\/seqam.rutgers.edu\/projects\/bioinfo\/region-semiprot]"},{"key":"3281_CR24","doi-asserted-by":"publisher","first-page":"133","DOI":"10.1142\/9781848162648_0012","volume-title":"Computational Systems Bioinformatics: Proceedings of the CSB2008 Conference","author":"P Kuksa","year":"2008","unstructured":"Kuksa P, Huang PH, Pavlovic V: Fast and Accurate Multi-class Protein Fold Recognition with Spatial Sample Kernels. Computational Systems Bioinformatics: Proceedings of the CSB2008 Conference 2008, 133\u2013143."}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-10-S4-S2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,1]],"date-time":"2021-09-01T07:56:53Z","timestamp":1630483013000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-10-S4-S2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2009,4]]},"references-count":24,"journal-issue":{"issue":"S4","published-print":{"date-parts":[[2009,4]]}},"alternative-id":["3281"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-10-s4-s2","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2009,4]]},"assertion":[{"value":"29 April 2009","order":1,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"S2"}}