{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,11]],"date-time":"2025-11-11T12:51:04Z","timestamp":1762865464537},"reference-count":22,"publisher":"Springer Science and Business Media LLC","issue":"1","content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2008,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Background<\/jats:title><jats:p>Establishment of peptide binding to Major Histocompatibility Complex class I (MHCI) is a crucial step in the development of subunit vaccines and prediction of such binding could greatly reduce costs and accelerate the experimental process of identifying immunogenic peptides. Many methods have been applied to the prediction of peptide-MHCI binding, with some achieving outstanding performance. Because of the experimental methods used to measure binding or affinity between peptides and MHCI molecules, however, available datasets are enriched for nonbinders, and thus highly unbalanced. Although there is no consensus on the ideal class distribution for training sets, extremely unbalanced datasets can be detrimental to the performance of prediction algorithms.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>We have developed a decision-theoretic framework to construct cost-sensitive trees to predict peptide-MHCI binding and have used them to 1) Assess the impact of the training data's class distribution on classifier accuracy, and 2) Compare resampling and cost-sensitive methods as approaches to compensate for training data imbalance. Our results confirm that highly unbalanced training sets can reduce the accuracy of classifier predictions and show that, in the peptide-MHCI binding context, resampling methods do not improve the classifier performance. In contrast, cost-sensitive methods significantly improve accuracy of decision trees. Finally, we propose the use of a training scheme that, when the training set is enriched for nonbinders, consistently improves the overall classifier accuracy compared to cost-insensitive classifiers and, in particular, increases the sensitivity of the classifiers. This method minimizes the expected classification cost for large datasets.<\/jats:p><\/jats:sec><jats:sec><jats:title>Conclusion<\/jats:title><jats:p>Our method consistently improves the performance of decision trees in predicting peptide-MHC class I binding by using cost-balancing techniques to compensate for the imbalance in the training dataset.<\/jats:p><\/jats:sec>","DOI":"10.1186\/1471-2105-9-385","type":"journal-article","created":{"date-parts":[[2008,9,19]],"date-time":"2008-09-19T18:14:32Z","timestamp":1221848072000},"update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":9,"title":["Improving peptide-MHC class I binding prediction for unbalanced datasets"],"prefix":"10.1186","volume":"9","author":[{"given":"Ana Paula","family":"Sales","sequence":"first","affiliation":[]},{"given":"Georgia D","family":"Tomaras","sequence":"additional","affiliation":[]},{"given":"Thomas B","family":"Kepler","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2008,9,19]]},"reference":[{"issue":"5","key":"2370_CR1","doi-asserted-by":"publisher","first-page":"929","DOI":"10.1006\/jmbi.1998.1982","volume":"281","author":"C Zhang","year":"1998","unstructured":"Zhang C, Anderson A, DeLisi C: Structural principles that govern the peptide-binding motifs of class I MHC molecules. J Mol Biol 1998, 281(5):929\u201347. 10.1006\/jmbi.1998.1982","journal-title":"J Mol Biol"},{"issue":"6","key":"2370_CR2","doi-asserted-by":"publisher","first-page":"e65","DOI":"10.1371\/journal.pcbi.0020065","volume":"2","author":"B Peters","year":"2006","unstructured":"Peters B, Bui HH, Frankild S, Nielson M, Lundegaard C, Kostem E, Basch D, Lamberth K, Harndahl M, Fleri W, Wilson SS, Sidney J, Lund O, Buus S, Sette A: A community resource benchmarking predictions of peptide binding to MHC-I molecules. PLoS Comput Biol 2006, 2(6):e65. 10.1371\/journal.pcbi.0020065","journal-title":"PLoS Comput Biol"},{"issue":"3\u20134","key":"2370_CR3","doi-asserted-by":"publisher","first-page":"213","DOI":"10.1007\/s002510050595","volume":"50","author":"H Rammensee","year":"1999","unstructured":"Rammensee H, Bachmann J, Emmerich NP, Bachor OA, Stevanovic S: SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics 1999, 50(3\u20134):213\u20139. 10.1007\/s002510050595","journal-title":"Immunogenetics"},{"key":"2370_CR4","doi-asserted-by":"crossref","first-page":"163","DOI":"10.4049\/jimmunol.152.1.163","volume":"152","author":"KC Parker","year":"1994","unstructured":"Parker KC, Bednarek MA, Coligan JE: Scheme for ranking potential HLA-A2 binding peptides based on independent binding of individual peptide side-chains. J Immunol 1994, 152: 163\u201375.","journal-title":"J Immunol"},{"issue":"5","key":"2370_CR5","doi-asserted-by":"publisher","first-page":"1258","DOI":"10.1006\/jmbi.1997.0937","volume":"267","author":"K Gulukota","year":"1997","unstructured":"Gulukota K, Sidney J, Sette A, DeLisi C: Two complementary methods for predicting peptides binding major histocompatibility complex molecules. J Mol Biol 1997, 267(5):1258\u201367. 10.1006\/jmbi.1997.0937","journal-title":"J Mol Biol"},{"key":"2370_CR6","doi-asserted-by":"publisher","first-page":"25","DOI":"10.1186\/1471-2105-3-25","volume":"3","author":"P Donnes","year":"2002","unstructured":"Donnes P, Elofsson A: Prediction of MHC class I binding peptides, using SVMHC. BMC Bioinformatics 2002, 3: 25. 10.1186\/1471-2105-3-25","journal-title":"BMC Bioinformatics"},{"issue":"3","key":"2370_CR7","doi-asserted-by":"crossref","first-page":"137","DOI":"10.1007\/BF03402006","volume":"8","author":"K Yu","year":"2002","unstructured":"Yu K, Petrovsky N, Schonbach C, Koh JY, Brusic V: Methods for prediction of peptide binding to MHC molecules: a comparative study. Mol Med 2002, 8(3):137\u201348.","journal-title":"Mol Med"},{"issue":"9","key":"2370_CR8","doi-asserted-by":"publisher","first-page":"1388","DOI":"10.1093\/bioinformatics\/bth100","volume":"20","author":"M Nielsen","year":"2004","unstructured":"Nielsen M, Lundegaard C, Worning P, Hvid CS, Lamberth K, Buus S, Brunak S, Lund O: Improved prediction of MHC class I and class II epitopes using a novel Gibbs sampling approach. Bioinformatics 2004, 20(9):1388\u201397. 10.1093\/bioinformatics\/bth100","journal-title":"Bioinformatics"},{"issue":"2","key":"2370_CR9","doi-asserted-by":"publisher","first-page":"632","DOI":"10.1111\/j.0006-341X.2001.00632.x","volume":"57","author":"MR Segal","year":"2001","unstructured":"Segal MR, Cummings MP, Hubbard AE: Relating amino acid sequence to phenotype: analysis of peptide-binding data. Biometrics 2001, 57(2):632\u201342. 10.1111\/j.0006-341X.2001.00632.x","journal-title":"Biometrics"},{"issue":"13","key":"2370_CR10","doi-asserted-by":"publisher","first-page":"1648","DOI":"10.1093\/bioinformatics\/btl141","volume":"22","author":"S Zhu","year":"2006","unstructured":"Zhu S, Udaka K, Sidney J, Sette A, Aoki-Kinoshita KF, Mamitsuka H: Improving MHC binding peptide prediction by incorporating binding data of auxiliary MHC molecules. Bioinformatics 2006, 22(13):1648\u201355. 10.1093\/bioinformatics\/btl141","journal-title":"Bioinformatics"},{"issue":"14","key":"2370_CR11","doi-asserted-by":"publisher","first-page":"1765","DOI":"10.1093\/bioinformatics\/btg247","volume":"19","author":"B Peters","year":"2003","unstructured":"Peters B, Tong W, Sidney J, Sette A, Weng Z: Examining the independent binding assumption for binding of peptide epitopes to MHC-I molecules. Bioinformatics 2003, 19(14):1765\u201372. 10.1093\/bioinformatics\/btg247","journal-title":"Bioinformatics"},{"key":"2370_CR12","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/1007730.1007733","volume":"6","author":"NV Chawla","year":"2004","unstructured":"Chawla NV, Japkowicz N, Kotcz A: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor Newsl 2004, 6: 1\u20136. 10.1145\/1007730.1007733","journal-title":"SIGKDD Explor Newsl"},{"key":"2370_CR13","first-page":"313","volume":"6","author":"V Brusic","year":"1999","unstructured":"Brusic V, Zeleznikow J: Computational binding assays of antigenic peptides. Letters in Peptide Science 1999, 6: 313\u2013324.","journal-title":"Letters in Peptide Science"},{"key":"2370_CR14","first-page":"973","volume-title":"IJCAI","author":"C Elkan","year":"2001","unstructured":"Elkan C: The Foundations of Cost-Sensitive Learning. IJCAI 2001, 973\u2013978."},{"key":"2370_CR15","doi-asserted-by":"crossref","first-page":"315","DOI":"10.1613\/jair.1199","volume":"19","author":"GM Weiss","year":"2003","unstructured":"Weiss GM, Provost FJ: Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction. J Artif Intell Res (JAIR) 2003, 19: 315\u2013354.","journal-title":"J Artif Intell Res (JAIR)"},{"key":"2370_CR16","volume-title":"Classification and regression trees. Wadsworth statistics\/probability series","author":"L Breiman","year":"1993","unstructured":"Breiman L: Classification and regression trees. Wadsworth statistics\/probability series. New York, N.Y.: Chapman and Hall; 1993."},{"key":"2370_CR17","doi-asserted-by":"publisher","first-page":"9","DOI":"10.1186\/1745-7580-3-9","volume":"3","author":"S Ray","year":"2007","unstructured":"Ray S, Kepler T: Amino acid biophysical properties in the statistical prediction of peptide-MHC class I binding. Immunome Research 2007, 3: 9. [http:\/\/www.immunome-research.com\/content\/3\/1\/9] 10.1186\/1745-7580-3-9","journal-title":"Immunome Research"},{"key":"2370_CR18","volume-title":"Proceedings of the International Conference on Machine Learning (ICML 2003) Workshop on Learning from Imbalanced Data Sets II","author":"C Drummond","year":"2003","unstructured":"Drummond C, Holte R: C4.5, Class Imbalance, and Cost-Sensitivity: Why Under-Sampling beats Over-Sampling. Proceedings of the International Conference on Machine Learning (ICML 2003) Workshop on Learning from Imbalanced Data Sets II 2003."},{"key":"2370_CR19","doi-asserted-by":"crossref","first-page":"429","DOI":"10.3233\/IDA-2002-6504","volume":"6","author":"N Japkowicz","year":"2002","unstructured":"Japkowicz N, Stephen S: The class imbalance problem: A systematic study. Intelligent Data Analysis 2002, 6: 429\u2013449.","journal-title":"Intelligent Data Analysis"},{"key":"2370_CR20","first-page":"445","volume-title":"European Conference on Artificial Intelligence","author":"M Kukar","year":"1998","unstructured":"Kukar M, Kononenko I: Cost-Sensitive Learning with Neural Networks. European Conference on Artificial Intelligence 1998, 445\u2013449. [http:\/\/citeseer.ist.psu.edu\/kukar98costsensitive.html]"},{"key":"2370_CR21","doi-asserted-by":"publisher","first-page":"63","DOI":"10.1109\/TKDE.2006.17","volume":"18","author":"ZH Zhou","year":"2006","unstructured":"Zhou ZH, Liu XY: Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem. IEEE Transactions on Knowledge and Data Engineering 2006, 18: 63\u201377. 10.1109\/TKDE.2006.17","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"key":"2370_CR22","first-page":"23","volume-title":"Lecture Notes in Computer Science","author":"U Brefeld","year":"2003","unstructured":"Brefeld U, Geibel P, Wysotzki F: Support Vector Machines with Examples Dependent Costs. Lecture Notes in Computer Science 2003, 23\u201334."}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-9-385.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,5,20]],"date-time":"2023-05-20T00:54:51Z","timestamp":1684544091000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-9-385"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2008,9,19]]},"references-count":22,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2008,12]]}},"alternative-id":["2370"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-9-385","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2008,9,19]]},"assertion":[{"value":"15 April 2008","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 September 2008","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 September 2008","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"385"}}