{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,2]],"date-time":"2025-12-02T15:21:29Z","timestamp":1764688889711},"reference-count":33,"publisher":"Oxford University Press (OUP)","issue":"18","license":[{"start":{"date-parts":[[2016,10,2]],"date-time":"2016-10-02T00:00:00Z","timestamp":1475366400000},"content-version":"vor","delay-in-days":1490,"URL":"http:\/\/creativecommons.org\/licenses\/by\/3.0"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2012,9,15]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Motivation: Approaches that use supervised machine learning techniques for protein\u2013protein interaction (PPI) prediction typically use features obtained by integrating several sources of data. Often certain attributes of the data are not available, resulting in missing values. In particular, our host\u2013pathogen PPI datasets have a large fraction, in the range of 58\u201385% of missing values, which makes it challenging to apply machine learning algorithms.<\/jats:p>\n               <jats:p>Results: We show that specialized techniques for missing value imputation can improve the performance of the models significantly. We use cross species information in combination with machine learning techniques like Group lasso with \u21131\/\u21132 regularization. We demonstrate the benefits of our approach on two PPI prediction problems. In our first example of Salmonella\u2013human PPI prediction, we are able to obtain high prediction accuracies with 77.6% precision and 84% recall. Comparison with various other techniques shows an improvement of 9 in F1 score over the next best technique. We also apply our method to Yersinia\u2013human PPI prediction successfully, demonstrating the generality of our approach.<\/jats:p>\n               <jats:p>Availability: Predicted interactions, datasets, features are available at: http:\/\/www.cs.cmu.edu\/~mkshirsa\/eccb2012_paper46.html.<\/jats:p>\n               <jats:p>Contact: \u00a0judithks@cs.cmu.edu<\/jats:p>\n               <jats:p>Supplementary Information: \u00a0Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/bts375","type":"journal-article","created":{"date-parts":[[2012,9,7]],"date-time":"2012-09-07T20:35:22Z","timestamp":1347050122000},"page":"i466-i472","source":"Crossref","is-referenced-by-count":37,"title":["Techniques to cope with missing data in host\u2013pathogen protein interaction prediction"],"prefix":"10.1093","volume":"28","author":[{"given":"Meghana","family":"Kshirsagar","sequence":"first","affiliation":[{"name":"1 School of Computer Science, Carnegie Mellon University 15213"}]},{"given":"Jaime","family":"Carbonell","sequence":"additional","affiliation":[{"name":"1 School of Computer Science, Carnegie Mellon University 15213"}]},{"given":"Judith","family":"Klein-Seetharaman","sequence":"additional","affiliation":[{"name":"1 School of Computer Science, Carnegie Mellon University 15213"},{"name":"2 Department of Structural Biology, University of Pittsburgh, School of Medicine, Pittsburgh 15261, USA"},{"name":"3 Forschungszentrum J\u00fclich, Institute of Complex Systems (ICS-5), J\u00fclich 52425, Germany"}]}],"member":"286","published-online":{"date-parts":[[2012,9,3]]},"reference":[{"issue":"17","key":"2023012513041978200_B1","doi-asserted-by":"crossref","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","article-title":"Gapped blast and psi-blast: a new generation of protein database search programs","volume":"25","author":"Altschul","year":"1997","journal-title":"Nucl. Acids Res."},{"issue":"1","key":"2023012513041978200_B2","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1038\/75556","article-title":"Gene ontology: tool for the unification of biology","volume":"25","author":"Ashburner","year":"2000","journal-title":"Nat. Genet."},{"key":"2023012513041978200_B3","doi-asserted-by":"crossref","first-page":"D1005","DOI":"10.1093\/nar\/gkq1184","article-title":"Ncbi geo: archive for functional genomics data sets 10 years on","volume":"39","author":"Barrett","year":"2011","journal-title":"Nucl. Acids Res."},{"key":"2023012513041978200_B4","doi-asserted-by":"crossref","DOI":"10.1145\/130385.130401","article-title":"A training algorithm for optimal margin classifiers","volume-title":"COLT","author":"Boser","year":"1992"},{"issue":"1","key":"2023012513041978200_B5","doi-asserted-by":"crossref","DOI":"10.1023\/A:1010933404324","article-title":"Random forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Machine Learning Journal"},{"key":"2023012513041978200_B6","article-title":"Random forests","author":"Breiman","year":"2004"},{"key":"2023012513041978200_B7","doi-asserted-by":"crossref","DOI":"10.1145\/1143844.1143874","article-title":"The relationship between precision-recall and roc curves","volume-title":"ICML","author":"Davis","year":"2006"},{"issue":"8","key":"2023012513041978200_B8","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pone.0012089","article-title":"The human-bacterial pathogen protein interaction networks of bacillus anthracis, francisella tularensis, and yersinia pestis","volume":"5","author":"Dyer","year":"2010","journal-title":"PLOS One"},{"key":"2023012513041978200_B9","doi-asserted-by":"crossref","first-page":"917","DOI":"10.1016\/j.meegid.2011.02.022","article-title":"Supervised learning and prediction of physical interactions between human and hiv proteins","volume":"11","author":"Dyer","year":"2011","journal-title":"Infect., Genetics and Evol."},{"key":"2023012513041978200_B10","doi-asserted-by":"crossref","DOI":"10.1201\/9780429246593","volume-title":"An Introduction to the Bootstrap","author":"Efron","year":"1994"},{"key":"2023012513041978200_B11","article-title":"Liblinear: A library for large linear classification","volume":"9","author":"Fan","year":"2008","journal-title":"JMLR"},{"issue":"3","key":"2023012513041978200_B12","doi-asserted-by":"crossref","first-page":"410","DOI":"10.1093\/bioinformatics\/bti011","article-title":"ipfam: visualization of protein\u2013protein interactions in pdb at domain and amino acid resolutions","volume":"21","author":"Finn","year":"2005","journal-title":"Bioinf."},{"key":"2023012513041978200_B13","doi-asserted-by":"crossref","first-page":"D211","DOI":"10.1093\/nar\/gkp985","article-title":"The pfam protein families database","volume":"38","author":"Finn","year":"2010","journal-title":"Nucl. Acids Res."},{"issue":"5","key":"2023012513041978200_B14","doi-asserted-by":"crossref","first-page":"409","DOI":"10.1097\/QCO.0b013e32833dd25d","article-title":"Nontyphoidal salmonellosis in Africa","volume":"23","author":"Graham","year":"2010","journal-title":"Curr. Opin. Infect. Dis."},{"key":"2023012513041978200_B15","doi-asserted-by":"crossref","DOI":"10.1007\/978-0-387-84858-7","volume-title":"The Elements of Statistical Learning: Data Mining, Inference, and Prediction","author":"Hastie","year":"2009"},{"issue":"5","key":"2023012513041978200_B16","doi-asserted-by":"crossref","first-page":"498","DOI":"10.1093\/bib\/bbq080","article-title":"Missing value imputation for gene expression data: computational techniques to recover missing data from available information","volume":"12","author":"Liew","year":"2010","journal-title":"Brief. in Bioinf."},{"key":"2023012513041978200_B17","volume-title":"Statistical analysis with missing data","author":"Little","year":"1987"},{"key":"2023012513041978200_B18","doi-asserted-by":"crossref","first-page":"474","DOI":"10.1038\/msb.2011.7","article-title":"Rnai screen of salmonella invasion shows role of copi in membrane targeting of cholesterol and cdc42","volume":"7","author":"Misselwitz","year":"2011","journal-title":"Mol Syst Biol."},{"key":"2023012513041978200_B19","doi-asserted-by":"crossref","DOI":"10.1186\/1471-2105-11-S1-S57","article-title":"Active learning for human protein-protein interaction prediction","volume":"11","author":"Mohamed","year":"2010","journal-title":"BMC Bioinf."},{"key":"2023012513041978200_B20","article-title":"Union support recovery in high-dimensional multivariate regression","author":"Obozinski","year":"2008","journal-title":"NIPS"},{"issue":"12","key":"2023012513041978200_B21","article-title":"Multi-population gwa mapping via multi-task regularized regression","volume":"26","author":"Puniyani","year":"2010","journal-title":"Bioinf."},{"key":"2023012513041978200_B22","first-page":"531","article-title":"Random forest similarity for protein-protein interaction prediction from multiple sources","volume":"10","author":"Qi","year":"2005","journal-title":"PSB"},{"issue":"Suppl 10","key":"2023012513041978200_B23","doi-asserted-by":"crossref","first-page":"S6","DOI":"10.1186\/1471-2105-8-S10-S6","article-title":"A mixture of feature experts approach for protein-protein interaction prediction","volume":"8","author":"Qi","year":"2007","journal-title":"BMC Bioinf."},{"issue":"3","key":"2023012513041978200_B24","doi-asserted-by":"crossref","first-page":"297","DOI":"10.1023\/A:1007614523901","article-title":"Improved boosting algorithms using confidence-rated predictions","volume":"37","author":"Schapire","year":"1999","journal-title":"Machine Learning"},{"issue":"1-2","key":"2023012513041978200_B25","doi-asserted-by":"crossref","first-page":"117","DOI":"10.1002\/prca.201100083","article-title":"The current salmonella-host interactome","volume":"6","author":"Schleker","year":"2012","journal-title":"Proteomics Clin Appl."},{"key":"2023012513041978200_B26","doi-asserted-by":"crossref","first-page":"4337","DOI":"10.1073\/pnas.0607879104","article-title":"Predicting protein-protein interactions based only on sequences information","volume":"104","author":"Shen","year":"2007","journal-title":"PNAS"},{"key":"2023012513041978200_B27","article-title":"Struct2net: Integrating structure into PPI prediction","volume-title":"PSB","author":"Singh","year":"2006"},{"key":"2023012513041978200_B28","first-page":"132","article-title":"Handling missing features with boosting algorithms for protein\u00e2\u0141\u201cprotein interaction prediction","volume":"6254\/2010","author":"Smeraldi","year":"2010","journal-title":"LNCS"},{"key":"2023012513041978200_B29","doi-asserted-by":"crossref","first-page":"D718","DOI":"10.1093\/nar\/gkq962","article-title":"3did: identification & classification of domain-based interactions of known 3d structure","volume":"39","author":"Stein","year":"2011","journal-title":"Nuc. Acids Res."},{"issue":"14","key":"2023012513041978200_B30","first-page":"516","article-title":"Prediction of interactions between hiv-1 and human proteins by information integration","author":"Tastan","year":"2009","journal-title":"Pac. Symp. Biocomput."},{"issue":"12","key":"2023012513041978200_B31","doi-asserted-by":"crossref","first-page":"1067","DOI":"10.1109\/LSP.2009.2030111","article-title":"Dual augmented lagrangian method for efficient sparse reconstruction","volume":"16","author":"Tomioka","year":"2009","journal-title":"IEEE Signal Processing Letters"},{"key":"2023012513041978200_B32","doi-asserted-by":"crossref","first-page":"D214","DOI":"10.1093\/nar\/gkq1020","article-title":"Ongoing and future developments at the universal protein resource","volume":"39","author":"UniProt Consortium","year":"2011","journal-title":"Nucl. Acids Res."},{"key":"2023012513041978200_B33","doi-asserted-by":"crossref","DOI":"10.1186\/1471-2105-7-91","article-title":"Bias in error estimation when using cross-validation for model selection","author":"Varma","year":"2006","journal-title":"BMC Bioinf"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/28\/18\/i466\/48885106\/bioinformatics_28_18_i466.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/28\/18\/i466\/48885106\/bioinformatics_28_18_i466.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,25]],"date-time":"2023-01-25T18:55:12Z","timestamp":1674672912000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/28\/18\/i466\/244674"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,9,3]]},"references-count":33,"journal-issue":{"issue":"18","published-print":{"date-parts":[[2012,9,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bts375","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2012,9,15]]},"published":{"date-parts":[[2012,9,3]]}}}