{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,28]],"date-time":"2026-04-28T20:35:51Z","timestamp":1777408551531,"version":"3.51.4"},"reference-count":33,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2022,12,15]],"date-time":"2022-12-15T00:00:00Z","timestamp":1671062400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Bioinform."],"abstract":"<jats:p>As practitioners of machine learning in the area of bioinformatics we know that the quality of the results crucially depends on the quality of our labeled data. While there is a tendency to focus on the quality of positive examples, the negative examples are equally as important. In this opinion paper we revisit the problem of choosing negative examples for the task of predicting protein-protein interactions, either among proteins of a given species or for host-pathogen interactions and describe important issues that are prevalent in the current literature. The challenge in creating datasets for this task is the noisy nature of the experimentally derived interactions and the lack of information on non-interacting proteins. A standard approach is to choose random pairs of non-interacting proteins as negative examples. Since the interactomes of all species are only partially known, this leads to a very small percentage of false negatives. This is especially true for host-pathogen interactions. To address this perceived issue, some researchers have chosen to select negative examples as pairs of proteins whose sequence similarity to the positive examples is sufficiently low. This clearly reduces the chance for false negatives, but also makes the problem much easier than it really is, leading to over-optimistic accuracy estimates. We demonstrate the effect of this form of bias using a selection of recent protein interaction prediction methods of varying complexity, and urge researchers to pay attention to the details of generating their datasets for potential biases like this.<\/jats:p>","DOI":"10.3389\/fbinf.2022.1083292","type":"journal-article","created":{"date-parts":[[2022,12,15]],"date-time":"2022-12-15T07:48:59Z","timestamp":1671090539000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["On the choice of negative examples for prediction of host-pathogen protein interactions"],"prefix":"10.3389","volume":"2","author":[{"given":"Don","family":"Neumann","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Soumyadip","family":"Roy","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Fayyaz Ul Amir Afsar","family":"Minhas","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Asa","family":"Ben-Hur","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1965","published-online":{"date-parts":[[2022,12,15]]},"reference":[{"key":"B1","volume-title":"HPIDB 2.0: A curated database for host\u2013pathogen interactions","author":"Ammari","year":"2016"},{"key":"B2","doi-asserted-by":"publisher","first-page":"e0270275","DOI":"10.1371\/journal.pone.0270275","article-title":"LGCA-VHPPI: A local-global residue context aware viral-host protein-protein interaction predictor","volume":"17","author":"Asim","year":"2022","journal-title":"Plos one"},{"key":"B3","doi-asserted-by":"publisher","first-page":"1850014","DOI":"10.1142\/s0219720018500142","article-title":"Training host-pathogen protein\u2013protein interaction predictors","volume":"16","author":"Basit","year":"2018","journal-title":"J. Bioinform. Comput. Biol."},{"key":"B4","doi-asserted-by":"publisher","first-page":"S2","DOI":"10.1186\/1471-2105-7-s1-s2","article-title":"Choosing negative examples for the prediction of protein-protein interactions","volume":"7","author":"Ben-Hur","year":"2006","journal-title":"BMC Bioinforma."},{"key":"B5","doi-asserted-by":"publisher","first-page":"D396","DOI":"10.1093\/nar\/gkt1079","article-title":"Negatome 2.0: A database of non-interacting proteins derived by literature mining, manual annotation and protein structure analysis","volume":"42","author":"Blohm","year":"2014","journal-title":"Nucleic Acids Res."},{"key":"B6","doi-asserted-by":"publisher","first-page":"D588","DOI":"10.1093\/nar\/gku830","article-title":"VirusMentha: A new resource for virus-host protein interactions","volume":"43","author":"Calderone","year":"2015","journal-title":"Nucleic acids Res."},{"key":"B7","doi-asserted-by":"publisher","first-page":"i305","DOI":"10.1093\/bioinformatics\/btz328","article-title":"Multifaceted protein\u2013protein interaction prediction based on siamese residual RCNN","volume":"35","author":"Chen","year":"2019","journal-title":"Bioinformatics"},{"key":"B8","doi-asserted-by":"publisher","first-page":"555","DOI":"10.1186\/s12864-022-08772-6","article-title":"DCSE: Double-channel-siamese-ensemble model for protein protein interaction prediction","volume":"23","author":"Chen","year":"2022","journal-title":"BMC genomics"},{"key":"B9","doi-asserted-by":"publisher","first-page":"438","DOI":"10.1016\/j.bj.2020.08.003","article-title":"Machine learning techniques for sequence-based prediction of viral\u2013host interactions between SARS-CoV-2 and human proteins","volume":"43","author":"Dey","year":"2020","journal-title":"Biomed. J."},{"key":"B10","doi-asserted-by":"publisher","first-page":"41","DOI":"10.3390\/molecules27010041","article-title":"Benchmark evaluation of protein\u2013protein interaction prediction algorithms","volume":"27","author":"Dunham","year":"2021","journal-title":"Molecules"},{"key":"B11","doi-asserted-by":"publisher","first-page":"1144","DOI":"10.1093\/bioinformatics\/btv737","article-title":"DeNovo: Virus-host sequence-based protein\u2013protein interaction prediction","volume":"32","author":"Eid","year":"2016","journal-title":"Bioinformatics"},{"key":"B12","doi-asserted-by":"publisher","first-page":"1945","DOI":"10.1093\/bioinformatics\/btv077","article-title":"Evolutionary profiles improve protein\u2013protein interaction prediction from sequence","volume":"31","author":"Hamp","year":"2015","journal-title":"Bioinformatics"},{"key":"B13","doi-asserted-by":"publisher","first-page":"3223","DOI":"10.1016\/j.csbj.2022.06.025","article-title":"Deep learning frameworks for protein-protein interaction prediction","volume":"20","author":"Hu","year":"","journal-title":"Comput. Struct. Biotechnol. J."},{"key":"B14","doi-asserted-by":"publisher","first-page":"694","DOI":"10.1093\/bioinformatics\/btab737","article-title":"DeepTrio: A ternary prediction system for protein\u2013protein interaction using mask multiple parallel convolutional neural networks","volume":"38","author":"Hu","year":"","journal-title":"Bioinformatics"},{"key":"B15","first-page":"1","article-title":"Transfer learning for predicting virus-host protein interactions for novel virus sequences","author":"Lanchantin","year":"2021","journal-title":"Proc. 12th ACM Conf. Bioinforma. Comput. Biol. Health Inf."},{"key":"B16","doi-asserted-by":"publisher","first-page":"bbab029","DOI":"10.1093\/bib\/bbab029","article-title":"Current status and future perspectives of computational studies on human\u2013virus protein\u2013protein interactions","volume":"22","author":"Lian","year":"2021","journal-title":"Brief. Bioinform."},{"key":"B17","doi-asserted-by":"publisher","first-page":"2722","DOI":"10.1093\/bioinformatics\/btab147","article-title":"DeepViral: Prediction of novel virus\u2013host interactions from protein sequences and infectious disease phenotypes","volume":"37","author":"Liu-Wei","year":"2021","journal-title":"Bioinformatics"},{"key":"B18","doi-asserted-by":"publisher","first-page":"100551","DOI":"10.1016\/j.patter.2022.100551","article-title":"Accurate prediction of virus-host protein-protein interactions via a siamese neural network using deep protein sequence embeddings","volume":"3","author":"Madan","year":"2022"},{"key":"B19","doi-asserted-by":"publisher","first-page":"218","DOI":"10.1093\/bioinformatics\/bth483","article-title":"Predicting protein\u2013protein interactions using signature products","volume":"21","author":"Martin","year":"2005","journal-title":"Bioinformatics"},{"key":"B20","doi-asserted-by":"publisher","first-page":"443","DOI":"10.1016\/0022-2836(70)90057-4","article-title":"A general method applicable to the search for similarities in the amino acid sequence of two proteins","volume":"48","author":"Needleman","year":"1970","journal-title":"J. Mol. Biol."},{"key":"B21","doi-asserted-by":"publisher","first-page":"1134","DOI":"10.1038\/nmeth.2259","article-title":"Flaws in evaluation schemes for pair-input computational predictions","volume":"9","author":"Park","year":"2012","journal-title":"Nat. Methods"},{"key":"B22","article-title":"PyTorch: An imperative style, high-performance deep learning library","volume":"32","author":"Paszke","year":"2019","journal-title":"Adv. neural Inf. Process. Syst."},{"key":"B23","first-page":"2825","article-title":"Scikit-learn: Machine learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J. Mach. Learn. Res."},{"key":"B24","doi-asserted-by":"publisher","first-page":"e11770","DOI":"10.7717\/peerj.11770","article-title":"In silico predictions of protein interactions between zika virus and human host","volume":"9","author":"Pitta","year":"2021","journal-title":"PeerJ"},{"key":"B25","doi-asserted-by":"publisher","first-page":"4337","DOI":"10.1073\/pnas.0607879104","article-title":"Predicting protein\u2013protein interactions based only on sequences information","volume":"104","author":"Shen","year":"2007","journal-title":"Proc. Natl. Acad. Sci. U. S. A."},{"key":"B26","doi-asserted-by":"publisher","first-page":"277","DOI":"10.1186\/s12859-017-1700-2","article-title":"Sequence-based prediction of protein protein interaction using a deep-learning algorithm","volume":"18","author":"Sun","year":"2017","journal-title":"BMC Bioinforma."},{"key":"B27","doi-asserted-by":"publisher","first-page":"gkw937","DOI":"10.1093\/nar\/gkw937","article-title":"The STRING database in 2017: Quality-controlled protein\u2013protein association networks, made broadly accessible","author":"Szklarczyk","year":"2016","journal-title":"Nucleic acids Res."},{"key":"B28","doi-asserted-by":"publisher","first-page":"bbab228","DOI":"10.1093\/bib\/bbab228","article-title":"LSTM-PHV: Prediction of human-virus protein\u2013protein interactions by LSTM with word2vec","volume":"22","author":"Tsukiyama","year":"2021","journal-title":"Brief. Bioinform."},{"key":"B29","doi-asserted-by":"publisher","first-page":"2740","DOI":"10.1093\/bioinformatics\/bty179","article-title":"Deep learning improves antimicrobial peptide recognition","volume":"34","author":"Veltri","year":"2018","journal-title":"Bioinformatics"},{"key":"B30","doi-asserted-by":"publisher","first-page":"153","DOI":"10.1016\/j.csbj.2019.12.005","article-title":"Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method","volume":"18","author":"Yang","year":"2020","journal-title":"Comput. Struct. Biotechnol. J."},{"key":"B31","doi-asserted-by":"publisher","first-page":"4771","DOI":"10.1093\/bioinformatics\/btab533","article-title":"Transfer learning via multi-scale convolutional neural layers for human\u2013virus protein\u2013protein interaction prediction","volume":"37","author":"Yang","year":"2021","journal-title":"Bioinformatics"},{"key":"B32","doi-asserted-by":"publisher","first-page":"ii75","DOI":"10.1093\/bioinformatics\/btac496","article-title":"Insights into performance evaluation of compound\u2013protein interaction prediction methods","volume":"38","author":"Yaseen","year":"2022","journal-title":"Bioinformatics"},{"key":"B33","doi-asserted-by":"publisher","first-page":"568","DOI":"10.1186\/s12864-018-4924-2","article-title":"A generalized approach to predicting protein-protein interactions between virus and host","volume":"19","author":"Zhou","year":"2018","journal-title":"BMC genomics"}],"container-title":["Frontiers in Bioinformatics"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fbinf.2022.1083292\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,15]],"date-time":"2022-12-15T07:49:03Z","timestamp":1671090543000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fbinf.2022.1083292\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,12,15]]},"references-count":33,"alternative-id":["10.3389\/fbinf.2022.1083292"],"URL":"https:\/\/doi.org\/10.3389\/fbinf.2022.1083292","relation":{},"ISSN":["2673-7647"],"issn-type":[{"value":"2673-7647","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,12,15]]},"article-number":"1083292"}}