{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,19]],"date-time":"2026-05-19T18:47:50Z","timestamp":1779216470521,"version":"3.51.4"},"update-to":[{"DOI":"10.1371\/journal.pcbi.1012597","type":"new_version","label":"New version","source":"publisher","updated":{"date-parts":[[2024,12,3]],"date-time":"2024-12-03T00:00:00Z","timestamp":1733184000000}}],"reference-count":33,"publisher":"Public Library of Science (PLoS)","issue":"11","license":[{"start":{"date-parts":[[2024,11,19]],"date-time":"2024-11-19T00:00:00Z","timestamp":1731974400000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100010665","name":"H2020 Marie Sk\u0142odowska-Curie Actions","doi-asserted-by":"publisher","award":["955974"],"award-info":[{"award-number":["955974"]}],"id":[{"id":"10.13039\/100010665","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000265","name":"Medical Research Council","doi-asserted-by":"publisher","award":["MC UU 12014\/12"],"award-info":[{"award-number":["MC UU 12014\/12"]}],"id":[{"id":"10.13039\/501100000265","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000265","name":"Medical Research Council","doi-asserted-by":"publisher","award":["MC UU 00034\/5"],"award-info":[{"award-number":["MC UU 00034\/5"]}],"id":[{"id":"10.13039\/501100000265","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000265","name":"Medical Research Council","doi-asserted-by":"publisher","award":["MR\/V01157X\/1"],"award-info":[{"award-number":["MR\/V01157X\/1"]}],"id":[{"id":"10.13039\/501100000265","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000265","name":"Medical Research Council","doi-asserted-by":"publisher","award":["MR\/N013166\/1"],"award-info":[{"award-number":["MR\/N013166\/1"]}],"id":[{"id":"10.13039\/501100000265","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000268","name":"Biotechnology and Biological Sciences Research Council","doi-asserted-by":"publisher","award":["BB\/V016067\/1"],"award-info":[{"award-number":["BB\/V016067\/1"]}],"id":[{"id":"10.13039\/501100000268","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000268","name":"Biotechnology and Biological Sciences Research Council","doi-asserted-by":"publisher","award":["BB\/V016067\/1"],"award-info":[{"award-number":["BB\/V016067\/1"]}],"id":[{"id":"10.13039\/501100000268","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000268","name":"Biotechnology and Biological Sciences Research Council","doi-asserted-by":"publisher","award":["BB\/V016067\/1"],"award-info":[{"award-number":["BB\/V016067\/1"]}],"id":[{"id":"10.13039\/501100000268","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000266","name":"Engineering and Physical Sciences Research Council","doi-asserted-by":"publisher","award":["EP\/R018634\/1"],"award-info":[{"award-number":["EP\/R018634\/1"]}],"id":[{"id":"10.13039\/501100000266","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["www.ploscompbiol.org"],"crossmark-restriction":false},"short-container-title":["PLoS Comput Biol"],"abstract":"<jats:p>Predicting virus-host associations is essential to determine the specific host species that viruses interact with, and discover if new viruses infect humans and animals. Currently, the host of the majority of viruses is unknown, particularly in microbiomes. To address this challenge, we introduce EvoMIL, a deep learning method that predicts the host species for viruses from viral sequences only. It also identifies important viral proteins that significantly contribute to host prediction. The method combines a pre-trained large protein language model (ESM) and attention-based multiple instance learning to allow protein-orientated predictions. Our results show that protein embeddings capture stronger predictive signals than sequence composition features, including amino acids, physiochemical properties, and DNA k-mers. In multi-host prediction tasks, EvoMIL achieves median F1 score improvements of 10.8%, 16.2%, and 4.9% in prokaryotic hosts, and 1.7%, 6.6% and 11.5% in eukaryotic hosts. EvoMIL binary classifiers achieve impressive AUC over 0.95 for all prokaryotic hosts and range from roughly 0.8 to 0.9 for eukaryotic hosts. Furthermore, EvoMIL identifies important proteins in the prediction task, capturing key functions involved in virus-host specificity.<\/jats:p>","DOI":"10.1371\/journal.pcbi.1012597","type":"journal-article","created":{"date-parts":[[2024,11,19]],"date-time":"2024-11-19T13:37:54Z","timestamp":1732023474000},"page":"e1012597","update-policy":"https:\/\/doi.org\/10.1371\/journal.pcbi.corrections_policy","source":"Crossref","is-referenced-by-count":16,"title":["Prediction of virus-host associations using protein language models and multiple instance learning"],"prefix":"10.1371","volume":"20","author":[{"given":"Dan","family":"Liu","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Francesca","family":"Young","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3011-5189","authenticated-orcid":true,"given":"Kieran D.","family":"Lamb","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"David L.","family":"Robertson","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2318-1460","authenticated-orcid":true,"given":"Ke","family":"Yuan","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"340","published-online":{"date-parts":[[2024,11,19]]},"reference":[{"issue":"1","key":"pcbi.1012597.ref001","doi-asserted-by":"crossref","first-page":"29","DOI":"10.1038\/nbt.4306","article-title":"Minimum information about an uncultivated virus genome (MIUViG)","volume":"37","author":"S Roux","year":"2019","journal-title":"Nature biotechnology"},{"key":"pcbi.1012597.ref002","doi-asserted-by":"crossref","first-page":"e985","DOI":"10.7717\/peerj.985","article-title":"VirSorter: mining viral signal from microbial genomic data","volume":"3","author":"S Roux","year":"2015","journal-title":"PeerJ"},{"key":"pcbi.1012597.ref003","doi-asserted-by":"crossref","first-page":"145","DOI":"10.1007\/978-3-642-34657-6_6","volume-title":"CRISPR-Cas systems","author":"RH Staals","year":"2013"},{"issue":"5962","key":"pcbi.1012597.ref004","doi-asserted-by":"crossref","first-page":"167","DOI":"10.1126\/science.1179555","article-title":"CRISPR\/Cas, the immune system of bacteria and archaea","volume":"327","author":"P Horvath","year":"2010","journal-title":"Science"},{"issue":"6","key":"pcbi.1012597.ref005","doi-asserted-by":"crossref","first-page":"e1000079","DOI":"10.1371\/journal.ppat.1000079","article-title":"Patterns of evolution and host gene mimicry in influenza and other RNA viruses","volume":"4","author":"BD Greenbaum","year":"2008","journal-title":"PLoS pathogens"},{"key":"pcbi.1012597.ref006","article-title":"Attention is all you need","volume":"30","author":"A Vaswani","year":"2017","journal-title":"Advances in neural information processing systems"},{"issue":"15","key":"pcbi.1012597.ref007","doi-asserted-by":"crossref","DOI":"10.1073\/pnas.2016239118","article-title":"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences","volume":"118","author":"A Rives","year":"2021","journal-title":"Proc Natl Acad Sci U S A"},{"key":"pcbi.1012597.ref008","article-title":"A framework for multiple-instance learning","volume":"10","author":"O Maron","year":"1997","journal-title":"Advances in neural information processing systems"},{"key":"pcbi.1012597.ref009","unstructured":"Ilse M, Tomczak J, Welling M. Attention-based deep multiple instance learning. In: International conference on machine learning. PMLR; 2018. p. 2127\u20132136."},{"key":"pcbi.1012597.ref010","doi-asserted-by":"crossref","unstructured":"Mihara T, Nishimura Y, Shimizu Y, Nishiyama H, Yoshikawa G, Uehara H, et al. Linking Virus Genomes with Host Taxonomy; 2016.","DOI":"10.3390\/v8030066"},{"issue":"1","key":"pcbi.1012597.ref011","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1016\/j.cels.2020.09.006","article-title":"A Sweep of Earth\u2019s Virome Reveals Host-Guided Viral Protein Structural Mimicry and Points to Determinants of Human Disease","volume":"12","author":"G Lasso","year":"2021","journal-title":"Cell Systems"},{"issue":"11","key":"pcbi.1012597.ref012","doi-asserted-by":"crossref","first-page":"1026","DOI":"10.1038\/nbt.3988","article-title":"MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets","volume":"35","author":"M Steinegger","year":"2017","journal-title":"Nature biotechnology"},{"issue":"5","key":"pcbi.1012597.ref013","doi-asserted-by":"crossref","first-page":"e1007894","DOI":"10.1371\/journal.pcbi.1007894","article-title":"Predicting host taxonomic information from viral genomes: A comparison of feature representations","volume":"16","author":"F Young","year":"2020","journal-title":"PLoS Comput Biol"},{"key":"pcbi.1012597.ref014","article-title":"iPHoP: an integrated machine-learning framework to maximize host prediction for metagenome-assembled virus genomes","author":"S Roux","year":"2022","journal-title":"bioRxiv"},{"issue":"19","key":"pcbi.1012597.ref015","doi-asserted-by":"crossref","first-page":"3113","DOI":"10.1093\/bioinformatics\/btx383","article-title":"WIsH: who is the host? Predicting prokaryotic hosts from metagenomic phage contigs","volume":"33","author":"C Galiez","year":"2017","journal-title":"Bioinformatics"},{"issue":"1","key":"pcbi.1012597.ref016","doi-asserted-by":"crossref","first-page":"39","DOI":"10.1093\/nar\/gkw1002","article-title":"Alignment-free d 2 * oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences","volume":"45","author":"NA Ahlgren","year":"2017","journal-title":"Nucleic Acids Res"},{"issue":"1","key":"pcbi.1012597.ref017","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1186\/s12915-020-00938-6","article-title":"Prokaryotic virus host predictor: a Gaussian model for host prediction of prokaryotic viruses in metagenomics","volume":"19","author":"C Lu","year":"2021","journal-title":"BMC Biol"},{"issue":"19","key":"pcbi.1012597.ref018","doi-asserted-by":"crossref","first-page":"3364","DOI":"10.1093\/bioinformatics\/btab222","article-title":"SpacePHARER: sensitive identification of phages from CRISPR spacers in prokaryotic hosts","volume":"37","author":"R Zhang","year":"2021","journal-title":"Bioinformatics"},{"issue":"2","key":"pcbi.1012597.ref019","doi-asserted-by":"crossref","first-page":"lqaa044","DOI":"10.1093\/nargab\/lqaa044","article-title":"A network-based integrated framework for predicting virus\u2013prokaryote interactions","volume":"2","author":"W Wang","year":"2020","journal-title":"NAR genomics and bioinformatics"},{"key":"pcbi.1012597.ref020","doi-asserted-by":"crossref","unstructured":"Amgarten D, Iha BKV, Piroupo CM, da Silva AM, Setubal JC. vHULK, a new tool for bacteriophage host prediction based on annotated genomic features and deep neural networks; 2020.","DOI":"10.1101\/2020.12.06.413476"},{"issue":"9","key":"pcbi.1012597.ref021","doi-asserted-by":"crossref","first-page":"1236","DOI":"10.1093\/bioinformatics\/btu031","article-title":"InterProScan 5: genome-scale protein function classification","volume":"30","author":"P Jones","year":"2014","journal-title":"Bioinformatics"},{"key":"pcbi.1012597.ref022","doi-asserted-by":"crossref","first-page":"195","DOI":"10.1007\/978-3-319-21903-5_8","article-title":"Hierarchical clustering","author":"F Nielsen","year":"2016","journal-title":"Introduction to HPC with MPI for Data Science"},{"issue":"7798","key":"pcbi.1012597.ref023","doi-asserted-by":"crossref","first-page":"265","DOI":"10.1038\/s41586-020-2008-3","article-title":"A new coronavirus associated with human respiratory disease in China","volume":"579","author":"F Wu","year":"2020","journal-title":"Nature"},{"key":"pcbi.1012597.ref024","doi-asserted-by":"crossref","first-page":"466","DOI":"10.1016\/j.ijbiomac.2022.01.121","article-title":"Insights into the specificity for the interaction of the promiscuous SARS-CoV-2 nucleocapsid protein N-terminal domain with deoxyribonucleic acids","volume":"203","author":"IP Caruso","year":"2022","journal-title":"International journal of biological macromolecules"},{"issue":"D1","key":"pcbi.1012597.ref025","doi-asserted-by":"crossref","first-page":"D553","DOI":"10.1093\/nar\/gkt1274","article-title":"RefSeq microbial genomes database: new representation and annotation strategy","volume":"42","author":"T Tatusova","year":"2014","journal-title":"Nucleic acids research"},{"issue":"28","key":"pcbi.1012597.ref026","doi-asserted-by":"crossref","first-page":"E288","DOI":"10.1073\/pnas.1101595108","article-title":"Statistical structure of host\u2013phage interactions","volume":"108","author":"CO Flores","year":"2011","journal-title":"Proceedings of the National Academy of Sciences"},{"key":"pcbi.1012597.ref027","first-page":"1","volume-title":"BMC bioinformaticsvol. 7","author":"A Ben-Hur","year":"2006"},{"issue":"14","key":"pcbi.1012597.ref028","first-page":"151","article-title":"Computational prediction of inter-species relationships through omics data analysis and machine learning","volume":"19","author":"DMC Leite","year":"2018","journal-title":"BMC bioinformatics"},{"key":"pcbi.1012597.ref029","doi-asserted-by":"crossref","unstructured":"L\u00f3pez JF, Sotelo JAL, Leite D, Pe\u00f1a-Reyes C. Applying one-class learning algorithms to predict phage-bacteria interactions. In: 2019 IEEE Latin American Conference on Computational Intelligence (LA-CCI); 2019. p. 1\u20136.","DOI":"10.1109\/LA-CCI47412.2019.9037032"},{"issue":"11","key":"pcbi.1012597.ref030","doi-asserted-by":"crossref","first-page":"4337","DOI":"10.1073\/pnas.0607879104","article-title":"Predicting protein\u2013protein interactions based only on sequences information","volume":"104","author":"J Shen","year":"2007","journal-title":"Proceedings of the National Academy of Sciences"},{"key":"pcbi.1012597.ref031","unstructured":"Salakhutdinov SRBPR, Zaheer AJSM, Kottur S. Deep Sets. Advances in Neural Information Processing (NIPS). 2017;."},{"issue":"suppl_1","key":"pcbi.1012597.ref032","doi-asserted-by":"crossref","first-page":"D258","DOI":"10.1093\/nar\/gkh036","article-title":"The Gene Ontology (GO) database and informatics resource","volume":"32","author":"GO Consortium","year":"2004","journal-title":"Nucleic acids research"},{"issue":"1","key":"pcbi.1012597.ref033","doi-asserted-by":"crossref","first-page":"28","DOI":"10.1111\/2041-210X.12628","article-title":"ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data","volume":"8","author":"G Yu","year":"2017","journal-title":"Methods in Ecology and Evolution"}],"updated-by":[{"DOI":"10.1371\/journal.pcbi.1012597","type":"new_version","label":"New version","source":"publisher","updated":{"date-parts":[[2024,12,3]],"date-time":"2024-12-03T00:00:00Z","timestamp":1733184000000}}],"container-title":["PLOS Computational Biology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dx.plos.org\/10.1371\/journal.pcbi.1012597","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,12,3]],"date-time":"2024-12-03T13:41:57Z","timestamp":1733233317000},"score":1,"resource":{"primary":{"URL":"https:\/\/dx.plos.org\/10.1371\/journal.pcbi.1012597"}},"subtitle":[],"editor":[{"given":"Fuhai","family":"Li","sequence":"first","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]}],"short-title":[],"issued":{"date-parts":[[2024,11,19]]},"references-count":33,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2024,11,19]]}},"URL":"https:\/\/doi.org\/10.1371\/journal.pcbi.1012597","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2023.04.07.536023","asserted-by":"object"}]},"ISSN":["1553-7358"],"issn-type":[{"value":"1553-7358","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,11,19]]}}}