{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,2]],"date-time":"2026-04-02T18:55:34Z","timestamp":1775156134857,"version":"3.50.1"},"update-to":[{"DOI":"10.1371\/journal.pcbi.1014125","type":"new_version","label":"New version","source":"publisher","updated":{"date-parts":[[2026,4,2]],"date-time":"2026-04-02T00:00:00Z","timestamp":1775088000000}}],"reference-count":58,"publisher":"Public Library of Science (PLoS)","issue":"3","license":[{"start":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T00:00:00Z","timestamp":1774569600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/publicdomain\/zero\/1.0\/"}],"funder":[{"DOI":"10.13039\/100008902","name":"Los Alamos National Laboratory","doi-asserted-by":"publisher","award":["20230044D"],"award-info":[{"award-number":["20230044D"]}],"id":[{"id":"10.13039\/100008902","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["www.ploscompbiol.org"],"crossmark-restriction":false},"short-container-title":["PLoS Comput Biol"],"abstract":"<jats:p>\n                    There have been several attempts to develop machine learning (ML) models to identify human infecting viruses from their genomic sequences, with varying degrees of success. Direct comparison between models is problematic, because these models are typically trained and evaluated on different datasets with alternative data splitting schemes, features, and model performance metrics. In this paper we present a standardized dataset of mammal infecting and non-infecting viral pathogens, refined from the previous work of Mollentze et al. to include the latest literature evidence, roughly doubling the number of curated host-virus records available to the community, and new host target labels, primate and mammal. The new host labels were included for several reasons, including previous reports that classification performance is better at broader taxonomic ranks and the idea that there may be more data for primate infection that might serve as a suitable proxy for zoonotic potential and avoidance of false positives for human infection due to absence of evidence. On this dataset, we report the performance of eight machine learning models for predicting mammal-infecting viruses from their genomic sequences. We find that randomly assigning cases in our improved dataset to training\/testing sets, when compared to the original assignments into training\/testing in Mollentze et al., increases the overall average ROC AUC of prediction of human infection from\n                    <jats:bold>0.663 \u00b1 0.070<\/jats:bold>\n                    to\n                    <jats:bold>0.784 \u00b1 0.013<\/jats:bold>\n                    , consistent with the reduction in phylogenetic distance between train and test sets (relative entropy change from 3.00 to 0.08). The broadest host category of mammal infection can be predicted most reliably at\n                    <jats:bold>0.850 \u00b1 0.020<\/jats:bold>\n                    . We share our improved dataset and code to enable standardized comparisons of machine learning methods to predict human host infections. Overall, we have presented preliminary evidence that classification of virus host infection is more tractable at higher taxonomic ranks, that unsurprisingly reducing the phylogenetic distance between training and test sets can improve predictive performance, that peptide kmer features appear to be harmful to out of sample model performance, and we are left with the question of whether models for virus host prediction can reasonably be expected to perform well in out of sample scenarios given the likelihood that viruses do not share a common ancestor. Consistent with this concern, when the data is resampled such that there is no overlap between viral families in training and test sets (relative entropy\u2009&gt;\u2009\n                    <jats:bold>24<\/jats:bold>\n                    ), models perform no better than random chance at prediction of human infection regardless of whether kmers are included (ROC AUC\n                    <jats:bold>0.50 \u00b1 0.08<\/jats:bold>\n                    ) or not (ROC AUC\n                    <jats:bold>0.50 \u00b1 0.04<\/jats:bold>\n                    ).\n                  <\/jats:p>","DOI":"10.1371\/journal.pcbi.1014125","type":"journal-article","created":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T17:49:32Z","timestamp":1774633772000},"page":"e1014125","update-policy":"https:\/\/doi.org\/10.1371\/journal.pcbi.corrections_policy","source":"Crossref","is-referenced-by-count":0,"title":["An improved dataset for predicting mammal infecting viruses from genetic sequence information"],"prefix":"10.1371","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2364-6157","authenticated-orcid":true,"given":"Tyler","family":"Reddy","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Austin","family":"Schneider","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Aaron R.","family":"Hall","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Adam","family":"Witmer","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Nick","family":"Hengartner","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"340","published-online":{"date-parts":[[2026,3,27]]},"reference":[{"key":"pcbi.1014125.ref001","doi-asserted-by":"crossref","DOI":"10.3389\/fmicb.2020.631736","article-title":"Pandemics throughout history","volume":"11","author":"J Piret","year":"2021","journal-title":"Front Microbiol"},{"issue":"9","key":"pcbi.1014125.ref002","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pbio.3001390","article-title":"Identifying and prioritizing potential human-infecting viruses from their genome sequences","volume":"19","author":"N Mollentze","year":"2021","journal-title":"PLoS Biol"},{"issue":"15","key":"pcbi.1014125.ref003","doi-asserted-by":"crossref","DOI":"10.1073\/pnas.2002324118","article-title":"Ranking the risk of animal-to-human spillover for newly discovered viruses","volume":"118","author":"ZL Grange","year":"2021","journal-title":"Proc Natl Acad Sci U S A"},{"issue":"7660","key":"pcbi.1014125.ref004","doi-asserted-by":"crossref","first-page":"646","DOI":"10.1038\/nature22975","article-title":"Host and viral traits predict zoonotic spillover from mammals","volume":"546","author":"KJ Olival","year":"2017","journal-title":"Nature"},{"issue":"1","key":"pcbi.1014125.ref005","doi-asserted-by":"crossref","first-page":"309","DOI":"10.1038\/s42003-025-07746-0","article-title":"Viral genomic features predict Orthopoxvirus reservoir hosts","volume":"8","author":"KK Tseng","year":"2025","journal-title":"Commun Biol"},{"issue":"29","key":"pcbi.1014125.ref006","doi-asserted-by":"crossref","first-page":"8722","DOI":"10.1021\/jp302103t","article-title":"A comparison of multiscale methods for the analysis of molecular dynamics simulations","volume":"116","author":"NC Benson","year":"2012","journal-title":"J Phys Chem B"},{"issue":"7","key":"pcbi.1014125.ref007","doi-asserted-by":"crossref","first-page":"1698","DOI":"10.1016\/j.cell.2016.05.040","article-title":"Breaking Cryo-EM Resolution Barriers to Facilitate Drug Discovery","volume":"165","author":"A Merk","year":"2016","journal-title":"Cell"},{"issue":"7873","key":"pcbi.1014125.ref008","doi-asserted-by":"crossref","first-page":"583","DOI":"10.1038\/s41586-021-03819-2","article-title":"Highly accurate protein structure prediction with AlphaFold","volume":"596","author":"J Jumper","year":"2021","journal-title":"Nature"},{"issue":"8","key":"pcbi.1014125.ref009","doi-asserted-by":"crossref","first-page":"2102","DOI":"10.1093\/bioinformatics\/btac020","article-title":"ProteinBERT: a universal deep-learning model of protein sequence and function","volume":"38","author":"N Brandes","year":"2022","journal-title":"Bioinformatics"},{"key":"pcbi.1014125.ref010","volume-title":"Introduction to modern virology","author":"NJ Dimmock","year":"2016"},{"key":"pcbi.1014125.ref011","article-title":"GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding","author":"A Wang","year":"2018","journal-title":"CoRR"},{"issue":"3","key":"pcbi.1014125.ref012","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","article-title":"ImageNet Large Scale Visual Recognition Challenge","volume":"115","author":"O Russakovsky","year":"2015","journal-title":"Int J Comput Vis"},{"issue":"6","key":"pcbi.1014125.ref013","doi-asserted-by":"crossref","first-page":"2517","DOI":"10.1111\/tbed.13314","article-title":"Rapid identification of human-infecting viruses","volume":"66","author":"Z Zhang","year":"2019","journal-title":"Transbound Emerg Dis"},{"issue":"1","key":"pcbi.1014125.ref014","article-title":"Interpretable detection of novel human viruses from genome sequencing data","volume":"3","author":"JM Bartoszewicz","year":"2021","journal-title":"NAR Genom Bioinform"},{"issue":"1604","key":"pcbi.1014125.ref015","doi-asserted-by":"crossref","first-page":"2864","DOI":"10.1098\/rstb.2011.0354","article-title":"Human viruses: discovery and emergence","volume":"367","author":"M Woolhouse","year":"2012","journal-title":"Philos Trans R Soc Lond B Biol Sci"},{"issue":"12","key":"pcbi.1014125.ref016","doi-asserted-by":"crossref","first-page":"1232","DOI":"10.1016\/j.tim.2022.07.002","article-title":"Disease-causing human viruses: novelty and legacy","volume":"30","author":"D Forni","year":"2022","journal-title":"Trends Microbiol"},{"issue":"5","key":"pcbi.1014125.ref017","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pcbi.1007894","article-title":"Predicting host taxonomic information from viral genomes: A comparison of feature representations","volume":"16","author":"F Young","year":"2020","journal-title":"PLoS Comput Biol"},{"issue":"8","key":"pcbi.1014125.ref018","doi-asserted-by":"crossref","first-page":"1444","DOI":"10.1038\/s41592-024-02362-y","article-title":"Guiding questions to avoid data leakage in biological machine learning applications","volume":"21","author":"J Bernett","year":"2024","journal-title":"Nat Methods"},{"issue":"10","key":"pcbi.1014125.ref019","doi-asserted-by":"crossref","first-page":"101046","DOI":"10.1016\/j.patter.2024.101046","article-title":"Avoiding common machine learning pitfalls","volume":"5","author":"MA Lones","year":"2024","journal-title":"Patterns (N Y)"},{"key":"pcbi.1014125.ref020","doi-asserted-by":"crossref","unstructured":"Japkowicz N. 8. In: Assessment Metrics for Imbalanced Learning. John Wiley & Sons, Ltd; 2013. p. 187\u2013206. Available from: https:\/\/onlinelibrary.wiley.com\/doi\/abs\/10.1002\/9781118646106.ch8","DOI":"10.1002\/9781118646106.ch8"},{"key":"pcbi.1014125.ref021","first-page":"1015","article-title":"Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation.","volume-title":"AI 2006: Advances in Artificial Intelligence","author":"M Sokolova","year":"2006"},{"key":"pcbi.1014125.ref022","doi-asserted-by":"crossref","first-page":"220","DOI":"10.1016\/j.eswa.2016.12.035","article-title":"Learning from class-imbalanced data: Review of methods and applications","volume":"73","author":"G Haixiang","year":"2017","journal-title":"Exp Syst Appl"},{"key":"pcbi.1014125.ref023","first-page":"110","article-title":"Machine learning driven design of experiments for predictive models in production systems","author":"S Maier","year":"2023"},{"issue":"1","key":"pcbi.1014125.ref024","doi-asserted-by":"crossref","first-page":"81","DOI":"10.1093\/bioinformatics\/btz541","article-title":"DeePaC: predicting pathogenic potential of novel DNA with reverse-complement neural networks","volume":"36","author":"JM Bartoszewicz","year":"2020","journal-title":"Bioinformatics"},{"key":"pcbi.1014125.ref025","doi-asserted-by":"crossref","first-page":"180017","DOI":"10.1038\/sdata.2018.17","article-title":"Epidemiological characteristics of human-infective RNA viruses","volume":"5","author":"MEJ Woolhouse","year":"2018","journal-title":"Sci Data"},{"issue":"17","key":"pcbi.1014125.ref026","doi-asserted-by":"crossref","first-page":"9423","DOI":"10.1073\/pnas.1919176117","article-title":"Viral zoonotic risk is homogenous among taxonomic orders of mammalian and avian reservoir hosts","volume":"117","author":"N Mollentze","year":"2020","journal-title":"Proc Natl Acad Sci U S A"},{"key":"pcbi.1014125.ref027","article-title":"Scikit-learn: Machine Learning in Python","author":"F Pedregosa","year":"2012","journal-title":"CoRR"},{"key":"pcbi.1014125.ref028","article-title":"LightGBM: A Highly Efficient Gradient Boosting Decision Tree.","volume-title":"Advances in Neural Information Processing Systems","author":"G Ke","year":"2017"},{"key":"pcbi.1014125.ref029","doi-asserted-by":"crossref","unstructured":"Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD \u201916. New York (NY): Association for Computing Machinery; 2016. p. 785\u201394. Available from: https:\/\/doi.org\/10.1145\/2939672.2939785","DOI":"10.1145\/2939672.2939785"},{"issue":"1924","key":"pcbi.1014125.ref030","first-page":"20192736","article-title":"Global shifts in mammalian population trends reveal key predictors of virus spillover risk","volume":"287","author":"CK Johnson","year":"2020","journal-title":"Proc Biol Sci"},{"issue":"6","key":"pcbi.1014125.ref031","doi-asserted-by":"crossref","first-page":"1115","DOI":"10.1007\/s00018-014-1785-y","article-title":"Detecting the emergence of novel, zoonotic viruses pathogenic to humans","volume":"72","author":"R Rosenberg","year":"2015","journal-title":"Cell Mol Life Sci"},{"key":"pcbi.1014125.ref032","first-page":"489","article-title":"DART: Dropouts meet Multiple Additive Regression Trees.","volume-title":"Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics. vol. 38 of Proceedings of Machine Learning Research","author":"R Korlakai Vinayak","year":"2015"},{"issue":"1","key":"pcbi.1014125.ref033","doi-asserted-by":"crossref","first-page":"913","DOI":"10.1186\/1471-2164-15-913","article-title":"Detection of atypical genes in virus families using a one-class SVM","volume":"15","author":"S Metzler","year":"2014","journal-title":"BMC Genomics"},{"issue":"5","key":"pcbi.1014125.ref034","doi-asserted-by":"crossref","first-page":"1842","DOI":"10.1002\/hbm.23140","article-title":"Classification based hypothesis testing in neuroscience: Below-chance level classification rates and overlooked statistical properties of linear parametric classifiers","volume":"37","author":"H Jamalabadi","year":"2016","journal-title":"Hum Brain Mapp"},{"key":"pcbi.1014125.ref035","doi-asserted-by":"crossref","first-page":"1885","DOI":"10.1099\/0022-1317-76-8-1885","article-title":"Ribosomal frameshifting viral RNAs","author":"I Brierley","year":"1995","journal-title":"J Gen Virol"},{"issue":"1","key":"pcbi.1014125.ref036","doi-asserted-by":"crossref","first-page":"45","DOI":"10.1038\/s41579-025-01189-4","article-title":"Evolution, spread and impact of highly pathogenic H5 avian influenza A viruses","volume":"24","author":"B Bellido-Mart\u00edn","year":"2026","journal-title":"Nat Rev Microbiol"},{"issue":"4","key":"pcbi.1014125.ref037","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pbio.3002083","article-title":"iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria","volume":"21","author":"S Roux","year":"2023","journal-title":"PLoS Biol"},{"issue":"11","key":"pcbi.1014125.ref038","article-title":"Prediction of virus-host associations using protein language models and multiple instance learning","volume":"20","author":"D Liu","year":"2024","journal-title":"PLoS Comput Biol"},{"issue":"15","key":"pcbi.1014125.ref039","doi-asserted-by":"crossref","DOI":"10.1073\/pnas.2016239118","article-title":"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences","volume":"118","author":"A Rives","year":"2021","journal-title":"Proc Natl Acad Sci U S A"},{"key":"pcbi.1014125.ref040","unstructured":"Reddi VJ, Cheng C, Kanter D, Mattson P, Schmuelling G, Wu CJ, et al. MLPerf Inference Benchmark. 2019."},{"key":"pcbi.1014125.ref041","doi-asserted-by":"crossref","first-page":"740","DOI":"10.1007\/978-3-319-10602-1_48","article-title":"Microsoft COCO: Common Objects in Context.","volume-title":"Computer Vision \u2013 ECCV 2014","author":"TY Lin","year":"2014"},{"key":"pcbi.1014125.ref042","unstructured":"On Taxonomy of Viruses (ICTV) IC; 2025. Available from: http:\/\/ictv.global\/taxonomy"},{"issue":"3","key":"pcbi.1014125.ref043","doi-asserted-by":"crossref","first-page":"66","DOI":"10.3390\/v8030066","article-title":"Linking Virus Genomes with Host Taxonomy","volume":"8","author":"T Mihara","year":"2016","journal-title":"Viruses"},{"issue":"1","key":"pcbi.1014125.ref044","doi-asserted-by":"crossref","first-page":"247","DOI":"10.1099\/0022-1317-43-1-247","article-title":"Regulation of the interferon system: evidence that Vero cells have a genetic defect in interferon production","volume":"43","author":"JM Emeny","year":"1979","journal-title":"J Gen Virol"},{"issue":"5","key":"pcbi.1014125.ref045","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pone.0096934","article-title":"Detection of antibodies against Turkey astrovirus in humans","volume":"9","author":"VA Meliopoulos","year":"2014","journal-title":"PLoS One"},{"key":"pcbi.1014125.ref046","unstructured":"Bergstra J, Bardenet R, Bengio Y, K\u00e9gl B. Algorithms for hyper-parameter optimization. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. NIPS\u201911. Red Hook (NY): Curran Associates Inc.; 2011. p. 2546\u20132554."},{"key":"pcbi.1014125.ref047","doi-asserted-by":"crossref","unstructured":"Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: A Next-generation Hyperparameter Optimization Framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. KDD \u201919. New York (NY): Association for Computing Machinery; 2019. p. 2623\u201331. Available from: https:\/\/doi.org\/10.1145\/3292500.3330701","DOI":"10.1145\/3292500.3330701"},{"key":"pcbi.1014125.ref048","article-title":"Tune: A Research Platform for Distributed Model Selection and Training","author":"R Liaw","year":"2018","journal-title":"arXiv preprint"},{"key":"pcbi.1014125.ref049","unstructured":"Amazon Web Services I. Tune an XGBoost Model - Amazon SageMaker AI. 2025. Available from: https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/xgboost-tuning.html"},{"key":"pcbi.1014125.ref050","unstructured":"Amazon Web Services I. Tune a LightGBM model - Amazon SageMaker AI. 2025. Available from: https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/lightgbm-tuning.html"},{"issue":"3","key":"pcbi.1014125.ref051","doi-asserted-by":"crossref","first-page":"261","DOI":"10.1038\/s41592-019-0686-2","article-title":"SciPy 1.0: fundamental algorithms for scientific computing in Python","volume":"17","author":"P Virtanen","year":"2020","journal-title":"Nat Methods"},{"issue":"7825","key":"pcbi.1014125.ref052","doi-asserted-by":"crossref","first-page":"357","DOI":"10.1038\/s41586-020-2649-2","article-title":"Array programming with NumPy","volume":"585","author":"CR Harris","year":"2020","journal-title":"Nature"},{"key":"pcbi.1014125.ref053","unstructured":"Vink R. Polars: Blazingly fast DataFrames in Rust, Python, Node.js, R, and SQL. https:\/\/github.com\/pola-rs\/polars"},{"key":"pcbi.1014125.ref054","unstructured":"McKinney W, Team P. Pandas-Powerful Python Data Analysis Toolkit. Pandas\u2014Powerful Python Data Analysis Toolkit. 2015;1625."},{"issue":"3","key":"pcbi.1014125.ref055","doi-asserted-by":"crossref","first-page":"90","DOI":"10.1109\/MCSE.2007.55","article-title":"Matplotlib: A 2D Graphics Environment","volume":"9","author":"JD Hunter","year":"2007","journal-title":"Comput Sci Eng"},{"issue":"60","key":"pcbi.1014125.ref056","doi-asserted-by":"crossref","first-page":"3021","DOI":"10.21105\/joss.03021","article-title":"seaborn: statistical data visualization","volume":"6","author":"M Waskom","year":"2021","journal-title":"JOSS"},{"issue":"11","key":"pcbi.1014125.ref057","doi-asserted-by":"crossref","DOI":"10.1093\/nar\/gkz173","article-title":"MitoZ: a toolkit for animal mitochondrial genome assembly, annotation and visualization","volume":"47","author":"G Meng","year":"2019","journal-title":"Nucleic Acids Res"},{"issue":"6","key":"pcbi.1014125.ref058","doi-asserted-by":"crossref","first-page":"1635","DOI":"10.1093\/molbev\/msw046","article-title":"ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data","volume":"33","author":"J Huerta-Cepas","year":"2016","journal-title":"Mol Biol Evol"}],"updated-by":[{"DOI":"10.1371\/journal.pcbi.1014125","type":"new_version","label":"New version","source":"publisher","updated":{"date-parts":[[2026,4,2]],"date-time":"2026-04-02T00:00:00Z","timestamp":1775088000000}}],"container-title":["PLOS Computational Biology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dx.plos.org\/10.1371\/journal.pcbi.1014125","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,2]],"date-time":"2026-04-02T17:52:52Z","timestamp":1775152372000},"score":1,"resource":{"primary":{"URL":"https:\/\/dx.plos.org\/10.1371\/journal.pcbi.1014125"}},"subtitle":[],"editor":[{"given":"Peter M","family":"Kasson","sequence":"first","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]}],"short-title":[],"issued":{"date-parts":[[2026,3,27]]},"references-count":58,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2026,3,27]]}},"URL":"https:\/\/doi.org\/10.1371\/journal.pcbi.1014125","relation":{},"ISSN":["1553-7358"],"issn-type":[{"value":"1553-7358","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,3,27]]}}}