{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,4]],"date-time":"2026-04-04T00:54:25Z","timestamp":1775264065781,"version":"3.50.1"},"reference-count":68,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2023,2,17]],"date-time":"2023-02-17T00:00:00Z","timestamp":1676592000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Research Foundation (NRF) of South Africa","award":["130187"],"award-info":[{"award-number":["130187"]}]},{"name":"National Research Foundation (NRF) of South Africa","award":["N\/A"],"award-info":[{"award-number":["N\/A"]}]},{"name":"College of Health Sciences (CHS) of the University of KwaZulu-Natal (UKZN) in Durban, KwaZulu-Natal, South Africa","award":["130187"],"award-info":[{"award-number":["130187"]}]},{"name":"College of Health Sciences (CHS) of the University of KwaZulu-Natal (UKZN) in Durban, KwaZulu-Natal, South Africa","award":["N\/A"],"award-info":[{"award-number":["N\/A"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>Reverse vaccinology (RV) is a computer-aided approach for vaccine development that identifies a subset of pathogen proteins as protective antigens (PAgs) or potential vaccine candidates. Machine learning (ML)-based RV is promising, but requires a dataset of PAgs (positives) and non-protective protein sequences (negatives). This study aimed to create an ML dataset, VPAgs-Dataset4ML, to predict viral PAgs based on PAgs obtained from Protegen. We performed seven steps to identify PAgs from the Protegen website and non-protective protein sequences from Universal Protein Resource (UniProt). The seven steps included downloading viral PAgs from Protegen, performing quality checks on PAgs using the standard BLASTp identity check \u226430% via MMseqs2, and computational steps running on Google Colaboratory and the Ubuntu terminal to retrieve and perform quality checks (similar to the PAgs) on non-protective protein sequences as negatives from UniProt. VPAgs-Dataset4ML contains 2145 viral protein sequences, with 210 PAgs in positive.fasta and 1935 non-protective protein sequences in negative.fasta. This dataset can be used to train ML models to predict antigens for various viral pathogens with the aim of developing effective vaccines.<\/jats:p>","DOI":"10.3390\/data8020041","type":"journal-article","created":{"date-parts":[[2023,2,20]],"date-time":"2023-02-20T03:56:07Z","timestamp":1676865367000},"page":"41","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["VPAgs-Dataset4ML: A Dataset to Predict Viral Protective Antigens for Machine Learning-Based Reverse Vaccinology"],"prefix":"10.3390","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7774-1104","authenticated-orcid":false,"given":"Zakia","family":"Salod","sequence":"first","affiliation":[{"name":"Discipline of Public Health Medicine, University of KwaZulu-Natal, Durban 4051, South Africa"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8076-0453","authenticated-orcid":false,"given":"Ozayr","family":"Mahomed","sequence":"additional","affiliation":[{"name":"Discipline of Public Health Medicine, University of KwaZulu-Natal, Durban 4051, South Africa"},{"name":"Dasman Diabetes Institute, P.O. Box 1180, Dasman 15462, Kuwait City, Kuwait"}]}],"member":"1968","published-online":{"date-parts":[[2023,2,17]]},"reference":[{"key":"ref_1","unstructured":"Our World in Data (2023, January 10). Death Rate from Infectious Diseases, 1990 to 2019. Available online: https:\/\/ourworldindata.org\/grapher\/infectious-disease-death-rates."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1204","DOI":"10.1016\/S0140-6736(20)30925-9","article-title":"Global burden of 369 diseases and injuries in 204 countries and territories, 1990\u20132019: A systematic analysis for the Global Burden of Disease Study 2019","volume":"396","author":"Vos","year":"2020","journal-title":"Lancet"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"990","DOI":"10.1038\/nature06536","article-title":"Global trends in emerging infectious diseases","volume":"451","author":"Jones","year":"2008","journal-title":"Nature"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1842","DOI":"10.3201\/eid1112.050997","article-title":"Host range and emerging and reemerging pathogens","volume":"11","author":"Woolhouse","year":"2005","journal-title":"Emerg. Infect. Dis."},{"key":"ref_5","first-page":"69","article-title":"1918 Influenza: The mother of all pandemics","volume":"17","author":"Taubenberger","year":"2006","journal-title":"Rev. Biomed."},{"key":"ref_6","first-page":"584","article-title":"Statistics of Influenza Morbidity: With Special Reference to Certain Factors in Case Incidence and Case Fatality","volume":"35","author":"Frost","year":"1920","journal-title":"Public Heal. Rep. 1896\u20131970"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"105","DOI":"10.1353\/bhm.2002.0022","article-title":"Updating the Accounts: Global Mortality of the 1918\u20131920 \u201cSpanish\u201d Influenza Pandemic","volume":"76","author":"Johnson","year":"2002","journal-title":"Bull. Hist. Med."},{"key":"ref_8","unstructured":"World Health Organization (2022, October 22). Ebola Virus Disease. Available online: https:\/\/www.who.int\/news-room\/fact-sheets\/detail\/ebola-virus-disease."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"2414","DOI":"10.1056\/NEJMp068074","article-title":"The HIV\u2013AIDS pandemic at 25\u2014The global response","volume":"354","author":"Merson","year":"2006","journal-title":"N. Engl. J. Med."},{"key":"ref_10","unstructured":"World Health Organization (2022, November 10). HIV\/AIDS. Available online: https:\/\/www.who.int\/news-room\/fact-sheets\/detail\/hiv-aids."},{"key":"ref_11","unstructured":"Cherry, J.D., Demmler, G.J., and Kaplan, S. (2003). Severe Acute Respiratory Syndrome (SARS) In: Textbook of Paediatric Infectious Diseases, Feigin, R.D., Elsevier."},{"key":"ref_12","unstructured":"World Health Organization (2022, November 10). Summary of Probable SARS Cases with Onset of Illness from 1 November 2002 to 31 July 2003. Available online: https:\/\/www.who.int\/publications\/m\/item\/summary-of-probable-sars-cases-with-onset-of-illness-from-1-november-2002-to-31-july-2003."},{"key":"ref_13","unstructured":"World Health Organization (2003). Consensus Document on the Epidemiology of Severe Acute Respiratory syndrome (SARS), World Health Organization."},{"key":"ref_14","unstructured":"Worldometers (2022, November 10). COVID-19 Coronavirus Pandemic. Available online: https:\/\/www.worldometers.info\/coronavirus\/."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"596","DOI":"10.1016\/S0264-410X(02)00623-0","article-title":"The global value of vaccination","volume":"21","author":"Ehreth","year":"2003","journal-title":"Vaccine"},{"key":"ref_16","first-page":"1","article-title":"Modeling the impact of vaccination for the immunization agenda 2030: Deaths averted due to vaccination against 14 pathogens in 194 countries from 2021\u20132030","volume":"2030","author":"Carter","year":"2021","journal-title":"Ann Hutubessy Raymond CW Model. Impact Vaccin. Immun. Agenda"},{"key":"ref_17","unstructured":"Centers for Disease Control and Prevention (2022, November 05). Fast Facts on Global Immunization, Available online: https:\/\/www.cdc.gov\/globalhealth\/immunization\/data\/fast-facts.html#:~:text=Immunization%20Prevents%20Death%20Worldwide,save%20nearly%2019%20million%20lives."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1232910","DOI":"10.1126\/science.1232910","article-title":"Accelerating Next-Generation Vaccine Development for Global Disease Prevention","volume":"340","author":"Koff","year":"2013","journal-title":"Science"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"445","DOI":"10.1016\/S1369-5274(00)00119-3","article-title":"Reverse vaccinology","volume":"3","author":"Rappuoli","year":"2000","journal-title":"Curr. Opin. Microbiol."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"113","DOI":"10.3389\/fimmu.2019.00113","article-title":"Comparison of open-source reverse vaccinology programs for bacterial vaccine antigen discovery","volume":"10","author":"Dalsass","year":"2019","journal-title":"Front. Immunol."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"1816","DOI":"10.1126\/science.287.5459.1816","article-title":"Identification of Vaccine Candidates Against Serogroup B Meningococcus by Whole-Genome Sequencing","volume":"287","author":"Pizza","year":"2000","journal-title":"Science"},{"key":"ref_22","first-page":"608","article-title":"Use of serogroup B meningococcal vaccines in persons aged \u226510 years at increased risk for serogroup B meningococcal disease: Recommendations of the Advisory Committee on Immunization Practices, 2015","volume":"64","author":"Folaranmi","year":"2015","journal-title":"MMWR. Morb. Mortal. Wkly. Rep."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"305","DOI":"10.1179\/2047773214Y.0000000162","article-title":"Bexsero\u00ae chronicle","volume":"108","author":"Vernikos","year":"2014","journal-title":"Pathog. Glob. Health"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"3185","DOI":"10.1093\/bioinformatics\/btaa119","article-title":"Vaxign-ML: Supervised machine learning reverse vaccinology model for improved prediction of bacterial protective antigens","volume":"36","author":"Ong","year":"2020","journal-title":"Bioinformatics"},{"key":"ref_25","first-page":"3","article-title":"Supervised machine learning: A review of classification techniques","volume":"160","author":"Kotsiantis","year":"2007","journal-title":"Emerg. Artif. Intell. Appl. Comput. Eng."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Vapnik, V.N. (1995). The Nature of Statistical Learning Theory, Springer Science & Business Media.","DOI":"10.1007\/978-1-4757-2440-0"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"121","DOI":"10.1023\/A:1009715923555","article-title":"A Tutorial on Support Vector Machines for Pattern Recognition","volume":"2","author":"Burges","year":"1998","journal-title":"Data Min. Knowl. Discov."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Doytchinova, I.A., and Flower, D.R. (2007). VaxiJen: A server for prediction of protective antigens, tumour antigens and subunit vaccines. BMC Bioinform., 8.","DOI":"10.1186\/1471-2105-8-4"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"2936","DOI":"10.1093\/bioinformatics\/btq551","article-title":"High-throughput prediction of protein antigenicity using protein microarray data","volume":"26","author":"Magnan","year":"2010","journal-title":"Bioinformatics"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"8156","DOI":"10.1016\/j.vaccine.2011.07.142","article-title":"Improving reverse vaccinology with a machine learning approach","volume":"29","author":"Bowman","year":"2011","journal-title":"Vaccine"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Heinson, A.I., Gunawardana, Y., Moesker, B., Hume, C.C.D., Vataga, E., Hall, Y., Stylianou, E., McShane, H., Williams, A., and Niranjan, M. (2017). Enhancing the Biological Relevance of Machine Learning Classifiers for Reverse Vaccinology. Int. J. Mol. Sci., 18.","DOI":"10.3390\/ijms18020312"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"28","DOI":"10.1016\/j.artmed.2018.12.010","article-title":"Antigenic: An improved prediction model of protective antigens","volume":"94","author":"Rahman","year":"2019","journal-title":"Artif. Intell. Med."},{"key":"ref_33","unstructured":"Kohavi, R. (1995, January 20\u201325). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. Proceedings of the The 1995 International Joint Conference, Montreal, QC, Canada."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Vivona, S., Bernante, F., and Filippini, F. (2006). NERVE: New enhanced reverse vaccinology environment. BMC Biotechnol., 6.","DOI":"10.1186\/1472-6750-6-35"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"297505","DOI":"10.1155\/2010\/297505","article-title":"Vaxign: The First Web-Based Vaccine Design Program for Reverse Vaccinology and Applications for Vaccine Development","volume":"2010","author":"He","year":"2010","journal-title":"J. Biomed. Biotechnol."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Jaiswal, V., Chanumolu, S.K., Gupta, A., Chauhan, R.S., and Rout, C. (2013). Jenner-predict server: Prediction of protein vaccine candidates (PVCs) in bacteria based on host-pathogen interactions. BMC Bioinform., 14.","DOI":"10.1186\/1471-2105-14-211"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Rizwan, M., Naz, A., Ahmad, J., Naz, K., Obaid, A., Parveen, T., Ahsan, M., and Ali, A. (2017). VacSol: A high throughput in silico pipeline to predict potential therapeutic targets in prokaryotic pathogens using subtractive reverse vaccinology. BMC Bioinform., 18.","DOI":"10.1186\/s12859-017-1540-0"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"W671","DOI":"10.1093\/nar\/gkab279","article-title":"Vaxign2: The second generation of the first Web-based vaccine design program using reverse vaccinology and machine learning","volume":"49","author":"Ong","year":"2021","journal-title":"Nucleic Acids Res."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"D1073","DOI":"10.1093\/nar\/gkq944","article-title":"Protegen: A web-based protective antigen database and analysis system","volume":"39","author":"Yang","year":"2011","journal-title":"Nucleic Acids Res."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"UniProt Consortium (2007). The universal protein resource (UniProt). Nucleic Acids Res., 36, D190\u2013D195.","DOI":"10.1093\/nar\/gkm895"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"403","DOI":"10.1016\/S0022-2836(05)80360-2","article-title":"Basic local alignment search tool","volume":"215","author":"Altschul","year":"1990","journal-title":"J. Mol. Biol."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"3.1.1","DOI":"10.1002\/0471250953.bi0301s42","article-title":"An introduction to sequence similarity (\u201chomology\u201d) searching","volume":"42","author":"Pearson","year":"2013","journal-title":"Curr. Protoc. Bioinform."},{"key":"ref_43","unstructured":"Anaconda Software Distribution (2022, October 30). Conda. Available online: https:\/\/www.anaconda.com\/."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"1026","DOI":"10.1038\/nbt.3988","article-title":"MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets","volume":"35","author":"Steinegger","year":"2017","journal-title":"Nat. Biotechnol."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"1323","DOI":"10.1093\/bioinformatics\/btw006","article-title":"MMseqs software suite for fast and deep clustering and searching of large protein sequence sets","volume":"32","author":"Hauser","year":"2016","journal-title":"Bioinformatics"},{"key":"ref_46","unstructured":"Reback, J., McKinney, W., Van Den Bossche, J., Augspurger, T., Cloud, P., Klein, A., Hawkins, S., Roeschke, M., Tratner, J., and She, C. (2020). pandas-dev\/pandas: Pandas 1.0. 5. Zenodo."},{"key":"ref_47","unstructured":"McKinney, W. (July, January 28). Data structures for Statistical Computing in Python. Proceedings of the Proceedings of the 9th Python in Science Conference, Austin, TX, USA."},{"key":"ref_48","first-page":"1","article-title":"Pandas: A foundational Python library for data analysis and statistics","volume":"14","author":"McKinney","year":"2011","journal-title":"Python High Perform. Sci. Comput."},{"key":"ref_49","unstructured":"Richardson, L. (2022, October 30). Beautiful Soup Documentation. Available online: https:\/\/www.crummy.com\/software\/BeautifulSoup\/bs4\/doc\/."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Bisong, E. (2019). Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners, Apress.","DOI":"10.1007\/978-1-4842-4470-8"},{"key":"ref_51","unstructured":"Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., and Magrane, M. (2022, March 11). UniProt. Available online: https:\/\/www.uniprot.org\/."},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"1422","DOI":"10.1093\/bioinformatics\/btp163","article-title":"Biopython: Freely available Python tools for computational molecular biology and bioinformatics","volume":"25","author":"Cock","year":"2009","journal-title":"Bioinformatics"},{"key":"ref_53","unstructured":"Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., and Devin, M. (2015). TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. arXiv."},{"key":"ref_54","first-page":"2825","article-title":"Scikit-learn: Machine learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J. Mach. Learn. Res."},{"key":"ref_55","unstructured":"Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017). Automatic differentiation in pytorch. NIPS 2017 Workshop Autodiff."},{"key":"ref_56","unstructured":"Frank, E., Hall, M.A., and Witten, I.H. (2016). Online Appendix for \u201cData Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann. [4th ed.]."},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Preisach, C., Burkhardt, H., Schmidt-Thieme, L., and Decker, R. (2008). KNIME: The Konstanz Information Miner in Data Analysis, Machine Learning and Applications SE-38, Springer.","DOI":"10.1007\/978-3-540-78246-9"},{"key":"ref_58","doi-asserted-by":"crossref","first-page":"26","DOI":"10.1145\/1656274.1656280","article-title":"KNIME-the Konstanz information miner: Version 2.0 and beyond","volume":"11","author":"Berthold","year":"2009","journal-title":"AcM SIGKDD Explor. Newsl."},{"key":"ref_59","first-page":"2349","article-title":"Orange: Data mining toolbox in Python","volume":"14","author":"Curk","year":"2013","journal-title":"J. Mach. Learn. Res."},{"key":"ref_60","first-page":"156","article-title":"dplyr: A grammar of data manipulation","volume":"3","author":"Wickham","year":"2015","journal-title":"R Package Version 0.4"},{"key":"ref_61","doi-asserted-by":"crossref","first-page":"1","DOI":"10.18637\/jss.v028.i05","article-title":"Building predictive models in R using the caret package","volume":"28","author":"Kuhn","year":"2008","journal-title":"J. Stat. Softw."},{"key":"ref_62","unstructured":"R Core Team (2022, November 06). R: A Language and Environment for Statistical Computing. Available online: https:\/\/www.R-project.org\/."},{"key":"ref_63","doi-asserted-by":"crossref","first-page":"321","DOI":"10.1613\/jair.953","article-title":"SMOTE: Synthetic minority over-sampling technique","volume":"16","author":"Chawla","year":"2002","journal-title":"J. Artif. Intell. Res."},{"key":"ref_64","unstructured":"He, H., Bai, Y., Garcia, E.A., and Li, S. (2008, January 1\u20138). ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China."},{"key":"ref_65","doi-asserted-by":"crossref","first-page":"17","DOI":"10.1016\/j.ins.2017.05.008","article-title":"Clustering-based undersampling in class-imbalanced data","volume":"409-410","author":"Lin","year":"2017","journal-title":"Inf. Sci."},{"key":"ref_66","first-page":"448","article-title":"An Experiment with The Edited Nearest-Nieghbor Rule","volume":"6","author":"Tomek","year":"1976","journal-title":"IEEE Trans. Syst. Man Cybern."},{"key":"ref_67","first-page":"24","article-title":"Using random forest to learn imbalanced data","volume":"110","author":"Chen","year":"2004","journal-title":"Univ. Calif. Berkeley"},{"key":"ref_68","first-page":"546","article-title":"An empirical evaluation of bagging and boosting","volume":"1997","author":"Maclin","year":"1997","journal-title":"AAAI\/IAAI"}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/8\/2\/41\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T18:35:45Z","timestamp":1760121345000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/8\/2\/41"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,17]]},"references-count":68,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2023,2]]}},"alternative-id":["data8020041"],"URL":"https:\/\/doi.org\/10.3390\/data8020041","relation":{},"ISSN":["2306-5729"],"issn-type":[{"value":"2306-5729","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,2,17]]}}}