{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T22:12:23Z","timestamp":1767651143885,"version":"3.37.3"},"reference-count":59,"publisher":"Oxford University Press (OUP)","issue":"5","license":[{"start":{"date-parts":[[2019,10,28]],"date-time":"2019-10-28T00:00:00Z","timestamp":1572220800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"name":"German Federal Ministry of Health","award":["IIA5-2512-FSB-725"],"award-info":[{"award-number":["IIA5-2512-FSB-725"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,9,25]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Bacterial proteins dubbed virulence factors (VFs) are a highly diverse group of sequences, whose only obvious commonality is the very property of being, more or less directly, involved in virulence. It is therefore tempting to speculate whether their prediction, based on direct sequence similarity (seqsim) to known VFs, could be enhanced or even replaced by using machine-learning methods. Specifically, when trained on a large and diverse set of VFs, such may be able to detect putative, non-trivial characteristics shared by otherwise unrelated VF families and therefore better predict novel VFs with insignificant similarity to each individual family. We therefore first reassess the performance of dimer-based Support Vector Machines, as used in the widely used MP3 method, in light of seqsim-only and seqsim\/dimer-hybrid classifiers. We then repeat the analysis with a novel, considerably more diverse data set, also addressing the important problem of negative data selection. Finally, we move on to the real-world use case of proteome-wide VF prediction, outlining different approaches to estimating specificity in this scenario. We find that direct seqsim is of unparalleled importance and therefore should always be exploited. Further, we observe strikingly low correlations between different feature and classifier types when ranking proteins by VF likeness. We therefore propose a \u2018best of each world\u2019 approach to prioritize proteins for experimental testing, focussing on the top predictions of each classifier. Further, classifiers for individual VF families should be developed.<\/jats:p>","DOI":"10.1093\/bib\/bbz076","type":"journal-article","created":{"date-parts":[[2019,7,4]],"date-time":"2019-07-04T19:09:15Z","timestamp":1562267355000},"page":"1596-1608","source":"Crossref","is-referenced-by-count":31,"title":["Predicting bacterial virulence factors \u2013 evaluation of machine learning and negative data strategies"],"prefix":"10.1093","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0932-8777","authenticated-orcid":false,"given":"Robert","family":"Rentzsch","sequence":"first","affiliation":[{"name":"Bioinformatics Unit (MF 1), Robert Koch Institute, Berlin"},{"name":"Institute for Innovation and Technology (IIT), Steinplatz 1, Berlin"}]},{"given":"Carlus","family":"Deneke","sequence":"additional","affiliation":[{"name":"Bioinformatics Unit (MF 1), Robert Koch Institute, Berlin"},{"name":"Molecular Microbiology and Genome Analysis Unit, German Federal Institute for Risk Assessment, Berlin"}]},{"given":"Andreas","family":"Nitsche","sequence":"additional","affiliation":[{"name":"Centre for Biological Threats and Special Pathogens: Highly Pathogenic Viruses (ZBS 1), Robert Koch Institute, Berlin"}]},{"given":"Bernhard Y","family":"Renard","sequence":"additional","affiliation":[{"name":"Bioinformatics Unit (MF 1), Robert Koch Institute, Berlin"}]}],"member":"286","published-online":{"date-parts":[[2019,10,28]]},"reference":[{"key":"2021031107441920200_ref1","first-page":"114","article-title":"Medical subject headings","volume":"51","author":"Rogers","year":"1963","journal-title":"Bull Med Libr Assoc"},{"key":"2021031107441920200_ref2","doi-asserted-by":"crossref","DOI":"10.1038\/s41598-017-13957-1","article-title":"Functional classification of protein toxins as a basis for bioinformatic screening","volume":"7","author":"Negi","year":"2017","journal-title":"Sci Rep"},{"key":"2021031107441920200_ref3","doi-asserted-by":"crossref","first-page":"455","DOI":"10.2217\/fmb.15.149","article-title":"Identification of virulence factors and antibiotic resistance markers using bacterial genomics","volume":"11","author":"Bakour","year":"2016","journal-title":"Future Microbiol"},{"issue":"Suppl 1","key":"2021031107441920200_ref4","doi-asserted-by":"crossref","first-page":"S2","DOI":"10.2166\/wh.2009.036","article-title":"Virulence factors and their mechanisms of action: the view from a damage-response framework","volume":"7","author":"Casadevall","year":"2009","journal-title":"J Water Health"},{"key":"2021031107441920200_ref5","doi-asserted-by":"crossref","first-page":"234","DOI":"10.1186\/cc7091","article-title":"Bench-to-bedside review: bacterial virulence and subversion of host defences","volume":"12","author":"Webb","year":"2008","journal-title":"Crit Care"},{"key":"2021031107441920200_ref6","doi-asserted-by":"crossref","first-page":"770","DOI":"10.1111\/j.1469-0691.2005.01210.x","article-title":"Virulence Searcher: a tool for searching raw genome sequences from bacterial genomes for putative virulence factors","volume":"11","author":"Underwood","year":"2005","journal-title":"Clin Microbiol Infect"},{"key":"2021031107441920200_ref7","doi-asserted-by":"crossref","first-page":"799","DOI":"10.1093\/bioinformatics\/15.10.799","article-title":"FingerPRINTScan: intelligent searching of the PRINTS motif database","volume":"15","author":"Scordis","year":"1999","journal-title":"Bioinformatics"},{"key":"2021031107441920200_ref8","first-page":"357","article-title":"Bacterial bioinformatics: pathogenesis and the genome","volume":"4","author":"Paine","year":"2002","journal-title":"J Mol Microbiol Biotechnol"},{"key":"2021031107441920200_ref9","doi-asserted-by":"crossref","first-page":"62","DOI":"10.1186\/1471-2105-9-62","article-title":"VirulentPred: a SVM based prediction method for virulent proteins in bacterial pathogens","volume":"9","author":"Garg","year":"2008","journal-title":"BMC Bioinformatics"},{"key":"2021031107441920200_ref10","doi-asserted-by":"crossref","first-page":"D325","DOI":"10.1093\/nar\/gki008","article-title":"VFDB: a reference database for bacterial virulence factors","volume":"33","author":"Chen","year":"2005","journal-title":"Nucleic Acids Res"},{"key":"2021031107441920200_ref11","doi-asserted-by":"crossref","first-page":"467","DOI":"10.1109\/TCBB.2011.117","article-title":"Identifying bacterial virulent proteins by fusing a set of classifiers based on variants of Chou\u2019s pseudo amino acid composition and on evolutionary information","volume":"9","author":"Nanni","year":"2011","journal-title":"IEEE\/ACM Trans Comput Biol Bioinform"},{"key":"2021031107441920200_ref12","doi-asserted-by":"crossref","first-page":"409","DOI":"10.1007\/s00726-008-0076-z","article-title":"Using ensemble of classifiers for predicting HIV protease cleavage sites in proteins","volume":"36","author":"Nanni","year":"2009","journal-title":"Amino Acids"},{"key":"2021031107441920200_ref13","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pone.0093907","article-title":"MP3: a software tool for the prediction of pathogenic proteins in genomic and metagenomic data","volume":"9","author":"Gupta","year":"2014","journal-title":"PLoS One"},{"issue":"D1","key":"2021031107441920200_ref14","doi-asserted-by":"crossref","first-page":"D427","DOI":"10.1093\/nar\/gky995","article-title":"The Pfam protein families database in 2019","volume":"47","author":"El-Gebali","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2021031107441920200_ref15","doi-asserted-by":"crossref","first-page":"D391","DOI":"10.1093\/nar\/gkl791","article-title":"MvirDB\u2014a microbial database of protein toxins, virulence factors and antibiotic resistance genes for bio-defence applications","volume":"35","author":"Zhou","year":"2007","journal-title":"Nucleic Acids Res"},{"key":"2021031107441920200_ref16","doi-asserted-by":"crossref","first-page":"D574","DOI":"10.1093\/nar\/gkt1131","article-title":"DEG 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements","volume":"42","author":"Luo","year":"2014","journal-title":"Nucleic Acids Res"},{"key":"2021031107441920200_ref17","first-page":"242","article-title":"Virulent-GO: prediction of virulent proteins in bacterial pathogens utilizing Gene Ontology terms","volume":"29","author":"Tsai","year":"2009","journal-title":"International Journal of Biological, Biomolecular, Agricultural, Food and Biotechnological Engineering"},{"key":"2021031107441920200_ref18","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1038\/75556","article-title":"Gene ontology: tool for the unification of biology. The Gene Ontology Consortium","volume":"25","author":"Ashburner","year":"2000","journal-title":"Nat Genet"},{"key":"2021031107441920200_ref19","doi-asserted-by":"crossref","first-page":"D331","DOI":"10.1093\/nar\/gkw1108","article-title":"Expansion of the Gene Ontology knowledgebase and resources","volume":"45","author":"The Gene Ontology","year":"2017","journal-title":"Nucleic Acids Res"},{"key":"2021031107441920200_ref20","doi-asserted-by":"crossref","first-page":"D262","DOI":"10.1093\/nar\/gkh021","article-title":"The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology","volume":"32","author":"Camon","year":"2004","journal-title":"Nucleic Acids Res"},{"key":"2021031107441920200_ref21","doi-asserted-by":"crossref","first-page":"D694","DOI":"10.1093\/nar\/gkv1239","article-title":"VFDB 2016: hierarchical and refined dataset for big data analysis\u201410 years on","volume":"44","author":"Chen","year":"2016","journal-title":"Nucleic Acids Res"},{"key":"2021031107441920200_ref22","doi-asserted-by":"crossref","first-page":"252","DOI":"10.1093\/bioinformatics\/btu631","article-title":"Curation, integration and visualization of bacterial virulence factors in PATRIC","volume":"31","author":"Mao","year":"2015","journal-title":"Bioinformatics"},{"key":"2021031107441920200_ref23","doi-asserted-by":"crossref","first-page":"D535","DOI":"10.1093\/nar\/gkw1017","article-title":"Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center","volume":"45","author":"Wattam","year":"2017","journal-title":"Nucleic Acids Res"},{"key":"2021031107441920200_ref24","article-title":"A comparison of computational methods for identifying virulence factors","volume":"7","author":"Zheng","year":"2012","journal-title":"PLoS One"},{"key":"2021031107441920200_ref25","doi-asserted-by":"crossref","first-page":"1447","DOI":"10.1039\/c3mb70024k","article-title":"Computationally identifying virulence factors based on KEGG pathways","volume":"9","author":"Cui","year":"2013","journal-title":"Mol Biosyst"},{"key":"2021031107441920200_ref26","doi-asserted-by":"crossref","first-page":"391","DOI":"10.1078\/1438-4221-00283","article-title":"Housekeeping enzymes as virulence factors for pathogens","volume":"293","author":"Pancholi","year":"2003","journal-title":"Int J Med Microbiol"},{"key":"2021031107441920200_ref27","doi-asserted-by":"crossref","first-page":"12561","DOI":"10.1038\/srep12561","article-title":"Comparative analysis of essential genes in prokaryotic genomic islands","volume":"5","author":"Zhang","year":"2015","journal-title":"Sci Rep"},{"key":"2021031107441920200_ref28"},{"issue":"D1","key":"2021031107441920200_ref29","doi-asserted-by":"crossref","first-page":"D693","DOI":"10.1093\/nar\/gky999","article-title":"Victors: a web-based knowledge base of virulence factors in human and animal pathogens","volume":"47","author":"Sayers","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2021031107441920200_ref30","doi-asserted-by":"crossref","first-page":"2699","DOI":"10.1093\/nar\/gky092","article-title":"UniProt: the universal protein knowledgebase","volume":"46","author":"UniProt Consortium T","year":"2018","journal-title":"Nucleic Acids Res"},{"key":"2021031107441920200_ref31"},{"key":"2021031107441920200_ref32","doi-asserted-by":"crossref","first-page":"D191","DOI":"10.1093\/nar\/gkt1140","article-title":"Activities at the Universal Protein Resource (UniProt)","volume":"42","author":"UniProt C","year":"2014","journal-title":"Nucleic Acids Res"},{"key":"2021031107441920200_ref33"},{"key":"2021031107441920200_ref34","doi-asserted-by":"crossref","first-page":"3236","DOI":"10.1093\/bioinformatics\/bth191","article-title":"UniProt archive","volume":"20","author":"Leinonen","year":"2004","journal-title":"Bioinformatics"},{"key":"2021031107441920200_ref35","doi-asserted-by":"crossref","first-page":"S14","DOI":"10.1186\/1471-2105-13-S4-S14","article-title":"Argot2: a large scale function prediction tool relying on semantic similarity of weighted Gene Ontology terms","volume":"13","author":"Falda","year":"2012","journal-title":"BMC Bioinformatics"},{"key":"2021031107441920200_ref36","doi-asserted-by":"crossref","first-page":"421","DOI":"10.1186\/1471-2105-10-421","article-title":"BLAST+: architecture and applications","volume":"10","author":"Camacho","year":"2009","journal-title":"BMC Bioinformatics"},{"key":"2021031107441920200_ref37","doi-asserted-by":"crossref","first-page":"1658","DOI":"10.1093\/bioinformatics\/btl158","article-title":"Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences","volume":"22","author":"Li","year":"2006","journal-title":"Bioinformatics"},{"key":"2021031107441920200_ref38","first-page":"169","volume-title":"Advances in Kernel Methods","author":"Joachims","year":"1999"},{"key":"2021031107441920200_ref39","doi-asserted-by":"crossref","first-page":"190","DOI":"10.1109\/IRI.2008.4583028","volume-title":"Proceedings of the 2008 IEEE International Conference on Information Reuse and Integration","author":"Folleco","year":"2008"},{"key":"2021031107441920200_ref40","first-page":"3853","volume-title":"2008 IEEE Congress on Evolutionary Computation","author":"Folleco","year":"2008"},{"key":"2021031107441920200_ref41","doi-asserted-by":"crossref","first-page":"1","DOI":"10.18637\/jss.v077.i01","article-title":"ranger: a fast implementation of random forests for high dimensional data in C plus plus and R","volume":"77","author":"Wright","year":"2017","journal-title":"J Stat Softw"},{"key":"2021031107441920200_ref42","doi-asserted-by":"crossref","first-page":"2595","DOI":"10.1093\/bioinformatics\/btv153","article-title":"PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R","volume":"31","author":"Grau","year":"2015","journal-title":"Bioinformatics"},{"key":"2021031107441920200_ref43"},{"key":"2021031107441920200_ref44","doi-asserted-by":"crossref","first-page":"631","DOI":"10.1126\/science.278.5338.631","article-title":"A genomic perspective on protein families","volume":"278","author":"Tatusov","year":"1997","journal-title":"Science"},{"key":"2021031107441920200_ref45","doi-asserted-by":"crossref","first-page":"184","DOI":"10.1186\/s13059-016-1037-6","article-title":"An expanded evaluation of protein function prediction methods shows an improvement in accuracy","volume":"17","author":"Jiang","year":"2016","journal-title":"Genome Biol"},{"key":"2021031107441920200_ref46","doi-asserted-by":"crossref","first-page":"D202","DOI":"10.1093\/nar\/gkm998","article-title":"AAindex: amino acid index database, progress report 2008","volume":"36","author":"Kawashima","year":"2008","journal-title":"Nucleic Acids Res"},{"key":"2021031107441920200_ref47","doi-asserted-by":"crossref","first-page":"2129","DOI":"10.1101\/gr.772403","article-title":"PANTHER: a library of protein families and subfamilies indexed by function","volume":"13","author":"Thomas","year":"2003","journal-title":"Genome Res"},{"key":"2021031107441920200_ref48","doi-asserted-by":"crossref","first-page":"aad6253","DOI":"10.1126\/science.aad6253","article-title":"Design and synthesis of a minimal bacterial genome","volume":"351","author":"Hutchison","year":"2016","journal-title":"Science"},{"key":"2021031107441920200_ref49","doi-asserted-by":"crossref","first-page":"4191","DOI":"10.1038\/srep04191","article-title":"Annotation enrichment analysis: an alternative method for evaluating the functional properties of gene sets","volume":"4","author":"Glass","year":"2014","journal-title":"Sci Rep"},{"key":"2021031107441920200_ref50","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pone.0045012","article-title":"Towards the improved discovery and design of functional peptides: common features of diverse classes permit generalized prediction of bioactivity","volume":"7","author":"Mooney","year":"2012","journal-title":"PLoS One"},{"issue":"D1","key":"2021031107441920200_ref51","doi-asserted-by":"crossref","first-page":"D687","DOI":"10.1093\/nar\/gky1080","article-title":"VFDB 2019: a comparative pathogenomic platform with an interactive web interface","volume":"47","author":"Liu","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2021031107441920200_ref52","doi-asserted-by":"crossref","first-page":"e1004557","DOI":"10.1371\/journal.pcbi.1004557","article-title":"High-specificity targeted functional profiling in microbial communities with ShortBRED","volume":"11","author":"Kaminski","year":"2015","journal-title":"PLoS Comput Biol"},{"key":"2021031107441920200_ref53","doi-asserted-by":"crossref","first-page":"279","DOI":"10.1016\/j.tim.2009.04.002","article-title":"Controlled vocabularies for microbial virulence factors","volume":"17","author":"Korves","year":"2009","journal-title":"Trends Microbiol"},{"key":"2021031107441920200_ref54","doi-asserted-by":"crossref","first-page":"D412","DOI":"10.1093\/nar\/gkn760","article-title":"STRING 8\u2014a global view on proteins and their functional interactions in 630 organisms","volume":"37","author":"Jensen","year":"2009","journal-title":"Nucleic Acids Res"},{"key":"2021031107441920200_ref55","doi-asserted-by":"crossref","first-page":"D353","DOI":"10.1093\/nar\/gkw1092","article-title":"KEGG: new perspectives on genomes, pathways, diseases and drugs","volume":"45","author":"Kanehisa","year":"2017","journal-title":"Nucleic Acids Res"},{"key":"2021031107441920200_ref56","doi-asserted-by":"crossref","first-page":"2115","DOI":"10.1093\/molbev\/msx148","article-title":"Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper","volume":"34","author":"Huerta-Cepas","year":"2017","journal-title":"Mol Biol Evol"},{"key":"2021031107441920200_ref57"},{"key":"2021031107441920200_ref58","doi-asserted-by":"crossref","first-page":"170","DOI":"10.1186\/1471-2105-8-170","article-title":"Estimating the annotation error rate of curated GO database sequence annotations","volume":"8","author":"Jones","year":"2007","journal-title":"BMC Bioinformatics"},{"key":"2021031107441920200_ref59","doi-asserted-by":"crossref","first-page":"e1000605","DOI":"10.1371\/journal.pcbi.1000605","article-title":"Annotation error in public databases: misannotation of molecular function in enzyme superfamilies","volume":"5","author":"Schnoes","year":"2009","journal-title":"PLoS Comput Biol"}],"container-title":["Briefings in Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bib\/article-pdf\/21\/5\/1596\/36529407\/bbz076.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/bib\/article-pdf\/21\/5\/1596\/36529407\/bbz076.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,3,11]],"date-time":"2021-03-11T09:56:45Z","timestamp":1615456605000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bib\/article\/21\/5\/1596\/5574719"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,10,28]]},"references-count":59,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2019,10,28]]},"published-print":{"date-parts":[[2020,9,25]]}},"URL":"https:\/\/doi.org\/10.1093\/bib\/bbz076","relation":{},"ISSN":["1477-4054"],"issn-type":[{"type":"electronic","value":"1477-4054"}],"subject":[],"published-other":{"date-parts":[[2020,9]]},"published":{"date-parts":[[2019,10,28]]}}}