{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,24]],"date-time":"2026-02-24T10:40:49Z","timestamp":1771929649165,"version":"3.50.1"},"reference-count":37,"publisher":"Oxford University Press (OUP)","issue":"3","license":[{"start":{"date-parts":[[2020,8,17]],"date-time":"2020-08-17T00:00:00Z","timestamp":1597622400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"name":"Israeli Ministry of Science and Technology","award":["3-14385"],"award-info":[{"award-number":["3-14385"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,4,20]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>High-resolution microbial strain typing is essential for various clinical purposes, including disease outbreak investigation, tracking of microbial transmission events and epidemiological surveillance of bacterial infections. The widely used approach for multilocus sequence typing (MLST) that is based on the core genome, cgMLST, has the advantage of a high level of typeability and maximal discriminatory power. Yet, the transition from a seven loci-based scheme to cgMLST involves several challenges, that include the need by some users to maintain backward compatibility, growing difficulties in the day-to-day communication within the microbiology community with respect to nomenclature and ontology, issues with typeability, especially if a more stringent approach to loci presence is used, and computational requirements concerning laboratory data management and sharing with end-users. Hence, methods for optimizing cgMLST schemes through careful reduction of the number of loci are expected to be beneficial for practical needs in different settings.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>We present a new machine learning-based methodology, minMLST, for minimizing the number of genes in cgMLST schemes by identifying subsets of informative genes and analyzing the trade-off between gene reduction and typing performance. The results achieved with minMLST over eight bacterial species show that despite the reduction in the number of genes up to a factor of 10, the typing performance remains very high and significant with an Adjusted Rand Index that ranges between 0.4 and 0.93 in different species and a P-value &amp;lt; 10-3. The identification of such optimized MLST schemes for bacterial strain typing is expected to improve the implementation of cgMLST by improving interlaboratory agreement and communication.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>The python package minMLST is available at https:\/\/PyPi.org\/project\/minmlst\/PyPI and supported on Linux and Windows.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Supplementary information<\/jats:title>\n                  <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaa724","type":"journal-article","created":{"date-parts":[[2020,8,10]],"date-time":"2020-08-10T11:10:06Z","timestamp":1597057806000},"page":"303-311","source":"Crossref","is-referenced-by-count":6,"title":["<i>minMLST<\/i>: machine learning for optimization of bacterial strain typing"],"prefix":"10.1093","volume":"37","author":[{"given":"Shani","family":"Cohen","sequence":"first","affiliation":[{"name":"Department of Software and Information Systems Engineering, Ben Gurion University of the Negev , Beer Sheva 8410501, Israel"}]},{"given":"Lior","family":"Rokach","sequence":"additional","affiliation":[{"name":"Department of Software and Information Systems Engineering, Ben Gurion University of the Negev , Beer Sheva 8410501, Israel"}]},{"given":"Yair","family":"Motro","sequence":"additional","affiliation":[{"name":"Department of Health Systems Management, Ben Gurion University of the Negev , Beer Sheva 8410501, Israel"}]},{"given":"Jacob","family":"Moran-Gilad","sequence":"additional","affiliation":[{"name":"Department of Health Systems Management, Ben Gurion University of the Negev , Beer Sheva 8410501, Israel"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4251-6158","authenticated-orcid":false,"given":"Isana","family":"Veksler-Lublinsky","sequence":"additional","affiliation":[{"name":"Department of Software and Information Systems Engineering, Ben Gurion University of the Negev , Beer Sheva 8410501, Israel"}]}],"member":"286","published-online":{"date-parts":[[2020,8,17]]},"reference":[{"key":"2023051604082355600_btaa724-B1","doi-asserted-by":"publisher","first-page":"e1007261","DOI":"10.1371\/journal.pgen.1007261","article-title":"A genomic overview of the population structure of Salmonella","volume":"14","author":"Alikhan","year":"2018","journal-title":"PLOS Genetics"},{"key":"2023051604082355600_btaa724-B2","doi-asserted-by":"crossref","first-page":"e0123298","DOI":"10.1371\/journal.pone.0123298","article-title":"Rapid high resolution genotyping of Francisella tularensis by whole genome sequence comparison of annotated genes (\u2018MLST+\u2019)","volume":"10","author":"Antwerpen","year":"2015","journal-title":"PLoS One"},{"key":"2023051604082355600_btaa724-B3","doi-asserted-by":"crossref","first-page":"983","DOI":"10.3390\/molecules21080983","article-title":"Bioactive molecule prediction using extreme gradient boosting","volume":"21","author":"Babajide Mustapha","year":"2016","journal-title":"Molecules"},{"key":"2023051604082355600_btaa724-B4","doi-asserted-by":"crossref","first-page":"3788","DOI":"10.1128\/JCM.01946-15","article-title":"Core genome multilocus sequence typing scheme for high-resolution typing of Enterococcus faecium","volume":"53","author":"de Been","year":"2015","journal-title":"J. Clin. Microbiol"},{"key":"2023051604082355600_btaa724-B5","first-page":"1","article-title":"Defining and evaluating a core genome multilocus sequence typing scheme for genome-wide typing of Clostridium difficile","volume-title":"J. Clin. Microbiol","author":"Bletz","year":"2018"},{"key":"2023051604082355600_btaa724-B6","doi-asserted-by":"crossref","first-page":"785","DOI":"10.1145\/2939672.2939785","volume-title":"Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD \u201916","author":"Chen","year":"2016"},{"key":"2023051604082355600_btaa724-B7","doi-asserted-by":"crossref","first-page":"2135","DOI":"10.1128\/JCM.00432-16","article-title":"Evaluation of an optimal epidemiologic typing scheme for Legionella pneumophila with whole genome sequence data using validation guidelines","volume":"54","author":"David","year":"2016","journal-title":"J. Clin. Microbiol"},{"key":"2023051604082355600_btaa724-B8","doi-asserted-by":"crossref","first-page":"2850","DOI":"10.1128\/JCM.01714-16","article-title":"Next-generation epidemiology: using real-time core genome multilocus sequence typing to support infection control policy","volume":"54","author":"Dekker","year":"2016","journal-title":"J. Clin. Microbiol"},{"key":"2023051604082355600_btaa724-B9","doi-asserted-by":"crossref","first-page":"102","DOI":"10.1016\/j.enconman.2018.02.087","article-title":"Comparison of support vector machine and extreme gradient boosting for predicting daily global solar radiation using temperature and precipitation in humid subtropical climates: a case study in China","volume":"164","author":"Fan","year":"2018","journal-title":"Energy Convers. Manag"},{"key":"2023051604082355600_btaa724-B10","doi-asserted-by":"crossref","first-page":"607","DOI":"10.1109\/LGRS.2018.2803259","article-title":"Very high resolution object-based land use-land cover urban classification using extreme gradient boosting","volume":"15","author":"Georganos","year":"2018","journal-title":"IEEE Geosci. Remote Sens. Lett"},{"key":"2023051604082355600_btaa724-B11","doi-asserted-by":"publisher","first-page":"e0179228","DOI":"10.1371\/journal.pone.0179228","article-title":"Development and evaluation of a core genome multilocus typing scheme for whole-genome sequence-based typing of Acinetobacter baumannii","volume":"12","author":"Higgins","year":"2017","journal-title":"PLOS ONE"},{"key":"2023051604082355600_btaa724-B12","doi-asserted-by":"crossref","first-page":"193","DOI":"10.1007\/BF01908075","article-title":"Comparing partitions","volume":"2","author":"Hubert","year":"1985","journal-title":"J. Classif"},{"key":"2023051604082355600_btaa724-B13","first-page":"2465","article-title":"Numerical index of the discriminatory ability of typing systems: an application of Simpson\u2019s index of diversity","author":"Hunter","year":"1988"},{"key":"2023051604082355600_btaa724-B14","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s12864-016-3284-z","article-title":"Genomic determination of minimum multi-locus sequence typing schemas to represent the genomic phylogeny of Mycoplasma hominis","volume":"17","author":"Jironkin","year":"2016","journal-title":"BMC Genomics"},{"key":"2023051604082355600_btaa724-B15","doi-asserted-by":"crossref","first-page":"1005","DOI":"10.1099\/mic.0.055459-0","article-title":"Ribosomal multilocus sequence typing: universal characterization of bacteria from domain to strain","volume":"158","author":"Jolley","year":"2012","journal-title":"Microbiology"},{"key":"2023051604082355600_btaa724-B16","doi-asserted-by":"crossref","first-page":"2365","DOI":"10.1128\/JCM.00262-14","article-title":"Bacterial whole-genome sequencing revisited: portable, scalable, and standardized analysis for typing and detection of virulence and antibiotic resistance genes","volume":"52","author":"Leopold","year":"2014","journal-title":"J. Clin. Microbiol"},{"key":"2023051604082355600_btaa724-B17","first-page":"127","article-title":"Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation","volume-title":"Bioinformatics","author":"Letunic","year":"2007"},{"key":"2023051604082355600_btaa724-B18","doi-asserted-by":"crossref","first-page":"892","DOI":"10.1111\/j.1574-6976.2009.00182.x","article-title":"Bacterial strain typing in the genomic era","volume":"33","author":"Li","year":"2009","journal-title":"FEMS Microbiol. Rev"},{"key":"2023051604082355600_btaa724-B19","volume-title":"Curran Associates","author":"et"},{"key":"2023051604082355600_btaa724-B20","doi-asserted-by":"publisher","first-page":"56","DOI":"10.1038\/s42256-019-0138-9","article-title":"From local explanations to global understanding with explainable AI for trees","volume":"2","author":"Lundberg","year":"2020","journal-title":"Nature Machine Intelligence"},{"key":"2023051604082355600_btaa724-B21","doi-asserted-by":"crossref","first-page":"728","DOI":"10.1038\/nrmicro3093","article-title":"MLST revisited: the gene-by-gene approach to bacterial genomics","volume":"11","author":"Maiden","year":"2013","journal-title":"Nat. Rev. Microbiol"},{"key":"2023051604082355600_btaa724-B22","doi-asserted-by":"crossref","first-page":"008","DOI":"10.1088\/1475-7516\/2016\/12\/008","article-title":"Photometric classification of type Ia supernovae in the SuperNova Legacy Survey with supervised learning","volume":"2016","author":"M\u00f6ller","year":"2016","journal-title":"J. Cosmol. Astropart. Phys"},{"key":"2023051604082355600_btaa724-B23","doi-asserted-by":"crossref","first-page":"1","DOI":"10.2807\/1560-7917.ES2015.20.28.21186","article-title":"Design and application of a core genome multilocus sequence typing scheme for investigation of Legionnaires\u2019 disease incidents","volume":"20","author":"Moran-Gilad","year":"2015","journal-title":"Eurosurveillance"},{"key":"2023051604082355600_btaa724-B24","doi-asserted-by":"crossref","first-page":"1","DOI":"10.3389\/fgene.2018.00751","article-title":"A novel protein subcellular localization method with CNN-XGBoost model for Alzheimer\u2019s disease","volume":"9","author":"Pang","year":"2019","journal-title":"Front. Genet"},{"key":"2023051604082355600_btaa724-B25","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.ijfoodmicro.2018.02.023","article-title":"Comparative analysis of core genome MLST and SNP typing within a European Salmonella serovar Enteritidis outbreak","volume":"274","author":"Pearce","year":"2018","journal-title":"Int. J. Food Microbiol"},{"key":"2023051604082355600_btaa724-B26","first-page":"787","article-title":"Identification of blaVIM-1 gene in ST307 and ST661 Klebsiella pneumoniae clones in Italy: old acquaintances for new combinations","volume-title":"Microb. Drug Resist.","author":"Piazza","year":"2019"},{"key":"2023051604082355600_btaa724-B27","doi-asserted-by":"crossref","first-page":"93","DOI":"10.1016\/j.foodqual.2013.05.005","article-title":"Significance test of the adjusted Rand index. Application to the free sorting task","volume":"32","author":"Qannari","year":"2014","journal-title":"Food Qual. Prefer"},{"key":"2023051604082355600_btaa724-B28","first-page":"321","volume-title":"Data Mining and Knowledge Discovery Handbook"},{"key":"2023051604082355600_btaa724-B29","doi-asserted-by":"crossref","first-page":"2869","DOI":"10.1128\/JCM.01193-15","article-title":"Defining and evaluating a core genome multilocus sequence typing scheme for whole-genome sequence-based typing of listeria monocytogenes","volume":"53","author":"Ruppitsch","year":"2015","journal-title":"J. Clin. Microbiol"},{"key":"2023051604082355600_btaa724-B30","doi-asserted-by":"crossref","first-page":"350","DOI":"10.1016\/j.cmi.2017.12.016","article-title":"Whole genome sequencing options for bacterial strain typing and epidemiologic analysis based on single nucleotide polymorphism versus gene-by-gene-based approaches","volume":"24","author":"Sch\u00fcrch","year":"2018","journal-title":"Clin. Microbiol. Infect"},{"key":"2023051604082355600_btaa724-B31","doi-asserted-by":"crossref","first-page":"L22","DOI":"10.3847\/2041-8205\/832\/2\/L22","article-title":"A machine learns to predict the stability of tightly packed planetary systems","volume":"832","author":"Tamayo","year":"2016","journal-title":"Astrophys. J"},{"key":"2023051604082355600_btaa724-B32","first-page":"1","article-title":"IRESpy: an XGBoost model for prediction of internal ribosome entry sites","volume":"20","author":"Wang","year":"2019","journal-title":"BMC Bioinformatics"},{"key":"2023051604082355600_btaa724-B33","first-page":"1","article-title":"IS 26-mediated transfer of bla NDM-1 as the main route of resistance transmission during a polyclonal","volume":"10","author":"Weber","year":"2019","journal-title":"Multispecies Outbreak German Hosp"},{"key":"2023051604082355600_btaa724-B34","doi-asserted-by":"crossref","first-page":"2749","DOI":"10.1093\/bioinformatics\/bty1043","article-title":"Sequence analysis PredGly: predicting lysine glycation sites for Homo sapiens based on XGboost feature optimization","volume":"35","author":"Yu","year":"2019","journal-title":"Bioinformatics"},{"key":"2023051604082355600_btaa724-B35","doi-asserted-by":"crossref","first-page":"1395","DOI":"10.1101\/gr.232397.117","article-title":"GrapeTree: visualization of core genomic relationships among 100,000 bacterial pathogens","volume":"28","author":"Zhou","year":"2018","journal-title":"Genome Res"},{"key":"2023051604082355600_btaa724-B36","doi-asserted-by":"crossref","first-page":"e7","DOI":"10.1093\/nar\/gkw837","article-title":"MetaMLST: multi-locus strain-level bacterial typing from metagenomic samples","volume":"45","author":"Zolfo","year":"2017","journal-title":"Nucleic Acids Res"},{"key":"2023051604082355600_btaa724-B37","article-title":"BoostMe accurately predicts DNA methylation values in whole-genome bisulfite sequencing of multiple human tissues","volume":"19","author":"Zou,L","year":"2018","journal-title":"BMC Genomics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btaa724\/34133874\/btaa724.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/3\/303\/50325830\/btaa724.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/3\/303\/50325830\/btaa724.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,5,16]],"date-time":"2023-05-16T04:09:05Z","timestamp":1684210145000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/37\/3\/303\/5893546"}},"subtitle":[],"editor":[{"given":"Pier","family":"Luigi Martelli","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2020,8,17]]},"references-count":37,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2021,4,20]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaa724","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021,2,1]]},"published":{"date-parts":[[2020,8,17]]}}}