{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,25]],"date-time":"2026-02-25T02:35:10Z","timestamp":1771986910822,"version":"3.50.1"},"reference-count":42,"publisher":"Oxford University Press (OUP)","issue":"7","license":[{"start":{"date-parts":[[2025,6,23]],"date-time":"2025-06-23T00:00:00Z","timestamp":1750636800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,7,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>How can we identify causal genetic mechanisms governing bacterial traits? Initial efforts entrusting machine learning models to handle the task of predicting phenotype from genotype yield high accuracy scores. However, attempts to extract meaningful interpretations from the predictive models are found to be corrupted by falsely identified \u2018causal\u2019 features. Relying solely on pattern recognition and correlations is unreliable, significantly so in bacterial genomics settings where high-dimensionality and spurious associations are the norm. Though it is not yet clear whether we can overcome this hurdle, significant efforts are being made towards discovering potential high-risk bacterial genetic variants. In view of this, we set up open problems surrounding phenotype prediction from bacterial whole-genome datasets and extending those approaches to learning causal effects, and discuss challenges that impact the reliability of a machine\u2019s decision-making when faced with datasets of this nature.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>We identify major sources of non-injectivity in the formulation of the genotype-to-phenotype mapping function\u2014linkage-disequilibrium, limited sampling, information loss in representations, unmeasured confounders and observational noise\u2014and analyse their implications for machine learning applications. Using a collection of 4,140 Staphylococcus aureus isolates, we illustrate challenges surrounding the defined open problems.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>Raw sequencing data are available from the European Nucleotide Archive (ENA) under project accessions ERP001012, PRJEB3174, PRJEB2655, PRJEB2756, and PRJEB2944. Assemblies and annotations were generated with the Sanger bacterial pipeline (https:\/\/github.com\/sanger-pathogens\/vr-codebase) and unitigs extracted using DBGWAS (https:\/\/gitlab.com\/leoisl\/dbgwas).<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaf206","type":"journal-article","created":{"date-parts":[[2025,6,28]],"date-time":"2025-06-28T19:02:31Z","timestamp":1751137351000},"source":"Crossref","is-referenced-by-count":7,"title":["Whole-genome phenotype prediction with machine learning: open problems in bacterial genomics"],"prefix":"10.1093","volume":"41","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-3498-8567","authenticated-orcid":false,"given":"Tamsin","family":"James","sequence":"first","affiliation":[{"name":"University of Birmingham, School of Computer Science , University Road West, Edgbaston , Birmingham, B15 2TT,","place":["United Kingdom"]}]},{"given":"Ben","family":"Williamson","sequence":"additional","affiliation":[{"name":"University of Birmingham, School of Computer Science , University Road West, Edgbaston , Birmingham, B15 2TT,","place":["United Kingdom"]}]},{"given":"Peter","family":"Tino","sequence":"additional","affiliation":[{"name":"University of Birmingham, School of Computer Science , University Road West, Edgbaston , Birmingham, B15 2TT,","place":["United Kingdom"]}]},{"given":"Nicole","family":"Wheeler","sequence":"additional","affiliation":[{"name":"Advanced Research and Invention Agency , 210 Euston Road , London, NW1 2DA,","place":["United Kingdom"]}]}],"member":"286","published-online":{"date-parts":[[2025,6,23]]},"reference":[{"key":"2025070907500541600_btaf206-B1","author":"Aliee","year":"2023"},{"key":"2025070907500541600_btaf206-B2","doi-asserted-by":"crossref","first-page":"621","DOI":"10.1016\/j.tim.2020.12.002","article-title":"Forest and trees: exploring bacterial virulence with genomewide association studies and machine learning","volume":"29","author":"Allen","year":"2021","journal-title":"Trends Microbiol"},{"key":"2025070907500541600_btaf206-B3","first-page":"001222","article-title":"Optimising machine learning prediction of minimum inhibitory concentrations in klebsiella pneumoniae","volume":"10","author":"Batisti Biffignandi","year":"2024","journal-title":"Microb Genom"},{"key":"2025070907500541600_btaf206-B4","doi-asserted-by":"crossref","first-page":"262","DOI":"10.1186\/s13059-016-1132-8","article-title":"Rapid scoring of genes in microbial pan-genome-wide association studies with scoary","volume":"17","author":"Brynildsrud","year":"2016","journal-title":"Genome Biol"},{"key":"2025070907500541600_btaf206-B5","doi-asserted-by":"crossref","first-page":"809560","DOI":"10.3389\/fcimb.2021.809560","article-title":"Lessons learnt from using the machine learning random forest algorithm to predict virulence in streptococcus pyogenes","volume":"11","author":"Buckley","year":"2021","journal-title":"Front Cell Infect Microbiol"},{"key":"2025070907500541600_btaf206-B6","doi-asserted-by":"crossref","first-page":"e1010842","DOI":"10.1371\/journal.pgen.1010842","article-title":"The bacterial genetic determinants of escherichia coli capacity to cause bloodstream infections in humans","volume":"19","author":"Burgaya","year":"2023","journal-title":"PLoS Genetics"},{"key":"2025070907500541600_btaf206-B7","first-page":"001116","article-title":"The advantage of intergenic regions as genomic features for machine-learning-based host attribution of Salmonella typhimurium from the USA","volume":"9","author":"Chalka","year":"2023","journal-title":"Microb Genom"},{"key":"2025070907500541600_btaf206-B8","doi-asserted-by":"crossref","first-page":"eaak9745","DOI":"10.1126\/scitranslmed.aak9745","article-title":"Longitudinal genomic surveillance of mrsa in the uk reveals transmission patterns in hospitals and the community","volume":"9","author":"Coll","year":"2017","journal-title":"Science Translational Medicine"},{"key":"2025070907500541600_btaf206-B9","doi-asserted-by":"crossref","first-page":"e1005958","DOI":"10.1371\/journal.pcbi.1005958","article-title":"A phylogenetic method to perform genome-wide association studies in microbes that accounts for population structure and recombination","volume":"14","author":"Collins","year":"2018","journal-title":"PLoS Comput Biol"},{"key":"2025070907500541600_btaf206-B10","doi-asserted-by":"crossref","first-page":"922","DOI":"10.3389\/fgene.2019.00922","article-title":"Machine learning predicts accurately mycobacterium tuberculosis drug resistance from whole genome sequencing data","volume":"10","author":"Deelder","year":"2019","journal-title":"Front Genet"},{"key":"2025070907500541600_btaf206-B11","doi-asserted-by":"crossref","first-page":"16041","DOI":"10.1038\/nmicrobiol.2016.41","article-title":"Identifying lineage effects when controlling for population structure improves power in bacterial association studies","volume":"1","author":"Earle","year":"2016","journal-title":"Nat Microbiol"},{"key":"2025070907500541600_btaf206-B12","doi-asserted-by":"crossref","first-page":"2128","DOI":"10.1038\/s41467-019-10110-6","article-title":"Gwas for quantitative resistance phenotypes in mycobacterium tuberculosis reveals resistance genes and regulatory regions","volume":"10","author":"Farhat","year":"2019","journal-title":"Nat Commun"},{"key":"2025070907500541600_btaf206-B13","volume-title":"Lectures on Cauchy\u2019s Problem in Linear Partial Differential Equations","author":"Hadamard","year":"2014"},{"key":"2025070907500541600_btaf206-B14","doi-asserted-by":"crossref","first-page":"9786","DOI":"10.1073\/pnas.0402521101","article-title":"Complete genomes of two clinical staphylococcus aureus strains: evidence for the rapid evolution of virulence and drug resistance","volume":"101","author":"Holden","year":"2004","journal-title":"Proc Nat Acad Sci"},{"key":"2025070907500541600_btaf206-B15","first-page":"2024","author":"Hu","year":"2024"},{"key":"2025070907500541600_btaf206-B16","doi-asserted-by":"crossref","first-page":"e1007758","DOI":"10.1371\/journal.pgen.1007758","article-title":"A fast and agnostic method for bacterial genome-wide association studies: bridging the gap between k-mers and genetic events","volume":"14","author":"Jaillard","year":"2018","journal-title":"PLoS Genet"},{"key":"2025070907500541600_btaf206-B17","doi-asserted-by":"crossref","first-page":"110817","DOI":"10.1016\/j.foodres.2021.110817","article-title":"Exploring the predictive capability of advanced machine learning in identifying severe disease phenotype in Salmonella enterica","volume":"151","author":"Karanth","year":"2022","journal-title":"Food Res Int"},{"key":"2025070907500541600_btaf206-B18","doi-asserted-by":"crossref","first-page":"2580","DOI":"10.1038\/s41467-020-16310-9","article-title":"A biochemically-interpretable machine learning classifier for microbial gwas","volume":"11","author":"Kavvas","year":"2020","journal-title":"Nat Commun"},{"key":"2025070907500541600_btaf206-B19","doi-asserted-by":"crossref","first-page":"2276","DOI":"10.1093\/bioinformatics\/bty949","article-title":"Application of machine learning techniques to tuberculosis drug resistance analysis","volume":"35","author":"Kouchaki","year":"2019","journal-title":"Bioinformatics"},{"key":"2025070907500541600_btaf206-B20","doi-asserted-by":"crossref","first-page":"4310","DOI":"10.1093\/bioinformatics\/bty539","article-title":"Pyseer: a comprehensive tool for microbial pangenome-wide association studies","volume":"34","author":"Lees","year":"2018","journal-title":"Bioinformatics"},{"key":"2025070907500541600_btaf206-B21","doi-asserted-by":"crossref","first-page":"10","DOI":"10.1128\/mBio.01344-20","article-title":"Improved prediction of bacterial genotype-phenotype associations using interpretable pangenome-spanning regressions","volume":"11","author":"Lees","year":"2020","journal-title":"MBio"},{"key":"2025070907500541600_btaf206-B22","doi-asserted-by":"crossref","first-page":"5374","DOI":"10.1038\/s41467-020-19250-6","article-title":"Increased power from conditional bacterial genome-wide association identifies macrolide resistance mutations in neisseria gonorrhoeae","volume":"11","author":"Ma","year":"2020","journal-title":"Nat Commun"},{"key":"2025070907500541600_btaf206-B23","first-page":"23","author":"Ma","year":"2020"},{"key":"2025070907500541600_btaf206-B24","doi-asserted-by":"crossref","first-page":"10","DOI":"10.1128\/msystems.00656-19","article-title":"Predicting phenotypic polymyxin resistance in klebsiella pneumoniae through machine learning analysis of genomic data","volume":"5","author":"Macesic","year":"2020","journal-title":"Msystems"},{"key":"2025070907500541600_btaf206-B25","doi-asserted-by":"crossref","first-page":"2866","DOI":"10.3390\/microorganisms11122866","article-title":"Genome-wide association studies (gwas) approaches for the detection of genetic variants associated with antibiotic resistance: a systematic review","volume":"11","author":"Mosquera-Rend\u2019on","year":"2023","journal-title":"Microorganisms"},{"key":"2025070907500541600_btaf206-B26","doi-asserted-by":"crossref","first-page":"421","DOI":"10.1038\/s41598-017-18972-w","article-title":"Developing an in silico minimum inhibitory concentration panel test for klebsiella pneumoniae","volume":"8","author":"Nguyen","year":"2018","journal-title":"Sci Rep"},{"key":"2025070907500541600_btaf206-B27","doi-asserted-by":"crossref","first-page":"e00913","DOI":"10.1128\/msystems.00913-20","article-title":"Genome-scale metabolic models and machine learning reveal genetic determinants of antibiotic resistance in escherichia coli and unravel the underlying metabolic adaptation mechanisms","volume":"6","author":"Pearcy","year":"2021","journal-title":"Msystems"},{"key":"2025070907500541600_btaf206-B28","doi-asserted-by":"crossref","first-page":"10","DOI":"10.1128\/mBio.01527-20","article-title":"A genome-based model to predict the virulence of pseudomonas aeruginosa isolates","volume":"11","author":"Pincus","year":"2020","journal-title":"MBio"},{"key":"2025070907500541600_btaf206-B29","doi-asserted-by":"crossref","first-page":"10","DOI":"10.1128\/msystems.00123-17","article-title":"Machine learning leveraging genomes from metagenomes identifies influential antibiotic resistance genes in the infant gut microbiome","volume":"3","author":"Rahman","year":"2018","journal-title":"MSystems"},{"key":"2025070907500541600_btaf206-B30","doi-asserted-by":"crossref","first-page":"263","DOI":"10.1101\/gr.196709.115","article-title":"Building a genomic framework for prospective mrsa surveillance in the United Kingdom and the republic of Ireland","volume":"26","author":"Reuter","year":"2016","journal-title":"Genome Res"},{"key":"2025070907500541600_btaf206-B31","doi-asserted-by":"crossref","first-page":"761869","DOI":"10.3389\/fmicb.2021.761869","article-title":"A genomic perspective across earth\u2019s microbiomes reveals that genome size in archaea and bacteria is linked to ecosystem type and trophic strategy","volume":"12","author":"Rod\u0155\u0131guez-Gij\u2019on","year":"2022","journal-title":"Front Microbiol"},{"key":"2025070907500541600_btaf206-B32","article-title":"Benchmarking bacterial genome-wide association study methods using simulated genomes and phenotypes","volume":"6","author":"Saber","year":"2020","journal-title":"Microb Genom"},{"key":"2025070907500541600_btaf206-B33","doi-asserted-by":"crossref","first-page":"3119","DOI":"10.3389\/fmicb.2019.03119","article-title":"Current affairs of microbial genome-wide association studies: approaches, bottlenecks and analytical pitfalls","volume":"10","author":"San","year":"2020","journal-title":"Front Microbiol"},{"key":"2025070907500541600_btaf206-B34","doi-asserted-by":"publisher","first-page":"831","DOI":"10.1109\/TNN.2010.2042729","article-title":"Regularization in matrix relevance learning","volume":"21","author":"Schneider","year":"2010","journal-title":"IEEE Trans Neural Netw"},{"key":"2025070907500541600_btaf206-B35","doi-asserted-by":"crossref","first-page":"535","DOI":"10.1186\/s12859-019-3054-4","article-title":"Antimicrobial resistance genetic factor identification from whole-genome sequence data using deep feature selection","volume":"20","author":"Shi","year":"2019","journal-title":"BMC Bioinform"},{"key":"2025070907500541600_btaf206-B36","doi-asserted-by":"crossref","first-page":"841289","DOI":"10.3389\/fmicb.2022.841289","article-title":"A practical approach for predicting antimicrobial phenotype resistance in staphylococcus aureus through machine learning analysis of genome data","volume":"13","author":"Wang","year":"2022","journal-title":"Front Microbiol"},{"key":"2025070907500541600_btaf206-B37","first-page":"44","author":"Wassan","year":"2018"},{"key":"2025070907500541600_btaf206-B38","first-page":"758144,","author":"Wheeler","year":"2019"},{"key":"2025070907500541600_btaf206-B39","first-page":"2024","author":"Wiatrak","year":"2024"},{"key":"2025070907500541600_btaf206-B40","first-page":"1625","article-title":"A new view of automatic relevance determination","volume":"20","author":"Wipf","year":"2007","journal-title":"Adv Neural Inf Process Syst"},{"key":"2025070907500541600_btaf206-B41","doi-asserted-by":"crossref","first-page":"bbab299","DOI":"10.1093\/bib\/bbab299","article-title":"An end-to-end heterogeneous graph attention network for mycobacterium tuberculosis drug-resistance prediction","volume":"22","author":"Yang","year":"2021","journal-title":"Brief Bioinform"},{"key":"2025070907500541600_btaf206-B42","doi-asserted-by":"crossref","first-page":"404","DOI":"10.1186\/s12866-023-03147-7","article-title":"Machine learning and phylogenetic analysis allow for predicting antibiotic resistance in m. tuberculosis","volume":"23","author":"Yurtseven","year":"2023","journal-title":"BMC Microbiol"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btaf206\/63545785\/btaf206.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/41\/7\/btaf206\/63545785\/btaf206.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/41\/7\/btaf206\/63545785\/btaf206.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,9]],"date-time":"2025-07-09T11:50:20Z","timestamp":1752061820000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btaf206\/8171528"}},"subtitle":[],"editor":[{"given":"Jonathan","family":"Wren","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2025,6,23]]},"references-count":42,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2025,7,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaf206","relation":{},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025,7]]},"published":{"date-parts":[[2025,6,23]]},"article-number":"btaf206"}}