{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,3]],"date-time":"2026-03-03T11:08:19Z","timestamp":1772536099441,"version":"3.50.1"},"reference-count":38,"publisher":"Oxford University Press (OUP)","issue":"7","license":[{"start":{"date-parts":[[2022,1,10]],"date-time":"2022-01-10T00:00:00Z","timestamp":1641772800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"name":"Office of Science Management and Operations"},{"DOI":"10.13039\/100000060","name":"National Institute of Allergy and Infectious Diseases","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000060","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000060","name":"NIAID","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000060","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["AI55765"],"award-info":[{"award-number":["AI55765"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["ES026835"],"award-info":[{"award-number":["ES026835"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,3,28]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Motivation<\/jats:title><jats:p>Short-read whole-genome sequencing (WGS) is a vital tool for clinical applications and basic research. Genetic divergence from the reference genome, repetitive sequences and sequencing bias reduces the performance of variant calling using short-read alignment, but the loss in recall and specificity has not been adequately characterized. To benchmark short-read variant calling, we used 36 diverse clinical Mycobacterium tuberculosis (Mtb) isolates dually sequenced with Illumina short-reads and PacBio long-reads. We systematically studied the short-read variant calling accuracy and the influence of sequence uniqueness, reference bias and GC content.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>Reference-based Illumina variant calling demonstrated a maximum recall of 89.0% and minimum precision of 98.5% across parameters evaluated. The approach that maximized variant recall while still maintaining high precision (&amp;lt;99%) was tuning the mapping quality filtering threshold, i.e. confidence of the read mapping (recall\u2009=\u200985.8%, precision\u2009=\u200999.1%, MQ\u2009\u2265\u200940). Additional masking of repetitive sequence content is an alternative conservative approach to variant calling that increases precision at cost to recall (recall\u2009=\u200970.2%, precision\u2009=\u200999.6%, MQ\u2009\u2265\u200940). Of the genomic positions typically excluded for Mtb, 68% are accurately called using Illumina WGS including 52\/168 PE\/PPE genes (34.5%). From these results, we present a refined list of low confidence regions across the Mtb genome, which we found to frequently overlap with regions with structural variation, low sequence uniqueness and low sequencing coverage. Our benchmarking results have broad implications for the use of WGS in the study of Mtb biology, inference of transmission in public health surveillance systems and more generally for WGS applications in other organisms.<\/jats:p><\/jats:sec><jats:sec><jats:title>Availability and implementation<\/jats:title><jats:p>All relevant code is available at https:\/\/github.com\/farhat-lab\/mtb-illumina-wgs-evaluation.<\/jats:p><\/jats:sec><jats:sec><jats:title>Supplementary information<\/jats:title><jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/btac023","type":"journal-article","created":{"date-parts":[[2022,1,8]],"date-time":"2022-01-08T04:55:42Z","timestamp":1641617742000},"page":"1781-1787","source":"Crossref","is-referenced-by-count":59,"title":["Benchmarking the empirical accuracy of short-read sequencing across the<i>M. tuberculosis<\/i>genome"],"prefix":"10.1093","volume":"38","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9108-3328","authenticated-orcid":false,"given":"Maximillian","family":"Marin","sequence":"first","affiliation":[{"name":"Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA"},{"name":"Department of Systems Biology, Harvard Medical School , Boston, MA 02115, USA"}]},{"suffix":"Jr","given":"Roger","family":"Vargas","sequence":"additional","affiliation":[{"name":"Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA"},{"name":"Department of Systems Biology, Harvard Medical School , Boston, MA 02115, USA"}]},{"given":"Michael","family":"Harris","sequence":"additional","affiliation":[{"name":"Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health , Bethesda, MD 20894, USA"}]},{"given":"Brendan","family":"Jeffrey","sequence":"additional","affiliation":[{"name":"Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health , Bethesda, MD 20894, USA"}]},{"given":"L Elaine","family":"Epperson","sequence":"additional","affiliation":[{"name":"Center for Genes, Environment, and Health, National Jewish Health , Denver, CO 80206, USA"}]},{"given":"David","family":"Durbin","sequence":"additional","affiliation":[{"name":"Mycobacteriology Reference Laboratory, Advanced Diagnostic Laboratories, National Jewish Health , Denver, CO 80206, USA"}]},{"given":"Michael","family":"Strong","sequence":"additional","affiliation":[{"name":"Center for Genes, Environment, and Health, National Jewish Health , Denver, CO 80206, USA"}]},{"given":"Max","family":"Salfinger","sequence":"additional","affiliation":[{"name":"College of Public Health and Morsani College of Medicine, University of South Florida , Tampa, FL 33612, USA"}]},{"given":"Zamin","family":"Iqbal","sequence":"additional","affiliation":[{"name":"EMBL-EBI, Wellcome Genome Campus , Hinxton CB10 1SD, UK"}]},{"given":"Irada","family":"Akhundova","sequence":"additional","affiliation":[{"name":"Scientific Research Institute of Lung Diseases, Ministry of Health, Baku AZ1014, Azerbaijan"}]},{"given":"Sergo","family":"Vashakidze","sequence":"additional","affiliation":[{"name":"Department of Medicine, The University of Georgia, Tbilisi 0171, Georgia"},{"name":"National Center for Tuberculosis and Lung Diseases, Ministry of Health, Tbilisi 0171, Georgia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5059-8002","authenticated-orcid":false,"given":"Valeriu","family":"Crudu","sequence":"additional","affiliation":[{"name":"Phthisiopneumology Institute, Ministry of Health, Chisinau 2025, Republic of Moldova"}]},{"given":"Alex","family":"Rosenthal","sequence":"additional","affiliation":[{"name":"Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health , Bethesda, MD 20894, USA"}]},{"given":"Maha Reda","family":"Farhat","sequence":"additional","affiliation":[{"name":"Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA"},{"name":"Pulmonary and Critical Care Medicine, Massachusetts General Hospital, Boston, MA 02114, USA"}]}],"member":"286","published-online":{"date-parts":[[2022,1,10]]},"reference":[{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"R18","DOI":"10.1186\/gb-2011-12-2-r18","article-title":"Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries","volume":"12","author":"Aird","year":"2011","journal-title":"Genome Biol"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1111\/mmi.14409","article-title":"New insights into the mycobacterial PE and PPE proteins provide a framework for future research","volume":"113","author":"Ates","year":"2019","journal-title":"Mol. Microbiol"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"2057","DOI":"10.1038\/s41598-020-59026-y","article-title":"Systematic dissection of biases in whole-exome and whole-genome sequencing reveals major determinants of coding sequence coverage","volume":"10","author":"Barbitoff","year":"2020","journal-title":"Sci. Rep"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"e72","DOI":"10.1093\/nar\/gks001","article-title":"Summarizing and correcting the GC content bias in high-throughput sequencing","volume":"40","author":"Benjamini","year":"2012","journal-title":"Nucleic Acids Res"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"e0214088","DOI":"10.1371\/journal.pone.0214088","article-title":"Reference set of Mycobacterium tuberculosis clinical strains: a tool for research and product development","volume":"14","author":"Borrell","year":"2019","journal-title":"PLoS ONE"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"3994","DOI":"10.1038\/s41467-019-11948-6","article-title":"Genome-wide mutational biases fuel transcriptional diversity in the Mycobacterium tuberculosis complex","volume":"10","author":"Chiner-Oms","year":"2019","journal-title":"Nat. Commun"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"538","DOI":"10.1016\/j.chom.2015.10.008","article-title":"M. tuberculosis T cell epitope analysis reveals paucity of antigenic variation and identifies rare variable TB antigens","volume":"18","author":"Coscolla","year":"2015","journal-title":"Cell Host Microbe"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"431","DOI":"10.1016\/j.smim.2014.09.012","article-title":"Consequences of genomic diversity in Mycobacterium tuberculosis","volume":"26","author":"Coscolla","year":"2014","journal-title":"Semin. Immunol"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"e11147","DOI":"10.1371\/journal.pone.0011147","article-title":"progressiveMauve: Multiple genome alignment with gene gain, loss and rearrangement","volume":"5","author":"Darling","year":"2010","journal-title":"PLoS ONE"},{"key":"2023030915370258800_","first-page":"e000294","article-title":"Comparison of long-read sequencing technologies in the hybrid assembly of complex bacterial genomes","volume":"5","author":"De Maio","year":"2019","journal-title":"Microb. Genom"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"875","DOI":"10.1038\/nbt.4227","article-title":"Variation graph toolkit improves read mapping by representing genetic variation in the reference","volume":"36","author":"Garrison","year":"2018","journal-title":"Nat. Biotechnol"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"24","DOI":"10.1186\/s12915-020-0748-z","article-title":"Contaminant DNA in bacterial sequencing experiments is a major source of false genetic variability","volume":"18","author":"Goig","year":"2020","journal-title":"BMC Biol"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"1032","DOI":"10.1038\/s41564-018-0218-3","article-title":"Clinically prevalent mutations in Mycobacterium tuberculosis alter propionate metabolism and mediate multidrug tolerance","volume":"3","author":"Hicks","year":"2018","journal-title":"Nat. Microbiol"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"849","DOI":"10.1038\/s41588-018-0117-9","article-title":"Frequent transmission of the Mycobacterium tuberculosis Beijing lineage and positive selection for the EsxW Beijing variant in Vietnam","volume":"50","author":"Holt","year":"2018","journal-title":"Nat. Genet"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"1900130","DOI":"10.2807\/1560-7917.ES.2019.24.50.1900130","article-title":"Towards standardisation: Comparison of five whole genome sequencing (WGS) analysis pipelines for detection of epidemiologically linked tuberculosis cases","volume":"24","author":"Jajou","year":"2019","journal-title":"Euro Surveill"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"568","DOI":"10.1101\/gr.129684.111","article-title":"VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing","volume":"22","author":"Koboldt","year":"2012","journal-title":"Genome Res"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"2520","DOI":"10.1093\/bioinformatics\/bts480","article-title":"Snakemake\u2014a scalable bioinformatics workflow engine","volume":"28","author":"K\u00f6ster","year":"2012","journal-title":"Bioinformatics"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"357","DOI":"10.1038\/nmeth.1923","article-title":"Fast gapped-read alignment with Bowtie 2","volume":"9","author":"Langmead","year":"2012","journal-title":"Nat. Methods"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"2987","DOI":"10.1093\/bioinformatics\/btr509","article-title":"A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data","volume":"27","author":"Li","year":"2011","journal-title":"Bioinformatics"},{"key":"2023030915370258800_","author":"Li","year":"2013"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"1851","DOI":"10.1101\/gr.078212.108","article-title":"Mapping short DNA sequencing reads and calling variants using mapping quality scores","volume":"18","author":"Li","year":"2008","journal-title":"Genome Res"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"3094","DOI":"10.1093\/bioinformatics\/bty191","article-title":"Minimap2: Pairwise alignment for nucleotide sequences","volume":"34","author":"Li","year":"2018","journal-title":"Bioinformatics"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"2843","DOI":"10.1093\/bioinformatics\/btu356","article-title":"Toward better understanding of artifacts in variant calling from high-coverage samples","volume":"30","author":"Li","year":"2014","journal-title":"Bioinformatics"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"533","DOI":"10.1038\/s41579-019-0214-5","article-title":"Whole genome sequencing of Mycobacterium tuberculosis: Current standards and open issues","volume":"17","author":"Meehan","year":"2019","journal-title":"Nat. Rev. Microbiol"},{"key":"2023030915370258800_","first-page":"mgen000465","article-title":"Exact mapping of Illumina blind spots in the Mycobacterium tuberculosis genome reveals platform-wide and workflow-specific biases","volume":"7","author":"Modlin","year":"2021","journal-title":"Microb. Genom"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"e90","DOI":"10.1093\/nar\/gkr344","article-title":"Sequence-specific error profile of Illumina sequencers","volume":"39","author":"Nakamura","year":"2011","journal-title":"Nucleic Acids Res"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"2917","DOI":"10.1038\/s41467-020-16626-6","article-title":"A sister lineage of the Mycobacterium tuberculosis complex discovered in the African Great Lakes region","volume":"11","author":"Ngabonziza","year":"2020","journal-title":"Nat. Commun"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"665","DOI":"10.1101\/gr.214155.116","article-title":"Genome graphs and the evolution of genome inference","volume":"27","author":"Paten","year":"2017","journal-title":"Genome Res"},{"key":"2023030915370258800_","author":"Poplin","year":"2018"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"278","DOI":"10.1016\/j.gpb.2015.08.002","article-title":"PacBio sequencing and its applications","volume":"13","author":"Rhoads","year":"2015","journal-title":"Genom. Proteom. Bioinform"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"R51","DOI":"10.1186\/gb-2013-14-5-r51","article-title":"Characterizing and measuring bias in sequence data","volume":"14","author":"Ross","year":"2013","journal-title":"Genome Biol"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"8953","DOI":"10.1093\/nar\/gky726","article-title":"Pushing the limits of de novo genome assembly for complex prokaryotic genomes harboring very long, near identical repeats","volume":"46","author":"Schmid","year":"2018","journal-title":"Nucleic Acids Res"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"S238","DOI":"10.1016\/j.ijmyco.2016.09.071","article-title":"Deletion of region of difference 181 in Mycobacterium tuberculosis Beijing strains","volume":"5(Suppl. 1","author":"Sharifipour","year":"2016","journal-title":"Int. J. Mycobacteriol"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"lqab019","DOI":"10.1093\/nargab\/lqab019","article-title":"Sequencing error profiles of Illumina sequencing instruments","volume":"3","author":"Stoler","year":"2021","journal-title":"NAR Genom. Bioinform"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"e27584","DOI":"10.1371\/journal.pone.0027584","article-title":"Modern and ancestral genotypes of Mycobacterium tuberculosis from Andhra Pradesh, India","volume":"6","author":"Thomas","year":"2011","journal-title":"PLoS ONE"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"e112963","DOI":"10.1371\/journal.pone.0112963","article-title":"Pilon: An integrated tool for comprehensive microbial variant detection and genome assembly improvement","volume":"9","author":"Walker","year":"2014","journal-title":"PLoS ONE"},{"key":"2023030915370258800_","first-page":"mgen000418","article-title":"Genomic variant-identification methods may alter Mycobacterium tuberculosis transmission inferences","volume":"6","author":"Walter","year":"2020","journal-title":"Microb Genom"},{"key":"2023030915370258800_","doi-asserted-by":"crossref","first-page":"1155","DOI":"10.1038\/s41587-019-0217-9","article-title":"Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome","volume":"37","author":"Wenger","year":"2019","journal-title":"Nat. Biotechnol"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btac023\/42257432\/btac023.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/7\/1781\/49480716\/btac023.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/7\/1781\/49480716\/btac023.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,11,15]],"date-time":"2023-11-15T14:35:33Z","timestamp":1700058933000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/38\/7\/1781\/6502279"}},"subtitle":[],"editor":[{"given":"Can","family":"Alkan","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2022,1,10]]},"references-count":38,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2022,3,28]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btac023","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022,4,1]]},"published":{"date-parts":[[2022,1,10]]}}}