{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T12:49:04Z","timestamp":1767962944612,"version":"3.49.0"},"reference-count":18,"publisher":"Springer Science and Business Media LLC","issue":"S5","content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2013,4]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Background<\/jats:title><jats:p>Elevated sequencing error rates are the most predominant obstacle in single-nucleotide polymorphism (SNP) detection, which is a major goal in the bulk of current studies using next-generation sequencing (NGS). Beyond routinely handled generic sources of errors, certain base calling errors relate to specific sequence patterns. Statistically principled ways to associate sequence patterns with base calling errors have not been previously described. Extant approaches either incur decisive losses in power, due to relating errors with individual genomic positions rather than motifs, or do not properly distinguish between motif-induced and sequence-unspecific sources of errors.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>Here, for the first time, we describe a statistically rigorous framework for the discovery of motifs that induce sequencing errors. We apply our method to several datasets from Illumina GA IIx, HiSeq 2000, and MiSeq sequencers. We confirm previously known error-causing sequence contexts and report new more specific ones.<\/jats:p><\/jats:sec><jats:sec><jats:title>Conclusions<\/jats:title><jats:p>Checking for error-inducing motifs should be included into SNP calling pipelines to avoid false positives. To facilitate filtering of sets of putative SNPs, we provide tracks of error-prone genomic positions (in BED format).<\/jats:p><\/jats:sec><jats:sec><jats:title>Availability<\/jats:title><jats:p><jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"http:\/\/discovering-cse.googlecode.com\" ext-link-type=\"uri\">http:\/\/discovering-cse.googlecode.com<\/jats:ext-link><\/jats:p><\/jats:sec>","DOI":"10.1186\/1471-2105-14-s5-s1","type":"journal-article","created":{"date-parts":[[2013,4,10]],"date-time":"2013-04-10T18:15:14Z","timestamp":1365617714000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":47,"title":["Discovering motifs that induce sequencing errors"],"prefix":"10.1186","volume":"14","author":[{"given":"Manuel","family":"Allhoff","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Alexander","family":"Sch\u00f6nhuth","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Marcel","family":"Martin","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ivan G","family":"Costa","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sven","family":"Rahmann","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tobias","family":"Marschall","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2013,4,10]]},"reference":[{"issue":"7319","key":"5769_CR1","doi-asserted-by":"publisher","first-page":"1061","DOI":"10.1038\/nature09534","volume":"467","author":"GP Consortium","year":"2010","unstructured":"Consortium GP: 1000 Genomes Project Consortium: A map of human genome variation from population-scale sequencing. Nature. 2010, 467 (7319): 1061-1073. 10.1038\/nature09534. [http:\/\/dx.doi.org\/10.1038\/nature09534]","journal-title":"Nature"},{"issue":"6","key":"5769_CR2","doi-asserted-by":"publisher","first-page":"659","DOI":"10.1093\/jhered\/esp086","volume":"100","author":"Community of Scientists Genome","year":"2009","unstructured":"Genome 10K Community of Scientists: A proposal to obtain whole-genome sequence for 10 000 vertebrate species. Journal of Heredity. 2009, 100 (6): 659-674.","journal-title":"Journal of Heredity"},{"issue":"7261","key":"5769_CR3","doi-asserted-by":"publisher","first-page":"272","DOI":"10.1038\/nature08250","volume":"461","author":"SB Ng","year":"2009","unstructured":"Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, Shendure J: Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009, 461 (7261): 272-276. 10.1038\/nature08250. [http:\/\/dx.doi.org\/10.1038\/nature08250]","journal-title":"Nature"},{"issue":"11","key":"5769_CR4","doi-asserted-by":"publisher","first-page":"745","DOI":"10.1038\/nrg3031","volume":"12","author":"MJ Bamshad","year":"2011","unstructured":"Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J: Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. 2011, 12 (11): 745-755. 10.1038\/nrg3031. [http:\/\/dx.doi.org\/10.1038\/nrg3031]","journal-title":"Nat Rev Genet"},{"key":"5769_CR5","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1038\/nrg2626","volume":"11","author":"ML Metzker","year":"2010","unstructured":"Metzker ML: Sequencing technologies - the next generation. Nature Reviews Genetics. 2010, 11: 31-46. 10.1038\/nrg2626.","journal-title":"Nature Reviews Genetics"},{"issue":"8","key":"5769_CR6","doi-asserted-by":"publisher","first-page":"R83","DOI":"10.1186\/gb-2009-10-8-r83","volume":"10","author":"M Kircher","year":"2009","unstructured":"Kircher M, Stenzel U, Kelso J: Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biology. 2009, 10 (8): R83-10.1186\/gb-2009-10-8-r83.","journal-title":"Genome Biology"},{"issue":"16","key":"5769_CR7","doi-asserted-by":"publisher","first-page":"e105","DOI":"10.1093\/nar\/gkn425","volume":"36","author":"JC Dohm","year":"2008","unstructured":"Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Research. 2008, 36 (16): e105-10.1093\/nar\/gkn425.","journal-title":"Nucleic Acids Research"},{"issue":"13","key":"5769_CR8","doi-asserted-by":"publisher","first-page":"e90","DOI":"10.1093\/nar\/gkr344","volume":"39","author":"K Nakamura","year":"2011","unstructured":"Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y, Ishikawa S, Linak MC, Hirai A, Takahashi H, Altaf-Ul-Amin M, Ogasawara N, Kanaya S: Sequence-specific error profile of Illumina sequencers. Nucleic Acids Research. 2011, 39 (13): e90-10.1093\/nar\/gkr344.","journal-title":"Nucleic Acids Research"},{"key":"5769_CR9","doi-asserted-by":"publisher","first-page":"451","DOI":"10.1186\/1471-2105-12-451","volume":"12","author":"F Meacham","year":"2011","unstructured":"Meacham F, Boffelli D, Dhahbi J, Martin D, Singer M, Pachter L: Identification and correction of systematic error in high-throughput sequence data. BMC Bioinformatics. 2011, 12: 451-10.1186\/1471-2105-12-451.","journal-title":"BMC Bioinformatics"},{"issue":"5","key":"5769_CR10","doi-asserted-by":"publisher","first-page":"491","DOI":"10.1038\/ng.806","volume":"43","author":"MA DePristo","year":"2011","unstructured":"DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics. 2011, 43 (5): 491-498. 10.1038\/ng.806.","journal-title":"Nature Genetics"},{"issue":"9","key":"5769_CR11","doi-asserted-by":"publisher","first-page":"1297","DOI":"10.1101\/gr.107524.110","volume":"20","author":"A McKenna","year":"2010","unstructured":"McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA: The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research. 2010, 20 (9): 1297-1303. 10.1101\/gr.107524.110.","journal-title":"Genome Research"},{"issue":"7","key":"5769_CR12","doi-asserted-by":"publisher","first-page":"476","DOI":"10.1093\/jnci\/94.7.476","volume":"94","author":"T Webb","year":"2002","unstructured":"Webb T: SNPs: can genetic variants control cancer susceptibility?. J Natl Cancer Inst. 2002, 94 (7): 476-478. 10.1093\/jnci\/94.7.476.","journal-title":"J Natl Cancer Inst"},{"key":"5769_CR13","volume-title":"A Guide to Chi-Squared Testing","author":"PE Greenwood","year":"1996","unstructured":"Greenwood PE, Nikulin MS: A Guide to Chi-Squared Testing. 1996, Wiley"},{"key":"5769_CR14","doi-asserted-by":"crossref","first-page":"289","DOI":"10.1111\/j.2517-6161.1995.tb02031.x","volume":"57","author":"Y Benjamini","year":"1995","unstructured":"Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B. 1995, 57: 289-300.","journal-title":"Journal of the Royal Statistical Society Series B"},{"issue":"5","key":"5769_CR15","doi-asserted-by":"publisher","first-page":"589","DOI":"10.1093\/bioinformatics\/btp698","volume":"26","author":"H Li","year":"2010","unstructured":"Li H, Durbin R: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010, 26 (5): 589-595. 10.1093\/bioinformatics\/btp698.","journal-title":"Bioinformatics"},{"issue":"16","key":"5769_CR16","doi-asserted-by":"publisher","first-page":"2078","DOI":"10.1093\/bioinformatics\/btp352","volume":"25","author":"H Li","year":"2009","unstructured":"Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup: The Sequence Alignment\/Map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093\/bioinformatics\/btp352.","journal-title":"Bioinformatics"},{"key":"5769_CR17","doi-asserted-by":"publisher","first-page":"308","DOI":"10.1093\/nar\/29.1.308","volume":"29","author":"ST Sherry","year":"2001","unstructured":"Sherry ST, Ward M, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Research. 2001, 29: 308-311. 10.1093\/nar\/29.1.308.","journal-title":"Nucleic Acids Research"},{"key":"5769_CR18","doi-asserted-by":"publisher","first-page":"24","DOI":"10.1038\/nbt.1754","volume":"29","author":"JT Robinson","year":"2011","unstructured":"Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP: Integrative genomics viewer. Nature Biotechnology. 2011, 29: 24-26. 10.1038\/nbt.1754.","journal-title":"Nature Biotechnology"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-14-S5-S1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,8]],"date-time":"2024-05-08T18:02:22Z","timestamp":1715191342000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-14-S5-S1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2013,4]]},"references-count":18,"journal-issue":{"issue":"S5","published-print":{"date-parts":[[2013,4]]}},"alternative-id":["5769"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-14-s5-s1","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2013,4]]},"assertion":[{"value":"10 April 2013","order":1,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"S1"}}