{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,30]],"date-time":"2025-07-30T11:44:14Z","timestamp":1753875854646,"version":"3.41.2"},"reference-count":34,"publisher":"Oxford University Press (OUP)","issue":"12","license":[{"start":{"date-parts":[[2023,11,29]],"date-time":"2023-11-29T00:00:00Z","timestamp":1701216000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100004410","name":"Scientific and Technological Research Council of Turkey","doi-asserted-by":"publisher","award":["TEYDEB-3190261"],"award-info":[{"award-number":["TEYDEB-3190261"]}],"id":[{"id":"10.13039\/501100004410","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2023,12,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>Technical errors in sequencing or bioinformatics steps and difficulties in alignment at some genomic sites result in false positive (FP) variants. Filtering based on quality metrics is a common method for detecting FP variants, but setting thresholds to reduce FP rates may reduce the number of true positive variants by overlooking the more complex relationships between features. The goal of this study is to develop a machine learning-based model for identifying FPs that integrates quality metrics with genomic features and with the feature interpretability property to provide insights into model results.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>We propose a random forest-based model that utilizes genomic features to improve identification of FPs. Further examination of the features shows that the newly introduced features have an important impact on the prediction of variants misclassified by VEF, GATK-CNN, and GARFIELD, recently introduced FP detection systems. We applied cost-sensitive training to avoid errors in misclassification of true variants and developed a model that provides a robust mechanism against misclassification of true variants while increasing the prediction rate of FP variants. This model can be easily re-trained when factors such as experimental protocols might alter the FP distribution. In addition, it has an interpretability mechanism that allows users to understand the impact of features on the model\u2019s predictions.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>The software implementation can be found at https:\/\/github.com\/ideateknoloji\/FPDetect.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btad694","type":"journal-article","created":{"date-parts":[[2023,11,29]],"date-time":"2023-11-29T19:09:21Z","timestamp":1701284961000},"source":"Crossref","is-referenced-by-count":4,"title":["Improving the filtering of false positive single nucleotide variations by combining genomic features with quality metrics"],"prefix":"10.1093","volume":"39","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5071-2722","authenticated-orcid":false,"given":"Kaz\u0131m K\u0131van\u00e7","family":"Eren","sequence":"first","affiliation":[{"name":"Department of Computer Engineering, Kocaeli University , Kocaeli 41000, Turkey"}]},{"given":"Esra","family":"\u00c7\u0131nar","sequence":"additional","affiliation":[{"name":"R&D Department, Idea Technology Solutions LLC. , Istanbul 34396, Turkey"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4072-3065","authenticated-orcid":false,"given":"Hamza U","family":"Karakurt","sequence":"additional","affiliation":[{"name":"R&D Department, Idea Technology Solutions LLC. , Istanbul 34396, Turkey"},{"name":"Department of Bioengineering, Gebze Technical University , Kocaeli 41400, Turkey"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8376-1056","authenticated-orcid":false,"given":"Arzucan","family":"\u00d6zg\u00fcr","sequence":"additional","affiliation":[{"name":"Department of Computer Engineering, Bo\u011fazi\u00e7i University , Istanbul 34342, Turkey"}]}],"member":"286","published-online":{"date-parts":[[2023,11,29]]},"reference":[{"key":"2023120209575090600_btad694-B1","doi-asserted-by":"crossref","first-page":"D789","DOI":"10.1093\/nar\/gku1205","article-title":"Omim.org: online mendelian inheritance in man (omim\u00ae), an online catalog of human genes and genetic disorders","volume":"43","author":"Amberger","year":"2015","journal-title":"Nucleic Acids Res"},{"key":"2023120209575090600_btad694-B2","doi-asserted-by":"crossref","first-page":"456","DOI":"10.1016\/j.jmoldx.2015.03.004","article-title":"Confirming variants in next-generation sequencing panel testing by sanger sequencing","volume":"17","author":"Baudhuin","year":"2015","journal-title":"J Mol Diagn"},{"key":"2023120209575090600_btad694-B3","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Mach Learn"},{"key":"2023120209575090600_btad694-B4","doi-asserted-by":"crossref","first-page":"104135","DOI":"10.1016\/j.biosystems.2020.104135","article-title":"Probability of change in life: amino acid changes in single nucleotide substitutions","volume":"193\u2013194","author":"Chan","year":"2020","journal-title":"Biosystems"},{"key":"2023120209575090600_btad694-B5","doi-asserted-by":"crossref","first-page":"D106","DOI":"10.1093\/nar\/gkab1051","article-title":"The European nucleotide archive in 2021","volume":"50","author":"Cummins","year":"2021","journal-title":"Nucleic Acids Res"},{"key":"2023120209575090600_btad694-B6","doi-asserted-by":"crossref","first-page":"157","DOI":"10.1101\/gr.210500.116","article-title":"A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree","volume":"27","author":"Eberle","year":"2017","journal-title":"Genome Res"},{"key":"2023120209575090600_btad694-B7","doi-asserted-by":"crossref","first-page":"e16","DOI":"10.1093\/nar\/gks836","article-title":"From next-generation sequencing alignments to accurate comparison and validation of single-nucleotide variants: the pibase software","volume":"41","author":"Forster","year":"2013","journal-title":"Nucleic Acids Res"},{"key":"2023120209575090600_btad694-B8","doi-asserted-by":"crossref","first-page":"2060","DOI":"10.1093\/bioinformatics\/btz901","article-title":"Lean and deep models for more accurate filtering of SNP and indel variant calls","volume":"36","author":"Friedman","year":"2020","journal-title":"Bioinformatics"},{"key":"2023120209575090600_btad694-B9","doi-asserted-by":"crossref","first-page":"1255","DOI":"10.1038\/s41436-021-01148-3","article-title":"Reducing sanger confirmation testing through false positive prediction algorithms","volume":"23","author":"Holt","year":"2021","journal-title":"Genet Med"},{"key":"2023120209575090600_btad694-B10","doi-asserted-by":"crossref","first-page":"918","DOI":"10.1101\/gr.176552.114","article-title":"An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data","volume":"25","author":"Jun","year":"2015","journal-title":"Genome Res"},{"key":"2023120209575090600_btad694-B11","doi-asserted-by":"crossref","first-page":"D493","DOI":"10.1093\/nar\/gkh103","article-title":"The UCSC table browser data retrieval tool","volume":"32","author":"Karolchik","year":"2004","journal-title":"Nucleic Acids Res"},{"key":"2023120209575090600_btad694-B12","doi-asserted-by":"crossref","first-page":"310","DOI":"10.1038\/ng.2892","article-title":"A general framework for estimating the relative pathogenicity of human genetic variants","volume":"46","author":"Kircher","year":"2014","journal-title":"Nat Genet"},{"key":"2023120209575090600_btad694-B13","doi-asserted-by":"crossref","first-page":"D1062","DOI":"10.1093\/nar\/gkx1153","article-title":"ClinVar: improving access to variant interpretations and supporting evidence","volume":"46","author":"Landrum","year":"2018","journal-title":"Nucleic Acids Res"},{"key":"2023120209575090600_btad694-B14","doi-asserted-by":"crossref","first-page":"D19","DOI":"10.1093\/nar\/gkq1019","article-title":"The sequence read archive","volume":"39","author":"Leinonen","year":"2010","journal-title":"Nucleic Acids Res"},{"key":"2023120209575090600_btad694-B15","doi-asserted-by":"crossref","first-page":"2843","DOI":"10.1093\/bioinformatics\/btu356","article-title":"Toward better understanding of artifacts in variant calling from high-coverage samples","volume":"30","author":"Li","year":"2014","journal-title":"Bioinformatics"},{"key":"2023120209575090600_btad694-B16","doi-asserted-by":"crossref","first-page":"1754","DOI":"10.1093\/bioinformatics\/btp324","article-title":"Fast and accurate short read alignment with burrows\u2013wheeler transform","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"2023120209575090600_btad694-B17","doi-asserted-by":"crossref","first-page":"318","DOI":"10.1016\/j.jmoldx.2018.10.009","article-title":"A rigorous interlaboratory examination of the need to confirm next-generation sequencing\u2014detected variants with an orthogonal method in clinical genetic testing","volume":"21","author":"Lincoln","year":"2019","journal-title":"J Mol Diagn"},{"key":"2023120209575090600_btad694-B18","doi-asserted-by":"crossref","first-page":"S119","DOI":"10.1186\/1753-6561-5-S9-S119","article-title":"Evaluating methods for the analysis of rare variants in sequence data","volume":"5","author":"Luedtke","year":"2011","journal-title":"BMC Proc"},{"author":"Lundberg","key":"2023120209575090600_btad694-B19","first-page":"4768"},{"key":"2023120209575090600_btad694-B20","doi-asserted-by":"crossref","first-page":"1297","DOI":"10.1101\/gr.107524.110","article-title":"The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data","volume":"20","author":"McKenna","year":"2010","journal-title":"Genome Res"},{"key":"2023120209575090600_btad694-B21","doi-asserted-by":"crossref","first-page":"923","DOI":"10.1016\/j.jmoldx.2016.07.006","article-title":"Sanger confirmation is required to achieve optimal sensitivity and specificity in next-generation sequencing panel testing","volume":"18","author":"Mu","year":"2016","journal-title":"J Mol Diagn"},{"key":"2023120209575090600_btad694-B22","doi-asserted-by":"crossref","first-page":"1361","DOI":"10.1093\/bioinformatics\/btt172","article-title":"A support vector machine for identification of single-nucleotide polymorphisms from next-generation sequencing data","volume":"29","author":"O'Fallon","year":"2013","journal-title":"Bioinformatics"},{"key":"2023120209575090600_btad694-B23","first-page":"1833","article-title":"Permutation tests for studying classifier performance","volume":"11","author":"Ojala","year":"2010","journal-title":"J Mach Learn Res"},{"key":"2023120209575090600_btad694-B24","first-page":"2825","article-title":"Scikit-learn: machine learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J Mach Learn Res"},{"key":"2023120209575090600_btad694-B25","doi-asserted-by":"crossref","first-page":"983","DOI":"10.1038\/nbt.4235","article-title":"A universal SNP and small-indel variant caller using deep neural networks","volume":"36","author":"Poplin","year":"2018","journal-title":"Nat Biotechnol"},{"key":"2023120209575090600_btad694-B26","doi-asserted-by":"crossref","first-page":"3038","DOI":"10.1093\/bioinformatics\/bty303","article-title":"Garfield-NGS: genomic variants filtering by deep learning models in NGS","volume":"34","author":"Ravasio","year":"2018","journal-title":"Bioinformatics"},{"key":"2023120209575090600_btad694-B27","doi-asserted-by":"crossref","first-page":"227","DOI":"10.1016\/S0378-3758(00)00115-4","article-title":"Improving predictive inference under covariate shift by weighting the log-likelihood function","volume":"90","author":"Shimodaira","year":"2000","journal-title":"J Stat Plan Inference"},{"key":"2023120209575090600_btad694-B28","doi-asserted-by":"crossref","first-page":"510","DOI":"10.1038\/gim.2013.183","article-title":"Assessing the necessity of confirmatory testing for exome-sequencing results in a clinical molecular diagnostic laboratory","volume":"16","author":"Strom","year":"2014","journal-title":"Genet Med"},{"volume-title":"Approaching (Almost) Any Machine Learning Problem","year":"2020","author":"Thakur","key":"2023120209575090600_btad694-B29"},{"key":"2023120209575090600_btad694-B30","doi-asserted-by":"crossref","first-page":"263","DOI":"10.1186\/s12864-018-4659-0","article-title":"A machine learning model to determine the accuracy of variant calls in capture-based next generation sequencing","volume":"19","author":"van den Akker","year":"2018","journal-title":"BMC Genomics"},{"key":"2023120209575090600_btad694-B31","doi-asserted-by":"crossref","first-page":"2328","DOI":"10.1093\/bioinformatics\/btz952","article-title":"VEF: a variant filtering tool based on ensemble methods","volume":"36","author":"Zhang","year":"2020","journal-title":"Bioinformatics"},{"key":"2023120209575090600_btad694-B32","doi-asserted-by":"crossref","first-page":"246","DOI":"10.1038\/nbt.2835","article-title":"Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls","volume":"32","author":"Zook","year":"2014","journal-title":"Nat Biotechnol"},{"key":"2023120209575090600_btad694-B33","doi-asserted-by":"crossref","first-page":"160025","DOI":"10.1038\/sdata.2016.25","article-title":"Extensive sequencing of seven human genomes to characterize benchmark reference materials","volume":"3","author":"Zook","year":"2016","journal-title":"Sci Data"},{"key":"2023120209575090600_btad694-B34","doi-asserted-by":"crossref","first-page":"561","DOI":"10.1038\/s41587-019-0074-6","article-title":"An open resource for accurately benchmarking small variant and reference calls","volume":"37","author":"Zook","year":"2019","journal-title":"Nat Biotechnol"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btad694\/53933310\/btad694.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/39\/12\/btad694\/53974372\/btad694.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/39\/12\/btad694\/53974372\/btad694.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,12,2]],"date-time":"2023-12-02T09:58:17Z","timestamp":1701511097000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btad694\/7455253"}},"subtitle":[],"editor":[{"given":"Inanc","family":"Birol","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2023,11,29]]},"references-count":34,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2023,12,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btad694","relation":{},"ISSN":["1367-4811"],"issn-type":[{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2023,12,1]]},"published":{"date-parts":[[2023,11,29]]},"article-number":"btad694"}}