{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T06:39:37Z","timestamp":1767595177865,"version":"3.48.0"},"reference-count":39,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T00:00:00Z","timestamp":1767571200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100008762","name":"Genome Canada","doi-asserted-by":"publisher","award":["6548"],"award-info":[{"award-number":["6548"]}],"id":[{"id":"10.13039\/100008762","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Bioinform."],"abstract":"<jats:p>\n                    Accurate variant calling refinement is crucial for distinguishing true genetic variants from technical artifacts in high-throughput sequencing data. While heuristic filtering and manual review are common approaches for refining variants, manual review is time-consuming, and heuristic filtering often lacks optimal solutions, especially for low-coverage data. Traditional variant calling methods often struggle with accuracy, especially in regions of low read coverage, leading to false-positive or false-negative calls. Advances in artificial intelligence, particularly deep learning, offer promising solutions for automating this refinement process. Here, we present a Transformers-based framework for genetic variant refinement that leverages self-attention to model dependencies among variant features and directly processes VCF files, enabling seamless integration with standard pipelines such as BCFTools and GATK4. Trained on 2 million variants from the GIAB (v4.2.1) sample HG003, the framework achieved 89.26% accuracy and a ROC AUC of 0.88. Across the tested samples, VariantTransformer improved baseline filtering accuracy by 4%\u201310%, demonstrating consistent gains over the default caller filters. When integrated into conventional variant calling pipelines, VariantTransformer outperformed traditional heuristic filters and, through refinement of existing caller outputs, approached the accuracy achieved by state-of-the-art AI-based variant callers such as DeepVariant, despite not operating as a standalone caller. By positioning this work as a flexible and generalizable framework rather than a single-use model, we highlight the underexplored potential of Transformers for variant refinement in genomics. This study contributes a blueprint for adapting Transformer architectures to a wide range of genomic quality control and filtering tasks. Code is available at:\n                    <jats:ext-link>https:\/\/github.com\/Omar-Abd-Elwahab\/VariantTransformer<\/jats:ext-link>\n                    .\n                  <\/jats:p>","DOI":"10.3389\/fbinf.2025.1694924","type":"journal-article","created":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T06:36:53Z","timestamp":1767595013000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["A Transformers-based framework for refinement of genetic variants"],"prefix":"10.3389","volume":"5","author":[{"given":"Omar","family":"Abdelwahab","sequence":"first","affiliation":[]},{"given":"Davoud","family":"Torkamaneh","sequence":"additional","affiliation":[]}],"member":"1965","published-online":{"date-parts":[[2026,1,5]]},"reference":[{"key":"B1","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1038\/s41588-018-0257-y","article-title":"A deep learning approach to automate refinement of somatic variant calling from cancer sequencing data","volume":"50","author":"Ainscough","year":"2018","journal-title":"Nat. Genet."},{"key":"B2","doi-asserted-by":"publisher","first-page":"412","DOI":"10.1093\/bioinformatics\/16.5.412","article-title":"Assessing the accuracy of prediction algorithms for classification: an overview","volume":"16","author":"Baldi","year":"2000","journal-title":"Bioinformatics"},{"key":"B3","first-page":"1877","article-title":"Language models are few-shot learners","author":"Brown","year":"2020"},{"key":"B4","doi-asserted-by":"publisher","first-page":"125","DOI":"10.1186\/1471-2105-15-125","article-title":"Effective filtering strategies to improve data quality from population-based whole exome sequencing studies","volume":"15","author":"Carson","year":"2014","journal-title":"BMC Nephrol."},{"key":"B5","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s12864-019-6413-7","article-title":"The advantages of the matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation","volume":"21","author":"Chicco","year":"2020","journal-title":"BMC Genomics"},{"key":"B6","doi-asserted-by":"publisher","first-page":"023754","DOI":"10.1101\/023754","article-title":"Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines","author":"Cleary","year":"2015","journal-title":"bioRxiv"},{"key":"B7","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1093\/gigascience\/giab008","article-title":"Twelve years of SAMtools and BCFtools","volume":"10","author":"Danecek","year":"2021","journal-title":"Gigascience"},{"key":"B8","doi-asserted-by":"publisher","first-page":"167","DOI":"10.1093\/bioinformatics\/btr629","article-title":"Feature-based classifiers for somatic mutation detection in tumour\u2013normal paired sequencing data","volume":"28","author":"Ding","year":"2012","journal-title":"Bioinformatics"},{"key":"B9","doi-asserted-by":"publisher","first-page":"861","DOI":"10.1016\/j.patrec.2005.10.010","article-title":"An introduction to ROC analysis","volume":"27","author":"Fawcett","year":"2006","journal-title":"Pattern Recognit. Lett."},{"key":"B10","doi-asserted-by":"publisher","first-page":"115717","DOI":"10.1101\/115717","article-title":"The sentieon genomics tools - a fast and accurate solution to variant calling from next-generation sequence data","author":"Freed","year":"2017","journal-title":"bioRxiv"},{"key":"B11","doi-asserted-by":"publisher","first-page":"22","DOI":"10.1007\/978-3-540-24775-3_5","article-title":"Discriminative methods for multi-labeled classification","volume":"3056","author":"Godbole","year":"2004","journal-title":"Lect. Notes Comput. Sci. Incl."},{"key":"B12","doi-asserted-by":"publisher","first-page":"367","DOI":"10.1016\/j.compbiolchem.2004.09.006","article-title":"Comparing two K-category assignments by a K-category correlation coefficient","volume":"28","author":"Gorodkin","year":"2004","journal-title":"Comput. Biol. Chem."},{"key":"B13","doi-asserted-by":"publisher","first-page":"204","DOI":"10.1038\/nature09764","article-title":"Charting a course for genomic medicine from base pairs to bedside","volume":"470","author":"Green","year":"2011","journal-title":"Nature"},{"key":"B14","doi-asserted-by":"publisher","first-page":"171","DOI":"10.1023\/a:1010920819831","article-title":"A simple generalisation of the area under the ROC curve for multiple class classification problems","volume":"45","author":"Hand","year":"2001","journal-title":"Mach. Learn"},{"key":"B15","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41576-024-00738-6","article-title":"Next-generation data filtering in the genomics era","volume":"2024","author":"Hemstrom","year":"2024","journal-title":"Nat. Rev. Genet."},{"key":"B16","doi-asserted-by":"publisher","first-page":"90","DOI":"10.1109\/mcse.2007.55","article-title":"Matplotlib: a 2D graphics environment","volume":"9","author":"Hunter","year":"2007","journal-title":"Comput. Sci. Eng."},{"key":"B17","doi-asserted-by":"publisher","first-page":"e41882","DOI":"10.1371\/journal.pone.0041882","article-title":"A comparison of MCC and CEN error measures in multi-class prediction","volume":"7","author":"Jurman","year":"2012","journal-title":"PLoS One"},{"key":"B18","doi-asserted-by":"publisher","first-page":"555","DOI":"10.1038\/s41587-019-0054-x","article-title":"Best practices for benchmarking germline small-variant calls in human genomes","volume":"37","author":"Krusche","year":"2019","journal-title":"Nat. Biotechnol."},{"key":"B19","doi-asserted-by":"publisher","first-page":"357","DOI":"10.1038\/nmeth.1923","article-title":"Fast gapped-read alignment with bowtie 2","volume":"9","author":"Langmead","year":"2012","journal-title":"Nat. Methods"},{"key":"B20","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41598-022-15563-2","article-title":"The evaluation of Bcftools mpileup and GATK HaplotypeCaller for variant calling in non-human species","volume":"12","author":"Lefouili","year":"2022","journal-title":"Sci. Rep."},{"key":"B21","doi-asserted-by":"publisher","first-page":"1754","DOI":"10.1093\/bioinformatics\/btp324","article-title":"Fast and accurate short read alignment with Burrows\u2013Wheeler transform","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"B22","doi-asserted-by":"publisher","first-page":"2078","DOI":"10.1093\/bioinformatics\/btp352","article-title":"The sequence alignment\/map format and SAMtools","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"B23","article-title":"Decoupled weight decay regularization","author":"Loshchilov","year":"2017"},{"key":"B24","doi-asserted-by":"publisher","first-page":"276","DOI":"10.11613\/bm.2012.031","article-title":"Interrater reliability: the kappa statistic","volume":"22","author":"McHugh","year":"2012","journal-title":"Biochem. Med. Zagreb."},{"key":"B25","doi-asserted-by":"publisher","first-page":"1297","DOI":"10.1101\/gr.107524.110","article-title":"The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data","volume":"20","author":"McKenna","year":"2010","journal-title":"Genome Res."},{"key":"B26","doi-asserted-by":"publisher","first-page":"111","DOI":"10.1038\/hdy.2016.102","article-title":"From next-generation resequencing reads to a high-quality variant data set","volume":"118","author":"Pfeifer","year":"2016","journal-title":"Heredity"},{"key":"B27","doi-asserted-by":"publisher","first-page":"983","DOI":"10.1038\/nbt.4235","article-title":"A universal SNP and small-indel variant caller using deep neural networks","volume":"36","author":"Poplin","year":"2018","journal-title":"Nat. Biotechnol."},{"key":"B28","doi-asserted-by":"publisher","first-page":"e31","DOI":"10.1158\/0008-5472.CAN-17-0337","article-title":"Variant review with the integrative genomics viewer","volume":"77","author":"Robinson","year":"2017","journal-title":"Cancer Res."},{"key":"B29","doi-asserted-by":"publisher","first-page":"24","DOI":"10.1038\/nbt.1754","article-title":"Integrative genomics viewer","volume":"29","author":"Robinson","year":"2011","journal-title":"Nat. Biotechnol."},{"key":"B30","doi-asserted-by":"publisher","first-page":"912","DOI":"10.1186\/s12864-016-3281-2","article-title":"SNooPer: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing","volume":"17","author":"Spinella","year":"2016","journal-title":"BMC Genomics"},{"key":"B31","doi-asserted-by":"publisher","first-page":"3","DOI":"10.28092\/j.issn.2095-3941.2016.0004","article-title":"Current practices and guidelines for clinical next-generation sequencing oncology testing","volume":"13","author":"Strom","year":"2016","journal-title":"Cancer Biol. Med."},{"key":"B32","doi-asserted-by":"publisher","first-page":"930","DOI":"10.1038\/35103535","article-title":"Accessing genetic variation: genotyping single nucleotide polymorphisms","volume":"2","author":"Syv\u00e4nen","year":"2001","journal-title":"Nat. Rev. Genet."},{"key":"B33","doi-asserted-by":"publisher","first-page":"11.10.33","DOI":"10.1002\/0471250953.bi1110s43","article-title":"From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline","volume":"43","author":"Van der Auwera","year":"2013","journal-title":"Curr. Protoc. Bioinforma."},{"key":"B34","doi-asserted-by":"publisher","first-page":"29","DOI":"10.1145\/2786984.2786995","article-title":"Scikit-learn","volume":"19","author":"Varoquaux","year":"2015","journal-title":"Getmob. Mob. Comput. Commun."},{"key":"B35","first-page":"6000","article-title":"Attention is all you need","author":"Vaswani","year":"2017"},{"key":"B36","doi-asserted-by":"publisher","first-page":"100128","DOI":"10.1016\/j.xgen.2022.100128","article-title":"Benchmarking challenging small variants with linked and long reads","volume":"2","author":"Wagner","year":"2022","journal-title":"Cell Genomics"},{"key":"B37","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-319-24277-4","volume-title":"ggplot2: Elegant Graphics for Data Analysis","author":"Wickham","year":"2016"},{"key":"B38","doi-asserted-by":"publisher","first-page":"1","DOI":"10.48550\/arXiv.1910.03771","article-title":"HuggingFace\u2019s transformers: state-of-the-art natural language processing","author":"Wolf","year":"2019","journal-title":"ArXiv"},{"key":"B39","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/sdata.2016.25","article-title":"Extensive sequencing of seven human genomes to characterize benchmark reference materials","volume":"3","author":"Zook","year":"2016","journal-title":"Sci. Data"}],"container-title":["Frontiers in Bioinformatics"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fbinf.2025.1694924\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T06:36:55Z","timestamp":1767595015000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fbinf.2025.1694924\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,5]]},"references-count":39,"alternative-id":["10.3389\/fbinf.2025.1694924"],"URL":"https:\/\/doi.org\/10.3389\/fbinf.2025.1694924","relation":{},"ISSN":["2673-7647"],"issn-type":[{"value":"2673-7647","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,5]]},"article-number":"1694924"}}