{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T20:34:31Z","timestamp":1772138071114,"version":"3.50.1"},"reference-count":34,"publisher":"Oxford University Press (OUP)","issue":"11","license":[{"start":{"date-parts":[[2024,9,19]],"date-time":"2024-09-19T00:00:00Z","timestamp":1726704000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,11,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Detection of germline variants in next-generation sequencing data is an essential component of modern genomics analysis. Variant detection tools typically rely on statistical algorithms such as de Bruijn graphs or Hidden Markov models, and are often coupled with heuristic techniques and thresholds to maximize accuracy. Despite significant progress in recent years, current methods still generate thousands of false-positive detections in a typical human whole genome, creating a significant manual review burden.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We introduce a new approach that replaces the handcrafted statistical techniques of previous methods with a single deep generative model. Using a standard transformer-based encoder and double-decoder architecture, our model learns to construct diploid germline haplotypes in a generative fashion identical to modern large language models. We train our model on 37 whole genome sequences from Genome-in-a-Bottle samples, and demonstrate that our method learns to produce accurate haplotypes with correct phase and genotype for all classes of small variants. We compare our method, called Jenever, to FreeBayes, GATK HaplotypeCaller, Clair3, and DeepVariant, and demonstrate that our method has superior overall accuracy compared to other methods. At F1-maximizing quality thresholds, our model delivers the highest sensitivity, precision, and the fewest genotyping errors for insertion and deletion variants. For single nucleotide variants, our model demonstrates the highest sensitivity but at somewhat lower precision, and achieves the highest overall F1 score among all callers we tested.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>Jenever is implemented as a python-based command line tool. Source code is available at https:\/\/github.com\/ARUP-NGS\/jenever\/<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btae565","type":"journal-article","created":{"date-parts":[[2024,9,18]],"date-time":"2024-09-18T15:25:30Z","timestamp":1726673130000},"source":"Crossref","is-referenced-by-count":1,"title":["Generative haplotype prediction outperforms statistical methods for small variant detection in next-generation sequencing data"],"prefix":"10.1093","volume":"40","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7185-7894","authenticated-orcid":false,"given":"Brendan","family":"O\u2019Fallon","sequence":"first","affiliation":[{"name":"Institute for Research and Innovation, ARUP Labs , Salt Lake City, UT 84108,","place":["United States"]},{"name":"Institute for Clinical and Experimental Pathology, ARUP Labs , Salt Lake City, UT 84108,","place":["United States"]}]},{"given":"Ashini","family":"Bolia","sequence":"additional","affiliation":[{"name":"Institute for Research and Innovation, ARUP Labs , Salt Lake City, UT 84108,","place":["United States"]}]},{"given":"Jacob","family":"Durtschi","sequence":"additional","affiliation":[{"name":"Institute for Research and Innovation, ARUP Labs , Salt Lake City, UT 84108,","place":["United States"]},{"name":"Institute for Clinical and Experimental Pathology, ARUP Labs , Salt Lake City, UT 84108,","place":["United States"]}]},{"given":"Luobin","family":"Yang","sequence":"additional","affiliation":[{"name":"Institute for Research and Innovation, ARUP Labs , Salt Lake City, UT 84108,","place":["United States"]}]},{"given":"Eric","family":"Fredrickson","sequence":"additional","affiliation":[{"name":"Institute for Research and Innovation, ARUP Labs , Salt Lake City, UT 84108,","place":["United States"]}]},{"given":"Hunter","family":"Best","sequence":"additional","affiliation":[{"name":"Institute for Research and Innovation, ARUP Labs , Salt Lake City, UT 84108,","place":["United States"]}]}],"member":"286","published-online":{"date-parts":[[2024,9,19]]},"reference":[{"key":"2024110904321050700_btae565-B1","first-page":"232","article-title":"DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer","volume":"41","author":"Baid","year":"2023","journal-title":"Nat Biotechnol"},{"key":"2024110904321050700_btae565-B2","doi-asserted-by":"publisher","author":"Behera","DOI":"10.1101\/2024.01.02.573821"},{"key":"2024110904321050700_btae565-B3","author":"Choromanski","year":"2020"},{"key":"2024110904321050700_btae565-B4","doi-asserted-by":"publisher","author":"Cleary","year":"2015","DOI":"10.1101\/023754,"},{"key":"2024110904321050700_btae565-B5","doi-asserted-by":"crossref","first-page":"885","DOI":"10.1038\/s41587-021-00861-3","article-title":"A unified haplotype-based method for accurate and comprehensive variant calling","volume":"39","author":"Cooke","year":"2021","journal-title":"Nat Biotechnol"},{"key":"2024110904321050700_btae565-B6","first-page":"16344","article-title":"FlashAttention: fast and memory-efficient exact attention with IO-awareness","volume":"35","author":"Dao","year":"2022","journal-title":"Adv Neural Inf Process Syst"},{"key":"2024110904321050700_btae565-B7","doi-asserted-by":"crossref","first-page":"491","DOI":"10.1038\/ng.806","article-title":"A framework for variation discovery and genotyping using next-generation DNA sequencing data","volume":"43","author":"DePristo","year":"2011","journal-title":"Nat Genet"},{"key":"2024110904321050700_btae565-B9","author":"Garrison","year":"2012"},{"key":"2024110904321050700_btae565-B10","doi-asserted-by":"crossref","first-page":"24","DOI":"10.1186\/s13073-016-0269-0","article-title":"Medical implications of technical accuracy in genome sequencing","volume":"8","author":"Goldfeder","year":"2016","journal-title":"Genome Med"},{"key":"2024110904321050700_btae565-B11","doi-asserted-by":"crossref","first-page":"025013","DOI":"10.1088\/2632-2153\/ab7e19","article-title":"DAVI: deep learning-based tool for alignment and single nucleotide variant identification","volume":"1","author":"Gupta","year":"2020","journal-title":"Mach Learn Sci Technol"},{"key":"2024110904321050700_btae565-B12","author":"Izmailov","year":"2018"},{"key":"2024110904321050700_btae565-B13","doi-asserted-by":"crossref","first-page":"591","DOI":"10.1038\/s41592-018-0051-x","article-title":"Strelka2: fast and accurate calling of germline and somatic variants","volume":"15","author":"Kim","year":"2018","journal-title":"Nat Methods"},{"key":"2024110904321050700_btae565-B14","doi-asserted-by":"crossref","first-page":"98","DOI":"10.1186\/s13059-020-01993-6","article-title":"Varlociraptor: enhancing sensitivity and controlling false discovery rate in somatic indel discovery","volume":"21","author":"K\u00f6ster","year":"2020","journal-title":"Genome Biol"},{"key":"2024110904321050700_btae565-B15","doi-asserted-by":"crossref","first-page":"555","DOI":"10.1038\/s41587-019-0054-x","article-title":"Best practices for benchmarking germline small-variant calls in human genomes","volume":"37","author":"Krusche","year":"2019","journal-title":"Nat Biotechnol"},{"key":"2024110904321050700_btae565-B16","doi-asserted-by":"crossref","first-page":"2843","DOI":"10.1093\/bioinformatics\/btu356","article-title":"Toward better understanding of artifacts in variant calling from high-coverage samples","volume":"30","author":"Li","year":"2014","journal-title":"Bioinformatics"},{"key":"2024110904321050700_btae565-B18","doi-asserted-by":"crossref","first-page":"2078","DOI":"10.1093\/bioinformatics\/btp352","article-title":"The sequence alignment\/map format and SAMtools","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"2024110904321050700_btae565-B19","doi-asserted-by":"crossref","first-page":"e1007556","DOI":"10.1371\/journal.pcbi.1007556","article-title":"ForestQC: quality control on genetic variants from next-generation sequencing data using random forest","volume":"15","author":"Li","year":"2019","journal-title":"PLoS Comput Biol"},{"key":"2024110904321050700_btae565-B21","doi-asserted-by":"crossref","first-page":"998","DOI":"10.1038\/s41467-019-09025-z","article-title":"A multi-task convolutional deep neural network for variant calling in single molecule sequencing","volume":"10","author":"Luo","year":"2019","journal-title":"Nat Commun"},{"key":"2024110904321050700_btae565-B22","doi-asserted-by":"crossref","first-page":"220","DOI":"10.1038\/s42256-020-0167-4","article-title":"Exploring the limit of using a deep neural network on pileup data for germline variant calling","volume":"2","author":"Luo","year":"2020","journal-title":"Nat Mach Intell"},{"key":"2024110904321050700_btae565-B23","doi-asserted-by":"crossref","first-page":"1185","DOI":"10.1038\/nmeth.2221","article-title":"The GEM mapper: fast, accurate and versatile alignment by filtration","volume":"9","author":"Marco-Sola","year":"2012","journal-title":"Nat Methods"},{"key":"2024110904321050700_btae565-B24","doi-asserted-by":"crossref","first-page":"443","DOI":"10.1038\/nrg2986","article-title":"Genotype and SNP calling from next-generation sequencing data","volume":"12","author":"Nielsen","year":"2011","journal-title":"Nat Rev Genet"},{"key":"2024110904321050700_btae565-B25","doi-asserted-by":"publisher","author":"O\u2019Fallon","DOI":"10.1101\/2022.09.12.506413,"},{"key":"2024110904321050700_btae565-B26","volume-title":"Advances in Neural Information Processing Systems","author":"Paszke"},{"key":"2024110904321050700_btae565-B28","doi-asserted-by":"crossref","first-page":"983","DOI":"10.1038\/nbt.4235","article-title":"A universal SNP and small-indel variant caller using deep neural networks","volume":"36","author":"Poplin","year":"2018","journal-title":"Nat Biotechnol"},{"key":"2024110904321050700_btae565-B30","author":"Qi","year":"2021."},{"key":"2024110904321050700_btae565-B31","doi-asserted-by":"publisher","author":"Ramachandran","year":"2020","DOI":"10.1101\/2020.03.23.004473,"},{"key":"2024110904321050700_btae565-B32","first-page":"8583","article-title":"Scaling vision with sparse mixture of experts","volume":"34","author":"Riquelme","year":"2021","journal-title":"Adv Neural Inf Proces Syst"},{"key":"2024110904321050700_btae565-B33","doi-asserted-by":"crossref","first-page":"53","DOI":"10.1162\/tacl_a_00353","article-title":"Efficient content-based sparse attention with routing transformers","volume":"9","author":"Roy","year":"2021","journal-title":"Trans Assoc Comput Linguist"},{"key":"2024110904321050700_btae565-B34","author":"Shaw","year":"2018"},{"key":"2024110904321050700_btae565-B35","author":"Shazeer","year":"2019"},{"key":"2024110904321050700_btae565-B36","doi-asserted-by":"crossref","first-page":"127063","DOI":"10.1016\/j.neucom.2023.127063","article-title":"RoFormer: enhanced transformer with rotary position embedding","volume":"568","author":"Su","year":"2024","journal-title":"Neurocomputing"},{"key":"2024110904321050700_btae565-B37","article-title":"Attention is all you need","volume":"30","author":"Vaswani","year":"2017","journal-title":"Adv Neural Inf Process Syst"},{"key":"2024110904321050700_btae565-B38","first-page":"1","article-title":"Benchmarking challenging small variants with linked and long reads","volume":"2","author":"Wagner","year":"2022","journal-title":"Cell Genom"},{"key":"2024110904321050700_btae565-B3800","author":"Wang"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btae565\/59204016\/btae565.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/11\/btae565\/60549144\/btae565.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/11\/btae565\/60549144\/btae565.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,8]],"date-time":"2024-11-08T23:34:06Z","timestamp":1731108846000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btae565\/7762102"}},"subtitle":[],"editor":[{"given":"Peter","family":"Robinson","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2024,9,19]]},"references-count":34,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2024,11,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btae565","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2024.02.27.582327","asserted-by":"object"}]},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024,11]]},"published":{"date-parts":[[2024,9,19]]},"article-number":"btae565"}}