{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,2]],"date-time":"2026-04-02T04:34:14Z","timestamp":1775104454724,"version":"3.50.1"},"reference-count":40,"publisher":"Oxford University Press (OUP)","issue":"Supplement_1","license":[{"start":{"date-parts":[[2021,7,12]],"date-time":"2021-07-12T00:00:00Z","timestamp":1626048000000},"content-version":"vor","delay-in-days":11,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["CCF-1816027"],"award-info":[{"award-number":["CCF-1816027"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100017223","name":"National Energy Research Scientific Computing Center","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100017223","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000015","name":"U.S. Department of Energy","doi-asserted-by":"publisher","award":["DE-AC02-05CH11231"],"award-info":[{"award-number":["DE-AC02-05CH11231"]}],"id":[{"id":"10.13039\/100000015","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,8,4]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Variation graph representations are projected to either replace or supplement conventional single genome references due to their ability to capture population genetic diversity and reduce reference bias. Vast catalogues of genetic variants for many species now exist, and it is natural to ask which among these are crucial to circumvent reference bias during read mapping.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>In this work, we propose a novel mathematical framework for variant selection, by casting it in terms of minimizing variation graph size subject to preserving paths of length \u03b1 with at most \u03b4 differences. This framework leads to a rich set of problems based on the types of variants [e.g. single nucleotide polymorphisms (SNPs), indels or structural variants (SVs)], and whether the goal is to minimize the number of positions at which variants are listed or to minimize the total number of variants listed. We classify the computational complexity of these problems and provide efficient algorithms along with their software implementation when feasible. We empirically evaluate the magnitude of graph reduction achieved in human chromosome variation graphs using multiple \u03b1 and \u03b4 parameter values corresponding to short and long-read resequencing characteristics. When our algorithm is run with parameter settings amenable to long-read mapping (\u03b1\u2009=\u200910\u2009kbp, \u03b4\u2009=\u20091000), 99.99% SNPs and 73% SVs can be safely excluded from human chromosome 1 variation graph. The graph size reduction can benefit downstream pan-genome analysis.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>https:\/\/github.com\/AT-CG\/VF.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btab302","type":"journal-article","created":{"date-parts":[[2021,4,25]],"date-time":"2021-04-25T07:10:29Z","timestamp":1619334629000},"page":"i460-i467","source":"Crossref","is-referenced-by-count":10,"title":["A variant selection framework for genome graphs"],"prefix":"10.1093","volume":"37","author":[{"given":"Chirag","family":"Jain","sequence":"first","affiliation":[{"name":"Department of Computational and Data Sciences, Indian Institute of Science , Bangalore, KA 560012, India"}]},{"given":"Neda","family":"Tavakoli","sequence":"additional","affiliation":[{"name":"School of Computational Science and Engineering, Georgia Institute of Technology , Atlanta, GA 30332, USA"}]},{"given":"Srinivas","family":"Aluru","sequence":"additional","affiliation":[{"name":"School of Computational Science and Engineering, Georgia Institute of Technology , Atlanta, GA 30332, USA"}]}],"member":"286","published-online":{"date-parts":[[2021,7,12]]},"reference":[{"key":"2023062410281887400_btab302-B1","doi-asserted-by":"crossref","first-page":"663","DOI":"10.1016\/j.cell.2018.12.019","article-title":"Characterizing the major structural variant alleles of the human genome","volume":"176","author":"Audano","year":"2019","journal-title":"Cell"},{"key":"2023062410281887400_btab302-B2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13059-019-1774-4","article-title":"Is it time to change the reference genome?","volume":"20","author":"Ballouz","year":"2019","journal-title":"Genome Biol"},{"key":"2023062410281887400_btab302-B3","doi-asserted-by":"crossref","first-page":"i146","DOI":"10.1093\/bioinformatics\/btaa446","article-title":"Distance indexing and seed clustering in sequence graphs","volume":"36","author":"Chang","year":"2020","journal-title":"Bioinformatics"},{"key":"2023062410281887400_btab302-B4","first-page":"118","article-title":"Computational pan-genomics: status, promises and challenges","volume":"19","year":"2018","journal-title":"Brief. Bioinform"},{"key":"2023062410281887400_btab302-B5","doi-asserted-by":"crossref","first-page":"68","DOI":"10.1038\/nature15393","article-title":"A global reference for human genetic variation","volume":"526","author":"Consortium","year":"2015","journal-title":"Nature"},{"key":"2023062410281887400_btab302-B6","doi-asserted-by":"crossref","first-page":"2156","DOI":"10.1093\/bioinformatics\/btr330","article-title":"The variant call format and vcftools","volume":"27","author":"Danecek","year":"2011","journal-title":"Bioinformatics"},{"key":"2023062410281887400_btab302-B7","doi-asserted-by":"crossref","first-page":"e109384","DOI":"10.1371\/journal.pone.0109384","article-title":"Indexes of large genome collections on a PC","volume":"9","author":"Danek","year":"2014","journal-title":"PLoS One"},{"key":"2023062410281887400_btab302-B8","doi-asserted-by":"crossref","first-page":"3712","DOI":"10.1093\/bioinformatics\/btaa265","article-title":"Vargas: heuristic-free alignment for assessing linear and graph read aligners","volume":"36","author":"Darby","year":"2020","journal-title":"Bioinformatics"},{"key":"2023062410281887400_btab302-B9","doi-asserted-by":"crossref","first-page":"682","DOI":"10.1038\/ng.3257","article-title":"Improved genome inference in the MHC using a population reference graph","volume":"47","author":"Dilthey","year":"2015","journal-title":"Nat. Genet"},{"key":"2023062410281887400_btab302-B10","doi-asserted-by":"crossref","first-page":"1654","DOI":"10.1038\/ng.3964","article-title":"Graphtyper enables population-scale genotyping using pangenome graphs","volume":"49","author":"Eggertsson","year":"2017","journal-title":"Nat. Genet"},{"key":"2023062410281887400_btab302-B11","doi-asserted-by":"crossref","first-page":"139","DOI":"10.1146\/annurev-genom-120219-080406","article-title":"Pangenome graphs","volume":"21","author":"Eizenga","year":"2020","journal-title":"Annu. Rev. Genomics Hum. Genet"},{"key":"2023062410281887400_btab302-B12","doi-asserted-by":"crossref","first-page":"835","DOI":"10.2140\/pjm.1965.15.835","article-title":"Incidence matrices and interval graphs","volume":"15","author":"Fulkerson","year":"1965","journal-title":"Pac. J. Math"},{"key":"2023062410281887400_btab302-B13","doi-asserted-by":"crossref","first-page":"875","DOI":"10.1038\/nbt.4227","article-title":"Variation graph toolkit improves read mapping by representing genetic variation in the reference","volume":"36","author":"Garrison","year":"2018","journal-title":"Nat. Biotechnol"},{"key":"2023062410281887400_btab302-B14","doi-asserted-by":"crossref","first-page":"i81","DOI":"10.1093\/bioinformatics\/btz341","article-title":"Fully-sensitive seed finding in sequence graphs using a hybrid index","volume":"35","author":"Ghaffaari","year":"2019","journal-title":"Bioinformatics"},{"key":"2023062410281887400_btab302-B15","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1007\/s00453-003-1028-3","article-title":"Fixed-parameter algorithms for closest string and related problems","volume":"37","author":"Gramm","year":"2003","journal-title":"Algorithmica"},{"key":"2023062410281887400_btab302-B16","first-page":"1","article-title":"Bloom filter trie: an alignment-free and reference-free data structure for pan-genome storage","volume":"11","author":"Holley","year":"2016","journal-title":"Algor. Mol. Biol"},{"key":"2023062410281887400_btab302-B17","doi-asserted-by":"crossref","first-page":"226","DOI":"10.1038\/ng.1028","article-title":"De novo assembly and genotyping of variants using colored de bruijn graphs","volume":"44","author":"Iqbal","year":"2012","journal-title":"Nat. Genet"},{"key":"2023062410281887400_btab302-B18","first-page":"104","article-title":"Astarix: fast and optimal sequence-to-graph alignment","author":"Ivanov","year":"2020"},{"key":"2023062410281887400_btab302-B19","first-page":"451","article-title":"Accelerating sequence alignment to graphs","volume-title":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","author":"Jain","year":"2019"},{"key":"2023062410281887400_btab302-B20","first-page":"17:1","article-title":"Validating paired-end read alignments in sequence graphs","author":"Jain","year":"2019"},{"key":"2023062410281887400_btab302-B21","doi-asserted-by":"crossref","first-page":"640","DOI":"10.1089\/cmb.2019.0066","article-title":"On the complexity of sequence-to-graph alignment","volume":"27","author":"Jain","year":"2020","journal-title":"J. Comput. Biol"},{"key":"2023062410281887400_btab302-B22","first-page":"266197","article-title":"Hisat-genotype: next generation genomic analysis platform on a personal computer","author":"Kim","year":"2018","journal-title":"BioRxiv"},{"key":"2023062410281887400_btab302-B23","doi-asserted-by":"crossref","first-page":"500","DOI":"10.1089\/cmb.2019.0309","article-title":"Efficient construction of a complete index for pan-genomics read alignment","volume":"27","author":"Kuhnle","year":"2020","journal-title":"J. Comput. Biol"},{"key":"2023062410281887400_btab302-B24","first-page":"105","article-title":"Using minimum path cover to boost dynamic programming on DAGs: co-linear chaining extended","author":"Kuosmanen","year":"2018"},{"key":"2023062410281887400_btab302-B25","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1016\/S0890-5401(03)00057-9","article-title":"Distinguishing string selection problems","volume":"185","author":"Lanctot","year":"2003","journal-title":"Inf. Comput"},{"key":"2023062410281887400_btab302-B26","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13059-020-02168-z","article-title":"The design and construction of reference pangenome graphs with minigraph","volume":"21","author":"Li","year":"2020","journal-title":"Genome Biol"},{"key":"2023062410281887400_btab302-B27","doi-asserted-by":"crossref","first-page":"3224","DOI":"10.1093\/bioinformatics\/btw371","article-title":"debga: read alignment with de Bruijn graph-based seed and extension","volume":"32","author":"Liu","year":"2016","journal-title":"Bioinformatics"},{"key":"2023062410281887400_btab302-B28","doi-asserted-by":"crossref","first-page":"222","DOI":"10.1007\/978-3-319-43681-4_18","volume-title":"International Workshop on Algorithms in Bioinformatics","author":"Maciuca","year":"2016"},{"key":"2023062410281887400_btab302-B29","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13059-019-1828-7","article-title":"Structural variant calling: the long and the short of it","volume":"20","author":"Mahmoud","year":"2019","journal-title":"Genome Biol"},{"key":"2023062410281887400_btab302-B30","doi-asserted-by":"crossref","first-page":"3476","DOI":"10.1093\/bioinformatics\/btu756","article-title":"Splitmem: a graphical algorithm for pan-genome analysis with suffix skips","volume":"30","author":"Marcus","year":"2014","journal-title":"Bioinformatics"},{"key":"2023062410281887400_btab302-B31","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13059-020-01963-y","article-title":"Chop: haplotype-aware path indexing in population graphs","volume":"21","author":"Mokveld","year":"2020","journal-title":"Genome Biol"},{"key":"2023062410281887400_btab302-B32","doi-asserted-by":"crossref","first-page":"665","DOI":"10.1101\/gr.214155.116","article-title":"Genome graphs and the evolution of genome inference","volume":"27","author":"Paten","year":"2017","journal-title":"Genome Res"},{"key":"2023062410281887400_btab302-B33","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13059-018-1595-x","article-title":"Forge: prioritizing variants for graph genomes","volume":"19","author":"Pritt","year":"2018","journal-title":"Genome Biol"},{"key":"2023062410281887400_btab302-B34","doi-asserted-by":"crossref","first-page":"i333","DOI":"10.1093\/bioinformatics\/bts378","article-title":"Delly: structural variant discovery by integrated paired-end and split-read analysis","volume":"28","author":"Rausch","year":"2012","journal-title":"Bioinformatics"},{"key":"2023062410281887400_btab302-B35","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13059-020-02157-2","article-title":"Graphaligner: rapid and versatile sequence-to-graph alignment","volume":"21","author":"Rautiainen","year":"2020","journal-title":"Genome Biol"},{"key":"2023062410281887400_btab302-B36","doi-asserted-by":"crossref","first-page":"R98","DOI":"10.1186\/gb-2009-10-9-r98","article-title":"Simultaneous alignment of short reads against multiple genomes","volume":"10","author":"Schneeberger","year":"2009","journal-title":"Genome Biol"},{"key":"2023062410281887400_btab302-B37","doi-asserted-by":"crossref","first-page":"375","DOI":"10.1109\/TCBB.2013.2297101","article-title":"Indexing graphs for path queries with applications in genome research","volume":"11","author":"Sir\u00e9n","year":"2014","journal-title":"IEEE\/ACM Trans. Comput. Biol. Bioinform"},{"key":"2023062410281887400_btab302-B38","doi-asserted-by":"crossref","first-page":"400","DOI":"10.1093\/bioinformatics\/btz575","article-title":"Haplotype-aware graph indexes","volume":"36","author":"Sir\u00e9n","year":"2020","journal-title":"Bioinformatics"},{"key":"2023062410281887400_btab302-B39","first-page":"259","article-title":"A deterministic linear program solver in current matrix multiplication time","author":"van den Brand","year":"2020"},{"key":"2023062410281887400_btab302-B40","doi-asserted-by":"crossref","first-page":"e127","DOI":"10.1093\/nar\/gks425","article-title":"A new strategy to reduce allelic bias in RNA-seq readmapping","volume":"40","author":"Vijaya","year":"2012","journal-title":"Nucleic Acids Res"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/Supplement_1\/i460\/50694165\/btab302.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/Supplement_1\/i460\/50694165\/btab302.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,6,24]],"date-time":"2023-06-24T20:20:26Z","timestamp":1687638026000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/37\/Supplement_1\/i460\/6319683"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,1]]},"references-count":40,"journal-issue":{"issue":"Supplement_1","published-print":{"date-parts":[[2021,8,4]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btab302","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2021.02.02.429378","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021,7,1]]},"published":{"date-parts":[[2021,7,1]]}}}