{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T20:34:05Z","timestamp":1772138045288,"version":"3.50.1"},"reference-count":46,"publisher":"Oxford University Press (OUP)","issue":"Supplement_1","license":[{"start":{"date-parts":[[2021,7,12]],"date-time":"2021-07-12T00:00:00Z","timestamp":1626048000000},"content-version":"vor","delay-in-days":11,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Gordon and Betty Moore Foundation\u2019s Data-Driven Discovery Initiative","award":["GBMF4554"],"award-info":[{"award-number":["GBMF4554"]}]},{"DOI":"10.13039\/100000002","name":"US National Institutes of Health","doi-asserted-by":"crossref","award":["R01GM122935"],"award-info":[{"award-number":["R01GM122935"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000001","name":"US National Science Foundation","doi-asserted-by":"crossref","award":["DBI-1937540"],"award-info":[{"award-number":["DBI-1937540"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,8,4]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>The size of a genome graph\u2014the space required to store the nodes, node labels and edges\u2014affects the efficiency of operations performed on it. For example, the time complexity to align a sequence to a graph without a graph index depends on the total number of characters in the node labels and the number of edges in the graph. This raises the need for approaches to construct space-efficient genome graphs.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We point out similarities in the string encoding mechanisms of genome graphs and the external pointer macro (EPM) compression model. We present a pair of linear-time algorithms that transform between genome graphs and EPM-compressed forms. The algorithms result in an upper bound on the size of the genome graph constructed in terms of an optimal EPM compression. To further reduce the size of the genome graph, we propose the source assignment problem that optimizes over the equivalent choices during compression and introduce an ILP formulation that solves that problem optimally. As a proof-of-concept, we introduce RLZ-Graph, a genome graph constructed based on the relative Lempel\u2013Ziv algorithm. Using RLZ-Graph, across all human chromosomes, we are able to reduce the disk space to store a genome graph on average by 40.7% compared to colored compacted de Bruijn graphs constructed by Bifrost under the default settings. The RLZ-Graph scales well in terms of running time and graph sizes with an increasing number of human genome sequences compared to Bifrost and variation graphs produced by VGtoolkit.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability<\/jats:title>\n                    <jats:p>The RLZ-Graph software is available at: https:\/\/github.com\/Kingsford-Group\/rlzgraph.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btab281","type":"journal-article","created":{"date-parts":[[2021,4,28]],"date-time":"2021-04-28T23:25:17Z","timestamp":1619652317000},"page":"i205-i213","source":"Crossref","is-referenced-by-count":2,"title":["Constructing small genome graphs via string compression"],"prefix":"10.1093","volume":"37","author":[{"given":"Yutong","family":"Qiu","sequence":"first","affiliation":[{"name":"Computational Biology Department, School of Computer Science, Carnegie Mellon University , Pittsburgh, PA 15213, USA"}]},{"given":"Carl","family":"Kingsford","sequence":"additional","affiliation":[{"name":"Computational Biology Department, School of Computer Science, Carnegie Mellon University , Pittsburgh, PA 15213, USA"}]}],"member":"286","published-online":{"date-parts":[[2021,7,12]]},"reference":[{"key":"2023062410180632700_btab281-B1","doi-asserted-by":"crossref","first-page":"68","DOI":"10.1038\/nature15393","article-title":"A global reference for human genetic variation","volume":"526","year":"2015","journal-title":"Nature"},{"key":"2023062410180632700_btab281-B2","first-page":", p.1","author":"Alanko","year":"2017"},{"key":"2023062410180632700_btab281-B3","author":"Almodaresi","year":"2017"},{"key":"2023062410180632700_btab281-B4","doi-asserted-by":"crossref","first-page":"485","DOI":"10.1089\/cmb.2019.0322","article-title":"An efficient, scalable, and exact representation of high-dimensional color information enabled using de Bruijn graph search","volume":"27","author":"Almodaresi","year":"2020","journal-title":"J. Comput. Biol"},{"key":"2023062410180632700_btab281-B5","doi-asserted-by":"crossref","first-page":"159","DOI":"10.1186\/s13059-019-1774-4","article-title":"Is it time to change the reference genome?","volume":"20","author":"Ballouz","year":"2019","journal-title":"Genome Biol"},{"key":"2023062410180632700_btab281-B6","doi-asserted-by":"crossref","first-page":"630","DOI":"10.1145\/179812.179818","article-title":"Linear approximation of shortest superstrings","volume":"41","author":"Blum","year":"1994","journal-title":"J. ACM"},{"key":"2023062410180632700_btab281-B7","author":"Chen","year":"2020"},{"key":"2023062410180632700_btab281-B8","doi-asserted-by":"crossref","first-page":"D67","DOI":"10.1093\/nar\/gkv1276","article-title":"GenBank","volume":"44","author":"Clark","year":"2016","journal-title":"Nucleic Acids Res"},{"key":"2023062410180632700_btab281-B9","first-page":"118","article-title":"Computational pan-genomics: status, promises and challenges","volume":"19","year":"2018","journal-title":"Brief. Bioinformatics"},{"key":"2023062410180632700_btab281-B10","doi-asserted-by":"crossref","first-page":"2979","DOI":"10.1093\/bioinformatics\/btr505","article-title":"Robust relative compression of genomes with random access","volume":"27","author":"Deorowicz","year":"2011","journal-title":"Bioinformatics"},{"key":"2023062410180632700_btab281-B11","doi-asserted-by":"crossref","first-page":"11565","DOI":"10.1038\/srep11565","article-title":"GDC2: compression of large collections of genomes","volume":"5","author":"Deorowicz","year":"2015","journal-title":"Sci. Rep"},{"key":"2023062410180632700_btab281-B12","doi-asserted-by":"crossref","first-page":"682","DOI":"10.1038\/ng.3257","article-title":"Improved genome inference in the MHC using a population reference graph","volume":"47","author":"Dilthey","year":"2015","journal-title":"Nat. Genetics"},{"key":"2023062410180632700_btab281-B13","doi-asserted-by":"crossref","first-page":"14","DOI":"10.1016\/j.tcs.2013.07.024","article-title":"Fast relative Lempel\u2013Ziv self-index for similar sequences","volume":"532","author":"Do","year":"2014","journal-title":"Theor. Comput. Sci"},{"key":"2023062410180632700_btab281-B14","doi-asserted-by":"crossref","first-page":"eabf7117","DOI":"10.1126\/science.abf7117","article-title":"Haplotype-resolved diverse human genomes and integrated analysis of structural variation","volume":"372","author":"Ebert","year":"2021","journal-title":"Science"},{"key":"2023062410180632700_btab281-B15","first-page":"13","article-title":"International Symposium on String Processing and Information Retrieval","author":"Ferrada","year":"2014"},{"key":"2023062410180632700_btab281-B16","first-page":"240","volume-title":".","author":"Gagie","year":"2012"},{"key":"2023062410180632700_btab281-B17","first-page":"160","author":"Gagie","year":"2016"},{"key":"2023062410180632700_btab281-B18","doi-asserted-by":"crossref","first-page":"875","DOI":"10.1038\/nbt.4227","article-title":"Variation graph toolkit improves read mapping by representing genetic variation in the reference","volume":"36","author":"Garrison","year":"2018","journal-title":"Nat. Biotechnol"},{"key":"2023062410180632700_btab281-B19","doi-asserted-by":"crossref","first-page":"326","DOI":"10.1007\/978-3-319-07959-2_28","article-title":"From theory to practice: plug and play with succinct data structures","volume-title":"13th International Symposium on Experimental Algorithms","author":"Gog","year":"2014"},{"key":"2023062410180632700_btab281-B20","doi-asserted-by":"crossref","first-page":"249","DOI":"10.1186\/s13059-020-02135-8","article-title":"Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs","volume":"21","author":"Holley","year":"2020","journal-title":"Genome Biol"},{"key":"2023062410180632700_btab281-B21","doi-asserted-by":"crossref","first-page":"226","DOI":"10.1038\/ng.1028","article-title":"De novo assembly and genotyping of variants using colored de Bruijn graphs","volume":"44","author":"Iqbal","year":"2012","journal-title":"Nat. Genetics"},{"key":"2023062410180632700_btab281-B22","doi-asserted-by":"publisher","first-page":"640","DOI":"10.1089\/cmb.2019.0066","article-title":"On the Complexity of Sequence-to-Graph Alignment","volume":"27","author":"Jain","year":"2020","journal-title":"Journal of Computational Biology"},{"key":"2023062410180632700_btab281-B23","first-page":"302","author":"K\u00e4rkk\u00e4inen","year":"2014"},{"key":"2023062410180632700_btab281-B24","doi-asserted-by":"crossref","first-page":"201","DOI":"10.1007\/978-3-642-16321-0_20","volume-title":"String Processing and Information Retrieval","author":"Kuruppu","year":"2010"},{"key":"2023062410180632700_btab281-B25","first-page":"91","author":"Kuruppu","year":"2011"},{"key":"2023062410180632700_btab281-B26","doi-asserted-by":"crossref","first-page":"2987","DOI":"10.1093\/bioinformatics\/btr509","article-title":"A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data","volume":"27","author":"Li","year":"2011","journal-title":"Bioinformatics"},{"key":"2023062410180632700_btab281-B27","doi-asserted-by":"crossref","first-page":"2103","DOI":"10.1093\/bioinformatics\/btw152","article-title":"Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences","volume":"32","author":"Li","year":"2016","journal-title":"Bioinformatics"},{"key":"2023062410180632700_btab281-B28","doi-asserted-by":"crossref","first-page":"265","DOI":"10.1186\/s13059-020-02168-z","article-title":"The design and construction of reference pangenome graphs with minigraph","volume":"21","author":"Li","year":"2020","journal-title":"Genome Biol"},{"key":"2023062410180632700_btab281-B29","first-page":"7:1","author":"M\u00e4kinen","year":"2020"},{"key":"2023062410180632700_btab281-B30","doi-asserted-by":"crossref","first-page":"4024","DOI":"10.1093\/bioinformatics\/btw609","article-title":"TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes","volume":"33","author":"Minkin","year":"2017","journal-title":"Bioinformatics"},{"key":"2023062410180632700_btab281-B31","doi-asserted-by":"crossref","first-page":"i51","DOI":"10.1093\/bioinformatics\/btz350","article-title":"Building large updatable colored de Bruijn graphs via merging","volume":"35","author":"Muggli","year":"2019","journal-title":"Bioinformatics"},{"key":"2023062410180632700_btab281-B32","first-page":"201","author":"Navarro","year":"2019"},{"key":"2023062410180632700_btab281-B33","first-page":"101378","author":"Novak","year":"2017"},{"key":"2023062410180632700_btab281-B34","doi-asserted-by":"crossref","first-page":"469","DOI":"10.1089\/cmb.2010.0252","article-title":"Cactus graphs for genome comparisons","volume":"18","author":"Paten","year":"2011","journal-title":"J. Comput. Biol"},{"key":"2023062410180632700_btab281-B35","doi-asserted-by":"crossref","first-page":"665","DOI":"10.1101\/gr.214155.116","article-title":"Genome graphs and the evolution of genome inference","volume":"27","author":"Paten","year":"2017","journal-title":"Genome Res"},{"key":"2023062410180632700_btab281-B36","doi-asserted-by":"crossref","first-page":"187","DOI":"10.1016\/0304-3975(81)90075-X","article-title":"The shortest common supersequence problem over binary alphabet is NP-complete","volume":"16","author":"R\u00e4ih\u00e4","year":"1981","journal-title":"Theor. Comput. Sci"},{"key":"2023062410180632700_btab281-B37","doi-asserted-by":"crossref","first-page":"354","DOI":"10.1038\/s41588-018-0316-4","article-title":"Fast and accurate genomic analyses using genome graphs","volume":"51","author":"Rakocevic","year":"2019","journal-title":"Nat. Genetics"},{"key":"2023062410180632700_btab281-B38","first-page":"233","author":"Raman","year":"2002"},{"key":"2023062410180632700_btab281-B39","doi-asserted-by":"crossref","first-page":"243","DOI":"10.1038\/s41576-020-0210-7","article-title":"Pan-genomics in the human genome era","volume":"21","author":"Sherman","year":"2020","journal-title":"Nat. Rev. Genetics"},{"key":"2023062410180632700_btab281-B40","first-page":"13","author":"Sir\u00e9n","year":"2017"},{"key":"2023062410180632700_btab281-B41","doi-asserted-by":"crossref","first-page":"375","DOI":"10.1109\/TCBB.2013.2297101","article-title":"Indexing graphs for path queries with applications in genome research","volume":"11","author":"Sir\u00e9n","year":"2014","journal-title":"IEEE\/ACM Trans. Comput. Biol. Bioinformatics"},{"key":"2023062410180632700_btab281-B42","doi-asserted-by":"crossref","first-page":"400","DOI":"10.1093\/bioinformatics\/btz575","article-title":"Haplotype-aware graph indexes","volume":"36","author":"Sir\u00e9n","year":"2020","journal-title":"Bioinformatics"},{"key":"2023062410180632700_btab281-B43","author":"Storer","year":"1977"},{"key":"2023062410180632700_btab281-B44","doi-asserted-by":"crossref","first-page":"928","DOI":"10.1145\/322344.322346","article-title":"Data compression via textual substitution","volume":"29","author":"Storer","year":"1982","journal-title":"J. ACM"},{"key":"2023062410180632700_btab281-B45","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/0890-5401(89)90044-8","article-title":"Approximation algorithms for the shortest common superstring problem","volume":"83","author":"Turner","year":"1989","journal-title":"Inform. Comput"},{"key":"2023062410180632700_btab281-B46","doi-asserted-by":"crossref","first-page":"337","DOI":"10.1109\/TIT.1977.1055714","article-title":"A universal algorithm for sequential data compression","volume":"23","author":"Ziv","year":"1977","journal-title":"IEEE Trans. Inform. Theory"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/Supplement_1\/i205\/50694313\/btab281.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/Supplement_1\/i205\/50694313\/btab281.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,6,24]],"date-time":"2023-06-24T20:18:49Z","timestamp":1687637929000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/37\/Supplement_1\/i205\/6319693"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,1]]},"references-count":46,"journal-issue":{"issue":"Supplement_1","published-print":{"date-parts":[[2021,8,4]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btab281","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2021.02.08.430279","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021,7,1]]},"published":{"date-parts":[[2021,7,1]]}}}