{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,12]],"date-time":"2026-03-12T00:36:44Z","timestamp":1773275804264,"version":"3.50.1"},"reference-count":36,"publisher":"Oxford University Press (OUP)","issue":"14","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2009,7,15]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Motivation: The continuing exponential accumulation of full genome data, including full diploid human genomes, creates new challenges not only for understanding genomic structure, function and evolution, but also for the storage, navigation and privacy of genomic data. Here, we develop data structures and algorithms for the efficient storage of genomic and other sequence data that may also facilitate querying and protecting the data.<\/jats:p>\n               <jats:p>Results: The general idea is to encode only the differences between a genome sequence and a reference sequence, using absolute or relative coordinates for the location of the differences. These locations and the corresponding differential variants can be encoded into binary strings using various entropy coding methods, from fixed codes such as Golomb and Elias codes, to variables codes, such as Huffman codes. We demonstrate the approach and various tradeoffs using highly variables human mitochondrial genome sequences as a testbed. With only a partial level of optimization, 3615 genome sequences occupying 56 MB in GenBank are compressed down to only 167 KB, achieving a 345-fold compression rate, using the revised Cambridge Reference Sequence as the reference sequence. Using the consensus sequence as the reference sequence, the data can be stored using only 133 KB, corresponding to a 433-fold level of compression, roughly a 23% improvement. Extensions to nuclear genomes and high-throughput sequencing data are discussed.<\/jats:p>\n               <jats:p>Availability: Data are publicly available from GenBank, the HapMap web site, and the MITOMAP database. Supplementary materials with additional results, statistics, and software implementations are available from http:\/\/mammag.web.uci.edu\/bin\/view\/Mitowiki\/ProjectDNACompression.<\/jats:p>\n               <jats:p>Contact: \u00a0pfbaldi@ics.uci.edu<\/jats:p>","DOI":"10.1093\/bioinformatics\/btp319","type":"journal-article","created":{"date-parts":[[2009,5,16]],"date-time":"2009-05-16T00:14:09Z","timestamp":1242432849000},"page":"1731-1738","source":"Crossref","is-referenced-by-count":78,"title":["Data structures and compression algorithms for genomic sequence data"],"prefix":"10.1093","volume":"25","author":[{"given":"Marty C.","family":"Brandon","sequence":"first","affiliation":[{"name":"1 Department of Computer Science, 2 Institute for Genomics and Bioinformatics, 3 Center for Molecular and Mitochondrial Medicine and Genetics and 4 Department of Biological Chemistry, UCI, Irvine, CA 92697, USA"},{"name":"1 Department of Computer Science, 2 Institute for Genomics and Bioinformatics, 3 Center for Molecular and Mitochondrial Medicine and Genetics and 4 Department of Biological Chemistry, UCI, Irvine, CA 92697, USA"},{"name":"1 Department of Computer Science, 2 Institute for Genomics and Bioinformatics, 3 Center for Molecular and Mitochondrial Medicine and Genetics and 4 Department of Biological Chemistry, UCI, Irvine, CA 92697, USA"}]},{"given":"Douglas C.","family":"Wallace","sequence":"additional","affiliation":[{"name":"1 Department of Computer Science, 2 Institute for Genomics and Bioinformatics, 3 Center for Molecular and Mitochondrial Medicine and Genetics and 4 Department of Biological Chemistry, UCI, Irvine, CA 92697, USA"},{"name":"1 Department of Computer Science, 2 Institute for Genomics and Bioinformatics, 3 Center for Molecular and Mitochondrial Medicine and Genetics and 4 Department of Biological Chemistry, UCI, Irvine, CA 92697, USA"},{"name":"1 Department of Computer Science, 2 Institute for Genomics and Bioinformatics, 3 Center for Molecular and Mitochondrial Medicine and Genetics and 4 Department of Biological Chemistry, UCI, Irvine, CA 92697, USA"}]},{"given":"Pierre","family":"Baldi","sequence":"additional","affiliation":[{"name":"1 Department of Computer Science, 2 Institute for Genomics and Bioinformatics, 3 Center for Molecular and Mitochondrial Medicine and Genetics and 4 Department of Biological Chemistry, UCI, Irvine, CA 92697, USA"},{"name":"1 Department of Computer Science, 2 Institute for Genomics and Bioinformatics, 3 Center for Molecular and Mitochondrial Medicine and Genetics and 4 Department of Biological Chemistry, UCI, Irvine, CA 92697, USA"},{"name":"1 Department of Computer Science, 2 Institute for Genomics and Bioinformatics, 3 Center for Molecular and Mitochondrial Medicine and Genetics and 4 Department of Biological Chemistry, UCI, Irvine, CA 92697, USA"}]}],"member":"286","published-online":{"date-parts":[[2009,5,15]]},"reference":[{"key":"2023013112050498300_B1","doi-asserted-by":"crossref","first-page":"457","DOI":"10.1038\/290457a0","article-title":"Sequence and organization of the human mitochondrial genome","volume":"290","author":"Anderson","year":"1981","journal-title":"Nature"},{"key":"2023013112050498300_B2","doi-asserted-by":"crossref","first-page":"147","DOI":"10.1038\/13779","article-title":"Reanalysis and revision of the cambridge reference sequence for human mitochondrial DNA","volume":"2","author":"Andrews","year":"1999","journal-title":"Nat. Genet."},{"key":"2023013112050498300_B3","doi-asserted-by":"crossref","first-page":"2098","DOI":"10.1021\/ci700200n","article-title":"Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval","volume":"47","author":"Baldi","year":"2007","journal-title":"J. Chem. Inf. Model."},{"key":"2023013112050498300_B4","doi-asserted-by":"crossref","first-page":"190","DOI":"10.1007\/11496656_17","article-title":"DNA compression challenge revisited: a dynamic programming approach","volume":"3537","author":"Behzadi","year":"2005","journal-title":"Lect. Notes Comput. Sci."},{"issue":"Database Issue","key":"2023013112050498300_B5","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1002\/humu.20801","article-title":"MITOMASTER: a bioinformatics tool for the analysis of mitochondrial DNA sequences","volume":"30","author":"Brandon","year":"2009","journal-title":"Hum. Mutat."},{"key":"2023013112050498300_B6","doi-asserted-by":"crossref","first-page":"D611","DOI":"10.1093\/nar\/gki079","article-title":"MITOMAP: a human mitochondrial genome database - 2004 update","volume":"33","author":"Brandon","year":"2005","journal-title":"Nucleic Acids Res."},{"key":"2023013112050498300_B7","doi-asserted-by":"crossref","first-page":"1696","DOI":"10.1093\/bioinformatics\/18.12.1696","article-title":"DNACompress: fast and effective DNA sequence compression","volume":"18","author":"Chen","year":"2002","journal-title":"Bioinformatics"},{"key":"2023013112050498300_B8","doi-asserted-by":"crossref","first-page":"274","DOI":"10.1093\/bioinformatics\/btn582","article-title":"Human genomes as email attachments","volume":"25","author":"Christley","year":"2009","journal-title":"Bioinformatics"},{"key":"2023013112050498300_B9","doi-asserted-by":"crossref","DOI":"10.1002\/0471200611","volume-title":"Elements of Information Theory","author":"Cover","year":"1991"},{"key":"2023013112050498300_B10","doi-asserted-by":"crossref","first-page":"194","DOI":"10.1109\/TIT.1975.1055349","article-title":"Universal codeword sets and representations of the integers","volume":"21","author":"Elias","year":"1975","journal-title":"IEEE Trans. Inf. Theory"},{"key":"2023013112050498300_B11","doi-asserted-by":"crossref","first-page":"331","DOI":"10.1146\/annurev.genet.40.110405.090448","article-title":"DNA transposons and the evolution of eukaryotic genomes","volume":"41","author":"Feschotte","year":"2007","journal-title":"Ann. Rev. Genet."},{"key":"2023013112050498300_B12","doi-asserted-by":"crossref","first-page":"1241","DOI":"10.1038\/4371241a","article-title":"Genomics: understanding human diversity","volume":"437","author":"Goldstein","year":"2005","journal-title":"Nature"},{"key":"2023013112050498300_B13","doi-asserted-by":"crossref","first-page":"399","DOI":"10.1109\/TIT.1966.1053907","article-title":"Run-length encodings","volume":"12","author":"Golomb","year":"1965","journal-title":"IEEE Trans. Inf. Theory"},{"key":"2023013112050498300_B14","first-page":"161","article-title":"Frequency of a 9-bp deletion in the mitochondrial DNA among Asian populations","volume":"64","author":"Harihara","year":"1992","journal-title":"Hum. Biol."},{"key":"2023013112050498300_B15","doi-asserted-by":"crossref","first-page":"1072","DOI":"10.1126\/science.1105436","article-title":"Whole genome patterns of common DNA variation in three human populations","volume":"307","author":"Hinds","year":"2005","journal-title":"Science"},{"key":"2023013112050498300_B16","doi-asserted-by":"crossref","DOI":"10.1109\/DCC.2008.9","article-title":"Effective compression of monotone and quasi-monotone sequences of integers","volume-title":"Proceedings of the 2008 Data Compression Conference (DCC 08)","author":"Hirschberg","year":"2008"},{"key":"2023013112050498300_B17","doi-asserted-by":"crossref","first-page":"1098","DOI":"10.1109\/JRPROC.1952.273898","article-title":"A method for the construction of minimum redundancy codes","volume":"40","author":"Huffman","year":"1952","journal-title":"Proc. IRE"},{"key":"2023013112050498300_B18","doi-asserted-by":"crossref","first-page":"1497","DOI":"10.1126\/science.1141319","article-title":"Genome-wide mapping of in vivo protein-DNA interactions","volume":"316","author":"Johnson","year":"2007","journal-title":"Science"},{"key":"2023013112050498300_B19","doi-asserted-by":"crossref","first-page":"395","DOI":"10.1126\/science.319.5862.395","article-title":"A plan to capture human diversity in 1000 genomes","volume":"319","author":"Kaiser","year":"2008","journal-title":"Science"},{"key":"2023013112050498300_B20","doi-asserted-by":"crossref","first-page":"e254","DOI":"10.1371\/journal.pbio.0050254","article-title":"The diploid genome sequence of an individual human","volume":"5","author":"Levy","year":"2007","journal-title":"PLoS Biol."},{"key":"2023013112050498300_B21","doi-asserted-by":"crossref","first-page":"D1025","DOI":"10.1093\/nar\/gkn966","article-title":"The YH database: the first Asian diploid genome database","volume":"37","author":"Li","year":"2009","journal-title":"Nucleic Acids Res."},{"key":"2023013112050498300_B22","volume-title":"The Theory of Information and Coding","author":"McEliece","year":"1977"},{"key":"2023013112050498300_B23","doi-asserted-by":"crossref","first-page":"171","DOI":"10.1073\/pnas.0136972100","article-title":"Natural selection shaped regional mtDNA variation in humans","volume":"100","author":"Mishmar","year":"2003","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023013112050498300_B24","doi-asserted-by":"crossref","first-page":"175","DOI":"10.1016\/j.ipl.2006.04.014","article-title":"Binary codes for locally homogeneous sequences","volume":"99","author":"Moffat","year":"2006","journal-title":"Inf. Process. Lett."},{"key":"2023013112050498300_B25","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1023\/A:1013002601898","article-title":"Binary interpolative coding for effective index compression","volume":"3","author":"Moffat","year":"2000","journal-title":"Inf. Retr."},{"key":"2023013112050498300_B26","doi-asserted-by":"crossref","first-page":"149","DOI":"10.1147\/rd.232.0149","article-title":"Arithmetic coding","volume":"23","author":"Rissanen","year":"1979","journal-title":"IBM J. Res. Dev."},{"key":"2023013112050498300_B27","doi-asserted-by":"crossref","first-page":"D823","DOI":"10.1093\/nar\/gkl927","article-title":"An enhanced MITOMAP with a global mtDNA mutational philogeny","volume":"35","author":"Ruiz-Pesini","year":"2007","journal-title":"Nucleic Acids Res."},{"key":"2023013112050498300_B28","doi-asserted-by":"crossref","first-page":"1544","DOI":"10.1126\/science.311.5767.1544","article-title":"The race for the $1000 genome","volume":"311","author":"Service","year":"2006","journal-title":"Science"},{"key":"2023013112050498300_B29","doi-asserted-by":"crossref","first-page":"789","DOI":"10.1038\/nature02168","article-title":"The International HapMap Project","volume":"426","author":"The International HapMap Consortium","year":"2003","journal-title":"Nature"},{"key":"2023013112050498300_B30","doi-asserted-by":"crossref","first-page":"851","DOI":"10.1038\/nature06258","article-title":"A second generation human haplotype map of over 3.1 million SNPs","volume":"449","author":"The International HapMap Consortium","year":"2007","journal-title":"Nature"},{"key":"2023013112050498300_B31","doi-asserted-by":"crossref","first-page":"955","DOI":"10.1098\/rstb.1998.0260","article-title":"Molecular instability in the COII-tRNA(lys) intergenic region of the human mitochondrial genome: multiple origins of the 9-bp deletion and heteroplasmy for expanded repeats","volume":"353","author":"Thomas","year":"1998","journal-title":"Phil. Trans. R. Soc. Lond. B Biol. Sci."},{"key":"2023013112050498300_B32","doi-asserted-by":"crossref","first-page":"727","DOI":"10.1038\/ng1562","article-title":"Fine-scale structural variation of the human genome","volume":"37","author":"Tuzun","year":"2005","journal-title":"Nat. Genet."},{"key":"2023013112050498300_B33","doi-asserted-by":"crossref","first-page":"60","DOI":"10.1038\/nature07484","article-title":"The diploid genome sequence of an Asian individual","volume":"456","author":"Wang","year":"2008","journal-title":"Nature"},{"key":"2023013112050498300_B34","doi-asserted-by":"crossref","first-page":"872","DOI":"10.1038\/nature06884","article-title":"The complete genome of an individual by massively parallel DNA sequencing","volume":"452","author":"Wheeler","year":"2008","journal-title":"Nature"},{"key":"2023013112050498300_B35","doi-asserted-by":"crossref","first-page":"549","DOI":"10.1093\/bioinformatics\/13.5.549","article-title":"Compression of nucleotide databases for fast searching","volume":"13","author":"Williams","year":"1997","journal-title":"Bioinformatics"},{"key":"2023013112050498300_B36","doi-asserted-by":"crossref","first-page":"520","DOI":"10.1145\/214762.214771","article-title":"Arithmetic coding for data compression","volume":"30","author":"Witten","year":"1987","journal-title":"Commun. ACM"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/25\/14\/1731\/48993486\/bioinformatics_25_14_1731.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/25\/14\/1731\/48993486\/bioinformatics_25_14_1731.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,31]],"date-time":"2023-01-31T21:19:57Z","timestamp":1675199997000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/25\/14\/1731\/225235"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2009,5,15]]},"references-count":36,"journal-issue":{"issue":"14","published-print":{"date-parts":[[2009,7,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btp319","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2009,7,15]]},"published":{"date-parts":[[2009,5,15]]}}}