{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,1]],"date-time":"2025-11-01T13:37:57Z","timestamp":1762004277456},"reference-count":35,"publisher":"Springer Science and Business Media LLC","issue":"1","content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2010,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:sec>\n            <jats:title>Background<\/jats:title>\n            <jats:p>High-throughput sequencing (HTS) technologies play important roles in the life sciences by allowing the rapid parallel sequencing of very large numbers of relatively short nucleotide sequences, in applications ranging from genome sequencing and resequencing to digital microarrays and ChIP-Seq experiments. As experiments scale up, HTS technologies create new bioinformatics challenges for the storage and sharing of HTS data.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Results<\/jats:title>\n            <jats:p>We develop data structures and compression algorithms for HTS data. A processing stage maps short sequences to a reference genome or a large table of sequences. Then the integers representing the short sequence absolute or relative addresses, their length, and the substitutions they may contain are compressed and stored using various entropy coding algorithms, including both old and new fixed codes (e.g Golomb, Elias Gamma, MOV) and variable codes (e.g. Huffman). The general methodology is illustrated and applied to several HTS data sets. Results show that the information contained in HTS files can be compressed by a factor of 10 or more, depending on the statistical properties of the data sets and various other choices and constraints. Our algorithms fair well against general purpose compression programs such as gzip, bzip2 and 7zip; timing results show that our algorithms are consistently faster than the best general purpose compression programs.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Conclusions<\/jats:title>\n            <jats:p>It is not likely that exactly one encoding strategy will be optimal for all types of HTS data. Different experimental conditions are going to generate various data distributions whereby one encoding strategy can be more effective than another. We have implemented some of our encoding algorithms into the software package GenCompress which is available upon request from the authors. With the advent of HTS technology and increasingly new experimental protocols for using the technology, sequence databases are expected to continue rising in size. The methodology we have proposed is general, and these advanced compression techniques should allow researchers to manage and share their HTS data in a more timely fashion.<\/jats:p>\n          <\/jats:sec>","DOI":"10.1186\/1471-2105-11-514","type":"journal-article","created":{"date-parts":[[2010,10,15]],"date-time":"2010-10-15T18:14:21Z","timestamp":1287166461000},"update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":37,"title":["Data structures and compression algorithms for high-throughput sequencing technologies"],"prefix":"10.1186","volume":"11","author":[{"given":"Kenny","family":"Daily","sequence":"first","affiliation":[]},{"given":"Paul","family":"Rigor","sequence":"additional","affiliation":[]},{"given":"Scott","family":"Christley","sequence":"additional","affiliation":[]},{"given":"Xiaohui","family":"Xie","sequence":"additional","affiliation":[]},{"given":"Pierre","family":"Baldi","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2010,10,14]]},"reference":[{"key":"4097_CR1","doi-asserted-by":"publisher","first-page":"860","DOI":"10.1038\/35057062","volume":"409","author":"International Human Genome Sequencing Consortium","year":"2001","unstructured":"International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature 2001, 409: 860\u2013921.","journal-title":"Nature"},{"issue":"10","key":"4097_CR2","doi-asserted-by":"publisher","first-page":"e254","DOI":"10.1371\/journal.pbio.0050254","volume":"5","author":"S Levy","year":"2007","unstructured":"Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AWC, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL, Venter JC: The diploid genome sequence of an individual human. PLoS Biol 2007, 5(10):e254.","journal-title":"PLoS Biol"},{"issue":"7189","key":"4097_CR3","doi-asserted-by":"publisher","first-page":"872","DOI":"10.1038\/nature06884","volume":"452","author":"DA Wheeler","year":"2008","unstructured":"Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song XZ, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM: The complete genome of an individual by massively parallel DNA sequencing. Nature 2008, 452(7189):872\u20136.","journal-title":"Nature"},{"issue":"7218","key":"4097_CR4","doi-asserted-by":"publisher","first-page":"53","DOI":"10.1038\/nature07517","volume":"456","author":"DR Bentley","year":"2008","unstructured":"Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Cheetham RK, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, Rasolonjatovo IMJ, Reed MT, Rigatti R, Rodighiero C, Ross MT, Sabot A, Sankar SV, Scally A, Schroth GP, Smith ME, Smith VP, Spiridou A, Torrance PE, Tzonev SS, Vermaas EH, Walter K, Wu X, Zhang L, Alam MD, Anastasi C, Aniebo IC, Bailey DMD, Bancarz IR, Banerjee S, Barbour SG, Baybayan PA, Benoit VA, Benson KF, Bevis C, Black PJ, Boodhun A, Brennan JS, Bridgham JA, Brown RC, Brown AA, Buermann DH, Bundu AA, Burrows JC, Carter NP, Castillo N, Catenazzi MCE, Chang S, Cooley RN, Crake NR, Dada OO, Diakoumakos KD, Dominguez-Fernandez B, Earnshaw DJ, Egbujor UC, Elmore DW, Etchin SS, Ewan MR, Fedurco M, Fraser LJ, Fajardo KVF, Furey WS, George D, Gietzen KJ, Goddard CP, Golda GS, Granieri PA, Green DE, Gustafson DL, Hansen NF, Harnish K, Haudenschild CD, Heyer NI, Hims MM, Ho JT, Horgan AM, Hoschler K, Hurwitz S, Ivanov DV, Johnson MQ, James T, Jones TAH, Kang GD, Kerelska TH, Kersey AD, Khrebtukova I, Kindwall AP, Kingsbury Z, Kokko-Gonzales PI, Kumar A, Laurent MA, Lawley CT, Lee SE, Lee X, Liao AK, Loch JA, Lok M, Luo S, Mammen RM, Martin JW, Mccauley PG, Mcnitt P, Mehta P, Moon KW, Mullens JW, Newington T, Ning Z, Ng BL, Novo SM, O'neill MJ, Osborne MA, Osnowski A, Ostadan O, Paraschos LL, Pickering L, Pike AC, Pike AC, Pinkard DC, Pliskin DP, Podhasky J, Quijano VJ, Raczy C, Rae VH, Rawlings SR, Rodriguez AC, Roe PM, Rogers J, Bacigalupo MCR, Romanov N, Romieu A, Roth RK, Rourke NJ, Ruediger ST, Rusman E, Sanches-Kuiper RM, Schenker MR, Seoane JM, Shaw RJ, Shiver MK, Short SW, Sizto NL, Sluis JP, Smith MA, Sohna JES, Spence EJ, Stevens K, Sutton N, Szajkowski L, Tregidgo CL, Turcatti G, Vandevondele S, Verhovsky Y, Virk SM, Wakelin S, Walcott GC, Wang J, Worsley GJ, Yan J, Yau L, Zuerlein M, Rogers J, Mullikin JC, Hurles ME, Mccooke NJ, West JS, Oaks FL, Lundberg PL, Klenerman D, Durbin R, Smith AJ: Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008, 456(7218):53\u201359.","journal-title":"Nature"},{"issue":"7218","key":"4097_CR5","doi-asserted-by":"publisher","first-page":"60","DOI":"10.1038\/nature07484","volume":"456","author":"J Wang","year":"2008","unstructured":"Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Zhang J, Guo Y, Feng B, Li H, Lu Y, Fang X, Liang H, Du Z, Li D, Zhao Y, Hu Y, Yang Z, Zheng H, Hellmann I, Inouye M, Pool J, Yi X, Zhao J, Duan J, Zhou Y, Qin J, Ma L, Li G, Yang Z, Zhang G, Yang B, Yu C, Liang F, Li W, Li S, Li D, Ni P, Ruan J, Li Q, Zhu H, Liu D, Lu Z, Li N, Guo G, Zhang J, Ye J, Fang L, Hao Q, Chen Q, Liang Y, Su Y, San A, Ping C, Yang S, Chen F, Li L, Zhou K, Zheng H, Ren Y, Yang L, Gao Y, Yang G, Li Z, Feng X, Kristiansen K, Wong GKS, Nielsen R, Durbin R, Bolund L, Zhang X, Li S, Yang H, Wang J: The diploid genome sequence of an Asian individual. Nature 2008, 456(7218):60\u20135.","journal-title":"Nature"},{"key":"4097_CR6","doi-asserted-by":"publisher","first-page":"395","DOI":"10.1126\/science.319.5862.395","volume":"319","author":"J Kaiser","year":"2008","unstructured":"Kaiser J: A Plan to Capture Human diversity in 1000 Genomes. Science 2008, 319: 395.","journal-title":"Science"},{"key":"4097_CR7","doi-asserted-by":"publisher","first-page":"1544","DOI":"10.1126\/science.311.5767.1544","volume":"311","author":"RF Service","year":"2006","unstructured":"Service RF: The Race for the $1000 Genome. Science 2006, 311: 1544\u20131546.","journal-title":"Science"},{"key":"4097_CR8","doi-asserted-by":"publisher","first-page":"613","DOI":"10.1038\/nmeth0807-613","volume":"4","author":"ER Mardis","year":"2007","unstructured":"Mardis ER: ChIP-seq: welcome to the new frontier. Nature Methods 2007, 4: 613\u2013614.","journal-title":"Nature Methods"},{"key":"4097_CR9","doi-asserted-by":"publisher","first-page":"1518","DOI":"10.1242\/jeb.001370","volume":"209","author":"N Hall","year":"2007","unstructured":"Hall N: Advanced Sequencing Technologies and their Wider Impact in Microbiology. The Journal of Experimental Biology 2007, 209: 1518\u20131525.","journal-title":"The Journal of Experimental Biology"},{"issue":"11","key":"4097_CR10","doi-asserted-by":"publisher","first-page":"1851","DOI":"10.1101\/gr.078212.108","volume":"18","author":"H Li","year":"2008","unstructured":"Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008, 18(11):1851\u20138.","journal-title":"Genome Res"},{"issue":"21","key":"4097_CR11","doi-asserted-by":"publisher","first-page":"2431","DOI":"10.1093\/bioinformatics\/btn416","volume":"24","author":"H Lin","year":"2008","unstructured":"Lin H, Zhang Z, Zhang MQ, Ma B, Li M: ZOOM! Zillions of oligos mapped. Bioinformatics 2008, 24(21):2431\u20137.","journal-title":"Bioinformatics"},{"issue":"3","key":"4097_CR12","doi-asserted-by":"publisher","first-page":"R25","DOI":"10.1186\/gb-2009-10-3-r25","volume":"10","author":"B Langmead","year":"2009","unstructured":"Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 2009, 10(3):R25. [http:\/\/genomebiology.com\/2009\/10\/3\/R25]","journal-title":"Genome Biology"},{"issue":"6","key":"4097_CR13","doi-asserted-by":"publisher","first-page":"875","DOI":"10.1016\/0306-4573(94)90014-0","volume":"30","author":"S Grumbach","year":"1994","unstructured":"Grumbach S, Tahi F: A new challenge for compression algorithms: Genetic sequences. Information Processing & Management 1994, 30(6):875\u2013886.","journal-title":"Information Processing & Management"},{"key":"4097_CR14","first-page":"43","volume":"11","author":"T Matsumoto","year":"2000","unstructured":"Matsumoto T, Sadakane K, Imai H: Biological sequence compression algorithms. Genome informatics 2000, 11: 43\u201352.","journal-title":"Genome informatics"},{"issue":"11","key":"4097_CR15","doi-asserted-by":"publisher","first-page":"1733","DOI":"10.1109\/5.892709","volume":"88","author":"A Apostolico","year":"2000","unstructured":"Apostolico A, Lonardi S: Off-Line Compression by Greedy Textual Substitution. Proceedings of the IEEE 2000, 88(11):1733\u20131744.","journal-title":"Proceedings of the IEEE"},{"key":"4097_CR16","doi-asserted-by":"publisher","first-page":"1696","DOI":"10.1093\/bioinformatics\/18.12.1696","volume":"18","author":"X Chen","year":"2002","unstructured":"Chen X, Li M, Ma B, Tromp J: DNACompress: fast and effective DNA sequence compression. Bioinformatics 2002, 18: 1696\u20131698.","journal-title":"Bioinformatics"},{"issue":"14","key":"4097_CR17","doi-asserted-by":"publisher","first-page":"1397","DOI":"10.1002\/spe.619","volume":"34","author":"G Manzini","year":"2004","unstructured":"Manzini G, Rastero M: A simple and fast DNA compressor. Softw Pract Exper 2004, 34(14):1397\u20131411.","journal-title":"Softw Pract Exper"},{"key":"4097_CR18","doi-asserted-by":"publisher","first-page":"242","DOI":"10.1186\/1471-2105-9-242","volume":"9","author":"WTJ White","year":"2008","unstructured":"White WTJ, Hendy MD: Compressing DNA sequence databases with coil. BMC Bioinformatics 2008, 9: 242.","journal-title":"BMC Bioinformatics"},{"key":"4097_CR19","doi-asserted-by":"publisher","first-page":"274","DOI":"10.1093\/bioinformatics\/btn582","volume":"25","author":"S Christley","year":"2008","unstructured":"Christley S, Lu Y, Li C, Xie X: Human Genomes as Email Attachments. Bioinformatics 2008, 25: 274\u2013275.","journal-title":"Bioinformatics"},{"key":"4097_CR20","unstructured":"The gzip home page[http:\/\/www.gzip.org]"},{"key":"4097_CR21","volume-title":"The Theory of Information and Coding","author":"RJ McEliece","year":"1977","unstructured":"McEliece RJ: The Theory of Information and Coding. Reading, MA: Addison-Wesley Publishing Company; 1977."},{"key":"4097_CR22","doi-asserted-by":"publisher","DOI":"10.1002\/0471200611","volume-title":"Elements of Information Theory","author":"TM Cover","year":"1991","unstructured":"Cover TM, Thomas JA: Elements of Information Theory. New York: John Wiley; 1991."},{"issue":"3","key":"4097_CR23","doi-asserted-by":"publisher","first-page":"399","DOI":"10.1109\/TIT.1966.1053907","volume":"12","author":"SW Golomb","year":"1965","unstructured":"Golomb SW: Run-Length Encodings. IEEE Transactions on Information Theory 1965, 12(3):399\u2013401.","journal-title":"IEEE Transactions on Information Theory"},{"issue":"2","key":"4097_CR24","doi-asserted-by":"publisher","first-page":"194","DOI":"10.1109\/TIT.1975.1055349","volume":"21","author":"P Elias","year":"1975","unstructured":"Elias P: Universal Codeword Sets and Representations of the Integers. IEEE Transactions on Information Theory 1975, 21(2):194\u2013203.","journal-title":"IEEE Transactions on Information Theory"},{"issue":"6","key":"4097_CR25","doi-asserted-by":"publisher","first-page":"2098","DOI":"10.1021\/ci700200n","volume":"47","author":"P Baldi","year":"2007","unstructured":"Baldi P, Benz RW, Hirschberg D, Swamidass S: Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval. Journal of Chemical Information and Modeling 2007, 47(6):2098\u20132109.","journal-title":"Journal of Chemical Information and Modeling"},{"key":"4097_CR26","doi-asserted-by":"publisher","first-page":"1098","DOI":"10.1109\/JRPROC.1952.273898","volume":"40","author":"D Huffman","year":"1952","unstructured":"Huffman D: A method for the construction of minimum redundancy codes. Proc IRE 1952, 40: 1098\u20131101.","journal-title":"Proc IRE"},{"key":"4097_CR27","doi-asserted-by":"publisher","first-page":"25","DOI":"10.1023\/A:1013002601898","volume":"3","author":"A Moffat","year":"2000","unstructured":"Moffat A, Stuiver L: Binary Interpolative Coding for Effective Index Compression. Inf Retr 2000, 3: 25\u201347.","journal-title":"Inf Retr"},{"key":"4097_CR28","doi-asserted-by":"publisher","first-page":"175","DOI":"10.1016\/j.ipl.2006.04.014","volume":"99","author":"A Moffat","year":"2006","unstructured":"Moffat A, Anh V: Binary codes for locally homogeneous sequences. Information Processing Letters 2006, 99: 175\u2013180.","journal-title":"Information Processing Letters"},{"key":"4097_CR29","volume-title":"Proceedings of the 2008 Data Compression Conference (DCC 08)","author":"DS Hirschberg","year":"2008","unstructured":"Hirschberg DS, Baldi P: Effective Compression of Monotone and Quasi-Monotone Sequences of Integers. In Proceedings of the 2008 Data Compression Conference (DCC 08). Los Alamitos, CA: IEEE Computer Society Press; 2008:in press."},{"issue":"2","key":"4097_CR30","doi-asserted-by":"publisher","first-page":"149","DOI":"10.1147\/rd.232.0149","volume":"23","author":"JJ Rissanen","year":"1979","unstructured":"Rissanen JJ, Langdonr GG: Arithmetic coding. IBM Journal of Research and Development 1979, 23(2):149\u2013162.","journal-title":"IBM Journal of Research and Development"},{"issue":"6","key":"4097_CR31","doi-asserted-by":"publisher","first-page":"520","DOI":"10.1145\/214762.214771","volume":"30","author":"IH Witten","year":"1987","unstructured":"Witten IH, Neal RM, Clearly JG: Arithmetic Coding for Data Compression. Communications of the ACM 1987, 30(6):520\u2013540.","journal-title":"Communications of the ACM"},{"key":"4097_CR32","volume-title":"Encyclopedia of Algorithms","author":"MY Kao","year":"2007","unstructured":"Kao MY: Encyclopedia of Algorithms. Secaucus, NJ, USA: Springer-Verlag New York, Inc; 2007."},{"key":"4097_CR33","volume-title":"Managing Gigabytes: Compressing and Indexing Documents and Images","author":"I Witten","year":"1999","unstructured":"Witten I, Moffat A, Cell TB: Managing Gigabytes: Compressing and Indexing Documents and Images. Second edition. Morgan Kauffman; 1999.","edition":"Second"},{"key":"4097_CR34","doi-asserted-by":"publisher","first-page":"1497","DOI":"10.1126\/science.1141319","volume":"316","author":"DS Johnson","year":"2007","unstructured":"Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-wide mapping of in vivo protein-DNA interactions. Science 2007, 316: 1497\u20131502.","journal-title":"Science"},{"key":"4097_CR35","doi-asserted-by":"publisher","first-page":"D1025","DOI":"10.1093\/nar\/gkn966","volume":"37","author":"G Li","year":"2009","unstructured":"Li G, Ma L, Song C, Yang Z, Wang X, Huang H, Li Y, Li R, Zhang X, Yang H, Wang J, Wang J: The YH database: the first Asian diploid genome database. Nucleic Acids Res 2009, 37: D1025\u20138.","journal-title":"Nucleic Acids Res"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-11-514.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,1]],"date-time":"2021-09-01T05:29:37Z","timestamp":1630474177000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-11-514"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2010,10,14]]},"references-count":35,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2010,12]]}},"alternative-id":["4097"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-11-514","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2010,10,14]]},"assertion":[{"value":"28 December 2009","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 October 2010","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 October 2010","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"514"}}