{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T09:57:45Z","timestamp":1775037465697,"version":"3.50.1"},"reference-count":166,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2016,10,14]],"date-time":"2016-10-14T00:00:00Z","timestamp":1476403200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>The ever increasing growth of the production of high-throughput sequencing data poses a serious challenge to the storage, processing and transmission of these data. As frequently stated, it is a data deluge. Compression is essential to address this challenge\u2014it reduces storage space and processing costs, along with speeding up data transmission. In this paper, we provide a comprehensive survey of existing compression approaches, that are specialized for biological data, including protein and DNA sequences. Also, we devote an important part of the paper to the approaches proposed for the compression of different file formats, such as FASTA, as well as FASTQ and SAM\/BAM, which contain quality scores and metadata, in addition to the biological sequences. Then, we present a comparison of the performance of several methods, in terms of compression ratio, memory usage and compression\/decompression time. Finally, we present some suggestions for future research on biological data compression.<\/jats:p>","DOI":"10.3390\/info7040056","type":"journal-article","created":{"date-parts":[[2016,10,14]],"date-time":"2016-10-14T09:40:29Z","timestamp":1476438029000},"page":"56","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":84,"title":["A Survey on Data Compression Methods for Biological Sequences"],"prefix":"10.3390","volume":"7","author":[{"given":"Morteza","family":"Hosseini","sequence":"first","affiliation":[{"name":"Institute of Electronics and Informatics Engineering of Aveiro\/Department of Electronics, Telecommunications and Informatics (IEETA\/DETI), University of Aveiro, 3810-193 Aveiro, Portugal"}]},{"given":"Diogo","family":"Pratas","sequence":"additional","affiliation":[{"name":"Institute of Electronics and Informatics Engineering of Aveiro\/Department of Electronics, Telecommunications and Informatics (IEETA\/DETI), University of Aveiro, 3810-193 Aveiro, Portugal"}]},{"given":"Armando","family":"Pinho","sequence":"additional","affiliation":[{"name":"Institute of Electronics and Informatics Engineering of Aveiro\/Department of Electronics, Telecommunications and Informatics (IEETA\/DETI), University of Aveiro, 3810-193 Aveiro, Portugal"}]}],"member":"1968","published-online":{"date-parts":[[2016,10,14]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Muir, P., Li, S., Lou, S., Wang, D., Spakowicz, D.J., Salichos, L., Zhang, J., Weinstock, G.M., Isaacs, F., and Rozowsky, J. (2016). The real cost of sequencing: Scaling computation to keep pace with data generation. Genom. Biol.","DOI":"10.1186\/s13059-016-0917-0"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"728","DOI":"10.1126\/science.1197891","article-title":"On the future of genomic data","volume":"331","author":"Kahn","year":"2011","journal-title":"Science"},{"key":"ref_3","unstructured":"Alberti, C., Mattavelli, M., Hernandez, A., Chiariglione, L., Xenarios, I., Guex, N., Stockinger, H., Schuepbach, T., Kahlem, P., and Iseli, C. (2015). Investigation on Genomic Information Compression and Storage, ISO. ISO\/IEC JTC 1\/SC 29\/WG 11 N15346."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"390","DOI":"10.1093\/bib\/bbt088","article-title":"Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies","volume":"15","author":"Giancarlo","year":"2014","journal-title":"Brief. Bioinform."},{"key":"ref_5","unstructured":"De Bruijn, N. A Combinatorial Problem. Available online: https:\/\/pure.tue.nl\/ws\/files\/4442708\/597473.pdf."},{"key":"ref_6","first-page":"987","article-title":"How to apply de Bruijn graphs to genome assembly","volume":"29","author":"Compeau","year":"2011","journal-title":"Nat. Methods"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"479","DOI":"10.1093\/bioinformatics\/btq697","article-title":"Succinct data structures for assembling","volume":"27","author":"Conway","year":"2011","journal-title":"Bioinformatics"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Cao, M., Dix, T., and Allison, L. (2010). A genome alignment algorithm based on compression. BMC Bioinform., 11.","DOI":"10.1186\/1471-2105-11-599"},{"key":"ref_9","unstructured":"Cao, M., Dix, T., Allison, L., and Mears, C. (2007, January 27\u201329). A simple statistical algorithm for biological sequence compression. Proceedings of the DCC \u201907: Data Compression Conference, Snowbird, UT, USA."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"46","DOI":"10.9790\/0661-1054651","article-title":"A new approach of protein sequence compression using repeat reduction and ASCII replacement","volume":"10","author":"Mallick","year":"2013","journal-title":"IOSR J. Comput. Eng. (IOSR-JCE)"},{"key":"ref_11","unstructured":"Ward, M. (2014). Virtual Organisms: The Startling World of Artificial Life, Macmillan."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"269","DOI":"10.1016\/0097-8485(94)85023-2","article-title":"Non-globular domains in protein sequences: Automated segmentation using complexity measures","volume":"18","author":"Wootton","year":"1994","journal-title":"Comput. Chem."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Benedetto, D., Caglioti, E., and Chica, C. (2007). Compressing proteomes: The relevance of medium range correlations. EURASIP J. Bioinform. Syst. Biol., 2007.","DOI":"10.1155\/2007\/60723"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"2949","DOI":"10.1007\/s00018-016-2138-9","article-title":"Natural protein sequences are more intrinsically disordered than random sequences","volume":"73","author":"Yu","year":"2016","journal-title":"Cell. Mol. Life Sci."},{"key":"ref_15","unstructured":"The Human Proteome Project. Available online: http:\/\/www.thehpp.org."},{"key":"ref_16","unstructured":"Three sequenced Neanderthal genomes. Available online: http:\/\/cdna.eva.mpg.de\/neandertal."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Nevill-Manning, C., and Witten, I. (1999, January 29\u201331). Protein is incompressible. Proceedings of the DCC \u201999: Data Compression Conference, Snowbird, UT, USA.","DOI":"10.1109\/DCC.1999.755675"},{"key":"ref_18","first-page":"43","article-title":"Biological sequence compression algorithms","volume":"11","author":"Matsumoto","year":"2000","journal-title":"Genom. Inform."},{"key":"ref_19","unstructured":"Hategan, A., and Tabus, I. (2004, January 9\u201311). Protein is compressible. Proceedings of the 6th Nordic Signal Processing Symposium, Espoo, Finland."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"653","DOI":"10.1109\/18.382012","article-title":"The context tree weighting method: Basic properties","volume":"41","author":"Willems","year":"1995","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Hategan, A., and Tabus, I. (2007, January 10\u201312). Jointly encoding protein sequences and their secondary structure. Proceedings of the IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS 2007), Tuusula, Finland.","DOI":"10.1109\/GENSIPS.2007.4365849"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"283","DOI":"10.1093\/bioinformatics\/btt214","article-title":"Compressive genomics for protein databases","volume":"29","author":"Daniels","year":"2013","journal-title":"Bioinformatics"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"2577","DOI":"10.1002\/bip.360221211","article-title":"Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features","volume":"22","author":"Kabsch","year":"1983","journal-title":"Biopolymers"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"380","DOI":"10.1016\/j.ymeth.2014.01.012","article-title":"Proteome compression via protein domain compositions","volume":"67","author":"Hayashida","year":"2014","journal-title":"Methods"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.cosrev.2011.11.001","article-title":"Textual data compression in computational biology: Algorithmic techniques","volume":"6","author":"Giancarlo","year":"2012","journal-title":"Comput. Sci. Rev."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Zhu, Z., Zhang, Y., Ji, Z., He, S., and Yang, X. (2013). High-throughput DNA sequence data compression. Brief. Bioinform., 16.","DOI":"10.1093\/bib\/bbt087"},{"key":"ref_27","first-page":"72","article-title":"DNA lossless compression algorithms: Review","volume":"3","author":"Bakr","year":"2013","journal-title":"Am. J. Bioinform. Res."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"315","DOI":"10.2174\/1574893609666140516010143","article-title":"Trends in genome compression","volume":"9","author":"Wandelt","year":"2014","journal-title":"Curr. Bioinform."},{"key":"ref_29","unstructured":"Grumbach, S., and Tahi, F. (April, January 30). Compression of DNA sequences. Proceedings of the DCC\u201993: Data Compression Conference, Snowbird, UT, USA."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"337","DOI":"10.1109\/TIT.1977.1055714","article-title":"A universal algorithm for sequential data compression","volume":"23","author":"Ziv","year":"1977","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"875","DOI":"10.1016\/0306-4573(94)90014-0","article-title":"A new challenge for compression algorithms: Genetic sequences","volume":"30","author":"Grumbach","year":"1994","journal-title":"Inf. Process. Manag."},{"key":"ref_32","unstructured":"Rivals, E., Delahaye, J., Dauchet, M., and Delgrange, O. (April, January 31). A guaranteed compression scheme for repetitive DNA sequences. Proceedings of the DCC \u201996: Data Compression Conference, Snowbird, UT, USA."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"249","DOI":"10.1007\/BF01206331","article-title":"On-line construction of suffix trees","volume":"14","author":"Ukkonen","year":"1995","journal-title":"Algorithmica"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Chen, X., Kwong, S., Li, M., and Delgrange, O. (2000, January 8\u201311). A compression algorithm for DNA sequences and its applications in genome comparison. Proceedings of the 4th Annual International Conference of Research in Computational Molecular Biology (RECOMB \u201900), Tokyo, Japan.","DOI":"10.1145\/332306.332352"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"1696","DOI":"10.1093\/bioinformatics\/18.12.1696","article-title":"DNACompress: Fast and effective DNA sequence","volume":"18","author":"Chen","year":"2002","journal-title":"Bioinformatics"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"440","DOI":"10.1093\/bioinformatics\/18.3.440","article-title":"PatternHunter: Faster and more sensitive homology search","volume":"18","author":"Ma","year":"2002","journal-title":"Bioinformatics"},{"key":"ref_37","unstructured":"Tabus, I., Korodi, G., and Rissanen, J. (2003, January 25\u201327). DNA sequence compression using the normalized maximum likelihood model for discrete regression. Proceedings of the DCC \u201903: Data Compression Conference, Snowbird, UT, USA."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1145\/1055709.1055711","article-title":"An efficient normalized maximum likelihood algorithm for DNA sequence compression","volume":"23","author":"Korodi","year":"2005","journal-title":"ACM Trans. Inf. Syst."},{"key":"ref_39","first-page":"99","article-title":"A scheme that facilitates searching and partial decompression of textual documents","volume":"1","author":"Gupta","year":"2008","journal-title":"Int. J. Adv. Comput. Eng."},{"key":"ref_40","first-page":"245","article-title":"A novel approach for compressing DNA sequences using semi-statistical compressor","volume":"33","author":"Gupta","year":"2011","journal-title":"Int. J. Comput. Appl."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Pinho, A., Ferreira, P., Neves, A., and Bastos, C. (2011). On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS ONE, 6.","DOI":"10.1371\/journal.pone.0021588"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"643","DOI":"10.1109\/TEVC.2011.2160399","article-title":"DNA sequence compression using adaptive particle swarm optimization-based memetic algorithm","volume":"15","author":"Zhu","year":"2011","journal-title":"IEEE Trans. Evolut. Comput."},{"key":"ref_43","unstructured":"Liang, J., Suganthan, P., and Deb, K. (2005, January 8\u201310). Novel composition test functions for numerical global optimization. Proceedings of the IEEE Swarm Intelligence Symposium (SIS 2005), Pasadena, CA, USA."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"281","DOI":"10.1109\/TEVC.2005.857610","article-title":"Comprehensive learning particle swarm optimizer for global optimization of multimodal functions","volume":"10","author":"Liang","year":"2006","journal-title":"IEEE Trans. Evolut. Comput."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Li, P., Wang, S., Kim, J., Xiong, H., Ohno-Machado, L., and Jiang, X. (2013). DNA-COMPACT: DNA compression based on a pattern-aware contextual modeling technique. PLoS ONE, 8.","DOI":"10.1371\/journal.pone.0080377"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Guo, H., Chen, M., Liu, X., and Xie, M. (2015, January 29\u201331). Genome compression based on Hilbert space filling curve. Proceedings of the 3rd International Conference on Management, Education, Information and Control (MEICI 2015), Shenyang, China.","DOI":"10.2991\/meici-15.2015.294"},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"1275","DOI":"10.1109\/TCBB.2015.2430331","article-title":"CoGI: Towards compressing genomes as an image","volume":"12","author":"Xie","year":"2015","journal-title":"IEEE\/ACM Trans. Comput. Biol. Bioinform."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"1888","DOI":"10.1109\/26.387415","article-title":"Binary image compression using efficient partitioning into rectangular regions","volume":"43","author":"Mohamed","year":"1995","journal-title":"IEEE Trans. Commun."},{"key":"ref_49","first-page":"1037","article-title":"Optimized context weighting based on the least square algorithm","volume":"Volume 348","author":"Zeng","year":"2016","journal-title":"Wireless Communications, Networking and Applications, Proceedings of the 2014 International Conference on Wireless Communications, Networking and Applications (WCNA 2014)"},{"key":"ref_50","unstructured":"Pratas, D., Pinho, A., and Ferreira, P. (April, January 30). Efficient compression of genomic sequences. Proceedings of the DCC \u201916: Data Compression Conference, Snowbird, UT, USA."},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Pinho, A.J., Pratas, D., and Ferreira, P.J. (2011, January 28\u201330). Bacteria DNA sequence compression using a mixture of finite-context models. Proceedings of the 2011 IEEE Statistical Signal Processing Workshop (SSP), Nice, France.","DOI":"10.1109\/SSP.2011.5967637"},{"key":"ref_52","unstructured":"Pratas, D., and Pinho, A.J. (2014, January 1\u20135). Exploring deep Markov models in genomic data compression using sequence pre-analysis. Proceedings of the 2014 22nd European Signal Processing Conference (EUSIPCO), Lisbon, Portugal."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Wandelt, S., and Leser, U. (2012). Adaptive efficient compression of genomes. Algorithms Mol. Biol., 7.","DOI":"10.1186\/1748-7188-7-30"},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"860","DOI":"10.1093\/bioinformatics\/btr014","article-title":"Compression of DNA sequence reads in FASTQ format","volume":"27","author":"Deorowicz","year":"2011","journal-title":"Bioinformatics"},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"274","DOI":"10.1093\/bioinformatics\/btn582","article-title":"Human genomes as email attachments","volume":"25","author":"Christley","year":"2009","journal-title":"Bioinformatics"},{"key":"ref_56","first-page":"201","article-title":"Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval","volume":"6393","author":"Kuruppu","year":"2010","journal-title":"String Process. Inf. Retr."},{"key":"ref_57","first-page":"91","article-title":"Optimized relative Lempel-Ziv compression of genomes","volume":"113","author":"Kuruppu","year":"2011","journal-title":"Conf. Res. Pract. Inf. Technol. Ser."},{"key":"ref_58","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1093\/nar\/gkr009","article-title":"A novel compression tool for efficient storage of genome resequencing data","volume":"39","author":"Wang","year":"2011","journal-title":"Nucleic Acids Res."},{"key":"ref_59","doi-asserted-by":"crossref","first-page":"1098","DOI":"10.1109\/JRPROC.1952.273898","article-title":"A method for the construction of minimum redundancy codes","volume":"40","author":"Huffman","year":"1952","journal-title":"Proc. IRE"},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Pinho, A., Pratas, D., and Garcia, S. (2012). GReEn: A tool for efficient compression of genome resequencing data. Nucleic Acids Res., 40.","DOI":"10.1093\/nar\/gkr1124"},{"key":"ref_61","doi-asserted-by":"crossref","first-page":"198","DOI":"10.1147\/rd.203.0198","article-title":"Generalized Kraft inequality and arithmetic coding","volume":"20","author":"Rissanen","year":"1976","journal-title":"IBM J. Res. Dev."},{"key":"ref_62","doi-asserted-by":"crossref","first-page":"2979","DOI":"10.1093\/bioinformatics\/btr505","article-title":"Robust relative compression of genomes with random access","volume":"27","author":"Deorowicz","year":"2011","journal-title":"Bioinformatics"},{"key":"ref_63","doi-asserted-by":"crossref","unstructured":"Deorowicz, S., Danek, A., and Niemiec, M. (2015). GDC 2: Compression of large collections of genomes. Sci. Rep., 5.","DOI":"10.1038\/srep11565"},{"key":"ref_64","doi-asserted-by":"crossref","first-page":"928","DOI":"10.1145\/322344.322346","article-title":"Data compression via text substitution","volume":"29","author":"Storer","year":"1982","journal-title":"J. ACM"},{"key":"ref_65","doi-asserted-by":"crossref","unstructured":"Grossi, R., and Vitter, J. (2000, January 21\u201323). Compressed suffix arrays and suffix trees with applications to text indexing and string matching. Proceedings of the 32nd ACM Symposium on Theory of Computing, Portland, OR, USA.","DOI":"10.1145\/335305.335351"},{"key":"ref_66","doi-asserted-by":"crossref","first-page":"137","DOI":"10.1109\/TCBB.2011.82","article-title":"Iterative dictionary construction for compression of large DNA data sets","volume":"9","author":"Kuruppu","year":"2012","journal-title":"IEEE\/ACM Trans. Comput. Biol. Bioinform."},{"key":"ref_67","doi-asserted-by":"crossref","first-page":"430","DOI":"10.1002\/1532-2890(2001)9999:9999<::AID-ASI1084>3.0.CO;2-Z","article-title":"General-purpose compression for efficient retrieval","volume":"52","author":"Cannane","year":"2001","journal-title":"J. Assoc. Inf. Sci. Technol."},{"key":"ref_68","unstructured":"Dai, W., Xiong, H., Jiang, X., and Ohno-Machado, L. (2013, January 20\u201322). An adaptive difference distribution-based coding with hierarchical tree structure for DNA sequence compression. Proceedings of the DCC \u201913: Data Compression Conference, Snowbird, UT, USA."},{"key":"ref_69","doi-asserted-by":"crossref","first-page":"396","DOI":"10.1109\/TCOM.1984.1096090","article-title":"Data compression using adaptive coding and partial string matching","volume":"32","author":"Cleary","year":"1984","journal-title":"IEEE Trans. Commun."},{"key":"ref_70","doi-asserted-by":"crossref","first-page":"1275","DOI":"10.1109\/TCBB.2013.122","article-title":"FRESCO: Referential compression of highly-similar sequences","volume":"10","author":"Wandelt","year":"2013","journal-title":"IEEE\/ACM Trans. Comput. Biol. Bioinform."},{"key":"ref_71","first-page":"35","article-title":"Streamlined genome sequence compression using distributed source coding","volume":"13","author":"Jung","year":"2014","journal-title":"Cancer Inform."},{"key":"ref_72","doi-asserted-by":"crossref","first-page":"626","DOI":"10.1109\/TIT.2002.808103","article-title":"Distributed source coding using syndromes (DISCUS): Design and construction","volume":"49","author":"Pradhan","year":"2003","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_73","doi-asserted-by":"crossref","first-page":"3468","DOI":"10.1093\/bioinformatics\/btv399","article-title":"ERGC: An efficient referential genome compression algorithm","volume":"31","author":"Saha","year":"2015","journal-title":"Bioinformatics"},{"key":"ref_74","doi-asserted-by":"crossref","first-page":"1917","DOI":"10.1109\/26.61469","article-title":"Implementing the PPM data compression scheme","volume":"38","author":"Moffat","year":"1990","journal-title":"IEEE Trans. Commun."},{"key":"ref_75","doi-asserted-by":"crossref","first-page":"626","DOI":"10.1093\/bioinformatics\/btu698","article-title":"iDoComp: A compression scheme for assembled genomes","volume":"31","author":"Ochoa","year":"2015","journal-title":"Bioinformatics"},{"key":"ref_76","doi-asserted-by":"crossref","first-page":"068102","DOI":"10.1103\/PhysRevLett.89.068102","article-title":"Multiscale entropy analysis of complex physiologic time series","volume":"89","author":"Costa","year":"2002","journal-title":"Phys. Rev. Lett."},{"key":"ref_77","doi-asserted-by":"crossref","first-page":"2039","DOI":"10.1152\/ajpheart.2000.278.6.H2039","article-title":"Physiological time-series analysis using approximate entropy and sample entropy","volume":"278","author":"Richman","year":"2000","journal-title":"Am. J. Physiol. Heart Circ. Physiol."},{"key":"ref_78","doi-asserted-by":"crossref","first-page":"1101","DOI":"10.1109\/10.335859","article-title":"Macromolecular bioactivity: Is it resonant interaction between macromolecules?\u2014Theory and applications","volume":"41","author":"Cosic","year":"1994","journal-title":"IEEE Trans. Biomed. Eng."},{"key":"ref_79","doi-asserted-by":"crossref","first-page":"696","DOI":"10.1109\/TIT.2009.2037052","article-title":"Compression of whole genome alignments","volume":"56","author":"Hanus","year":"2010","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_80","doi-asserted-by":"crossref","first-page":"3189","DOI":"10.1109\/TIT.2012.2236605","article-title":"A compression model for DNA multiple sequence alignment blocks","volume":"59","author":"Matos","year":"2013","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_81","doi-asserted-by":"crossref","unstructured":"Matos, L., Neves, A., Pratas, D., and Pinho, A. (2015). MAFCO: A compression tool for MAF files. PLoS ONE, 10.","DOI":"10.1371\/journal.pone.0116082"},{"key":"ref_82","doi-asserted-by":"crossref","first-page":"2156","DOI":"10.1093\/bioinformatics\/btr330","article-title":"The variant call format and VCFtools","volume":"27","author":"Danecek","year":"2011","journal-title":"Bioinformatics"},{"key":"ref_83","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1038\/nmeth.3654","article-title":"Efficient genotype compression and analysis of large genetic-variation data sets","volume":"13","author":"Layer","year":"2016","journal-title":"Nat. Methods"},{"key":"ref_84","first-page":"1435","article-title":"Rapid and sensitive protein similarity searches","volume":"227","author":"Lipman","year":"1985","journal-title":"Brief. Bioinform."},{"key":"ref_85","doi-asserted-by":"crossref","first-page":"785","DOI":"10.1007\/s12038-012-9230-6","article-title":"BIND\u2014An algorithm for loss-less compression of nucleotide sequence data","volume":"37","author":"Bose","year":"2012","journal-title":"J. Biosci."},{"key":"ref_86","unstructured":"LZMA. Available online: http:\/\/www.7-zip.org\/sdk.html."},{"key":"ref_87","doi-asserted-by":"crossref","first-page":"2527","DOI":"10.1093\/bioinformatics\/bts467","article-title":"DELIMINATE\u2014A fast and efficient method for loss-less compression of genomic sequences: Sequence analysis","volume":"28","author":"Mohammed","year":"2012","journal-title":"Bioinformatics"},{"key":"ref_88","doi-asserted-by":"crossref","first-page":"2587","DOI":"10.1007\/s10916-011-9731-0","article-title":"Integrating human genome database into electronic health record with sequence alignment and compression mechanism","volume":"36","author":"Chen","year":"2012","journal-title":"J. Med. Syst."},{"key":"ref_89","doi-asserted-by":"crossref","first-page":"238","DOI":"10.1109\/TIT.1987.1057284","article-title":"Robust transmission of unbounded strings using Fibonacci representations","volume":"33","author":"Apostolico","year":"1987","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_90","doi-asserted-by":"crossref","first-page":"117","DOI":"10.1093\/bioinformatics\/btt594","article-title":"MFCompress: A compression tool for FASTA and multi-FASTA data","volume":"30","author":"Pinho","year":"2013","journal-title":"Bioinformatics"},{"key":"ref_91","doi-asserted-by":"crossref","unstructured":"Benoit, G., Lemaitre, C., Lavenier, D., Drezen, E., Dayris, T., Uricaru, R., and Rizk, G. (2015). Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinform., 16.","DOI":"10.1186\/s12859-015-0709-7"},{"key":"ref_92","doi-asserted-by":"crossref","first-page":"187","DOI":"10.1002\/rsa.20208","article-title":"Less hashing, same performance: Building a better bloom filter","volume":"33","author":"Kirsch","year":"2008","journal-title":"J. Random Struct. Algorithms"},{"key":"ref_93","doi-asserted-by":"crossref","unstructured":"Kim, M., Zhang, X., Ligo, J.G., Farnoud, F., Veeravalli, V.V., and Milenkovic, O. (2016). MetaCRAM: An integrated pipeline for metagenomic taxonomy identification and compression. BMC Bioinform., 17.","DOI":"10.1186\/s12859-016-0932-x"},{"key":"ref_94","doi-asserted-by":"crossref","unstructured":"Wood, D., and Salzberg, S. (2014). Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genom. Biol., 15.","DOI":"10.1186\/gb-2014-15-3-r46"},{"key":"ref_95","doi-asserted-by":"crossref","first-page":"357","DOI":"10.1038\/nmeth.1923","article-title":"Fast gapped-read alignment with bowtie 2","volume":"9","author":"Langmead","year":"2012","journal-title":"Nat. Methods"},{"key":"ref_96","doi-asserted-by":"crossref","first-page":"1420","DOI":"10.1093\/bioinformatics\/bts174","article-title":"IDBA-UD: A de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth","volume":"28","author":"Peng","year":"2012","journal-title":"Bioinformatics"},{"key":"ref_97","doi-asserted-by":"crossref","first-page":"399","DOI":"10.1109\/TIT.1966.1053907","article-title":"Run-length encodings","volume":"12","author":"Golomb","year":"1966","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_98","doi-asserted-by":"crossref","first-page":"239","DOI":"10.1109\/TMM.2006.886260","article-title":"Extended golomb code for integer representation","volume":"9","author":"Somasundaram","year":"2007","journal-title":"IEEE Trans. Multimed."},{"key":"ref_99","doi-asserted-by":"crossref","unstructured":"Ochoa, I., Asnani, H., Bharadia, D., Chowdhury, M., Weissman, T., and Yona, G. (2013). Qualcomp: A new lossy compressor for quality scores based on rate distortion theory. BMC Bioinform., 14.","DOI":"10.1186\/1471-2105-14-187"},{"key":"ref_100","doi-asserted-by":"crossref","first-page":"1767","DOI":"10.1093\/nar\/gkp1137","article-title":"The Sanger FASTQ file format for sequences with quality scores, and the Solexa\/Illumina FASTQ variants","volume":"38","author":"Cock","year":"2009","journal-title":"Nucleic Acids Res."},{"key":"ref_101","doi-asserted-by":"crossref","unstructured":"Daily, K., Rigor, P., Christley, S., Xie, X., and Baldi, P. (2010). Data structures and compression algorithms for high-throughput sequencing technologies. BMC Bioinform., 11.","DOI":"10.1186\/1471-2105-11-514"},{"key":"ref_102","doi-asserted-by":"crossref","first-page":"194","DOI":"10.1109\/TIT.1975.1055349","article-title":"Universal codeword sets and representations of the integers","volume":"21","author":"Elias","year":"1975","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_103","doi-asserted-by":"crossref","first-page":"2098","DOI":"10.1021\/ci700200n","article-title":"Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval","volume":"47","author":"Baldi","year":"2007","journal-title":"J. Chem. Inf. Model."},{"key":"ref_104","doi-asserted-by":"crossref","first-page":"2192","DOI":"10.1093\/bioinformatics\/btq346","article-title":"G-SQZ: Compact encoding of genomic sequence and quality data","volume":"26","author":"Tembe","year":"2010","journal-title":"Bioinformatics"},{"key":"ref_105","doi-asserted-by":"crossref","first-page":"2213","DOI":"10.1093\/bioinformatics\/btu208","article-title":"DSRC 2\u2014Industry-oriented compression of FASTQ files","volume":"30","author":"Roguski","year":"2014","journal-title":"Bioinformatics"},{"key":"ref_106","doi-asserted-by":"crossref","unstructured":"Salomon, D., and Motta, G. (2010). Handbook of Data Compression, Springer.","DOI":"10.1007\/978-1-84882-903-9"},{"key":"ref_107","doi-asserted-by":"crossref","unstructured":"Bhola, V., Bopardikar, A., Narayanan, R., Lee, K., and Ahn, T. (2011, January 12\u201315). No-reference compression of genomic data stored in FASTQ format. Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2011), Atlanta, GA, USA.","DOI":"10.1109\/BIBM.2011.110"},{"key":"ref_108","doi-asserted-by":"crossref","unstructured":"Jones, D., Ruzzo, W., Peng, X., and Katze, M. (2012). Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res., 40.","DOI":"10.1093\/nar\/gks754"},{"key":"ref_109","doi-asserted-by":"crossref","first-page":"3051","DOI":"10.1093\/bioinformatics\/bts593","article-title":"SCALCE: Boosting sequence compression algorithms using locally consistent encoding","volume":"28","author":"Hach","year":"2012","journal-title":"Bioinformatics"},{"key":"ref_110","unstructured":"Sahinalp, S., and Vishkin, U. (1996, January 14\u201316). Efficient approximate and dynamic matching of patterns using a labeling paradigm. Proceedings of the 37th Annual Symposium on Foundations of Computer Science (FOCS), Burlington, VT, USA."},{"key":"ref_111","unstructured":"Cormode, G., Paterson, M., Sahinalp, S., and Vishkin, U. (2000, January 9\u201311). Communication complexity of document exchange. Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), San Francisco, CA, USA."},{"key":"ref_112","doi-asserted-by":"crossref","unstructured":"Batu, T., Ergun, F., and Sahinalp, S. (2006, January 22\u201324). Oblivious string embeddings and edit distance approximations. Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithm (SODA), Miami, FL, USA.","DOI":"10.1145\/1109557.1109644"},{"key":"ref_113","doi-asserted-by":"crossref","first-page":"213","DOI":"10.1109\/TCBB.2012.160","article-title":"High-throughput compression of FASTQ data with SeqDB","volume":"10","author":"Howison","year":"2013","journal-title":"IEEE\/ACM Trans. Comput. Biol. Bioinform."},{"key":"ref_114","unstructured":"Alted, F. Available online: http:\/\/www.blosc.org."},{"key":"ref_115","doi-asserted-by":"crossref","first-page":"68","DOI":"10.1109\/MCSE.2010.51","article-title":"Why modern CPUs are starving and what can be done about it","volume":"12","author":"Alted","year":"2010","journal-title":"Comput. Sci. Eng."},{"key":"ref_116","doi-asserted-by":"crossref","unstructured":"Bonfield, J., and Mahoney, M. (2013). Compression of FASTQ and SAM format sequencing data. PLoS ONE, 8.","DOI":"10.1371\/journal.pone.0059190"},{"key":"ref_117","unstructured":"Shelwien, E. Available online: http:\/\/compressionratings.com\/i_ctxf.html."},{"key":"ref_118","unstructured":"Mahoney, M. Available online: http:\/\/mattmahoney.net\/dc\/zpaq.html."},{"key":"ref_119","unstructured":"Mahoney, M. (2005). Adaptive Weighing of Context Models for Lossless Data Compression, Florida Institute of Technology CS Department. Technical Report CS-2005\u201316."},{"key":"ref_120","doi-asserted-by":"crossref","first-page":"1389","DOI":"10.1093\/bioinformatics\/btu844","article-title":"Disk-based compression of data from genome sequencing","volume":"31","author":"Grabowski","year":"2014","journal-title":"Bioinformatics"},{"key":"ref_121","doi-asserted-by":"crossref","first-page":"3363","DOI":"10.1093\/bioinformatics\/bth408","article-title":"Reducing storage requirements for biological sequence comparison","volume":"20","author":"Roberts","year":"2004","journal-title":"Bioinformatics"},{"key":"ref_122","doi-asserted-by":"crossref","unstructured":"Movahedi, N., Forouzmand, E., and Chitsaz, H. (2012, January 4\u20137). De novo co-assembly of bacterial genomes from multiple single cells. Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2012), Philadelphia, PA, USA.","DOI":"10.1109\/BIBM.2012.6392618"},{"key":"ref_123","unstructured":"Li, Y., Kamousi, P., Han, F., Yang, S., Yan, X., and Suri, S. (2013, January 26\u201330). Memory efficient minimum substring partitioning. Proceedings of the 39th international conference on Very Large Data Bases (VLDB 2013), Trento, Italy."},{"key":"ref_124","unstructured":"Chikhi, R., Limasset, A., Jackman, S., Simpson, J., and Medvedev, P. (2014, January 2\u20135). On the representation of de Bruijn graphs. Proceedings of the 18th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2014), Pittsburgh, PA, USA."},{"key":"ref_125","doi-asserted-by":"crossref","first-page":"1569","DOI":"10.1093\/bioinformatics\/btv022","article-title":"KMC 2: Fast and resource-frugal k-mer counting","volume":"31","author":"Deorowicz","year":"2015","journal-title":"Bioinformatics"},{"key":"ref_126","unstructured":"Shkarin, D. (2002, January 2\u20134). PPM: One step to practicality. Proceedings of the DCC \u201902: Data Compression Conference, Snowbird, UT, USA."},{"key":"ref_127","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Li, L., Yang, Y., Yang, X., and He, S. (2015). Light-weight reference-based compression of FASTQ data. BMC Bioinform., 16.","DOI":"10.1186\/s12859-015-0628-7"},{"key":"ref_128","doi-asserted-by":"crossref","first-page":"2078","DOI":"10.1093\/bioinformatics\/btp352","article-title":"The Sequence Alignment\/Map format and SAMtools","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"ref_129","unstructured":"The SAM\/BAM Format Specification Working Group Sequence Alignment\/Map Format Specification. Available online: https:\/\/samtools.github.io\/hts-specs\/SAMv1.pdf."},{"key":"ref_130","doi-asserted-by":"crossref","first-page":"734","DOI":"10.1101\/gr.114819.110","article-title":"Efficient storage of high throughput DNA sequencing data using reference-based compression","volume":"21","author":"Fritz","year":"2011","journal-title":"Genom. Res."},{"key":"ref_131","doi-asserted-by":"crossref","unstructured":"Campagne, F., Dorff, K., Chambwe, N., Robinson, J., and Mesirov, J. (2013). Compression of structured high-throughput sequencing data. PLoS ONE, 8.","DOI":"10.1371\/journal.pone.0079871"},{"key":"ref_132","unstructured":"Varda, K. PB. Available online: https:\/\/github.com\/google\/protobuf."},{"key":"ref_133","doi-asserted-by":"crossref","unstructured":"Popitsch, N., and Von Haeseler, A. (2013). NGC: Lossless and lossy compression of aligned high-throughput sequencing data. Nucleic Acids Res., 41.","DOI":"10.1093\/nar\/gks939"},{"key":"ref_134","doi-asserted-by":"crossref","first-page":"1081","DOI":"10.1038\/nmeth.3133","article-title":"DeeZ: Reference-based compression by local assembly","volume":"11","author":"Hach","year":"2014","journal-title":"Nat. Methods"},{"key":"ref_135","unstructured":"gzip. Available online: http:\/\/www.gzip.org."},{"key":"ref_136","unstructured":"Rebico. Available online: http:\/\/bioinformatics.ua.pt\/software\/rebico."},{"key":"ref_137","unstructured":"Human (GRC), Available online: ftp:\/\/ftp.ncbi.nlm.nih.gov\/genomes\/Homo_sapiens\/Assembled_chromosomes\/seq."},{"key":"ref_138","unstructured":"Chimpanzee, Available online: ftp:\/\/ftp.ncbi.nlm.nih.gov\/genomes\/Pan_troglodytes\/Assembled_chromosomes\/seq."},{"key":"ref_139","unstructured":"Rice5. Available online: ftp:\/\/ftp.plantbiology.msu.edu\/pub\/data\/Eukaryotic_Projects\/o_sativa\/annotation_dbs\/pseudomolecules\/version_5.0."},{"key":"ref_140","unstructured":"CAMERA Prokaryotic Nucleotide. Available online: ftp:\/\/ftp.imicrobe.us\/camera\/camera_reference_datasets\/10572.V10.fa.gz."},{"key":"ref_141","unstructured":"ERR174310_1. Available online: ftp:\/\/ftp.sra.ebi.ac.uk\/vol1\/fastq\/ERR174\/ERR174310\/ERR174310_1.fastq.gz."},{"key":"ref_142","unstructured":"ERR174310_2. Available online: ftp:\/\/ftp.sra.ebi.ac.uk\/vol1\/fastq\/ERR174\/ERR174310\/ERR174310_2.fastq.gz."},{"key":"ref_143","unstructured":"ERR194146_1. Available online: ftp:\/\/ftp.sra.ebi.ac.uk\/vol1\/fastq\/ERR194\/ERR194146\/ERR194146_1.fastq.gz."},{"key":"ref_144","unstructured":"ERR194146_2. Available online: ftp:\/\/ftp.sra.ebi.ac.uk\/vol1\/fastq\/ERR194\/ERR194146\/ERR194146_2.fastq.gz."},{"key":"ref_145","unstructured":"NA12877_S1. Available online: ftp:\/\/ftp.sra.ebi.ac.uk\/vol1\/ERA207\/ERA207860\/bam\/NA12877_S1.bam."},{"key":"ref_146","unstructured":"NA12878_S1. Available online: ftp:\/\/ftp.sra.ebi.ac.uk\/vol1\/ERA207\/ERA207860\/bam\/NA12878_S1.bam."},{"key":"ref_147","unstructured":"NA12882_S1. Available online: ftp:\/\/ftp.sra.ebi.ac.uk\/vol1\/ERA207\/ERA207860\/bam\/NA12882_S1.bam."},{"key":"ref_148","unstructured":"Homo sapiens, GRC Reference Assembly\u2014Chromosome 8, Available online: ftp:\/\/ftp.ncbi.nlm.nih.gov\/genomes\/Homo_sapiens\/Assembled_chromosomes\/seq\/hs_ref_GRCh38.p7_chr8.fa.gz."},{"key":"ref_149","unstructured":"Homo sapiens, CHM Reference Assembly\u2014Chromosome 8, Available online: ftp:\/\/ftp.ncbi.nlm.nih.gov\/genomes\/Homo_sapiens\/Assembled_chromosomes\/seq\/hs_alt_CHM1_1.1_chr8.fa.gz."},{"key":"ref_150","unstructured":"Homo sapiens, GRC Reference Assembly\u2014Chromosome 11, Available online: ftp:\/\/ftp.ncbi.nlm.nih.gov\/genomes\/Homo_sapiens\/Assembled_chromosomes\/seq\/hs_ref_GRCh38.p7_chr11.fa.gz."},{"key":"ref_151","unstructured":"Homo sapiens, CHM Reference Assembly\u2014Chromosome 11, Available online: ftp:\/\/ftp.ncbi.nlm.nih.gov\/genomes\/Homo_sapiens\/Assembled_chromosomes\/seq\/hs_alt_CHM1_1.1_chr11.fa.gz."},{"key":"ref_152","unstructured":"Pan troglodytes (Chimpanze) Reference Assembly, v3.0\u2014Chromosome 11, Available online: ftp:\/\/ftp.ncbi.nlm.nih.gov\/genomes\/Pan_troglodytes\/Assembled_chromosomes\/seq\/ptr_ref_Pan_tro_3.0_chr11.fa.gz."},{"key":"ref_153","unstructured":"Pongo abelii (Orangutan) Reference Assembly\u2014Chromosome 11, Available online: ftp:\/\/ftp.ncbi.nlm.nih.gov\/genomes\/Pongo_abelii\/Assembled_chromosomes\/seq\/pab_ref_P_pygmaeus_2.0.2_chr11.fa.gz."},{"key":"ref_154","unstructured":"Homo sapiens, GRC Reference Assembly\u2014Chromosome 16, Available online: ftp:\/\/ftp.ncbi.nlm.nih.gov\/genomes\/Homo_sapiens\/Assembled_chromosomes\/seq\/hs_ref_GRCh38.p7_chr16.fa.gz."},{"key":"ref_155","unstructured":"Homo sapiens, Korean Reference\u2014Chromosome 16. Available online: ftp:\/\/ftp.kobic.re.kr\/pub\/KOBIC-KoreanGenome\/fasta\/chromosome_16.fa.gz."},{"key":"ref_156","unstructured":"Oryza sativa (Rice), v5.0. Available online: ftp:\/\/ftp.plantbiology.msu.edu\/pub\/data\/Eukaryotic_Projects\/o_sativa\/annotation_dbs\/pseudomolecules\/version_5.0."},{"key":"ref_157","unstructured":"Oryza sativa (Rice), v7.0. Available online: ftp:\/\/ftp.plantbiology.msu.edu\/pub\/data\/Eukaryotic_Projects\/o_sativa\/annotation_dbs\/pseudomolecules\/version_7.0."},{"key":"ref_158","unstructured":"Pratas, D. Available online: https:\/\/raw.githubusercontent.com\/pratas\/rebico\/master\/methods.txt."},{"key":"ref_159","doi-asserted-by":"crossref","unstructured":"Li, H. (2015). BGT: Efficient and flexible genotype query across many samples. Bioinformatics.","DOI":"10.1093\/bioinformatics\/btv613"},{"key":"ref_160","doi-asserted-by":"crossref","first-page":"3078","DOI":"10.1093\/bioinformatics\/btu495","article-title":"Compression and fast retrieval of SNP data","volume":"30","author":"Sambo","year":"2014","journal-title":"Bioinformatics"},{"key":"ref_161","doi-asserted-by":"crossref","unstructured":"Cao, M.D., Dix, T.I., and Allison, L. (2010). A genome alignment algorithm based on compression. BMC bioinform., 11.","DOI":"10.1186\/1471-2105-11-599"},{"key":"ref_162","doi-asserted-by":"crossref","unstructured":"Pratas, D., Silva, R.M., Pinho, A.J., and Ferreira, P.J. (2015). An alignment-free method to find and visualise rearrangements between pairs of DNA sequences. Sci. Rep., 5.","DOI":"10.1038\/srep10203"},{"key":"ref_163","unstructured":"Beller, T., and Ohlebusch, E. (2015). Combinatorial Pattern Matching, Springer."},{"key":"ref_164","doi-asserted-by":"crossref","first-page":"497","DOI":"10.1093\/bioinformatics\/btv603","article-title":"Graphical pan-genome analysis with compressed suffix trees and the Burrows\u2013Wheeler transform","volume":"32","author":"Baier","year":"2015","journal-title":"Bioinformatics"},{"key":"ref_165","doi-asserted-by":"crossref","unstructured":"Pinho, A.J., Garcia, S.P., Pratas, D., and Ferreira, P.J. (2013). DNA sequences at a glance. PLoS ONE, 8.","DOI":"10.1371\/journal.pone.0079922"},{"key":"ref_166","doi-asserted-by":"crossref","first-page":"461","DOI":"10.14778\/2735479.2735480","article-title":"MRCSI: Compressing and searching string collections with multiple references","volume":"8","author":"Wandelt","year":"2015","journal-title":"Proc. VLDB Endow."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/7\/4\/56\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T19:32:57Z","timestamp":1760211177000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/7\/4\/56"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,10,14]]},"references-count":166,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2016,12]]}},"alternative-id":["info7040056"],"URL":"https:\/\/doi.org\/10.3390\/info7040056","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2016,10,14]]}}}