{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,28]],"date-time":"2026-02-28T02:07:19Z","timestamp":1772244439221,"version":"3.50.1"},"reference-count":45,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2025,1,23]],"date-time":"2025-01-23T00:00:00Z","timestamp":1737590400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Bioinform."],"abstract":"<jats:p>Data compression is a challenging and increasingly important problem. As the amount of data generated daily continues to increase, efficient transmission and storage have never been more critical. In this study, a novel encoding algorithm is proposed, motivated by the compression of DNA data and associated characteristics. The proposed algorithm follows a divide-and-conquer approach by scanning the whole genome, classifying subsequences based on similarities in their content, and binning similar subsequences together. The data is then compressed into each bin independently. This approach is different than the currently known approaches: entropy, dictionary, predictive, or transform-based methods. Proof-of-concept performance was evaluated using a benchmark dataset with seventeen genomes ranging in size from kilobytes to gigabytes. The results showed a considerable improvement in the compression of each genome, preserving several megabytes compared to state-of-the-art tools. Moreover, the algorithm can be applied to the compression of other data types include mainly text, numbers, images, audio, and video which are being generated daily and unprecedentedly in massive volumes.<\/jats:p>","DOI":"10.3389\/fbinf.2024.1489704","type":"journal-article","created":{"date-parts":[[2025,1,23]],"date-time":"2025-01-23T01:59:39Z","timestamp":1737597579000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["A novel lossless encoding algorithm for data compression\u2013genomics data as an exemplar"],"prefix":"10.3389","volume":"4","author":[{"given":"Anas","family":"Al-okaily","sequence":"first","affiliation":[]},{"given":"Abdelghani","family":"Tbakhi","sequence":"additional","affiliation":[]}],"member":"1965","published-online":{"date-parts":[[2025,1,23]]},"reference":[{"key":"B1","doi-asserted-by":"crossref","first-page":"452","DOI":"10.1109\/ITCC.2001.918838","article-title":"Lipt: a lossless text transform to improve compression","volume-title":"Proceedings international Conference on information Technology: Coding and computing","author":"Awan","year":"2001"},{"key":"B2","doi-asserted-by":"publisher","first-page":"72","DOI":"10.5923\/j.bioinformatics.20130303.04","article-title":"Dna lossless compression algorithms","volume":"3","author":"Bakr","year":"2013","journal-title":"Am. J. Bioinforma. Res."},{"key":"B3","doi-asserted-by":"publisher","first-page":"320","DOI":"10.1145\/5684.5688","article-title":"A locally adaptive data compression scheme","volume":"29","author":"Bentley","year":"1986","journal-title":"Commun. ACM"},{"key":"B4","doi-asserted-by":"publisher","first-page":"e59190","DOI":"10.1371\/journal.pone.0059190","article-title":"Compression of fastq and sam format sequencing data","volume":"8","author":"Bonfield","year":"2013","journal-title":"PloS one"},{"key":"B5","volume-title":"A block-sorting lossless data compression algorithm","author":"Burrows","year":"1994"},{"key":"B6","doi-asserted-by":"publisher","first-page":"157","DOI":"10.1109\/tit.1959.1057512","article-title":"A probabilistic model for run-length coding of pictures","volume":"5","author":"Capon","year":"1959","journal-title":"IRE Trans. Inf. Theory"},{"key":"B7","doi-asserted-by":"publisher","first-page":"396","DOI":"10.1109\/tcom.1984.1096090","article-title":"Data compression using adaptive coding and partial string matching","volume":"32","author":"Cleary","year":"1984","journal-title":"IEEE Trans. Commun."},{"key":"B8","doi-asserted-by":"publisher","first-page":"1767","DOI":"10.1093\/nar\/gkp1137","article-title":"The sanger fastq file format for sequences with quality scores, and the solexa\/illumina fastq variants","volume":"38","author":"Cock","year":"2010","journal-title":"Nucleic acids Res."},{"key":"B9","doi-asserted-by":"publisher","first-page":"541","DOI":"10.1093\/comjnl\/30.6.541","article-title":"Data compression using dynamic markov modelling","volume":"30","author":"Cormack","year":"1987","journal-title":"Comput. J."},{"key":"B10","volume-title":"Elements of information theory","author":"Cover","year":"1999"},{"key":"B11","first-page":"2540","article-title":"Asymmetric numeral systems: entropy coding combining speed of huffman coding with compression rate of arithmetic coding","author":"Duda","year":"2013"},{"key":"B12","doi-asserted-by":"publisher","first-page":"194","DOI":"10.1109\/tit.1975.1055349","article-title":"Universal codeword sets and representations of the integers","volume":"21","author":"Elias","year":"1975","journal-title":"IEEE Trans. Inf. theory"},{"key":"B13","volume-title":"The transmission of information","author":"Fano","year":"1949"},{"key":"B14","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1016\/0166-218x(93)00116-h","article-title":"Robust universal complete codes for transmission and compression","volume":"64","author":"Fraenkel","year":"1996","journal-title":"Discrete Appl. Math."},{"key":"B15","doi-asserted-by":"publisher","DOI":"10.17487\/RFC3943","article-title":"Transport layer security (TLS) protocol compression using lempel-ziv-stac (LZS)","volume":"3943","author":"Friend","year":"2004","journal-title":"RFC"},{"key":"B16","doi-asserted-by":"crossref","first-page":"628","DOI":"10.1109\/ICICT48043.2020.9112516","article-title":"Comparison of lossless data compression techniques","volume-title":"2020 international Conference on inventive computation technologies (ICICT)","author":"Gopinath","year":"2020"},{"key":"B17","doi-asserted-by":"publisher","first-page":"875","DOI":"10.1016\/0306-4573(94)90014-0","article-title":"A new challenge for compression algorithms: genetic sequences","volume":"30","author":"Grumbach","year":"1994","journal-title":"Inf. Process. and Manag."},{"key":"B18","doi-asserted-by":"publisher","first-page":"56","DOI":"10.3390\/info7040056","article-title":"A survey on data compression methods for biological sequences","volume":"7","author":"Hosseini","year":"2016","journal-title":"Information"},{"key":"B19","doi-asserted-by":"publisher","first-page":"1098","DOI":"10.1109\/jrproc.1952.273898","article-title":"A method for the construction of minimum-redundancy codes","volume":"40","author":"Huffman","year":"1952","journal-title":"Proc. IRE"},{"key":"B20","doi-asserted-by":"publisher","DOI":"10.26483\/ijarcs.v8i3.3086","article-title":"A comparative study and survey on existing dna compression techniques","volume":"8","author":"Jahaan","year":"2017","journal-title":"Int. J. Adv. Res. Comput. Sci."},{"key":"B21","article-title":"A survey on lossless and lossy data compression methods","volume":"7","author":"Kavitha","year":"2016","journal-title":"Int. J. Comput. Sci. and Eng. Technol. (IJCSET)"},{"key":"B22","doi-asserted-by":"publisher","first-page":"163","DOI":"10.1016\/0196-6774(85)90036-7","article-title":"Dynamic huffman coding","volume":"6","author":"Knuth","year":"1985","journal-title":"J. algorithms"},{"key":"B23","first-page":"416","article-title":"Comparison of lossless data compression algorithms for text data","volume":"1","author":"Kodituwakku","year":"2010","journal-title":"Indian J. Comput. Sci. Eng."},{"key":"B24","doi-asserted-by":"publisher","first-page":"giaa072","DOI":"10.1093\/gigascience\/giaa072","article-title":"Sequence compression benchmark (scb) database\u2014a comprehensive evaluation of reference-free compressors for fasta-formatted sequences","volume":"9","author":"Kryukov","year":"2020","journal-title":"GigaScience"},{"key":"B25","doi-asserted-by":"publisher","first-page":"135","DOI":"10.1147\/rd.282.0135","article-title":"An introduction to arithmetic coding","volume":"28","author":"Langdon","year":"1984","journal-title":"IBM J. Res. Dev."},{"key":"B26","doi-asserted-by":"publisher","first-page":"2078","DOI":"10.1093\/bioinformatics\/btp352","article-title":"The sequence alignment\/map format and samtools","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"B27","doi-asserted-by":"publisher","first-page":"1435","DOI":"10.1126\/science.2983426","article-title":"Rapid and sensitive protein similarity searches","volume":"227","author":"Lipman","year":"1985","journal-title":"Science"},{"key":"B28","article-title":"Adaptive weighing of context models for lossless data compression","author":"Mahoney","year":"2005","journal-title":"Tech. Rep."},{"key":"B29","doi-asserted-by":"publisher","first-page":"99","DOI":"10.3390\/a13040099","article-title":"A new lossless dna compression algorithm based on a single-block encoding scheme","volume":"13","author":"Mansouri","year":"2020","journal-title":"Algorithms"},{"key":"B30","first-page":"24","article-title":"Range encoding: an algorithm for removing redundancy from a digitised message","author":"Mart\u00edn","year":"1979","journal-title":"Video Data Rec. Conf."},{"key":"B31","article-title":"Lzo-a real-time data compression library","author":"Oberhumer","year":"2008"},{"key":"B32","doi-asserted-by":"publisher","first-page":"96","DOI":"10.1109\/82.219839","article-title":"High-speed vlsi designs for lempel-ziv-based data compression","volume":"40","author":"Ranganathan","year":"1993","journal-title":"IEEE Trans. Circuits Syst. II Analog Digital Signal Process."},{"key":"B33","first-page":"16","article-title":"Data compression by means of a \u201cbook stack\u201d","volume":"16","author":"Ryabko","year":"1980","journal-title":"Probl. Peredachi Inf."},{"key":"B34","volume-title":"Data compression: the complete reference","author":"Salomon","year":"2004"},{"key":"B35","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1145\/584091.584093","article-title":"A mathematical theory of communication","volume":"5","author":"Shannon","year":"2001","journal-title":"ACM Sigmob. Mob. Comput. Commun. Rev."},{"key":"B36","doi-asserted-by":"publisher","first-page":"928","DOI":"10.1145\/322344.322346","article-title":"Data compression via textual substitution","volume":"29","author":"Storer","year":"1982","journal-title":"J. ACM (JACM)"},{"key":"B37","doi-asserted-by":"publisher","first-page":"607","DOI":"10.1109\/tit.1980.1056237","article-title":"Improved prefix encodings of the natural numbers (corresp.)","volume":"26","author":"Stout","year":"1980","journal-title":"IEEE Trans. Inf. Theory"},{"key":"B38","volume-title":"Synthesis of noiseless compression codes","author":"Tunstall","year":"1968"},{"key":"B39","doi-asserted-by":"publisher","first-page":"647","DOI":"10.1016\/j.jksuci.2017.10.007","article-title":"Swarm intelligence based classification rule induction (CRI) framework for qualitative and quantitative approach: an application of bankruptcy prediction and credit risk analysis","volume":"32","author":"Uthayakumar","year":"2018","journal-title":"J. King Saud University-Computer Inf. Sci."},{"key":"B40","doi-asserted-by":"publisher","first-page":"825","DOI":"10.1145\/31846.42227","article-title":"Design and analysis of dynamic huffman codes","volume":"34","author":"Vitter","year":"1987","journal-title":"J. ACM (JACM)"},{"key":"B41","doi-asserted-by":"publisher","first-page":"8","DOI":"10.1109\/mc.1984.1659158","article-title":"A technique for high-performance data compression","volume":"17","author":"Welch","year":"1984","journal-title":"Computer"},{"key":"B42","doi-asserted-by":"publisher","first-page":"653","DOI":"10.1109\/18.382012","article-title":"The context-tree weighting method: basic properties","volume":"41","author":"Willems","year":"1995","journal-title":"IEEE Trans. Inf. theory"},{"key":"B43","doi-asserted-by":"crossref","first-page":"362","DOI":"10.1109\/DCC.1991.213344","article-title":"An extremely fast ziv-lempel data compression algorithm","volume-title":"[1991] proceedings. Data compression conference","author":"Williams","year":"1991"},{"key":"B44","doi-asserted-by":"publisher","first-page":"337","DOI":"10.1109\/tit.1977.1055714","article-title":"A universal algorithm for sequential data compression","volume":"23","author":"Ziv","year":"1977","journal-title":"IEEE Trans. Inf. theory"},{"key":"B45","doi-asserted-by":"publisher","first-page":"530","DOI":"10.1109\/tit.1978.1055934","article-title":"Compression of individual sequences via variable-rate coding","volume":"24","author":"Ziv","year":"1978","journal-title":"IEEE Trans. Inf. Theory"}],"container-title":["Frontiers in Bioinformatics"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fbinf.2024.1489704\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,1,23]],"date-time":"2025-01-23T01:59:49Z","timestamp":1737597589000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fbinf.2024.1489704\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1,23]]},"references-count":45,"alternative-id":["10.3389\/fbinf.2024.1489704"],"URL":"https:\/\/doi.org\/10.3389\/fbinf.2024.1489704","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2020.08.24.264366","asserted-by":"object"}]},"ISSN":["2673-7647"],"issn-type":[{"value":"2673-7647","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,1,23]]},"article-number":"1489704"}}