{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,6]],"date-time":"2026-06-06T01:13:39Z","timestamp":1780708419038,"version":"3.54.1"},"reference-count":44,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2023,6,19]],"date-time":"2023-06-19T00:00:00Z","timestamp":1687132800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61972441, 61972112, 61832004"],"award-info":[{"award-number":["61972441, 61972112, 61832004"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100021171","name":"Guangdong Basic and Applied Basic Research Foundation","doi-asserted-by":"crossref","award":["2021B1515020088"],"award-info":[{"award-number":["2021B1515020088"]}],"id":[{"id":"10.13039\/501100021171","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100017610","name":"Shenzhen Science and Technology Innovation Program","doi-asserted-by":"crossref","award":["RCYX20210609104510007, JCYJ20210324131203009, JCYJ20200109113427092, GXWD20201230155427003-20200821172511002"],"award-info":[{"award-number":["RCYX20210609104510007, JCYJ20210324131203009, JCYJ20200109113427092, GXWD20201230155427003-20200821172511002"]}],"id":[{"id":"10.13039\/501100017610","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies","award":["2022B1212010005"],"award-info":[{"award-number":["2022B1212010005"]}]},{"name":"HITSZ-J&A Joint Laboratory of Digital Design and Intelligent Fabrication","award":["HITSZ-J&A-2021A01"],"award-info":[{"award-number":["HITSZ-J&A-2021A01"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Storage"],"published-print":{"date-parts":[[2023,8,31]]},"abstract":"<jats:p>Post-deduplication delta compression is a data reduction technique that calculates and stores the differences of very similar but non-duplicate chunks in storage systems, which is able to achieve a very high compression ratio. However, the low throughput of widely used resemblance detection approaches (e.g., N-Transform) usually becomes the bottleneck of delta compression systems due to introducing high computational overhead. Generally, this overhead mainly consists of two parts: \u2460 calculating the rolling hash byte by byte across data chunks and \u2461 applying multiple transforms on all of the calculated rolling hash values.<\/jats:p>\n          <jats:p>\n            \u00a0\u00a0In this article, we propose Odess, a fast and lightweight resemblance detection approach, that greatly reduces the computational overhead for resemblance detection while achieving high detection accuracy and a high compression ratio. Odess first utilizes a novel Subwindow-based Parallel Rolling (SWPR) hash method using Single Instruction Multiple Data\u00a0[\n            <jats:xref ref-type=\"bibr\">1<\/jats:xref>\n            ] (SIMD) to accelerate calculation of rolling hashes (corresponding to the first part of the overhead). Odess then uses a novel Content-Defined Sampling method to generate a much smaller proxy hash set from the whole rolling hash set and quickly applies transforms on this small hash set for resemblance detection (corresponding to the second part of the overhead).\n          <\/jats:p>\n          <jats:p>\n            Evaluation results show that during the stage of resemblance detection, the Odess approach is \u223c31.4\u00d7 and \u223c7.9\u00d7 faster than the state-of-the-art N-Transform and Finesse (a recent variant of N-Transform\u00a0[\n            <jats:xref ref-type=\"bibr\">39<\/jats:xref>\n            ]), respectively. When considering an end-to-end data reduction storage system, the Odess-based system\u2019s throughput is about 3.20\u00d7 and 1.41\u00d7 higher than the N-Transform- and Finesse-based systems\u2019 throughput, respectively, while maintaining the high compression ratio of N-Transform and achieving \u223c1.22\u00d7 higher compression ratio over Finesse.\n          <\/jats:p>","DOI":"10.1145\/3584663","type":"journal-article","created":{"date-parts":[[2023,2,16]],"date-time":"2023-02-16T11:36:07Z","timestamp":1676547367000},"page":"1-30","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":98,"title":["The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta Compression"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4093-6391","authenticated-orcid":false,"given":"Wen","family":"Xia","sequence":"first","affiliation":[{"name":"Harbin Institute of Technology, Shenzhen; Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9117-2590","authenticated-orcid":false,"given":"Lifeng","family":"Pu","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology, Shenzhen, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5104-8301","authenticated-orcid":false,"given":"Xiangyu","family":"Zou","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology, Shenzhen, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1235-0502","authenticated-orcid":false,"given":"Philip","family":"Shilane","sequence":"additional","affiliation":[{"name":"Dell Technologies, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8206-6916","authenticated-orcid":false,"given":"Shiyi","family":"Li","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology, Shenzhen, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1648-0227","authenticated-orcid":false,"given":"Haijun","family":"Zhang","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology, Shenzhen, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3512-0649","authenticated-orcid":false,"given":"Xuan","family":"Wang","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology, Shenzhen; Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2023,6,19]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2019.09.012"},{"key":"e_1_3_1_3_2","first-page":"1","volume-title":"Proceedings of the Israeli Experimental Systems Conference (SYSTOR\u201909)","author":"Aronovich Lior","year":"2009","unstructured":"Lior Aronovich, Ron Asher, Eitan Bachmat, Haim Bitner, Michael Hirsch, and Shmuel T. Klein. 2009. The design of a similarity based deduplication system. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR\u201909). ACM, Haifa, 1\u201314."},{"key":"e_1_3_1_4_2","first-page":"2","volume-title":"Data De-Duplication and Disk-to-Disk Backup Systems: Technical and Business Considerations","author":"Asaro Tony","year":"2007","unstructured":"Tony Asaro and Heidi Biggar. 2007. Data De-Duplication and Disk-to-Disk Backup Systems: Technical and Business Considerations. Technical Report. The Enterprise Strategy Group. 2\u201315."},{"key":"e_1_3_1_5_2","first-page":"21","volume-title":"Proceedings of Compression and Complexity of Sequences","author":"Broder Andrei Z.","year":"1997","unstructured":"Andrei Z. Broder. 1997. On the resemblance and containment of documents. In Proceedings of Compression and Complexity of Sequences. IEEE, Salerno, Italy, 21\u201329."},{"key":"e_1_3_1_6_2","first-page":"1","article-title":"Identifying and filtering near-duplicate documents","volume":"1848","author":"Broder Andrei Z.","year":"1998","unstructured":"Andrei Z. Broder. 1998. Identifying and filtering near-duplicate documents. In Lecture Notes in Computer Science, Vol. 1848, Springer, 1\u201310.","journal-title":"Lecture Notes in Computer Science"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1006\/jcss.1999.1690"},{"key":"e_1_3_1_8_2","first-page":"113","volume-title":"Proceedings of the USENIX Annual Technical Conference (USENIX ATC\u201903).","author":"Douglis Fred","year":"2003","unstructured":"Fred Douglis and Arun Iyengar. 2003. Application-specific delta-encoding via resemblance detection. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC\u201903).USENIX Association, San Antonio, TX, 113\u2013126."},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/1081870.1081916"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.5555\/2750482.2750507"},{"key":"e_1_3_1_11_2","first-page":"85","volume-title":"Proceedings of 2008 USENIX Symposium on Operating Systems Design and Implementations (OSDI\u201908)","author":"Gupta Diwaker","year":"2008","unstructured":"Diwaker Gupta, Sangmin Lee, Michael Vrable, Stefan Savage, Alex C. Snoeren, George Varghese, Geoffrey M. Voelker, and Amin Vahdat. 2008. Difference engine: Harnessing memory redundancy in virtual machines. In Proceedings of 2008 USENIX Symposium on Operating Systems Design and Implementations (OSDI\u201908). ACM, New York, NY, 85\u201393."},{"key":"e_1_3_1_12_2","unstructured":"Arne Holst. 2021. Volume of Data\/Information Created Captured Copied and Consumed Worldwide from 2010 to 2025. Statista 2 (2021)."},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/JRPROC.1952.273898"},{"key":"e_1_3_1_14_2","first-page":"547","article-title":"Etude de la distribution florale dans une portion des alpes et du jura","volume":"37","author":"Jaccard Paul","year":"1901","unstructured":"Paul Jaccard. 1901. Etude de la distribution florale dans une portion des alpes et du jura. Bulletin de la Societe Vaudoise des Sciences Naturelles 37 (1901), 547\u2013579.","journal-title":"Bulletin de la Societe Vaudoise des Sciences Naturelles"},{"key":"e_1_3_1_15_2","first-page":"21","volume-title":"Proceedings of 4th USENIX Conference on File and Storage Technologies (FAST\u201905)","author":"Jain Navendu","year":"2005","unstructured":"Navendu Jain, Michael Dahlin, and Renu Tewari. 2005. TAPER: Tiered approach for eliminating redundancy in replica synchronization. In Proceedings of 4th USENIX Conference on File and Storage Technologies (FAST\u201905). USENIX Association, San Francisco, CA, 21\u201335."},{"key":"e_1_3_1_16_2","first-page":"59","volume-title":"Proceedings of the USENIX Conference on USENIX Annual Technical Conference (USENIX ATC\u201904)","author":"Kulkarni Purushottam","year":"2004","unstructured":"Purushottam Kulkarni, Fred Douglis, Jason D. LaVoie, and John M. Tracey. 2004. Redundancy elimination within large collections of files. In Proceedings of the USENIX Conference on USENIX Annual Technical Conference (USENIX ATC\u201904). USENIX Association, Boston, MA, 59\u201372."},{"key":"e_1_3_1_17_2","first-page":"111","volume-title":"Proceedings of the 7th USENIX Conference on File and Storage Technologies (USENIX FAST\u201909)","author":"Lillibridge Mark","year":"2009","unstructured":"Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezis, and Peter Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (USENIX FAST\u201909). USENIX Association, San Francisco, CA, 111\u2013123."},{"key":"e_1_3_1_18_2","first-page":"256","volume-title":"Proceedings of 12th USENIX Conference on File and Storage Technologies (USENIX FAST\u201914)","author":"Lin Xing","year":"2014","unstructured":"Xing Lin, Guanlin Lu, Fred Douglis, Philip Shilane, and Grant Wallace. 2014. Migratory compression: Coarse-grained data reordering to improve compressibility. In Proceedings of 12th USENIX Conference on File and Storage Technologies (USENIX FAST\u201914). USENIX Association, Santa Clara, CA, 256\u2013273."},{"key":"e_1_3_1_19_2","volume-title":"File System Support for Delta Compression","author":"MacDonald Josh","year":"2000","unstructured":"Josh MacDonald. 2000. File System Support for Delta Compression. Master\u2019s thesis. Department of Electrical Engineering and Computer Science, University of California at Berkeley."},{"key":"e_1_3_1_20_2","first-page":"15:1","volume-title":"Proceedings of the 6th International Systems and Storage Conference (SYSTOR\u201913)","author":"Meister Dirk","year":"2013","unstructured":"Dirk Meister, J\u00fcrgen Kaiser, and Andr\u00e9 Brinkmann. 2013. Block locality caching for data deduplication. In Proceedings of the 6th International Systems and Storage Conference (SYSTOR\u201913). ACM, New York, NY, 15:1\u201315:12."},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/502059.502052"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3357223.3362731"},{"key":"e_1_3_1_23_2","first-page":"86","volume-title":"Proceedings of the 12th ACM International Conference on Systems and Storage","author":"Ni Fan","year":"2019","unstructured":"Fan Ni, Xing Lin, and Song Jiang. 2019. SS-CDC: A two-stage parallel content-defined chunking for deduplicating backup storage. In Proceedings of the 12th ACM International Conference on Systems and Storage. ACM, New York, NY, 86\u201396."},{"key":"e_1_3_1_24_2","first-page":"15","volume-title":"Proceedings of 4th USENIX Symposium on Network System Design and Implementation (NSDI\u201907)","author":"Pucha Himabindu","year":"2007","unstructured":"Himabindu Pucha, David G. Andersen, and Michael Kaminsky. 2007. Exploiting similarity for multi-source downloads using file handprints. In Proceedings of 4th USENIX Symposium on Network System Design and Implementation (NSDI\u201907). USENIX Association, Cambridge, MA, 15\u201328."},{"key":"e_1_3_1_25_2","first-page":"89","volume-title":"Proceedings of the 1st USENIX Conference on File and Storage Technologies (USENIX FAST\u201902)","author":"Quinlan Sean","year":"2002","unstructured":"Sean Quinlan and Sean Dorward. 2002. Venti: A new approach to archival data storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (USENIX FAST\u201902). USENIX Association, Monterey, CA, 89\u2013101."},{"key":"e_1_3_1_26_2","volume-title":"Fingerprinting by Random Polynomials","author":"Rabin Michael O.","year":"1981","unstructured":"Michael O. Rabin. 1981. Fingerprinting by Random Polynomials. Technical Report. Center for Research in Computing Technology, Harvard University."},{"key":"e_1_3_1_27_2","first-page":"28","article-title":"The digitization of the world from edge to core","volume":"16","author":"Reinsel David","year":"2018","unstructured":"David Reinsel, John Gantz, and John Rydning. 2018. The digitization of the world from edge to core. IDC White Paper 16 (2018), 28 pages.","journal-title":"IDC White Paper"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/2385603.2385606"},{"key":"e_1_3_1_29_2","first-page":"16","volume-title":"Proceedings of 4th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage\u201912)","author":"Shilane Philip","year":"2012","unstructured":"Philip Shilane, Grant Wallace, Mark Huang, and Windsor Hsu. 2012. Delta compressed and deduplicated storage using stream-informed locality. In Proceedings of 4th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage\u201912). USENIX Association, Boston, MA, 16 pages."},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.5555\/2208461.2208465"},{"key":"e_1_3_1_31_2","first-page":"446","volume-title":"Proceedings of 39th IEEE International Conference on Data Engineering (IEEE ICDE\u201913)","author":"Wildani Avani","year":"2013","unstructured":"Avani Wildani, Ethan L. Miller, and Ohad Rodeh. 2013. HANDS: A heuristically arranged non-backup in-line deduplication system. In Proceedings of 39th IEEE International Conference on Data Engineering (IEEE ICDE\u201913). IEEE, Brisbane, QLD, Australia, 446\u2013457."},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2016.2571298"},{"key":"e_1_3_1_33_2","first-page":"26","volume-title":"Proceedings of the USENIX Annual Technical Conference (USENIX ATC\u201911)","author":"Xia Wen","year":"2011","unstructured":"Wen Xia, Hong Jiang, Dan Feng, and Yu Hua. 2011. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC\u201911). USENIX Association, Portland, OR, 26\u201330."},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2015.2456015"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.peva.2014.07.016"},{"key":"e_1_3_1_36_2","first-page":"101","volume-title":"Proceedings of the USENIX Annual Technical Conference (USENIX ATC\u201916)","author":"Xia Wen","year":"2016","unstructured":"Wen Xia, Yukun Zhou, Hong Jiang, Dan Feng, Yu Hua, Yuchong Hu, Qing Liu, and Yucheng Zhang. 2016. FastCDC: A fast and efficient content-defined chunking approach for data deduplication. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC\u201916). USENIX Association, Denver, CO, 101\u2013114."},{"key":"e_1_3_1_37_2","first-page":"1355","volume-title":"Proceedings of 2017 ACM Conference on Management of Data (ACM SIGMOD\u201917)","author":"Xu Lianghong","year":"2017","unstructured":"Lianghong Xu, Andrew Pavlo, Sudipta Sengupta, and Gregory R. Ganger. 2017. Online deduplication for databases. In Proceedings of 2017 ACM Conference on Management of Data (ACM SIGMOD\u201917). ACM, New York, NY, 1355\u20131368."},{"key":"e_1_3_1_38_2","first-page":"222","volume-title":"Proceedings of 2015 ACM Symposium on Cloud Computing (SoCC\u201915)","author":"Xu Lianghong","year":"2015","unstructured":"Lianghong Xu, Andrew Pavlo, Sudipta Sengupta, Jin Li, and Gregory R. Ganger. 2015. Reducing replication bandwidth for distributed document databases. In Proceedings of 2015 ACM Symposium on Cloud Computing (SoCC\u201915). ACM, New York, NY, 222\u2013235."},{"key":"e_1_3_1_39_2","first-page":"804","volume-title":"Proceedings of the 21st International Conference on Data Engineering (ICDE\u201905)","author":"You Lawrence L.","year":"2005","unstructured":"Lawrence L. You, Kristal T. Pollack, and Darrell D. E. Long. 2005. Deep store: An archival storage system architecture. In Proceedings of the 21st International Conference on Data Engineering (ICDE\u201905). IEEE, Tokyo, Japan, 804\u2013815."},{"key":"e_1_3_1_40_2","first-page":"121","volume-title":"Proceedings of 17th USENIX Conference on File and Storage Technologies (USENIX FAST\u201919)","author":"Zhang Yucheng","year":"2019","unstructured":"Yucheng Zhang, Wen Xia, Dan Feng, Hong Jiang, Yu Hua, and Qiang Wang. 2019. Finesse: Fine-grained feature locality based fast resemblance detection for post-deduplication delta compression. In Proceedings of 17th USENIX Conference on File and Storage Technologies (USENIX FAST\u201919). USENIX Association, Boston, MA, 121\u2013128."},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.5555\/1515915.1515917"},{"key":"e_1_3_1_42_2","first-page":"269","volume-title":"Proceedings of the 6th USENIX Conference on File and Storage Technologies (USENIX FAST\u201908)","author":"Zhu Benjamin","year":"2008","unstructured":"Benjamin Zhu, Kai Li, and Hugo Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (USENIX FAST\u201908). USENIX Association, San Jose, California, 269\u2013282."},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.1977.1055714"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.1978.1055934"},{"key":"e_1_3_1_45_2","first-page":"31","volume-title":"CRC Standard Probability and Statistics Tables and Formulae","author":"Zwillinger Daniel","year":"1999","unstructured":"Daniel Zwillinger and Stephen Kokoska. 1999. CRC Standard Probability and Statistics Tables and Formulae. Crc Press, New York, NY, USA. 31\u201332."}],"container-title":["ACM Transactions on Storage"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3584663","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3584663","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T17:49:59Z","timestamp":1750268999000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3584663"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,19]]},"references-count":44,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2023,8,31]]}},"alternative-id":["10.1145\/3584663"],"URL":"https:\/\/doi.org\/10.1145\/3584663","relation":{},"ISSN":["1553-3077","1553-3093"],"issn-type":[{"value":"1553-3077","type":"print"},{"value":"1553-3093","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,6,19]]},"assertion":[{"value":"2022-04-27","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-01-21","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-06-19","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}