{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T17:15:47Z","timestamp":1767978947967,"version":"3.49.0"},"reference-count":39,"publisher":"Association for Computing Machinery (ACM)","issue":"1","funder":[{"name":"Major Key Project of PCL","award":["PCL2024A05"],"award-info":[{"award-number":["PCL2024A05"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62472127"],"award-info":[{"award-number":["62472127"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Shenzhen Science and Technology Program","award":["GXWD20231128111309001, KJZD20231023094701003"],"award-info":[{"award-number":["GXWD20231128111309001, KJZD20231023094701003"]}]},{"DOI":"10.13039\/501100021171","name":"GuangDong Basic and Applied Basic Research Foundation","doi-asserted-by":"crossref","award":["2023A1515110072"],"award-info":[{"award-number":["2023A1515110072"]}],"id":[{"id":"10.13039\/501100021171","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Storage"],"published-print":{"date-parts":[[2026,2,28]]},"abstract":"<jats:p>For data reduction techniques used in storage systems, delta compression is often implemented after deduplication, having been shown to achieve a much higher compression ratio by efficiently detecting and compressing similar data chunks. Unfortunately, existing resemblance detection approaches cannot maintain both high throughput and data reduction ratio simultaneously since they either introduce heavy calculation overhead or generate useless features that reduce the accuracy of resemblance detection. In this article, we propose Argus, a fast and precise resemblance detection approach using two primary techniques to improve data-reduction efficiency significantly. First, Argus utilizes a Bin-Wise Partitioning strategy, which separates the rolling hash values of each data chunk into different subsets according to the suffix bits of the hash value and generates features from these subsets. Thus, Argus generates features more efficiently, which both achieves high detection accuracy and improves feature generation speed. Second, based on the efficient Bin-Wise Partitioning strategy, Argus utilizes fine-grained Gear rolling hash and a Plain Feature strategy to manage the granularity of the content represented in the feature, increasing the probability of feature matching and catching as many similar chunks as possible. Consequently, Argus can achieve better detection accuracy, resulting in a much higher compression ratio than previous works while minimizing the computational overhead for resemblance detection.<\/jats:p>\n                  <jats:p>Our evaluation results driven by several real-world datasets suggest that, compared to the state-of-the-art approaches, Argus achieves up to 1.64 \u00d7 (DeepSketch) and 2.29 \u00d7 (Finesse, Odess, and N-Transform) higher delta compression ratio and achieves up to 19.9 \u00d7 (N-Transform), 5.57 \u00d7 (Finesse), and 1.18 \u00d7 (Odess) faster feature generation speed.<\/jats:p>","DOI":"10.1145\/3747839","type":"journal-article","created":{"date-parts":[[2025,7,14]],"date-time":"2025-07-14T11:43:18Z","timestamp":1752493398000},"page":"1-29","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Argus: A Precise and Efficient Resemblance Detection for Post-Deduplication Delta Compression"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0009-0007-5358-2082","authenticated-orcid":false,"given":"Han","family":"Xu","sequence":"first","affiliation":[{"name":"Harbin Institute of Technology Shenzhen","place":["Shenzhen, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5104-8301","authenticated-orcid":false,"given":"Xiangyu","family":"Zou","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology Shenzhen","place":["Shenzhen, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-5004-013X","authenticated-orcid":false,"given":"Yunsheng","family":"Dong","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology Shenzhen","place":["Shenzhen, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1235-0502","authenticated-orcid":false,"given":"Philip","family":"Shilane","sequence":"additional","affiliation":[{"name":"Dell Technologies","place":["Newtown, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-7832-0599","authenticated-orcid":false,"given":"Yanqi","family":"Pan","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology Shenzhen","place":["Shenzhen, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7717-6990","authenticated-orcid":false,"given":"Cai","family":"Deng","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology Shenzhen","place":["Shenzhen, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4093-6391","authenticated-orcid":false,"given":"Wen","family":"Xia","sequence":"additional","affiliation":[{"name":"School of Computer Science, Harbin Institute of Technology Shenzhen","place":["Shenzhen, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2026,1,9]]},"reference":[{"key":"e_1_3_1_2_2","first-page":"21","volume-title":"Proceedings of the of Compression and Complexity of Sequences","author":"Broder Andrei Z.","year":"1997","unstructured":"Andrei Z. Broder. 1997. On the resemblance and containment of documents. In Proceedings of the of Compression and Complexity of Sequences. IEEE, Salerno, Italy, 21\u201329."},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.14778\/2983200.2983203"},{"key":"e_1_3_1_4_2","article-title":"xxHash: Extremely fast hash algorithm","author":"Collet Yann","year":"2016","unstructured":"Yann Collet. 2016. xxHash: Extremely fast hash algorithm. GitHub https:\/\/github. com\/Cyan4973\/xxHash (2012-2022) (2016).","journal-title":"GitHub https:\/\/github. com\/Cyan4973\/xxHash (2012-2022)"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1921015"},{"key":"e_1_3_1_6_2","first-page":"1071","volume-title":"Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE)","author":"Deng Cai","year":"2022","unstructured":"Cai Deng, Qi Chen, Xiangyu Zou, Erci Xu, Bo Tang, and Wen Xia. 2022. imDedup: A lossless deduplication scheme to eliminate fine-grained redundancy among images. In Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 1071\u20131084."},{"key":"e_1_3_1_7_2","first-page":"113","volume-title":"Proceedings of the USENIX ATC\u201903","author":"Douglis Fred","year":"2003","unstructured":"Fred Douglis and Arun Iyengar. 2003. Application-specific delta-encoding via resemblance detection. In Proceedings of the USENIX ATC\u201903. USENIX Association, San Antonio, TX, USA, 113\u2013126."},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.5555\/2342821.2342847"},{"key":"e_1_3_1_9_2","unstructured":"Jisung Park Jeonggyun Kim Yeseong Kim Sungjin Lee and Onur Mutlu. 2018. DeepSketch. Retrieved July 18 2025 from https:\/\/github.com\/dgist-datalab\/deepsketch-fast2022"},{"key":"e_1_3_1_10_2","first-page":"85","volume-title":"Proceedings of the OSDI\u201908","author":"Gupta Diwaker","year":"2008","unstructured":"Diwaker Gupta, Sangmin Lee, Michael Vrable, Stefan Savage, Alex C. Snoeren, George Varghese, Geoffrey M Voelker, and Amin Vahdat. 2008. Difference Engine: Harnessing memory redundancy in virtual machines. In Proceedings of the OSDI\u201908. Association for Computing Machinery, New York, NY, USA, 85\u201393."},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/JRPROC.1952.273898"},{"key":"e_1_3_1_12_2","first-page":"547","article-title":"Etude de la distribution florale dans une portion des alpes et du jura","volume":"37","author":"Jaccard Paul","year":"1901","unstructured":"Paul Jaccard. 1901. Etude de la distribution florale dans une portion des alpes et du jura. Bulletin de la Societe Vaudoise des Sciences Naturelles 37 (1901), 547\u2013579.","journal-title":"Bulletin de la Societe Vaudoise des Sciences Naturelles"},{"key":"e_1_3_1_13_2","first-page":"14","volume-title":"Proceedings of the FAST\u201905","author":"Jain Navendu","year":"2005","unstructured":"Navendu Jain, Michael Dahlin, and Renu Tewari. 2005. TAPER: Tiered approach for eliminating redundancy in replica synchronization. In Proceedings of the FAST\u201905. USENIX Association, San Francisco, CA, USA, 14 pages."},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2024.3511334"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.14778\/3551793.3551852"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2015.7301269"},{"key":"e_1_3_1_17_2","volume-title":"File System Support for Delta Compression","author":"MacDonald Josh","year":"2000","unstructured":"Josh MacDonald. 2000. File System Support for Delta Compression. Master\u2019s thesis. Department of Electrical Engineering and Computer Science, University of California at Berkeley."},{"key":"e_1_3_1_18_2","doi-asserted-by":"crossref","unstructured":"Muthitacharoen Athicha Benjie Chen and David Mazieres. 2001. A low-bandwidth network file system. In Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles. Association for Computing Machinery New York NY USA 172\u2013187.","DOI":"10.1145\/502034.502052"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2024.3435681"},{"key":"e_1_3_1_20_2","first-page":"247","volume-title":"Proceedings of the 20th USENIX Conference on File and Storage Technologies (FAST 22)","author":"Park Jisung","year":"2022","unstructured":"Jisung Park, Jeonggyun Kim, Yeseong Kim, Sungjin Lee, and Onur Mutlu. 2022. DeepSketch: A new machine Learning-Based reference search technique for Post-Deduplication delta compression. In Proceedings of the 20th USENIX Conference on File and Storage Technologies (FAST 22). USENIX Association, Santa Clara, CA, 247\u2013264."},{"key":"e_1_3_1_21_2","doi-asserted-by":"crossref","unstructured":"Gennady Pekhimenko Vivek Seshadri Onur Mutlu Phillip Gibbons Michael Kozuch and Todd Mowry. 2012. Base-delta-immediate compression: Practical data compression for on-chip caches. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. Association for Computing Machinery New York NY USA 377\u2013388.","DOI":"10.1145\/2370816.2370870"},{"key":"e_1_3_1_22_2","first-page":"15","volume-title":"Proceedings of the NSDI\u201907","author":"Pucha Himabindu","year":"2007","unstructured":"Himabindu Pucha, David G. Andersen, and Michael Kaminsky. 2007. Exploiting similarity for multi-source downloads using file handprints. In Proceedings of the NSDI\u201907. USENIX Association, Cambridge, MA, USA, 15\u201328."},{"key":"e_1_3_1_23_2","volume-title":"Fingerprinting by Random Polynomials","author":"Rabin Michael O.","year":"1981","unstructured":"Michael O. Rabin. 1981. Fingerprinting by Random Polynomials. Technical Report. Center for Research in Computing Technology, Harvard University."},{"key":"e_1_3_1_24_2","doi-asserted-by":"crossref","first-page":"68200E","DOI":"10.1117\/12.766171","volume-title":"Proceedings of the Multimedia Content Access: Algorithms and Systems II","author":"Sarkar Anindya","year":"2008","unstructured":"Anindya Sarkar, Pratim Ghosh, Emily Moxley, and B. S. Manjunath. 2008. Video fingerprinting: Features for duplicate and similar video detection and query-based video retrieval. In Proceedings of the Multimedia Content Access: Algorithms and Systems II. International Society for Optics and Photonics, SPIE, 68200E."},{"key":"e_1_3_1_25_2","first-page":"714","volume-title":"Proceedings of the 17th European Conference on Computer Systems (EuroSys \u201922)","author":"Saxena Divyanshu","year":"2022","unstructured":"Divyanshu Saxena, Tao Ji, Arjun Singhvi, Junaid Khalid, and Aditya Akella. 2022. Memory deduplication for serverless computing with medes. In Proceedings of the 17th European Conference on Computer Systems (EuroSys \u201922). Association for Computing Machinery, New York, NY, USA, 714\u2013729."},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/2385603.2385606"},{"key":"e_1_3_1_27_2","first-page":"5","volume-title":"Proceedings of the HotStorage\u201912","author":"Shilane Philip","year":"2012","unstructured":"Philip Shilane, Grant Wallace, Mark Huang, and Windsor Hsu. 2012. Delta Compressed and Deduplicated Storage Using Stream-Informed Locality. In Proceedings of the HotStorage\u201912. USENIX Association, Boston, MA, USA, 5 pages."},{"key":"e_1_3_1_28_2","doi-asserted-by":"crossref","unstructured":"Jan E. Trost. 1986. Statistically nonrepresentative stratified sampling: A sampling technique for qualitative studies. Qualitative Sociology 9 1 (1986) 54\u201357.","DOI":"10.1007\/BF00988249"},{"key":"e_1_3_1_29_2","first-page":"446","volume-title":"Proceedings of the IEEE ICDE\u201913","author":"Wildani Avani","year":"2013","unstructured":"Avani Wildani, Ethan L. Miller, and Ohad Rodeh. 2013. HANDS: A Heuristically Arranged Non-Backup in-Line Deduplication System. In Proceedings of the IEEE ICDE\u201913. IEEE, Brisbane, QLD, Australia, 446\u2013457."},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2016.2571298"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.peva.2014.07.016"},{"key":"e_1_3_1_32_2","first-page":"101","volume-title":"Proceedings of the USENIX ATC\u201916","author":"Xia Wen","year":"2016","unstructured":"Wen Xia, Yukun Zhou, Hong Jiang, Dan Feng, Yu Hua, Yuchong Hu, Qing Liu, and Yucheng Zhang. 2016. FastCDC: A fast and efficient content-defined chunking approach for data deduplication. In Proceedings of the USENIX ATC\u201916. USENIX Association, Denver, CO, USA, 101\u2013114."},{"key":"e_1_3_1_33_2","first-page":"1355","volume-title":"Proceedings of the ACM SIGMOD\u201917","author":"Xu Lianghong","year":"2017","unstructured":"Lianghong Xu, Andrew Pavlo, Sudipta Sengupta, and Gregory R. Ganger. 2017. Online Deduplication for Databases. In Proceedings of the ACM SIGMOD\u201917. Association for Computing Machinery, New York, NY, USA, 1355\u20131368."},{"key":"e_1_3_1_34_2","first-page":"121","volume-title":"Proceedings of the FAST\u201919","author":"Zhang Yucheng","year":"2019","unstructured":"Yucheng Zhang, Wen Xia, Dan Feng, Hong Jiang, Yu Hua, and Qiang Wang. 2019. Finesse: Fine-grained feature locality based fast resemblance detection for post-deduplication delta compression. In Proceedings of the FAST\u201919. USENIX Association, Boston, MA, USA, 121\u2013128."},{"key":"e_1_3_1_35_2","first-page":"1","volume-title":"Proceedings of the Middleware\u201903","author":"Zhou Feng","year":"2003","unstructured":"Feng Zhou, Li Zhuang, Ben Y. Zhao, Ling Huang, Anthony D. Joseph, and John Kubiatowicz. 2003. Approximate object location and spam filtering on peer-to-peer systems. In Proceedings of the Middleware\u201903. Springer, Berlin,1\u201320."},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.1978.1055934"},{"key":"e_1_3_1_37_2","volume-title":"Proceedings of the ICDE\u201921","author":"Zou Xiangyu","year":"2021","unstructured":"Xiangyu Zou, Cai Deng, Wen Xia, Philip Shilane, Haoliang Tan, Haijun Zhang, and Xuan Wang. 2021. Odess: Speeding up resemblance detection for redundancy elimination by fast content-defined sampling. In Proceedings of the ICDE\u201921."},{"key":"e_1_3_1_38_2","first-page":"480","volume-title":"Proceedings of the 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19-22, 2021","author":"Zou Xiangyu","year":"2021","unstructured":"Xiangyu Zou, Cai Deng, Wen Xia, Philip Shilane, Haoliang Tan, Haijun Zhang, and Xuan Wang. 2021. Odess: Speeding up resemblance detection for redundancy elimination by fast content-defined sampling. In Proceedings of the 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19-22, 2021. IEEE, 480\u2013491."},{"key":"e_1_3_1_39_2","first-page":"19","volume-title":"Proceedings of the 2022 USENIX Annual Technical Conference (USENIX ATC 22)","author":"Zou Xiangyu","year":"2022","unstructured":"Xiangyu Zou, Wen Xia, Philip Shilane, Haijun Zhang, and Xuan Wang. 2022. Building a high-performance fine-grained deduplication framework for backup storage with high deduplication ratio. In Proceedings of the 2022 USENIX Annual Technical Conference (USENIX ATC 22). USENIX Association, Carlsbad, CA, 19\u201336."},{"key":"e_1_3_1_40_2","volume-title":"CRC Standard Probability and Statistics Tables and Formulae","author":"Zwillinger Daniel","year":"1999","unstructured":"Daniel Zwillinger and Stephen Kokoska. 1999. CRC Standard Probability and Statistics Tables and Formulae. CRC Press, New York, NY, USA."}],"container-title":["ACM Transactions on Storage"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3747839","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T13:05:01Z","timestamp":1767963901000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3747839"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,9]]},"references-count":39,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,2,28]]}},"alternative-id":["10.1145\/3747839"],"URL":"https:\/\/doi.org\/10.1145\/3747839","relation":{},"ISSN":["1553-3077","1553-3093"],"issn-type":[{"value":"1553-3077","type":"print"},{"value":"1553-3093","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,9]]},"assertion":[{"value":"2024-01-19","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-05-24","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-01-09","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}