{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,13]],"date-time":"2026-03-13T08:53:11Z","timestamp":1773391991183,"version":"3.50.1"},"reference-count":67,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2023,6,13]],"date-time":"2023-06-13T00:00:00Z","timestamp":1686614400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"NSF","award":["2212629"],"award-info":[{"award-number":["2212629"]}]},{"name":"NSF","award":["2152908"],"award-info":[{"award-number":["2152908"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2023,6,13]]},"abstract":"<jats:p>Recent studies show that large language models (LLM) unintendedly memorize part of the training data, which brings serious privacy risks. For example, it has been shown that over 1% of tokens generated unprompted by an LLM are part of sequences in the training data. However, current studies mainly focus on the exact memorization behaviors. In this paper, we propose to evaluate how many generated texts have near-duplicates (e.g., only differ by a couple of tokens out of 100) in the training corpus. A major challenge of conducting this evaluation is the huge computation cost incurred by near-duplicate sequence searches. This is because modern LLMs are trained on larger and larger corpora with up to 1 trillion tokens. What's worse is that the number of sequences in a text is quadratic to the text length. To address this issue, we develop an efficient and scalable near-duplicate sequence search algorithm in this paper. It can find (almost) all the near-duplicate sequences of the query sequence in a large corpus with guarantees. Specifically, the algorithm generates and groups the min-hash values of all the sequences with at least t tokens (as very short near-duplicates are often irrelevant noise) in the corpus in linear time to the corpus size. We formally prove that only 2 n+1\/t+1 -1 min-hash values are generated for a text with n tokens in expectation. Thus the index time and size are reasonable. When a query arrives, we find all the sequences sharing enough min-hash values with the query using inverted indexes and prefix filtering. Extensive experiments on a few large real-world LLM training corpora show that our near-duplicate sequence search algorithm is efficient and scalable.<\/jats:p>","DOI":"10.1145\/3589324","type":"journal-article","created":{"date-parts":[[2023,6,20]],"date-time":"2023-06-20T20:26:45Z","timestamp":1687292805000},"page":"1-18","source":"Crossref","is-referenced-by-count":6,"title":["Near-Duplicate Sequence Search at Scale for Large Language Model Memorization Evaluation"],"prefix":"10.1145","volume":"1","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4182-0075","authenticated-orcid":false,"given":"Zhencan","family":"Peng","sequence":"first","affiliation":[{"name":"Rutgers University, New Brunswick, NJ, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2223-9621","authenticated-orcid":false,"given":"Zhizhi","family":"Wang","sequence":"additional","affiliation":[{"name":"Rutgers University, New Brunswick, NJ, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4596-3850","authenticated-orcid":false,"given":"Dong","family":"Deng","sequence":"additional","affiliation":[{"name":"Rutgers University, New Brunswick, NJ, USA"}]}],"member":"320","published-online":{"date-parts":[[2023,6,20]]},"reference":[{"key":"e_1_2_2_1_1","volume-title":"SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation","author":"Agirre Eneko","unstructured":"Eneko Agirre, Carmen Banea, Daniel M. Cer, Mona T. Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2016. SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation. In SEMEVAL. The Association for Computer Linguistics, 497--511."},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00224-004--1155--5"},{"key":"e_1_2_2_3_1","volume-title":"Unsupervised neural machine translation. arXiv preprint arXiv:1710.11041","author":"Artetxe Mikel","year":"2017","unstructured":"Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2017. Unsupervised neural machine translation. arXiv preprint arXiv:1710.11041 (2017)."},{"key":"e_1_2_2_4_1","volume-title":"Ribeiro-Neto","author":"Baeza-Yates Ricardo A.","year":"1999","unstructured":"Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. 1999. Modern Information Retrieval. ACM Press \/ Addison-Wesley. http:\/\/www.dcc.ufmg.br\/irbook\/"},{"key":"e_1_2_2_5_1","doi-asserted-by":"crossref","unstructured":"Roberto J. Bayardo Yiming Ma and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In WWW. 131--140.","DOI":"10.1145\/1242572.1242591"},{"key":"e_1_2_2_6_1","volume-title":"Carey","author":"Behm Alexander","year":"2011","unstructured":"Alexander Behm, Chen Li, and Michael J. Carey. 2011. Answering approximate string queries on large data sets using external memory. In ICDE. 888--899."},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jalgor.2005.08.001"},{"key":"e_1_2_2_8_1","volume-title":"Copy Detection Mechanisms for Digital Documents","author":"Brin Sergey","unstructured":"Sergey Brin, James Davis, and Hector Garcia-Molina. 1995. Copy Detection Mechanisms for Digital Documents. In SIGMOD. ACM Press, 398--409."},{"key":"e_1_2_2_9_1","volume-title":"On the resemblance and containment of documents","author":"Broder Andrei Z.","unstructured":"Andrei Z. Broder. 1997. On the resemblance and containment of documents. In SEQUENCES. IEEE, 21--29."},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0169--7552(97)00031--7"},{"key":"e_1_2_2_11_1","first-page":"1877","article-title":"Language models are few-shot learners","volume":"33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. NIPS 33 (2020), 1877--1901.","journal-title":"NIPS"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/SP46214.2022.9833649"},{"key":"e_1_2_2_13_1","volume-title":"Quantifying Memorization Across Neural Language Models. CoRR abs\/2202.07646","author":"Carlini Nicholas","year":"2022","unstructured":"Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tram\u00e8r, and Chiyuan Zhang. 2022. Quantifying Memorization Across Neural Language Models. CoRR abs\/2202.07646 (2022). arXiv:2202.07646 https:\/\/arxiv.org\/abs\/2202.07646"},{"key":"e_1_2_2_14_1","volume-title":"30th USENIX Security Symposium, USENIX Security 2021","author":"Carlini Nicholas","year":"2021","unstructured":"Nicholas Carlini, Florian Tram\u00e8r, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom B. Brown, Dawn Song, \u00dalfar Erlingsson, Alina Oprea, and Colin Raffel. 2021. Extracting Training Data from Large Language Models. In 30th USENIX Security Symposium, USENIX Security 2021, August 11--13, 2021. USENIX Association, 2633--2650. https:\/\/www.usenix.org\/conference\/usenixsecurity21\/presentation\/carlini-extracting"},{"key":"e_1_2_2_15_1","volume-title":"Charles Sutton, Sebastian Gehrmann, et al.","author":"Chowdhery Aakanksha","year":"2022","unstructured":"Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022)."},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/506309.506311"},{"key":"e_1_2_2_17_1","doi-asserted-by":"crossref","unstructured":"Jack G Conrad Xi S Guo and Cindy P Schriber. 2003. Online duplicate document detection: signature reliability in a dynamic retrieval environment. In CIKM. 443--452.","DOI":"10.1145\/956863.956946"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2593675"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/2856318.2856330"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.14778\/2977797.2977808"},{"key":"e_1_2_2_21_1","doi-asserted-by":"crossref","unstructured":"Dong Deng Yufei Tao and Guoliang Li. 2018. Overlap Set Similarity Joins with Theoretical Guarantees. In SIGMOD. ACM 905--920.","DOI":"10.1145\/3183713.3183748"},{"key":"e_1_2_2_22_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457548"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1007\/978--3--642--12200--2_16"},{"key":"e_1_2_2_26_1","first-page":"2","article-title":"A New Algorithm for Data Compression","volume":"12","author":"Gage Philip","year":"1994","unstructured":"Philip Gage. 1994. A New Algorithm for Data Compression. C Users J. 12, 2 (feb 1994), 23--38.","journal-title":"C Users J."},{"key":"e_1_2_2_27_1","volume-title":"The Pile: An 800GB Dataset of Diverse Text for Language Modeling. CoRR abs\/2101.00027","author":"Gao Leo","year":"2021","unstructured":"Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2021. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. CoRR abs\/2101.00027 (2021). arXiv:2101.00027 https:\/\/arxiv.org\/abs\/2101.00027"},{"key":"e_1_2_2_28_1","volume-title":"Matthew Lee Hinman, and Roy Russo","author":"Gheorghe Radu","year":"2015","unstructured":"Radu Gheorghe, Matthew Lee Hinman, and Roy Russo. 2015. Elasticsearch in action. Manning Shelter Island, NY."},{"key":"e_1_2_2_29_1","unstructured":"Aaron Gokaslan and Vanya Cohen. [n.d.]. OpenWebText Corpus."},{"key":"e_1_2_2_30_1","volume-title":"Search engine society","author":"Halavais Alexander","unstructured":"Alexander Halavais. 2017. Search engine society. John Wiley & Sons."},{"key":"e_1_2_2_31_1","doi-asserted-by":"crossref","unstructured":"Ossama Abdel Hamid Behshad Behzadi Stefan Christoph and Monika Rauch Henzinger. 2009. Detecting the origin of text segments efficiently. In WWW. ACM 61--70.","DOI":"10.1145\/1526709.1526719"},{"key":"e_1_2_2_32_1","first-page":"7","article-title":"Microsoft SQL server full-text search","volume":"24","author":"Hamilton James R.","year":"2001","unstructured":"James R. Hamilton and Tapas K. Nayak. 2001. Microsoft SQL server full-text search. IEEE Data Eng. Bull. 24, 4 (2001), 7--10.","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1002\/asi.10170"},{"key":"e_1_2_2_34_1","unstructured":"Shengyue Ji Guoliang Li Chen Li and Jianhua Feng. 2009. Efficient Interactive Fuzzy Keyword Search. In WWW. 433--439."},{"key":"e_1_2_2_35_1","volume-title":"ICML (Proceedings of Machine Learning Research)","volume":"162","author":"Kandpal Nikhil","year":"2022","unstructured":"Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating Training Data Mitigates Privacy Risks in Language Models. In ICML (Proceedings of Machine Learning Research), Vol. 162. PMLR, 10697--10707. https:\/\/proceedings.mlr.press\/v162\/kandpal22a.html"},{"key":"e_1_2_2_36_1","doi-asserted-by":"crossref","unstructured":"Jong Wook Kim K. Sel\u00e7uk Candan and Jun'ichi Tatemura. 2009. Efficient overlap and content reuse detection in blogs and online news articles. In WWW. ACM 81--90.","DOI":"10.1145\/1526709.1526721"},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00276"},{"key":"e_1_2_2_38_1","unstructured":"Katherine Lee Daphne Ippolito Andrew Nystrom Chiyuan Zhang Douglas Eck Chris Callison-Burch and Nicholas Carlini. 2022. Deduplicating Training Data Makes Language Models Better. In ACL. 8424--8445."},{"key":"e_1_2_2_39_1","volume-title":"Ullman","author":"Leskovec Jure","year":"2014","unstructured":"Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. 2014. Mining of Massive Datasets, 2nd Ed. Cambridge University Press.","edition":"2"},{"key":"e_1_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/1989323.1989379"},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.14778\/2078331.2078340"},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-011-0218-x"},{"key":"e_1_2_2_43_1","volume-title":"Art B. Owen, and Cun-Hui Zhang","author":"Li Ping","year":"2012","unstructured":"Ping Li, Art B. Owen, and Cun-Hui Zhang. 2012. One Permutation Hashing. In NIPS. 3122--3130."},{"key":"e_1_2_2_44_1","volume-title":"USENIX","author":"Manber Udi","year":"1994","unstructured":"Udi Manber. 1994. Finding Similar Files in a Large File System. In USENIX Winter 1994 Technical Conference. USENIX Association, 1--10."},{"key":"e_1_2_2_45_1","volume-title":"How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN. CoRR abs\/2111.09509","author":"McCoy R. Thomas","year":"2021","unstructured":"R. Thomas McCoy, Paul Smolensky, Tal Linzen, Jianfeng Gao, and Asli Celikyilmaz. 2021. How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN. CoRR abs\/2111.09509 (2021). arXiv:2111.09509 https:\/\/arxiv.org\/abs\/2111.09509"},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2010-343"},{"key":"e_1_2_2_47_1","volume-title":"Overview of the 2nd International Competition on Plagiarism Detection. In CLEF 2010 LABs and Workshops, Notebook Papers (CEUR Workshop Proceedings)","volume":"1176","author":"Potthast Martin","year":"2010","unstructured":"Martin Potthast, Alberto Barr\u00f3n-Cede\u00f1o, Andreas Eiselt, Benno Stein, and Paolo Rosso. 2010. Overview of the 2nd International Competition on Plagiarism Detection. In CLEF 2010 LABs and Workshops, Notebook Papers (CEUR Workshop Proceedings), Vol. 1176. CEUR-WS.org."},{"key":"e_1_2_2_48_1","doi-asserted-by":"crossref","unstructured":"David MW Powers. 1998. Applications and explanations of Zipf's law. In New methods in language processing and computational natural language learning.","DOI":"10.3115\/1603899.1603924"},{"key":"e_1_2_2_49_1","volume-title":"et al","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al . 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9."},{"key":"e_1_2_2_50_1","volume-title":"Liu","author":"Raffel Colin","year":"2019","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv e-prints (2019). arXiv:1910.10683"},{"key":"e_1_2_2_51_1","volume-title":"Database management systems","author":"Ramakrishnan Raghu","unstructured":"Raghu Ramakrishnan, Johannes Gehrke, and Johannes Gehrke. 2003. Database management systems. Vol. 3. McGraw-Hill New York."},{"key":"e_1_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.psychres.2021.114135"},{"key":"e_1_2_2_53_1","volume-title":"Daniel Shawcross Wilkerson, and Alexander Aiken","author":"Schleimer Saul","year":"2003","unstructured":"Saul Schleimer, Daniel Shawcross Wilkerson, and Alexander Aiken. 2003. Winnowing: Local Algorithms for Document Fingerprinting. In SIGMOD. ACM, 76--85."},{"key":"e_1_2_2_54_1","volume-title":"Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368","author":"Liu Peter J","year":"2017","unstructured":"Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368 (2017)."},{"key":"e_1_2_2_55_1","doi-asserted-by":"crossref","unstructured":"Jangwon Seo and W. Bruce Croft. 2008. Local text reuse detection. In SIGIR. ACM 571--578.","DOI":"10.1145\/1390334.1390432"},{"key":"e_1_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/223784.223801"},{"key":"e_1_2_2_57_1","volume-title":"International Workshop WebDB'98 (Lecture Notes in Computer Science)","volume":"1590","author":"Shivakumar Narayanan","year":"1998","unstructured":"Narayanan Shivakumar and Hector Garcia-Molina. 1998. Finding Near-Replicas of Documents and Servers on the Web. In The World Wide Web and Databases, International Workshop WebDB'98 (Lecture Notes in Computer Science), Vol. 1590. Springer, 204--212."},{"key":"e_1_2_2_58_1","doi-asserted-by":"publisher","unstructured":"Mikkel Thorup. 2013. Bottom-k and priority sampling set similarity and subset sums with minimal independence. In STOC. ACM 371--380. https:\/\/doi.org\/10.1145\/2488608.2488655","DOI":"10.1145\/2488608.2488655"},{"key":"e_1_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2205.10770"},{"key":"e_1_2_2_60_1","volume-title":"Attention is all you need. NIPS 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NIPS 30 (2017)."},{"key":"e_1_2_2_61_1","doi-asserted-by":"crossref","unstructured":"Jiannan Wang Guoliang Li and Jianhua Feng. 2012. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD. 85--96.","DOI":"10.1145\/2213836.2213847"},{"key":"e_1_2_2_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2915211"},{"key":"e_1_2_2_63_1","doi-asserted-by":"publisher","unstructured":"Zhizhi Wang Chaoji Zuo and Dong Deng. 2022. TxtAlign: Efficient Near-Duplicate Text Alignment Search via Bottom-k Sketches for Plagiarism Detection. In SIGMOD. ACM 1146--1159. https:\/\/doi.org\/10.1145\/3514221.3526178","DOI":"10.1145\/3514221.3526178"},{"key":"e_1_2_2_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/2000824.2000825"},{"key":"e_1_2_2_65_1","volume-title":"Proceedings of the 2005 national conference on Digital government research. 78--86","author":"Yang Hui","year":"2005","unstructured":"Hui Yang and Jamie Callan. 2005. Near-duplicate detection for eRulemaking. In Proceedings of the 2005 national conference on Digital government research. 78--86."},{"key":"e_1_2_2_66_1","doi-asserted-by":"crossref","unstructured":"Hui Yang and Jamie Callan. 2006. Near-duplicate detection by instance-level constrained clustering. In ACM SIGIR. 421--428.","DOI":"10.1145\/1148170.1148243"},{"key":"e_1_2_2_67_1","volume-title":"Xlnet: Generalized autoregressive pretraining for language understanding. NIPS 32","author":"Yang Zhilin","year":"2019","unstructured":"Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. NIPS 32 (2019)."}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3589324","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3589324","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3589324","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:46:14Z","timestamp":1750178774000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3589324"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,13]]},"references-count":67,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,6,13]]}},"alternative-id":["10.1145\/3589324"],"URL":"https:\/\/doi.org\/10.1145\/3589324","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,6,13]]}}}