{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,12]],"date-time":"2026-02-12T23:26:37Z","timestamp":1770938797725,"version":"3.50.1"},"reference-count":141,"publisher":"Association for Computing Machinery (ACM)","issue":"9","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62472035, U24B20148, 72201266, 72192843, and 72192844"],"award-info":[{"award-number":["62472035, U24B20148, 72201266, 72192843, and 72192844"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities","doi-asserted-by":"crossref","award":["2023CX01020"],"award-info":[{"award-number":["2023CX01020"]}],"id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"crossref"}]},{"name":"State Key Laboratory of Intelligent Manufacturing Equipment and Technology"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2026,7,31]]},"abstract":"<jats:p>As large language models (LLMs) continue to evolve, the scope and diversity of data used for training are expanding significantly. However, the training dataset of LLMs may inevitably contain sensitive information such as personal data or copyrighted material, leading to privacy leakage or copyright infringement risks if the model generates highly similar or identical text to these sources. This has drawn attention to the issue of detecting whether the text data is used for LLM training. To date, research on detecting training data usage in artificial intelligence (AI) models has mainly focused on traditional machine learning (ML) models. However, studies on LLMs remain relatively immature. The lack of understanding of research progress in this area has hindered the development of more effective detection methods. Therefore, this article aims to address this gap by conducting the analysis of detecting training data for LLM. Specifically, we analyze the available LLM\u2019s information to the detector, the main detection methods, determination metrics, and discuss the technical challenges and potential directions for future research in this field.<\/jats:p>","DOI":"10.1145\/3779430","type":"journal-article","created":{"date-parts":[[2026,1,7]],"date-time":"2026-01-07T20:45:22Z","timestamp":1767818722000},"page":"1-35","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Detecting Training Data For Large Language Models: A Survey"],"prefix":"10.1145","volume":"58","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3863-5832","authenticated-orcid":false,"given":"Chen","family":"Yang","sequence":"first","affiliation":[{"name":"School of Cyberspace Science and Technology, Beijing Institute of Technology","place":["Beijing, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-9841-4217","authenticated-orcid":false,"given":"Junyi","family":"Li","sequence":"additional","affiliation":[{"name":"School of Cyberspace Science and Technology, Beijing Institute of Technology","place":["Beijing, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5234-0830","authenticated-orcid":false,"given":"Shulin","family":"Lan","sequence":"additional","affiliation":[{"name":"School of Economics and Management, University of the Chinese Academy of Sciences","place":["Beijing, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2037-7465","authenticated-orcid":false,"given":"Yingchao","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Cyberspace Science and Technology, Beijing Institute of Technology","place":["Beijing, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8220-6525","authenticated-orcid":false,"given":"Hongyang","family":"Du","sequence":"additional","affiliation":[{"name":"Department of Electrical and Electronic Engineering, The University of Hong Kong","place":["Hong Kong, Hong Kong"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-0075-1195","authenticated-orcid":false,"given":"Congcheng","family":"Gong","sequence":"additional","affiliation":[{"name":"School of Cyberspace Science and Technology, Beijing Institute of Technology","place":["Beijing, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-2833-6350","authenticated-orcid":false,"given":"Xingshan","family":"Yao","sequence":"additional","affiliation":[{"name":"School of Cyberspace Science and Technology, Beijing Institute of Technology","place":["Beijing, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7442-7416","authenticated-orcid":false,"given":"Dusit (Tao)","family":"Niyato","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Nanyang Technological University","place":["Singapore, Singapore"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3277-3887","authenticated-orcid":false,"given":"Liehuang","family":"Zhu","sequence":"additional","affiliation":[{"name":"School of Cyberspace Science and Technology, Beijing Institute of Technology","place":["Beijing, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2026,2,12]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"Josh Achiam Steven Adler Sandhini Agarwal Lama Ahmad Ilge Akkaya Florencia Leoni Aleman Diogo Almeida Janko Altenschmidt Sam Altman Shyamal Anadkat et\u00a0al. 2023. Gpt-4 technical report. arXiv:2303.08774. Retrieved from https:\/\/arxiv.org\/abs\/2303.08774"},{"key":"e_1_3_1_3_2","unstructured":"Ebtesam Almazrouei Hamza Alobeidli Abdulaziz Alshamsi Alessandro Cappelli Ruxandra Cojocaru M\u00e9rouane Debbah \u00c9tienne Goffinet Daniel Hesslow Julien Launay Quentin Malartic et\u00a0al. 2023. The falcon series of open language models. arXiv:2311.16867. Retrieved from https:\/\/arxiv.org\/abs\/2311.16867"},{"key":"e_1_3_1_4_2","volume-title":"Claude","year":"2023","unstructured":"Anthropic. 2023. Claude. Retrieved from https:\/\/www.anthropic.com\/news\/claude-2"},{"key":"e_1_3_1_5_2","first-page":"14","article-title":"A survey on membership inference attacks against machine learning","volume":"6","author":"Bai Yang","year":"2021","unstructured":"Yang Bai, Ting Chen, and Mingyu Fan. 2021. A survey on membership inference attacks against machine learning. Management 6 (2021), 14.","journal-title":"Management"},{"key":"e_1_3_1_6_2","unstructured":"Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv:2309.10305. Retrieved from https:\/\/arxiv.org\/abs\/2309.10305"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2003.08.002"},{"key":"e_1_3_1_8_2","article-title":"Emergent and predictable memorization in large language models","volume":"36","author":"Biderman Stella","year":"2024","unstructured":"Stella Biderman, Usvsn Prashanth, Lintang Sutawika, Hailey Schoelkopf, Quentin Anthony, Shivanshu Purohit, and Edward Raff. 2024. Emergent and predictable memorization in large language models. Advances in Neural Information Processing Systems 36 (2024).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_9_2","first-page":"2397","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Biderman Stella","year":"2023","unstructured":"Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O\u2019Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et\u00a0al. 2023. Pythia: A suite for analyzing large language models across training and scaling. In Proceedings of the International Conference on Machine Learning. PMLR, 2397\u20132430."},{"issue":"2","key":"e_1_3_1_10_2","article-title":"Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow","volume":"58","author":"Black Sid","year":"2021","unstructured":"Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow. If you use this Software, Please Cite it Using These Metadata 58, 2 (2021).","journal-title":"If you use this Software, Please Cite it Using These Metadata"},{"key":"e_1_3_1_11_2","unstructured":"Rishi Bommasani Drew A. Hudson Ehsan Adeli Russ Altman Simran Arora Sydney von Arx Michael S. Bernstein Jeannette Bohg Antoine Bosselut Emma Brunskill et\u00a0al. 2021. On the opportunities and risks of foundation models. arXiv:2108.07258. Retrieved from https:\/\/arxiv.org\/abs\/2108.07258"},{"key":"e_1_3_1_12_2","first-page":"2023","article-title":"Artists take new shot at stability, midjourney","volume":"30","author":"Brittain Blake","year":"2023","unstructured":"Blake Brittain. 2023. Artists take new shot at stability, midjourney. Reuters, November 30 (2023), 2023.","journal-title":"Reuters, November"},{"key":"e_1_3_1_13_2","first-page":"21","volume-title":"Proceedings of the Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171)","author":"Broder Andrei Z.","year":"1997","unstructured":"Andrei Z. Broder. 1997. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171). IEEE, 21\u201329."},{"key":"e_1_3_1_14_2","unstructured":"T. Brown B. Mann N. Ryder M. Subbiah J. D. Kaplan P. Dhariwal A. Neelakantan P. Shyam G. Sastry A. Askell et\u00a0al. 2020. Language models are few-shot learners advances in neural information processing systems 33. (2020)."},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/SP.2015.35"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/SP46214.2022.9833649"},{"key":"e_1_3_1_17_2","unstructured":"Nicholas Carlini Daphne Ippolito Matthew Jagielski Katherine Lee Florian Tramer and Chiyuan Zhang. 2022. Quantifying memorization across neural language models. arXiv:2202.07646. Retrieved from https:\/\/arxiv.org\/abs\/2202.07646"},{"key":"e_1_3_1_18_2","first-page":"13263","article-title":"The privacy onion effect: Memorization is relative","volume":"35","author":"Carlini Nicholas","year":"2022","unstructured":"Nicholas Carlini, Matthew Jagielski, Chiyuan Zhang, Nicolas Papernot, Andreas Terzis, and Florian Tramer. 2022. The privacy onion effect: Memorization is relative. Advances in Neural Information Processing Systems 35 (2022), 13263\u201313276.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_19_2","first-page":"267","volume-title":"Proceedings of the 28th USENIX Security Symposium (USENIX Security 19)","author":"Carlini Nicholas","year":"2019","unstructured":"Nicholas Carlini, Chang Liu, \u00dalfar Erlingsson, Jernej Kos, and Dawn Song. 2019. The secret sharer: Evaluating and testing unintended memorization in neural networks. In Proceedings of the 28th USENIX Security Symposium (USENIX Security 19). 267\u2013284."},{"key":"e_1_3_1_20_2","first-page":"2633","volume-title":"Proceedings of the 30th USENIX Security Symposium (USENIX Security 21)","author":"Carlini Nicholas","year":"2021","unstructured":"Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et\u00a0al. 2021. Extracting training data from large language models. In Proceedings of the 30th USENIX Security Symposium (USENIX Security 21). 2633\u20132650."},{"key":"e_1_3_1_21_2","doi-asserted-by":"crossref","unstructured":"Kent K. Chang Mackenzie Cramer Sandeep Soni and David Bamman. 2023. Speak memory: An archaeology of books known to chatgpt\/gpt-4. arXiv:2305.00118. Retrieved from https:\/\/arxiv.org\/abs\/2305.00118","DOI":"10.18653\/v1\/2023.emnlp-main.453"},{"issue":"3","key":"e_1_3_1_22_2","first-page":"6","article-title":"Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality","volume":"2","author":"Chiang Wei-Lin","year":"2023","unstructured":"Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, et\u00a0al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https:\/\/vicuna. lmsys. org (accessed 14 April 2023) 2, 3 (2023), 6.","journal-title":"See https:\/\/vicuna. lmsys. org (accessed 14 April 2023)"},{"issue":"240","key":"e_1_3_1_23_2","first-page":"1","article-title":"Palm: Scaling language modeling with pathways","volume":"24","author":"Chowdhery Aakanksha","year":"2023","unstructured":"Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et\u00a0al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1\u2013113.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_1_24_2","article-title":"Deep reinforcement learning from human preferences","volume":"30","author":"Christiano Paul F.","year":"2017","unstructured":"Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems 30 (2017).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_25_2","unstructured":"Junjie Chu Yugeng Liu Ziqing Yang Xinyue Shen Michael Backes and Yang Zhang. 2024. Comprehensive assessment of jailbreak attacks against llms. arXiv:2402.05668. Retrieved from https:\/\/arxiv.org\/abs\/2402.05668"},{"key":"e_1_3_1_26_2","article-title":"RedPajama-Data: An Open Source Recipe to Reproduce LLaMA training dataset","author":"Computer T","unstructured":"T Computer. [n.d.]. RedPajama-Data: An Open Source Recipe to Reproduce LLaMA training dataset. Retrieved from https:\/\/github.com\/taohong0511\/RedPajama-Data\/?tab=readme-ov-file","journal-title":"Retrieved from https:\/\/github.com\/taohong0511\/RedPajama-Data\/?tab=readme-ov-file"},{"key":"e_1_3_1_27_2","volume-title":"Common Crawl","author":"Crawl Common","year":"2023","unstructured":"Common Crawl. 2023. Common Crawl. Retrieved from https:\/\/commoncrawl.org\/"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-71782-7_35"},{"key":"e_1_3_1_29_2","doi-asserted-by":"crossref","unstructured":"Yihong Dong Xue Jiang Huanyu Liu Zhi Jin and Ge Li. 2024. Generalization or memorization: Data contamination and trustworthy evaluation for large language models. arXiv:2402.15938. Retrieved from https:\/\/arxiv.org\/abs\/2402.15938","DOI":"10.18653\/v1\/2024.findings-acl.716"},{"key":"e_1_3_1_30_2","unstructured":"Michael Duan Anshuman Suri Niloofar Mireshghallah Sewon Min Weijia Shi Luke Zettlemoyer Yulia Tsvetkov Yejin Choi David Evans and Hannaneh Hajishirzi. 2024. Do membership inference attacks work on large language models? arXiv:2402.07841. Retrieved from https:\/\/arxiv.org\/abs\/2402.07841"},{"key":"e_1_3_1_31_2","unstructured":"Andr\u00e9 V Duarte Xuandong Zhao Arlindo L Oliveira and Lei Li. 2024. De-cop: Detecting copyrighted content in language models training data. arXiv:2402.09910. Retrieved from https:\/\/arxiv.org\/abs\/2402.09910"},{"key":"e_1_3_1_32_2","unstructured":"Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How small can language models be and still speak coherent english? arXiv:2305.07759. Retrieved from https:\/\/arxiv.org\/abs\/2305.07759"},{"key":"e_1_3_1_33_2","unstructured":"Niva Elkin-Koren Uri Hacohen Roi Livni and Shay Moran. 2023. Can copyright be reduced to privacy? arXiv:2305.14822. Retrieved from https:\/\/arxiv.org\/abs\/2305.14822"},{"key":"e_1_3_1_34_2","unstructured":"Angela Fan Mike Lewis and Yann Dauphin. 2018. Hierarchical neural story generation. arXiv:1805.04833. Retrieved from https:\/\/arxiv.org\/abs\/1805.04833"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/3357713.3384290"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jbi.2018.04.009"},{"key":"e_1_3_1_37_2","unstructured":"Jean-loup Gailly and Mark Adler. 2004. Zlib compression library. (2004)."},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3579027.3608973"},{"key":"e_1_3_1_39_2","unstructured":"L. Gao S. Biderman S. Black L. Golding T. Hoppe C. Foster J. Phang H. He A. Thite N. Nabeshima et\u00a0al. 2021. The pile: An 800 GB dataset of diverse text for language modeling 2020. arXiv:2101.00027. Retrieved from https:\/\/arxiv.org\/abs\/2101.00027"},{"issue":"1","key":"e_1_3_1_40_2","first-page":"1","article-title":"Research progress and challenges of membership inference attacks in machine learning","volume":"12","author":"Gao Ting","year":"2022","unstructured":"Ting Gao. 2022. Research progress and challenges of membership inference attacks in machine learning. Operations Research and Blurring 12, 1 (2022), 1\u201315.","journal-title":"Operations Research and Blurring"},{"issue":"3","key":"e_1_3_1_41_2","first-page":"15","article-title":"Membership inference attacks in black-box machine learning models","volume":"6","author":"Gaoyang Liu","year":"2021","unstructured":"Liu Gaoyang, Li Yutong, Wan Borui, Wang Chen, and Peng Kai. 2021. Membership inference attacks in black-box machine learning models. J. Cyber Secur 6, 3 (2021), 15.","journal-title":"J. Cyber Secur"},{"key":"e_1_3_1_42_2","unstructured":"Sameera Ghayyur Jay Averitt Eric Lin Eric Wallace Apoorvaa Deshpande and Hunter Luthi. 2023. Panel: Privacy challenges and opportunities in \\(\\lbrace\\) LLM-Based \\(\\rbrace\\) chatbot applications. (2023)."},{"key":"e_1_3_1_43_2","unstructured":"Shahriar Golchin and Mihai Surdeanu. 2023. Data contamination quiz: A tool to detect and estimate contamination in large language models. arXiv:2311.06233. Retrieved from https:\/\/arxiv.org\/abs\/2311.06233"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.5120\/11638-7118"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-021-01453-z"},{"key":"e_1_3_1_46_2","article-title":"The times sues openai and microsoft over ai use of copyrighted work","volume":"27","author":"Grynbaum Michael M.","year":"2023","unstructured":"Michael M. Grynbaum and Ryan Mac. 2023. The times sues openai and microsoft over ai use of copyrighted work. The New York Times 27 (2023).","journal-title":"The New York Times"},{"key":"e_1_3_1_47_2","article-title":"AG Recommends Clause in Publishing and Distribution Agreements Prohibiting AI Training Uses","author":"Guild The Authors","unstructured":"The Authors Guild. [n.d.]. AG Recommends Clause in Publishing and Distribution Agreements Prohibiting AI Training Uses. Retrieved from https:\/\/authorsguild.org\/news\/model-clause-prohibiting-ai-training\/","journal-title":"Retrieved from https:\/\/authorsguild.org\/news\/model-clause-prohibiting-ai-training\/"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01046"},{"key":"e_1_3_1_49_2","unstructured":"Mingcan Guo Zhongyuan Han Xintian Wang and Jiangao Peng. 2024. Multidimensional text feature analysis: Unveiling the veil of automatically generated text. (2024)."},{"key":"e_1_3_1_50_2","unstructured":"Bikash Gyawali Lucas Anastasiou and Petr Knoth. 2020. Deduplication of scholarly documents using locality sensitive hashing and word embeddings. (2020)."},{"key":"e_1_3_1_51_2","unstructured":"Valentin Hartmann Anshuman Suri Vincent Bindschaedler David Evans Shruti Tople and Robert West. 2023. Sok: Memorization in general-purpose large language models. arXiv:2310.18362. Retrieved from https:\/\/arxiv.org\/abs\/2310.18362"},{"key":"e_1_3_1_52_2","volume-title":"Proceedings of the USENIX Security","author":"He Yu","year":"2025","unstructured":"Yu He, Boheng Li, Liu Liu, Zhongjie Ba, Wei Dong, Yiming Li, Zhan Qin, Kui Ren, and Chun Chen. 2025. Towards label-only membership inference attack against pre-trained large language models. In Proceedings of the USENIX Security."},{"key":"e_1_3_1_53_2","unstructured":"Jordan Hoffmann Sebastian Borgeaud Arthur Mensch Elena Buchatskaya Trevor Cai Eliza Rutherford Diego de Las Casas Lisa Anne Hendricks Johannes Welbl Aidan Clark et\u00a0al. 2022. Training compute-optimal large language models. arXiv:2203.15556. Retrieved from https:\/\/arxiv.org\/abs\/2203.15556"},{"key":"e_1_3_1_54_2","first-page":"30016","article-title":"An empirical analysis of compute-optimal large language model training","volume":"35","author":"Hoffmann Jordan","year":"2022","unstructured":"Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et\u00a0al. 2022. An empirical analysis of compute-optimal large language model training. Advances in Neural Information Processing Systems 35 (2022), 30016\u201330030.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_55_2","unstructured":"Ari Holtzman Jan Buys Li Du Maxwell Forbes and Yejin Choi. 2019. The curious case of neural text degeneration. arXiv:1904.09751. Retrieved from https:\/\/arxiv.org\/abs\/1904.09751"},{"key":"e_1_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1145\/3523273"},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/3620667"},{"key":"e_1_3_1_58_2","doi-asserted-by":"crossref","unstructured":"Jie Huang Hanyin Shao and Kevin Chen-Chuan Chang. 2022. Are large pre-trained language models leaking your personal information? arXiv:2205.12628. Retrieved from https:\/\/arxiv.org\/abs\/2205.12628","DOI":"10.18653\/v1\/2022.findings-emnlp.148"},{"key":"e_1_3_1_59_2","doi-asserted-by":"crossref","unstructured":"Daphne Ippolito Florian Tram\u00e8r Milad Nasr Chiyuan Zhang Matthew Jagielski Katherine Lee Christopher A Choquette-Choo and Nicholas Carlini. 2022. Preventing verbatim memorization in language models gives a false sense of privacy. arXiv:2210.17546. Retrieved from https:\/\/arxiv.org\/abs\/2210.17546","DOI":"10.18653\/v1\/2023.inlg-main.3"},{"key":"e_1_3_1_60_2","unstructured":"Shotaro Ishihara. 2023. Training data extraction from pre-trained language models: A survey. arXiv:2305.16157. Retrieved from https:\/\/arxiv.org\/abs\/2305.16157"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1111\/j.1469-8137.1912.tb05611.x"},{"key":"e_1_3_1_62_2","first-page":"22205","article-title":"Auditing differentially private machine learning: How private is private SGD?","volume":"33","author":"Jagielski Matthew","year":"2020","unstructured":"Matthew Jagielski, Jonathan Ullman, and Alina Oprea. 2020. Auditing differentially private machine learning: How private is private SGD? Advances in Neural Information Processing Systems 33 (2020), 22205\u201322216.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_63_2","unstructured":"Albert Q. Jiang Alexandre Sablayrolles Arthur Mensch Chris Bamford Devendra Singh Chaplot Diego de las Casas Florian Bressand Gianna Lengyel Guillaume Lample Lucile Saulnier et\u00a0al. 2023. Mistral 7B. arXiv:2310.06825. Retrieved from https:\/\/arxiv.org\/abs\/2310.06825"},{"key":"e_1_3_1_64_2","unstructured":"Albert Q. Jiang Alexandre Sablayrolles Antoine Roux Arthur Mensch Blanche Savary Chris Bamford Devendra Singh Chaplot Diego de las Casas Emma Bou Hanna Florian Bressand et\u00a0al. 2024. Mixtral of experts. arXiv:2401.04088. Retrieved from https:\/\/arxiv.org\/abs\/2401.04088"},{"key":"e_1_3_1_65_2","first-page":"10697","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Kandpal Nikhil","year":"2022","unstructured":"Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating training data mitigates privacy risks in language models. In Proceedings of the International Conference on Machine Learning. PMLR, 10697\u201310707."},{"key":"e_1_3_1_66_2","unstructured":"Masahiro Kaneko Youmi Ma Yuki Wata and Naoaki Okazaki. 2024. Sampling-based pseudo-likelihood for membership inference attacks. arXiv:2404.11262. Retrieved from https:\/\/arxiv.org\/abs\/2404.11262"},{"key":"e_1_3_1_67_2","doi-asserted-by":"crossref","unstructured":"Antonia Karamolegkou Jiaang Li Li Zhou and Anders S\u00f8gaard. 2023. Copyright violations and large language models. arXiv:2310.13771. Retrieved from https:\/\/arxiv.org\/abs\/2310.13771","DOI":"10.18653\/v1\/2023.emnlp-main.458"},{"key":"e_1_3_1_68_2","unstructured":"Gyuwan Kim Yang Li Evangelia Spiliopoulou Jie Ma Miguel Ballesteros and William Yang Wang. 2024. Detecting training data of large language models via expectation maximization. arXiv:2410.07582. Retrieved from https:\/\/arxiv.org\/abs\/2410.07582"},{"key":"e_1_3_1_69_2","unstructured":"Diederik P. Kingma. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https:\/\/arxiv.org\/abs\/1412.6980"},{"key":"e_1_3_1_70_2","unstructured":"Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality. arXiv:2302.14520. Retrieved from https:\/\/arxiv.org\/abs\/2302.14520"},{"key":"e_1_3_1_71_2","unstructured":"Katherine Lee Daphne Ippolito Andrew Nystrom Chiyuan Zhang Douglas Eck Chris Callison-Burch and Nicholas Carlini. 2021. Deduplicating training data makes language models better. arXiv:2107.06499. Retrieved from https:\/\/arxiv.org\/abs\/2107.06499"},{"key":"e_1_3_1_72_2","unstructured":"Jiwei Li Michel Galley Chris Brockett Jianfeng Gao and Bill Dolan. 2015. A diversity-promoting objective function for neural conversation models. arXiv:1510.03055. Retrieved from https:\/\/arxiv.org\/abs\/1510.03055"},{"key":"e_1_3_1_73_2","doi-asserted-by":"publisher","DOI":"10.1145\/3422337.3447836"},{"key":"e_1_3_1_74_2","unstructured":"Yuanzhi Li S\u00e9bastien Bubeck Ronen Eldan Allie Del Giorno Suriya Gunasekar and Yin Tat Lee. 2023. Textbooks are all you need ii: Phi-1.5 technical report. arXiv:2309.05463. Retrieved from https:\/\/arxiv.org\/abs\/2309.05463"},{"key":"e_1_3_1_75_2","unstructured":"Yinhan Liu. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692. Retrieved from https:\/\/arxiv.org\/abs\/1907.11692"},{"key":"e_1_3_1_76_2","unstructured":"Yi Liu Gelei Deng Zhengzi Xu Yuekang Li Yaowen Zheng Ying Zhang Lida Zhao Tianwei Zhang and Yang Liu. 2023. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv 2023. arXiv:2305.13860. Retrieved from https:\/\/arxiv.org\/abs\/2305.13860"},{"key":"e_1_3_1_77_2","doi-asserted-by":"publisher","DOI":"10.1145\/3691620.3695018"},{"key":"e_1_3_1_78_2","unstructured":"Zhenhua Liu Tong Zhu Chuanyuan Tan Haonan Lu Bing Liu and Wenliang Chen. 2024. Probing language models for pre-training data detection. arXiv:2406.01333. Retrieved from https:\/\/arxiv.org\/abs\/2406.01333"},{"key":"e_1_3_1_79_2","unstructured":"Yunhui Long Vincent Bindschaedler Lei Wang Diyue Bu Xiaofeng Wang Haixu Tang Carl A. Gunter and Kai Chen. 2018. Understanding membership inferences on well-generalized learning models. arXiv:1802.04889. Retrieved from https:\/\/arxiv.org\/abs\/1802.04889"},{"key":"e_1_3_1_80_2","article-title":"Fixing weight decay regularization in adam","volume":"5","author":"Loshchilov Ilya","year":"2017","unstructured":"Ilya Loshchilov, Frank Hutter, et\u00a0al. 2017. Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101 5 (2017).","journal-title":"arXiv preprint arXiv:1711.05101"},{"key":"e_1_3_1_81_2","unstructured":"Saeed Mahloujifar Huseyin A. Inan Melissa Chase Esha Ghosh and Marcello Hasegawa. 2021. Membership inference on word embedding and beyond. arXiv:2106.11384. Retrieved from https:\/\/arxiv.org\/abs\/2106.11384"},{"key":"e_1_3_1_82_2","unstructured":"Pratyush Maini Hengrui Jia Nicolas Papernot and Adam Dziedzic. 2024. LLM dataset inference: Did you train on my dataset? arXiv:2406.06443. Retrieved from https:\/\/arxiv.org\/abs\/2406.06443"},{"key":"e_1_3_1_83_2","doi-asserted-by":"crossref","unstructured":"Justus Mattern Fatemehsadat Mireshghallah Zhijing Jin Bernhard Sch\u00f6lkopf Mrinmaya Sachan and Taylor Berg-Kirkpatrick. 2023. Membership inference attacks against language models via neighbourhood comparison. arXiv:2305.18462. Retrieved from https:\/\/arxiv.org\/abs\/2305.18462","DOI":"10.18653\/v1\/2023.findings-acl.719"},{"key":"e_1_3_1_84_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00567"},{"key":"e_1_3_1_85_2","unstructured":"Sewon Min Suchin Gururangan Eric Wallace Weijia Shi Hannaneh Hajishirzi Noah A Smith and Luke Zettlemoyer. 2023. Silo language models: Isolating legal risk in a nonparametric datastore. arXiv:2308.04430. Retrieved from https:\/\/arxiv.org\/abs\/2308.04430"},{"key":"e_1_3_1_86_2","doi-asserted-by":"crossref","unstructured":"Fatemehsadat Mireshghallah Kartik Goyal Archit Uniyal Taylor Berg-Kirkpatrick and Reza Shokri. 2022. Quantifying privacy risks of masked language models using membership inference attacks. arXiv:2203.03929. Retrieved from https:\/\/arxiv.org\/abs\/2203.03929","DOI":"10.18653\/v1\/2022.emnlp-main.570"},{"key":"e_1_3_1_87_2","first-page":"24950","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Mitchell Eric","year":"2023","unstructured":"Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, and Chelsea Finn. 2023. Detectgpt: Zero-shot machine-generated text detection using probability curvature. In Proceedings of the International Conference on Machine Learning. PMLR, 24950\u201324962."},{"key":"e_1_3_1_88_2","unstructured":"Hamid Mozaffari and Virendra J. Marathe. 2024. Semantic membership inference attack against large language models. arXiv:2406.10218. Retrieved from https:\/\/arxiv.org\/abs\/2406.10218"},{"key":"e_1_3_1_89_2","unstructured":"Milad Nasr Nicholas Carlini Jonathan Hayase Matthew Jagielski A. Feder Cooper Daphne Ippolito Christopher A. Choquette-Choo Eric Wallace Florian Tram\u00e8r and Katherine Lee. 2023. Scalable extraction of training data from (production) language models. arXiv:2311.17035. Retrieved from https:\/\/arxiv.org\/abs\/2311.17035"},{"key":"e_1_3_1_90_2","doi-asserted-by":"publisher","DOI":"10.1145\/3243734.3243855"},{"key":"e_1_3_1_91_2","doi-asserted-by":"publisher","DOI":"10.1109\/SP40001.2021.00069"},{"key":"e_1_3_1_92_2","unstructured":"Erik Nijkamp Bo Pang Hiroaki Hayashi Lifu Tu Huan Wang Yingbo Zhou Silvio Savarese and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn program synthesis. arXiv:2203.13474. Retrieved from https:\/\/arxiv.org\/abs\/2203.13474"},{"key":"e_1_3_1_93_2","first-page":"27730","article-title":"Training language models to follow instructions with human feedback","volume":"35","author":"Ouyang Long","year":"2022","unstructured":"Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et\u00a0al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730\u201327744.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_94_2","doi-asserted-by":"crossref","unstructured":"Mustafa Safa Ozdayi Charith Peris Jack FitzGerald Christophe Dupuy Jimit Majmudar Haidar Khan Rahil Parikh and Rahul Gupta. 2023. Controlling the extraction of memorized data from large language models via prompt-tuning. arXiv:2305.11759. Retrieved from https:\/\/arxiv.org\/abs\/2305.11759","DOI":"10.18653\/v1\/2023.acl-short.129"},{"key":"e_1_3_1_95_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4842-6576-5_6"},{"key":"e_1_3_1_96_2","unstructured":"Guilherme Penedo Quentin Malartic Daniel Hesslow Ruxandra Cojocaru Alessandro Cappelli Hamza Alobeidli Baptiste Pannier Ebtesam Almazrouei and Julien Launay. 2023. The refinedweb dataset for falcon LLM: Outperforming curated corpora with web data and web data only. arXiv:2306.01116. Retrieved from https:\/\/arxiv.org\/abs\/2306.01116"},{"key":"e_1_3_1_97_2","article-title":"Targeted training data extraction\u2014neighborhood comparison-based membership inference attacks in large language models","volume":"14","author":"Peng Kai","year":"2024","unstructured":"Kai Peng. 2024. Targeted training data extraction\u2014neighborhood comparison-based membership inference attacks in large language models. Applied Sciences 14 (2024).","journal-title":"Applied Sciences"},{"key":"e_1_3_1_98_2","doi-asserted-by":"publisher","DOI":"10.26555\/ijain.v4i1.152"},{"key":"e_1_3_1_99_2","unstructured":"A. Radford. 2018. Improving language understanding by generative pre-training. (2018)."},{"issue":"8","key":"e_1_3_1_100_2","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et\u00a0al. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.","journal-title":"OpenAI Blog"},{"issue":"140","key":"e_1_3_1_101_2","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 140 (2020), 1\u201367.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_1_102_2","doi-asserted-by":"crossref","unstructured":"Abhilasha Ravichander Jillian Fisher Taylor Sorensen Ximing Lu Yuchen Lin Maria Antoniak Niloofar Mireshghallah Chandra Bhagavatula and Yejin Choi. 2025. Information-guided identification of training data imprint in (proprietary) large language models. arXiv:2503.12072. Retrieved from https:\/\/arxiv.org\/abs\/2503.12072","DOI":"10.18653\/v1\/2025.naacl-long.99"},{"key":"e_1_3_1_103_2","article-title":"A meta-analysis of overfitting in machine learning","volume":"32","author":"Roelofs Rebecca","year":"2019","unstructured":"Rebecca Roelofs, Vaishaal Shankar, Benjamin Recht, Sara Fridovich-Keil, Moritz Hardt, John Miller, and Ludwig Schmidt. 2019. A meta-analysis of overfitting in machine learning. Advances in Neural Information Processing Systems 32 (2019).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_104_2","unstructured":"Noveen Sachdeva Benjamin Coleman Wang-Cheng Kang Jianmo Ni Lichan Hong Ed H. Chi James Caverlee Julian McAuley and Derek Zhiyuan Cheng. 2024. How to train data-efficient LLMs. arXiv:2402.09668. Retrieved from https:\/\/arxiv.org\/abs\/2402.09668"},{"key":"e_1_3_1_105_2","unstructured":"Victor Sanh Albert Webson Colin Raffel Stephen H. Bach Lintang Sutawika Zaid Alyafeai Antoine Chaffin Arnaud Stiegler Teven Le Scao Arun Raja et\u00a0al. 2021. Multitask prompted training enables zero-shot task generalization. arXiv:2110.08207. Retrieved from https:\/\/arxiv.org\/abs\/2110.08207"},{"key":"e_1_3_1_106_2","article-title":"On the role of data anonymization in machine learning privacy","author":"Senavirathne Navoda","year":"2020","unstructured":"Navoda Senavirathne and Vicenc Torra. 2020. On the role of data anonymization in machine learning privacy. IEEE (2020).","journal-title":"IEEE"},{"key":"e_1_3_1_107_2","first-page":"4596","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Shazeer Noam","year":"2018","unstructured":"Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. In Proceedings of the International Conference on Machine Learning. PMLR, 4596\u20134604."},{"key":"e_1_3_1_108_2","unstructured":"Weijia Shi Anirudh Ajith Mengzhou Xia Yangsibo Huang Daogao Liu Terra Blevins Danqi Chen and Luke Zettlemoyer. 2023. Detecting pretraining data from large language models. arXiv:2310.16789. Retrieved from https:\/\/arxiv.org\/abs\/2310.16789"},{"key":"e_1_3_1_109_2","doi-asserted-by":"publisher","DOI":"10.1109\/SP.2017.41"},{"key":"e_1_3_1_110_2","unstructured":"Luca Soldaini Rodney Kinney Akshita Bhagia Dustin Schwenk David Atkinson Russell Authur Ben Bogin Khyathi Chandu Jennifer Dumas Yanai Elazar et\u00a0al. 2024. Dolma: An open corpus of three trillion tokens for language model pretraining research. arXiv:2402.00159. Retrieved from https:\/\/arxiv.org\/abs\/2402.00159"},{"key":"e_1_3_1_111_2","doi-asserted-by":"publisher","DOI":"10.1145\/3292500.3330885"},{"key":"e_1_3_1_112_2","article-title":"Galactica: A large language model for science. arXiv 2022","volume":"10","author":"Taylor Ross","year":"2023","unstructured":"Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. 2023. Galactica: A large language model for science. arXiv 2022. arXiv preprint arXiv:2211.09085 10 (2023).","journal-title":"arXiv preprint arXiv:2211.09085"},{"key":"e_1_3_1_113_2","unstructured":"MosaicML NLP Team et\u00a0al. 2023. Introducing mpt-7b: A new standard for open-source commercially usable llms."},{"key":"e_1_3_1_114_2","unstructured":"Qwen Team. 2024. Introducing Qwen1.5. Retrieved from https:\/\/qwenlm.github.io\/blog\/qwen1.5\/"},{"key":"e_1_3_1_115_2","first-page":"38274","article-title":"Memorization without overfitting: Analyzing the training dynamics of large language models","volume":"35","author":"Tirumala Kushal","year":"2022","unstructured":"Kushal Tirumala, Aram Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. 2022. Memorization without overfitting: Analyzing the training dynamics of large language models. Advances in Neural Information Processing Systems 35 (2022), 38274\u201338290.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_116_2","doi-asserted-by":"publisher","DOI":"10.4018\/978-1-60566-766-9.ch011"},{"key":"e_1_3_1_117_2","unstructured":"Hugo Touvron Thibaut Lavril Gautier Izacard Xavier Martinet Marie-Anne Lachaux Timoth\u00e9e Lacroix Baptiste Rozi\u00e8re Naman Goyal Eric Hambro Faisal Azhar et\u00a0al. 2023. Llama: Open and efficient foundation language models. arXiv:2302.13971. Retrieved from https:\/\/arxiv.org\/abs\/2302.13971"},{"key":"e_1_3_1_118_2","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale et\u00a0al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288. Retrieved from https:\/\/arxiv.org\/abs\/2307.09288"},{"key":"e_1_3_1_119_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSC.2019.2897554"},{"key":"e_1_3_1_120_2","first-page":"35277","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Vyas Nikhil","year":"2023","unstructured":"Nikhil Vyas, Sham M. Kakade, and Boaz Barak. 2023. On provable copyright protection for generative models. In Proceedings of the International Conference on Machine Learning. PMLR, 35277\u201335299."},{"key":"e_1_3_1_121_2","doi-asserted-by":"publisher","DOI":"10.3390\/info11090421"},{"issue":"10","key":"e_1_3_1_122_2","first-page":"1","article-title":"A survey on membership inference on training datasets in machine learning","volume":"10","author":"Wang L. L.","year":"2019","unstructured":"L. L. Wang, Peng Zhang, Zheng Yan, and Xiaokang Zhou. 2019. A survey on membership inference on training datasets in machine learning. Cyberspace Security 10, 10 (2019), 1\u20137.","journal-title":"Cyberspace Security"},{"key":"e_1_3_1_123_2","unstructured":"Lauren Watson Chuan Guo Graham Cormode and Alex Sablayrolles. 2021. On the importance of difficulty calibration in membership inference attacks. arXiv:2111.08440. Retrieved from https:\/\/arxiv.org\/abs\/2111.08440"},{"key":"e_1_3_1_124_2","article-title":"Jailbroken: How does llm safety training fail?","volume":"36","author":"Wei Alexander","year":"2024","unstructured":"Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2024. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems 36 (2024).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_125_2","unstructured":"Jason Wei Maarten Bosma Vincent Y. Zhao Kelvin Guu Adams Wei Yu Brian Lester Nan Du Andrew M. Dai and Quoc V. Le. 2021. Finetuned language models are zero-shot learners. arXiv:2109.01652. Retrieved from https:\/\/arxiv.org\/abs\/2109.01652"},{"key":"e_1_3_1_126_2","unstructured":"Laura Weidinger John Mellor Maribeth Rauh Conor Griffin Jonathan Uesato Po-Sen Huang Myra Cheng Mia Glaese Borja Balle Atoosa Kasirzadeh et\u00a0al. 2021. Ethical and social risks of harm from language models. arXiv:2112.04359. Retrieved from https:\/\/arxiv.org\/abs\/2112.04359"},{"key":"e_1_3_1_127_2","unstructured":"Wikipedia contributors. 2024. Wikipedia \u2014 Wikipedia The Free Encyclopedia. Retrieved September 2 2024 from https:\/\/en.wikipedia.org\/w\/index.php?title=Wikipedia&oldid=1243483768"},{"key":"e_1_3_1_128_2","doi-asserted-by":"publisher","DOI":"10.1109\/JAS.2023.123618"},{"key":"e_1_3_1_129_2","doi-asserted-by":"publisher","DOI":"10.1145\/3603620"},{"key":"e_1_3_1_130_2","article-title":"One LLM is not enough: Harnessing the power of ensemble learning for medical question answering","author":"Yang Han","year":"2023","unstructured":"Han Yang, Mingchen Li, Huixue Zhou, Yongkang Xiao, Qian Fang, and Rui Zhang. 2023. One LLM is not enough: Harnessing the power of ensemble learning for medical question answering. medRxiv (2023).","journal-title":"medRxiv"},{"key":"e_1_3_1_131_2","doi-asserted-by":"publisher","DOI":"10.1109\/CSF.2018.00027"},{"key":"e_1_3_1_132_2","unstructured":"Sajjad Zarifzadeh Philippe Cheng-Jie Marc Liu and Reza Shokri. 2023. Low-cost high-power membership inference by boosting relativity. (2023)."},{"key":"e_1_3_1_133_2","article-title":"Defending against neural fake news","volume":"32","author":"Zellers Rowan","year":"2019","unstructured":"Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending against neural fake news. Advances in Neural Information Processing Systems 32 (2019).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_134_2","first-page":"39321","article-title":"Counterfactual memorization in neural language models","volume":"36","author":"Zhang Chiyuan","year":"2023","unstructured":"Chiyuan Zhang, Daphne Ippolito, Katherine Lee, Matthew Jagielski, Florian Tram\u00e8r, and Nicholas Carlini. 2023. Counterfactual memorization in neural language models. Advances in Neural Information Processing Systems 36 (2023), 39321\u201339362.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_135_2","unstructured":"Jingyang Zhang Jingwei Sun Eric Yeats Yang Ouyang Martin Kuo Jianyi Zhang Hao Yang and Hai Li. 2024. Min-K%++: Improved baseline for detecting pre-training data from large language models. arXiv:2404.02936. Retrieved from https:\/\/arxiv.org\/abs\/2404.02936"},{"key":"e_1_3_1_136_2","unstructured":"Susan Zhang Stephen Roller Naman Goyal Mikel Artetxe Moya Chen Shuohui Chen Christopher Dewan Mona Diab Xian Li Xi Victoria Lin et\u00a0al. 2022. Opt: Open pre-trained transformer language models. arXiv:2205.01068. Retrieved from https:\/\/arxiv.org\/abs\/2205.01068"},{"key":"e_1_3_1_137_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00632"},{"key":"e_1_3_1_138_2","doi-asserted-by":"crossref","unstructured":"Weichao Zhang Ruqing Zhang Jiafeng Guo Maarten de Rijke Yixing Fan and Xueqi Cheng. 2024. Pretraining data detection for large language models: A divergence-based calibration method. arXiv:2409.14781. Retrieved from https:\/\/arxiv.org\/abs\/2409.14781","DOI":"10.18653\/v1\/2024.emnlp-main.300"},{"key":"e_1_3_1_139_2","unstructured":"Wayne Xin Zhao Kun Zhou Junyi Li Tianyi Tang Xiaolei Wang Yupeng Hou Yingqian Min Beichen Zhang Junjie Zhang Zican Dong et\u00a0al. 2023. A survey of large language models. arXiv:2303.18223. Retrieved from https:\/\/arxiv.org\/abs\/2303.18223"},{"key":"e_1_3_1_140_2","unstructured":"Xuandong Zhao Lei Li and Yu-Xiang Wang. 2022. Provably confidential language modelling. arXiv:2205.01863. Retrieved from https:\/\/arxiv.org\/abs\/2205.01863"},{"key":"e_1_3_1_141_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.findings-acl.35"},{"key":"e_1_3_1_142_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.414"}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3779430","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,12]],"date-time":"2026-02-12T22:38:31Z","timestamp":1770935911000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3779430"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,12]]},"references-count":141,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2026,7,31]]}},"alternative-id":["10.1145\/3779430"],"URL":"https:\/\/doi.org\/10.1145\/3779430","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,12]]},"assertion":[{"value":"2025-01-03","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-11-16","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-02-12","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}