{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,29]],"date-time":"2025-12-29T04:46:25Z","timestamp":1766983585161,"version":"3.44.0"},"publisher-location":"New York, NY, USA","reference-count":44,"publisher":"ACM","funder":[{"name":"The National Research Foundation, Singapore under its National Large Language Models Funding Initiative","award":["AISG-NMLP-2024-002"],"award-info":[{"award-number":["AISG-NMLP-2024-002"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,8,3]]},"DOI":"10.1145\/3711896.3736919","type":"proceedings-article","created":{"date-parts":[[2025,8,3]],"date-time":"2025-08-03T21:05:41Z","timestamp":1754255141000},"page":"3483-3494","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["EARN: Efficient Inference Acceleration for LLM-based Generative Recommendation by Register Tokens"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-3501-491X","authenticated-orcid":false,"given":"Chaoqun","family":"Yang","sequence":"first","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6931-3182","authenticated-orcid":false,"given":"Xinyu","family":"Lin","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5199-1428","authenticated-orcid":false,"given":"Wenjie","family":"Wang","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, Hefei, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6932-4228","authenticated-orcid":false,"given":"Yongqi","family":"Li","sequence":"additional","affiliation":[{"name":"The Hong Kong Polytechnic University, Hong Kong, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0932-8910","authenticated-orcid":false,"given":"Teng","family":"Sun","sequence":"additional","affiliation":[{"name":"Shandong University, Qingdao, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7867-3190","authenticated-orcid":false,"given":"Xianjing","family":"Han","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6097-7807","authenticated-orcid":false,"given":"Tat-Seng","family":"Chua","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}]}],"member":"320","published-online":{"date-parts":[[2025,8,3]]},"reference":[{"key":"e_1_3_2_2_1_1","first-page":"17413","article-title":"Scatterbrain: Unifying sparse and low-rank attention","volume":"34","author":"Chen Beidi","year":"2021","unstructured":"Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher R\u00e9. 2021. Scatterbrain: Unifying sparse and low-rank attention. Advances in Neural Information Processing Systems, Vol. 34 (2021), 17413-17426.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_2_2_1","unstructured":"Guoxuan Chen Han Shi Jiawei Li Yihang Gao Xiaozhe Ren Yimeng Chen Xin Jiang Zhenguo Li Weiyang Liu and Chao Huang. 2024. SepLLM: Accelerate large language models by compressing one segment into one separator. arXiv preprint arXiv:2412.12094(2024)."},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.emnlp-main.232"},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3640457.3688118"},{"key":"e_1_3_2_2_5_1","unstructured":"Yichuan Deng Zhao Song Jing Xiong and Chiwun Yang. 2024. How Sparse Attention Approximates Exact Attention? Your Attention is Naturally n^C-Sparse. arXiv preprint arXiv:2404.02690(2024)."},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3404835.3462978"},{"key":"e_1_3_2_2_7_1","volume-title":"The 13th International Conference on Learning Representations.","author":"Gu Xiangming","year":"2025","unstructured":"Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. 2025. When attention sink emerges in language models: An empirical view. In The 13th International Conference on Learning Representations."},{"key":"e_1_3_2_2_8_1","volume-title":"The 9th International Conference on Learning Representations.","author":"Hendrycks Dan","year":"2021","unstructured":"Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. In The 9th International Conference on Learning Representations."},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.emnlp-main.825"},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3637528.3671931"},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3583780.3615017"},{"key":"e_1_3_2_2_12_1","first-page":"10146","volume-title":"Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING","author":"Li Lei","year":"2024","unstructured":"Lei Li, Yongfeng Zhang, Dugang Liu, and Li Chen. 2024 e. Large language models for generative recommendation: A survey and visionary discussions. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 10146-10159."},{"key":"e_1_3_2_2_13_1","volume-title":"The 13th International Conference on Learning Representations. arXiv preprint arXiv:2412","author":"Li Pengxiang","year":"2025","unstructured":"Pengxiang Li, Lu Yin, and Shiwei Liu. 2025. Mix-LN: Unleashing the power of deeper layers by combining Pre-LN and Post-LN, In The 13th International Conference on Learning Representations. arXiv preprint arXiv:2412.13795."},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.emnlp-main.391"},{"key":"e_1_3_2_2_15_1","first-page":"22947","article-title":"SnapKV: LLM knows what you are looking for before generation","volume":"37","author":"Li Yuhong","year":"2024","unstructured":"Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. 2024a. SnapKV: LLM knows what you are looking for before generation. Advances in Neural Information Processing Systems, Vol. 37 (2024), 22947-22970.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_2_16_1","unstructured":"Yongqi Li Xinyu Lin Wenjie Wang Fuli Feng Liang Pang Wenjie Li Liqiang Nie Xiangnan He and Tat-Seng Chua. 2024b. A survey of generative search and recommendation in the era of large language models. arXiv preprint arXiv:2404.16924(2024)."},{"key":"e_1_3_2_2_17_1","volume-title":"Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 7182-7195","author":"Li Zongqian","year":"2024","unstructured":"Zongqian Li, Yinhong Liu, Yixuan Su, and Nigel Collier. 2024c. Prompt compression for large language models: A survey. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 7182-7195."},{"key":"e_1_3_2_2_18_1","unstructured":"Zongqian Li Yixuan Su and Nigel Collier. 2024d. 500xCompressor: Generalized prompt compression for large language models. arXiv preprint arXiv:2408.03094(2024)."},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11704-024-40039-z"},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3678004"},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3637528.3671884"},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626772.3657807"},{"key":"e_1_3_2_2_23_1","volume-title":"The 13th International Conference on Learning Representations.","author":"Lin Xinyu","year":"2025","unstructured":"Xinyu Lin, Chaoqun Yang, Wenjie Wang, Yongqi Li, Cunxiao Du, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. 2025 c. Efficient inference for large language model-based generative recommendation. In The 13th International Conference on Learning Representations."},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3539618.3591717"},{"key":"e_1_3_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.acl-short.8"},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.aiopen.2023.08.012"},{"key":"e_1_3_2_2_27_1","volume-title":"The 1st Conference on Language Modeling (COLM).","author":"Luohe Shi","year":"2024","unstructured":"Shi Luohe, Hongyi Zhang, Yao Yao, Zuchao Li, et al., 2024. Keep the cost down: A review on methods to optimize LLM's KV-Cache consumption. In The 1st Conference on Language Modeling (COLM)."},{"key":"e_1_3_2_2_28_1","first-page":"19327","article-title":"Learning to compress prompts with gist tokens","volume":"36","author":"Mu Jesse","year":"2023","unstructured":"Jesse Mu, Xiang Li, and Noah Goodman. 2023. Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems, Vol. 36 (2023), 19327-19352.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_2_29_1","volume-title":"The 41st International Conference on Machine Learning. 37396-37412","author":"Nawrot Piotr","year":"2024","unstructured":"Piotr Nawrot, Adrian \u0141a'ncucki, Marcin Chochowski, David Tarjan, and Edoardo M Ponti. 2024. Dynamic memory compression: retrofitting LLMs for accelerated inference. In The 41st International Conference on Machine Learning. 37396-37412."},{"key":"e_1_3_2_2_30_1","first-page":"4958","article-title":"Anchor-based large language models","author":"Pang Jianhui","year":"2024","unstructured":"Jianhui Pang, Fanghua Ye, Derek Wong, Xin He, Wanshun Chen, and Longyue Wang. 2024. Anchor-based large language models. In Findings of the Association for Computational Linguistics. 4958-4976.","journal-title":"Findings of the Association for Computational Linguistics."},{"key":"e_1_3_2_2_31_1","first-page":"10299","article-title":"Recommender systems with generative retrieval","volume":"36","author":"Rajput Shashank","year":"2023","unstructured":"Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Tran, Jonah Samost, et al., 2023. Recommender systems with generative retrieval. Advances in Neural Information Processing Systems, Vol. 36 (2023), 10299-10315.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_2_32_1","unstructured":"Zhenmei Shi Yifei Ming Xuan-Phi Nguyen Yingyu Liang and Shafiq Joty. 2024. Discovering the gems in early layers: Accelerating long-context LLMs with 1000x input token reduction. arXiv preprint arXiv:2409.17422(2024)."},{"key":"e_1_3_2_2_33_1","volume-title":"Leyu Lin, and Ji-Rong Wen.","author":"Sun Wenqi","year":"2024","unstructured":"Wenqi Sun, Ruobing Xie, Junjie Zhang, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. 2024. Distillation is all you need for practically using different pre-trained recommendation models. arXiv preprint arXiv:2401.00797(2024)."},{"key":"e_1_3_2_2_34_1","unstructured":"Qwen Team. 2024. Qwen2.5: A party of foundation models. https:\/\/qwenlm.github.io\/blog\/qwen2.5\/"},{"key":"e_1_3_2_2_35_1","volume-title":"Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971(2023).","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, et al., 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971(2023)."},{"key":"e_1_3_2_2_36_1","unstructured":"Wenjie Wang Xinyu Lin Fuli Feng Xiangnan He and Tat-Seng Chua. 2023. Generative recommendation: Towards next-generation recommender paradigm. arXiv preprint arXiv:2304.03516(2023)."},{"key":"e_1_3_2_2_37_1","unstructured":"Haotian Wu Yingpeng Du Zhu Sun Tianjun Wei Jie Zhang and Ong Yew Soon. 2024a. A survey on efficient solutions of large language models for recommendation. Authorea Preprints(2024)."},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11280-024-01291-2"},{"key":"e_1_3_2_2_39_1","unstructured":"Yunjia Xi Hangyu Wang Bo Chen Jianghao Lin Menghui Zhu Weiwen Liu Ruiming Tang Weinan Zhang and Yong Yu. 2025. Efficiency unleashed: Inference acceleration for LLM-based recommender systems with speculative decoding. arXiv preprint arXiv:2408.05676(2025)."},{"key":"e_1_3_2_2_40_1","volume-title":"The 12th International Conference on Learning Representations.","author":"Xiao Guangxuan","year":"2024","unstructured":"Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient streaming language models with attention sinks. In The 12th International Conference on Learning Representations."},{"key":"e_1_3_2_2_41_1","volume-title":"The 41st International Conference on Machine Learning. 58484-58509","author":"Zhai Jiaqi","year":"2024","unstructured":"Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Jiayuan He, et al., 2024. Actions speak louder than words: trillion-parameter sequential transducers for generative recommendations. In The 41st International Conference on Machine Learning. 58484-58509."},{"key":"e_1_3_2_2_42_1","volume-title":"The 41st International Conference on Machine Learning. 58840-58850","author":"Zhang Yuxin","year":"2024","unstructured":"Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, and Rongrong Ji. 2024. CaM: Cache merging for memory-efficient LLMs inference. In The 41st International Conference on Machine Learning. 58840-58850."},{"key":"e_1_3_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE60146.2024.00118"},{"key":"e_1_3_2_2_44_1","unstructured":"Zixuan Zhou Xuefei Ning Ke Hong Tianyu Fu Jiaming Xu Shiyao Li Yuming Lou Luning Wang Zhihang Yuan Xiuhong Li et al. 2024. A survey on efficient inference for large language models. arXiv preprint arXiv:2404.14294(2024)."}],"event":{"name":"KDD '25: The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining","sponsor":["SIGMOD ACM Special Interest Group on Management of Data","SIGKDD ACM Special Interest Group on Knowledge Discovery in Data"],"location":"Toronto ON Canada","acronym":"KDD '25"},"container-title":["Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3711896.3736919","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,16]],"date-time":"2025-08-16T14:38:36Z","timestamp":1755355116000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3711896.3736919"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,3]]},"references-count":44,"alternative-id":["10.1145\/3711896.3736919","10.1145\/3711896"],"URL":"https:\/\/doi.org\/10.1145\/3711896.3736919","relation":{},"subject":[],"published":{"date-parts":[[2025,8,3]]},"assertion":[{"value":"2025-08-03","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}