{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T08:56:16Z","timestamp":1775638576756,"version":"3.50.1"},"reference-count":85,"publisher":"Association for Computing Machinery (ACM)","issue":"3","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2025,6,17]]},"abstract":"<jats:p>\n                    Retrieval-Augmented Generation (RAG) is often used with Large Language Models (LLMs) to infuse domain knowledge or user-specific information. In RAG, given a user query, a retriever extracts chunks of relevant text from a knowledge base. These chunks are sent to an LLM as part of the input prompt. Typically, any given chunk is repeatedly retrieved across user questions. However, currently, for every question, attention layers in LLMs fully compute the Keys and Values (KVs) repeatedly for the input chunks, as state-of-the-art methods cannot reuse KV-caches when chunks appear at arbitrary locations or with arbitrary contexts. Naive reuse leads to output quality degradation. This leads to potentially redundant computations on expensive GPUs and increases latency. In this work, we propose\n                    <jats:sc>Cache-Craft<\/jats:sc>\n                    , a system for managing and reusing precomputed KVs corresponding to the text chunks (which we call\n                    <jats:italic toggle=\"yes\">chunk-caches<\/jats:italic>\n                    ) in RAG-based systems. We present how to identify\n                    <jats:italic toggle=\"yes\">chunk-caches<\/jats:italic>\n                    that are reusable, how to efficiently perform a small fraction of recomputation to\n                    <jats:italic toggle=\"yes\">fix<\/jats:italic>\n                    the cache and maintain output quality, and how to efficiently store and evict\n                    <jats:italic toggle=\"yes\">chunk-caches<\/jats:italic>\n                    in the hardware for maximizing reuse while masking any overheads. With real production workloads as well as synthetic datasets, we show that\n                    <jats:sc>Cache-Craft<\/jats:sc>\n                    reduces redundant computation by\n                    <jats:bold>51%<\/jats:bold>\n                    over SOTA prefix-caching and\n                    <jats:bold>75%<\/jats:bold>\n                    over full recomputation. Additionally, with continuous batching on a real production workload, we get a\n                    <jats:bold>1.6\u00d7<\/jats:bold>\n                    speed up in throughput for both the LLama-3-8B and 70B models and a\n                    <jats:bold>2.1\u00d7<\/jats:bold>\n                    and\n                    <jats:bold>2\u00d7<\/jats:bold>\n                    reduction in end-to-end response latency respectively, compared to prefix-caching, while maintaining generation quality.\n                  <\/jats:p>","DOI":"10.1145\/3725273","type":"journal-article","created":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T21:23:29Z","timestamp":1750281809000},"page":"1-28","source":"Crossref","is-referenced-by-count":6,"title":["Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation"],"prefix":"10.1145","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3290-6328","authenticated-orcid":false,"given":"Shubham","family":"Agarwal","sequence":"first","affiliation":[{"name":"Adobe Research, Bangalore, India"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-6908-6823","authenticated-orcid":false,"given":"Sai","family":"Sundaresan","sequence":"additional","affiliation":[{"name":"Adobe Research, Bangalore, India"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-8436-3119","authenticated-orcid":false,"given":"Subrata","family":"Mitra","sequence":"additional","affiliation":[{"name":"Adobe Research, Bangalore, India"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7229-2944","authenticated-orcid":false,"given":"Debabrata","family":"Mahapatra","sequence":"additional","affiliation":[{"name":"Adobe Research, Bangalore, India"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-2707-9958","authenticated-orcid":false,"given":"Archit","family":"Gupta","sequence":"additional","affiliation":[{"name":"IIT Bombay, Mumbai, India"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-8205-040X","authenticated-orcid":false,"given":"Rounak","family":"Sharma","sequence":"additional","affiliation":[{"name":"IIT Kanpur, Kanpur, India"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-8028-0386","authenticated-orcid":false,"given":"Nirmal Joshua","family":"Kapu","sequence":"additional","affiliation":[{"name":"IIT Kanpur, Kanpur, India"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5991-2050","authenticated-orcid":false,"given":"Tong","family":"Yu","sequence":"additional","affiliation":[{"name":"Adobe Research, San Jose, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6568-7104","authenticated-orcid":false,"given":"Shiv","family":"Saini","sequence":"additional","affiliation":[{"name":"Adobe Research, Bangalore, India"}]}],"member":"320","published-online":{"date-parts":[[2025,6,18]]},"reference":[{"key":"e_1_2_2_1_1","unstructured":"[n. d.]. Amazon EC2 P4d Instances -- AWS. https:\/\/aws.amazon.com\/ec2\/instance-types\/p4\/. (Accessed on 10\/18\/2024)."},{"key":"e_1_2_2_2_1","first-page":"114","article-title":"Keyformer: Kv cache reduction through key tokens selection for efficient generative inference","volume":"6","author":"Adnan Muhammad","year":"2024","unstructured":"Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant Nair, Ilya Soloveychik, and Purushotham Kamath. 2024. Keyformer: Kv cache reduction through key tokens selection for efficient generative inference. Proceedings of Machine Learning and Systems 6 (2024), 114--127.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_2_2_3_1","volume-title":"Fast Natural Language Based Data Exploration with Samples. In Companion of the 2023 International Conference on Management of Data. 155--158","author":"Agarwal Shubham","year":"2023","unstructured":"Shubham Agarwal, Gromit Yeuk-Yin Chan, Shaddy Garg, Tong Yu, and Subrata Mitra. 2023. Fast Natural Language Based Data Exploration with Samples. In Companion of the 2023 International Conference on Management of Data. 155--158."},{"key":"e_1_2_2_4_1","volume-title":"Prompt-Aware Scheduling for Efficient Text-to-Image Inferencing System. arXiv preprint arXiv:2502.06798","author":"Agarwal Shubham","year":"2025","unstructured":"Shubham Agarwal, Saud Iqbal, and Subrata Mitra. 2025. Prompt-Aware Scheduling for Efficient Text-to-Image Inferencing System. arXiv preprint arXiv:2502.06798 (2025)."},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.5555\/3691825.3691890"},{"key":"e_1_2_2_6_1","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Agrawal Amey","year":"2024","unstructured":"Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming {Throughput-Latency} Tradeoff in {LLM} Inference with {Sarathi-Serve}. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 117--134."},{"key":"e_1_2_2_7_1","volume-title":"ScaleViz: Scaling Visualization Recommendation Models on Large Data. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 93--104","author":"Ahmad Ghazi Shazan","year":"2024","unstructured":"Ghazi Shazan Ahmad, Shubham Agarwal, Subrata Mitra, Ryan Rossi, Manav Doshi, Vibhor Porwal, and Syam Manoj Kumar Paila. 2024. ScaleViz: Scaling Visualization Recommendation Models on Large Data. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 93--104."},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3617232.3624849"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2018.2850347"},{"key":"e_1_2_2_10_1","volume-title":"The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card 1","author":"Anthropic AI","year":"2024","unstructured":"AI Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card 1 (2024)."},{"key":"e_1_2_2_11_1","volume-title":"Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508","author":"Bai Yushi","year":"2023","unstructured":"Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. 2023. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508 (2023)."},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-5817"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2021.3061394"},{"key":"e_1_2_2_14_1","volume-title":"Kendall tau sequence distance: Extending Kendall tau from ranks to sequences. arXiv preprint arXiv:1905.02752","author":"Cicirello Vincent A","year":"2019","unstructured":"Vincent A Cicirello. 2019. Kendall tau sequence distance: Extending Kendall tau from ranks to sequences. arXiv preprint arXiv:1905.02752 (2019)."},{"key":"e_1_2_2_15_1","doi-asserted-by":"crossref","unstructured":"Daniel Crankshaw Gur-Eyal Sela Corey Zumar Xiangxi Mo Joseph E. Gonzalez Ion Stoica and Alexey Tumanov. 2020. InferLine: ML Prediction Pipeline Provisioning and Management for Tight Latency Objectives. arXiv:1812.01776 [cs.DC]","DOI":"10.1145\/3419111.3421285"},{"key":"e_1_2_2_16_1","volume-title":"14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17)","author":"Crankshaw Daniel","year":"2017","unstructured":"Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2017. Clipper: A {Low-Latency} online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 613--627."},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626772.3657834"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.52202\/068431-1189"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/1327452.1327492"},{"key":"e_1_2_2_20_1","first-page":"30318","article-title":"Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale","volume":"35","author":"Dettmers Tim","year":"2022","unstructured":"Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems 35 (2022), 30318--30332.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_2_21_1","volume-title":"Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference. arXiv preprint arXiv:2402.09398","author":"Dong Harry","year":"2024","unstructured":"Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, and Beidi Chen. 2024. Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference. arXiv preprint arXiv:2402.09398 (2024)."},{"key":"e_1_2_2_22_1","volume-title":"A survey on in-context learning. arXiv preprint arXiv:2301.00234","author":"Dong Qingxiu","year":"2022","unstructured":"Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey on in-context learning. arXiv preprint arXiv:2301.00234 (2022)."},{"key":"e_1_2_2_23_1","volume-title":"QAQ: Quality Adaptive Quantization for LLM KV Cache. arXiv preprint arXiv:2403.04643","author":"Dong Shichen","year":"2024","unstructured":"Shichen Dong, Wen Cheng, Jiayu Qin, and Wei Wang. 2024. QAQ: Quality Adaptive Quantization for LLM KV Cache. arXiv preprint arXiv:2403.04643 (2024)."},{"key":"e_1_2_2_24_1","volume-title":"DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161","author":"Dua Dheeru","year":"2019","unstructured":"Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161 (2019)."},{"key":"e_1_2_2_25_1","unstructured":"Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Amy Yang Angela Fan et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)."},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3384402"},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476311.3476407"},{"key":"e_1_2_2_28_1","volume-title":"International Conference on Machine Learning. PMLR, 10323--10337","author":"Frantar Elias","year":"2023","unstructured":"Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning. PMLR, 10323--10337."},{"key":"e_1_2_2_29_1","volume-title":"AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving. arXiv preprint arXiv:2403.19708","author":"Gao Bin","year":"2024","unstructured":"Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving. arXiv preprint arXiv:2403.19708 (2024)."},{"key":"e_1_2_2_30_1","first-page":"645927","article-title":"Approximate Query Processing: Taming the TeraBytes","volume":"10","author":"Garofalakis Minos N","year":"2001","unstructured":"Minos N Garofalakis and Phillip B Gibbons. 2001. Approximate Query Processing: Taming the TeraBytes.. In VLDB, Vol. 10. 645927--672356.","journal-title":"VLDB"},{"key":"e_1_2_2_31_1","first-page":"325","article-title":"Prompt cache: Modular attention reuse for low-latency inference","volume":"6","author":"Gim In","year":"2024","unstructured":"In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt cache: Modular attention reuse for low-latency inference. Proceedings of Machine Learning and Systems 6 (2024), 325--338.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_2_2_32_1","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Gujarati Arpan","year":"2020","unstructured":"Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving {DNNs} like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 443--462."},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3579371.3589038"},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00060"},{"key":"e_1_2_2_35_1","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Han Mingcong","year":"2022","unstructured":"Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-scale preemption for concurrent {GPU-accelerated} {DNN} inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 539--558."},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.coling-main.580"},{"key":"e_1_2_2_37_1","volume-title":"Kurt Keutzer, and Amir Gholami.","author":"Hooper Coleman","year":"2024","unstructured":"Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv:2401.18079 (2024)."},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF02365362"},{"key":"e_1_2_2_39_1","volume-title":"Llmlingua: Compressing prompts for accelerated inference of large language models. arXiv preprint arXiv:2310.05736","author":"Jiang Huiqiang","year":"2023","unstructured":"Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. Llmlingua: Compressing prompts for accelerated inference of large language models. arXiv preprint arXiv:2310.05736 (2023)."},{"key":"e_1_2_2_40_1","volume-title":"Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839","author":"Jiang Huiqiang","year":"2023","unstructured":"Huiqiang Jiang, QianhuiWu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839 (2023)."},{"key":"e_1_2_2_41_1","volume-title":"RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation. arXiv preprint arXiv:2404.12457","author":"Jin Chao","year":"2024","unstructured":"Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, and Xin Jin. 2024. RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation. arXiv preprint arXiv:2404.12457 (2024)."},{"key":"e_1_2_2_42_1","first-page":"24101","article-title":"A fast post-training pruning framework for transformers","volume":"35","author":"Kwon Woosuk","year":"2022","unstructured":"Woosuk Kwon, Sehoon Kim, Michael W Mahoney, Joseph Hassoun, Kurt Keutzer, and Amir Gholami. 2022. A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems 35 (2022), 24101--24116.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_2_2_44_1","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Lee Wonbeom","year":"2024","unstructured":"Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155--172."},{"key":"e_1_2_2_45_1","unstructured":"Benjamin Lefaudeux Francisco Massa Diana Liskovich Wenhan Xiong Vittorio Caggiano Sean Naren Min Xu Jieru Hu Marta Tintore Susan Zhang Patrick Labatut Daniel Haziza Luca Wehrstedt Jeremy Reizenstein and Grigory Sizov. 2022. xFormers: A modular and hackable Transformer modelling library. https:\/\/github.com\/facebookresearch\/xformers."},{"key":"e_1_2_2_46_1","first-page":"9459","article-title":"Retrieval-augmented generation for knowledge-intensive nlp tasks","volume":"33","author":"Lewis Patrick","year":"2020","unstructured":"Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\u00fcttler, Mike Lewis, Wen-tau Yih, Tim Rockt\u00e4schel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459--9474.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00686"},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.naacl-long.210"},{"key":"e_1_2_2_49_1","first-page":"16079","article-title":"Cape: Encoding relative positions with continuous augmented positional embeddings","volume":"34","author":"Likhomanenko Tatiana","year":"2021","unstructured":"Tatiana Likhomanenko, Qiantong Xu, Gabriel Synnaeve, Ronan Collobert, and Alex Rogozhnikov. 2021. Cape: Encoding relative positions with continuous augmented positional embeddings. Advances in Neural Information Processing Systems 34 (2021), 16079--16092.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_2_50_1","volume-title":"Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81."},{"key":"e_1_2_2_51_1","volume-title":"Ntcir workshop.","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin and FJ Och. 2004. Looking for a few good metrics: ROUGE and its evaluation. In Ntcir workshop."},{"key":"e_1_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00638"},{"key":"e_1_2_2_53_1","volume-title":"Optimizing llm queries in relational workloads. arXiv preprint arXiv:2403.05821","author":"Liu Shu","year":"2024","unstructured":"Shu Liu, Asim Biswal, Audrey Cheng, Xiangxi Mo, Shiyi Cao, Joseph E Gonzalez, Ion Stoica, and Matei Zaharia. 2024. Optimizing llm queries in relational workloads. arXiv preprint arXiv:2403.05821 (2024)."},{"key":"e_1_2_2_54_1","volume-title":"Cachegen: Fast context loading for language model applications. arXiv preprint arXiv:2310.07240","author":"Liu Yuhan","year":"2023","unstructured":"Yuhan Liu, Hanchen Li, Kuntai Du, Jiayi Yao, Yihua Cheng, Yuyang Huang, Shan Lu, Michael Maire, Henry Hoffmann, Ari Holtzman, et al. 2023. Cachegen: Fast context loading for language model applications. arXiv preprint arXiv:2310.07240 (2023)."},{"key":"e_1_2_2_55_1","volume-title":"Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems 36","author":"Liu Zichang","year":"2024","unstructured":"Zichang Liu, Aditya Desai, Fangshuo Liao,WeitaoWang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. 2024. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems 36 (2024)."},{"key":"e_1_2_2_56_1","volume-title":"RECON: Training-Free Acceleration for Text-to-Image Synthesis with Retrieval of Concept Prompt Trajectories. In European Conference on Computer Vision. Springer, 288--306","author":"Lu Chen-Yi","year":"2024","unstructured":"Chen-Yi Lu, Shubham Agarwal, Md Mehrab Tanjim, Kanak Mahadik, Anup Rao, Subrata Mitra, Shiv Kumar Saini, Saurabh Bagchi, and Somali Chaterji. 2024. RECON: Training-Free Acceleration for Text-to-Image Synthesis with Retrieval of Concept Prompt Trajectories. In European Conference on Computer Vision. Springer, 288--306."},{"key":"e_1_2_2_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01492"},{"key":"e_1_2_2_58_1","volume-title":"Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems 36","author":"Mu Jesse","year":"2024","unstructured":"Jesse Mu, Xiang Li, and Noah Goodman. 2024. Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems 36 (2024)."},{"key":"e_1_2_2_59_1","doi-asserted-by":"crossref","unstructured":"Ramesh Nallapati Bowen Zhou Caglar Gulcehre Bing Xiang et al. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023 (2016).","DOI":"10.18653\/v1\/K16-1028"},{"key":"e_1_2_2_60_1","volume-title":"Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745","author":"Narayan Shashi","year":"2018","unstructured":"Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018. Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745 (2018)."},{"key":"e_1_2_2_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2014.6844463"},{"key":"e_1_2_2_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3196905"},{"key":"e_1_2_2_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/3503222.3507738"},{"key":"e_1_2_2_64_1","volume-title":"100,000 questions for machine comprehension of text. arXiv preprint arXiv:1606.05250","author":"Rajpurkar P","year":"2016","unstructured":"P Rajpurkar. 2016. Squad: 100,000 questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)."},{"key":"e_1_2_2_65_1","unstructured":"Machel Reid Nikolay Savinov Denis Teplyashin Dmitry Lepikhin Timothy Lillicrap Jean-baptiste Alayrac Radu Soricut Angeliki Lazaridou Orhan Firat Julian Schrittwieser et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)."},{"key":"e_1_2_2_66_1","doi-asserted-by":"publisher","DOI":"10.14778\/3151106.3151108"},{"key":"e_1_2_2_67_1","unstructured":"Francisco Romero Qian Li Neeraja J. Yadwadkar and Christos Kozyrakis. 2020. INFaaS: A Model-less and Managed Inference Serving System. arXiv:1905.13348 [cs.DC] https:\/\/arxiv.org\/abs\/1905.13348"},{"key":"e_1_2_2_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359658"},{"key":"e_1_2_2_69_1","volume-title":"17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)","author":"Shi Yining","year":"2023","unstructured":"Yining Shi, Zhi Yang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Ziming Miao, Yuxiao Guo, Fan Yang, and Lidong Zhou. 2023. Welder: Scheduling deep learning memory access via tile-graph. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 701--718."},{"key":"e_1_2_2_70_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2023.127063"},{"key":"e_1_2_2_71_1","volume-title":"Triton: Open-source GPU programming for neural networks. https:\/\/openai.com\/index\/triton\/.","author":"Tillet Philippe","year":"2021","unstructured":"Philippe Tillet. 2021. Triton: Open-source GPU programming for neural networks. https:\/\/openai.com\/index\/triton\/."},{"key":"e_1_2_2_72_1","volume-title":"Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)."},{"key":"e_1_2_2_73_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00475"},{"key":"e_1_2_2_74_1","volume-title":"Attention is all you need. Advances in Neural Information Processing Systems","author":"Vaswani A","year":"2017","unstructured":"A Vaswani. 2017. Attention is all you need. Advances in Neural Information Processing Systems (2017)."},{"key":"e_1_2_2_75_1","doi-asserted-by":"publisher","DOI":"10.14778\/3625054.3625068"},{"key":"e_1_2_2_76_1","doi-asserted-by":"publisher","DOI":"10.14778\/3397230.3397247"},{"key":"e_1_2_2_77_1","volume-title":"Md Rizwan Parvez, and Graham Neubig","author":"Wang Zhiruo","year":"2023","unstructured":"Zhiruo Wang, Jun Araki, Zhengbao Jiang, Md Rizwan Parvez, and Graham Neubig. 2023. Learning to filter context for retrieval-augmented generation. arXiv preprint arXiv:2311.08377 (2023)."},{"key":"e_1_2_2_78_1","volume-title":"2018 USENIX Annual Technical Conference (USENIX ATC 18)","author":"Xu Ran","year":"2018","unstructured":"Ran Xu, Jinkyu Koo, Rakesh Kumar, Peter Bai, Subrata Mitra, Sasa Misailovic, and Saurabh Bagchi. 2018. {VideoChef}: Efficient Approximation for Streaming Video Processing Pipelines. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). 43--56."},{"key":"e_1_2_2_79_1","volume-title":"International Conference on Machine Learning. PMLR, 11648--11658","author":"Yan Yu","year":"2021","unstructured":"Yu Yan, Jiusheng Chen, Weizhen Qi, Nikhil Bhendawade, Yeyun Gong, Nan Duan, and Ruofei Zhang. 2021. Elattention: Memory efficient lossless attention for generation. In International Conference on Machine Learning. PMLR, 11648--11658."},{"key":"e_1_2_2_80_1","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Yu Gyeong-In","year":"2022","unstructured":"Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for {Transformer-Based} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521--538."},{"key":"e_1_2_2_81_1","volume-title":"14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17)","author":"Zhang Haoyu","year":"2017","unstructured":"Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, Paramvir Bahl, and Michael J Freedman. 2017. Live video analytics at scale with approximation and {Delay-Tolerance}. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 377--392."},{"key":"e_1_2_2_82_1","volume-title":"20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)","author":"Zhang Hong","year":"2023","unstructured":"Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. {SHEPHERD}: Serving {DNNs} in the wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 787--808."},{"key":"e_1_2_2_83_1","unstructured":"Zhenyu Zhang Ying Sheng Tianyi Zhou Tianlong Chen Lianmin Zheng Ruisi Cai Zhao Song Yuandong Tian Christopher R\u00e9 Clark Barrett et al. 2024. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems 36 (2024)."},{"key":"e_1_2_2_84_1","unstructured":"Lianmin Zheng Wei-Lin Chiang Ying Sheng Tianle Li Siyuan Zhuang Zhanghao Wu Yonghao Zhuang Zhuohan Li Zi Lin Eric P Xing et al. 2023. Lmsys-chat-1m: A large-scale real-world llm conversation dataset. arXiv preprint arXiv:2309.11998 (2023)."},{"key":"e_1_2_2_85_1","unstructured":"Lianmin Zheng Liangsheng Yin Zhiqiang Xie Jeff Huang Chuyue Sun Cody_Hao Yu Shiyi Cao Christos Kozyrakis Ion Stoica Joseph E Gonzalez et al. 2023. Efficiently Programming Large Language Models using SGLang. (2023)."}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3725273","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T18:55:05Z","timestamp":1774983305000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3725273"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,17]]},"references-count":85,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,6,17]]}},"alternative-id":["10.1145\/3725273"],"URL":"https:\/\/doi.org\/10.1145\/3725273","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,6,17]]}}}