{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,15]],"date-time":"2025-08-15T01:07:40Z","timestamp":1755220060735,"version":"3.43.0"},"reference-count":38,"publisher":"Association for Computing Machinery (ACM)","issue":"1","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["SIGOPS Oper. Syst. Rev."],"published-print":{"date-parts":[[2025,8,4]]},"abstract":"<jats:p>Transformer-based Large Language Models (LLMs) heavily depend on the KV cache for efficient handling of long context sequences. However, the size of the KV cache grows linearly with the input sequence length, increasingly straining system memory, computational resources, bandwidth, and latency during decoding. Although recent research has proposed various techniques to compress the KV cache -targeting either storage or computational efficiency-few methods effectively achieve both simultaneously. Additionally, existing methods primarily rely on heuristic-driven approaches, lacking comprehensive insights into token selection criteria, and often significantly compromise model accuracy under strict KV cache token budget constraints (e.g., keeping 512 tokens). Building upon our recent work, RocketKV, this paper introduces EMPIRIC as an oracle-based vision study, which explicitly defines theoretical bounds for accuracy, computation, and storage in KV cache compression. By analyzing intrinsic patterns in KV cache attention heads, EMPIRIC provides novel insights into effective token pruning without accuracy degradation. This work clarifies the overlooked elements critical to KV cache compression during decoding and optimally balances computational efficiency, storage optimization, inference latency, and accuracy. We envision that EMPIRIC will guide future research efforts toward creating scalable, efficient KV cache compression techniques, significantly improving inference performance for long context LLM inference.<\/jats:p>","DOI":"10.1145\/3759441.3759448","type":"journal-article","created":{"date-parts":[[2025,8,6]],"date-time":"2025-08-06T14:43:44Z","timestamp":1754491424000},"page":"46-54","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["EMPIRIC: Exploring Missing Pieces in KV Cache Compression for Reducing Computation, Storage, and Latency in Long-Context LLM Inference"],"prefix":"10.1145","volume":"59","author":[{"given":"Payman","family":"Behnam","sequence":"first","affiliation":[{"name":"NVIDIA, Georgia Tech"}]},{"given":"Yaosheng","family":"Fu","sequence":"additional","affiliation":[{"name":"NVIDIA"}]},{"given":"Ritchie","family":"Zhao","sequence":"additional","affiliation":[{"name":"NVIDIA"}]},{"given":"Po-An","family":"Tsai","sequence":"additional","affiliation":[{"name":"NVIDIA"}]},{"given":"Zhididng","family":"Yu","sequence":"additional","affiliation":[{"name":"NVIDIA"}]},{"given":"Alexey","family":"Tumanov","sequence":"additional","affiliation":[{"name":"Georgia Tech"}]}],"member":"320","published-online":{"date-parts":[[2025,8,6]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"AI@META. Llama 3 model card."},{"key":"e_1_2_1_2_1","volume-title":"Rocketkv: Accelerating longcontext llm inference via two-stage kv cache compression. arXiv preprint arXiv:2502.14051","author":"BEHNAM P.","year":"2025","unstructured":"BEHNAM, P., FU, Y., ZHAO, R., TSAI, P.-A., YU, Z., AND TUMANOV, A. Rocketkv: Accelerating longcontext llm inference via two-stage kv cache compression. arXiv preprint arXiv:2502.14051 (2025)."},{"key":"e_1_2_1_3_1","volume-title":"Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150","author":"BELTAGY I.","year":"2020","unstructured":"BELTAGY, I., PETERS, M. E., AND COHAN, A. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020)."},{"key":"e_1_2_1_4_1","volume-title":"Magicpig: Lsh sampling for efficient llm generation. arXiv preprint arXiv:2407.09876","author":"CHEN Z.","year":"2024","unstructured":"CHEN, Z., SADHUKHAN, R., YE, Z., ZHOU, Y., ZHANG, J., NOLTE, N., TIAN, Y., DOUZE, M., BOTTOU, L., JIA, Z., AND CHEN, B. Magicpig: Lsh sampling for efficient llm generation. arXiv preprint arXiv:2407.09876 (2024)."},{"key":"e_1_2_1_5_1","volume-title":"A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282","author":"CHENG Y.","year":"2017","unstructured":"CHENG, Y., WANG, D., ZHOU, P., AND ZHANG, T. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282 (2017)."},{"key":"e_1_2_1_6_1","volume-title":"Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132","author":"CHIANG W.-L.","year":"2024","unstructured":"CHIANG, W.-L., ZHENG, L., SHENG, Y., ANGELOPOULOS, A. N., LI, T., LI, D., ZHANG, H., ZHU, B., JORDAN, M., GONZALEZ, J. E., ET AL. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132 (2024)."},{"key":"e_1_2_1_7_1","first-page":"16344","article-title":"Fast and memory-efficient exact attention with io-awareness","volume":"35","author":"DAO T.","year":"2022","unstructured":"DAO, T., FU, D., ERMON, S., RUDRA, A., AND R\u00c9, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344-16359.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_8_1","volume-title":"Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801","author":"GE S.","year":"2023","unstructured":"GE, S., ZHANG, Y., LIU, L., ZHANG, M., HAN, J., AND GAO, J. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801 (2023)."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.sustainlp-1.5"},{"key":"e_1_2_1_10_1","volume-title":"Advancing transformer architecture in longcontext large language models: A comprehensive survey. arXiv preprint arXiv:2311.12351","author":"HUANG Y.","year":"2023","unstructured":"HUANG, Y., XU, J., LAI, J., JIANG, Z., CHEN, T., LI, Z., YAO, Y., MA, X., YANG, L., CHEN, H., ET AL. Advancing transformer architecture in longcontext large language models: A comprehensive survey. arXiv preprint arXiv:2311.12351 (2023)."},{"key":"e_1_2_1_11_1","volume-title":"arXiv preprint arXiv:2312.06550","author":"JIANG A. Q.","year":"2023","unstructured":"JIANG, A. Q., SABLAYROLLES, A., MENSCH, A., BAMFORD, C., CHAPLOT, D. S., CASAS, D. D. L., BRESSAND, F., LENGYEL, G., LAMPLE, G., SAULNIER, L., ET AL. Mistral 7b. arXiv preprint arXiv:2312.06550 (2023)."},{"key":"e_1_2_1_12_1","volume-title":"Accelerating prefilling for long-context llms via dynamic sparse attention. arXiv preprint arXiv:2407.02490","author":"JIANG H.","year":"2024","unstructured":"JIANG, H., LI, Y., ZHANG, C., WU, Q., LUO, X., AHN, S., HAN, Z., ABDI, A. H., LI, D., AND LIN, CHIN-YEW, E. A. Minference 1.0: Accelerating prefilling for long-context llms via dynamic sparse attention. arXiv preprint arXiv:2407.02490 (2024)."},{"key":"e_1_2_1_13_1","volume-title":"Pod-attention: Unlocking full prefill-decode overlap for faster llm inference. arXiv preprint arXiv:2410.18038","author":"KAMATH A. K.","year":"2024","unstructured":"KAMATH, A. K., PRABHU, R., MOHAN, J., PETER, S., RAMJEE, R., AND PANWAR, A. Pod-attention: Unlocking full prefill-decode overlap for faster llm inference. arXiv preprint arXiv:2410.18038 (2024)."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_2_1_15_1","volume-title":"NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following","author":"LI D.","year":"2023","unstructured":"LI, D., SHAO, R., XIE, A., SHENG, Y., ZHENG, L., GONZALEZ, J., STOICA, I., MA, X., AND ZHANG, H. How long can context length of open-source llms truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following (2023)."},{"key":"e_1_2_1_16_1","volume-title":"Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469","author":"LI Y.","year":"2024","unstructured":"LI, Y., HUANG, Y., YANG, B., VENKITESH, B., LOCATELLI, A., YE, H., CAI, T., LEWIS, P., AND CHEN, D. Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469 (2024)."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2021.07.045"},{"key":"e_1_2_1_18_1","volume-title":"Base of rope bounds context length. arXiv preprint arXiv:2405.14591","author":"MEN X.","year":"2024","unstructured":"MEN, X., XU, M., WANG, B., ZHANG, Q., LIN, H., HAN, X., AND CHEN, W. Base of rope bounds context length. arXiv preprint arXiv:2405.14591 (2024)."},{"key":"e_1_2_1_19_1","volume-title":"In-context learning and induction heads. arXiv preprint arXiv:2209.11895","author":"OLSSON C.","year":"2022","unstructured":"OLSSON, C., ELHAGE, N., NANDA, N., JOSEPH, N., DASSARMA, N., HENIGHAN, T., MANN, B., ASKELL, A., BAI, Y., CHEN, A., ET AL. In-context learning and induction heads. arXiv preprint arXiv:2209.11895 (2022)."},{"key":"e_1_2_1_20_1","volume-title":"The what, why, and how of context length extension techniques in large language models-a detailed survey. arXiv preprint arXiv:2401.07872","author":"PAWAR S.","year":"2024","unstructured":"PAWAR, S., TONMOY, S., ZAMAN, S., JAIN, V., CHADHA, A., AND DAS, A. The what, why, and how of context length extension techniques in large language models-a detailed survey. arXiv preprint arXiv:2401.07872 (2024)."},{"key":"e_1_2_1_21_1","volume-title":"vattention: Dynamic memory management for serving llms without pagedattention. arXiv preprint arXiv:2405.04437","author":"PRABHU R.","year":"2024","unstructured":"PRABHU, R., NAYAK, A., MOHAN, J., RAMJEE, R., AND PANWAR, A. vattention: Dynamic memory management for serving llms without pagedattention. arXiv preprint arXiv:2405.04437 (2024)."},{"key":"e_1_2_1_22_1","volume-title":"Compressive transformers for long-range sequence modelling. arXiv preprint","author":"RAE J. W.","year":"2019","unstructured":"RAE, J. W., POTAPENKO, A., JAYAKUMAR, S. M., HILLIER, C., AND LILLICRAP, T. P. Compressive transformers for long-range sequence modelling. arXiv preprint (2019)."},{"key":"e_1_2_1_23_1","volume-title":"Proceedings of the 41st International Conference on Machine Learning (ICML)","author":"RIBAR L.","year":"2024","unstructured":"RIBAR, L., CHELOMBIEV, I., HUDLASS-GALLEY, L., BLAKE, C., LUSCHI, C., AND ORR, D. Sparq attention: Bandwidth-efficient llm inference. In Proceedings of the 41st International Conference on Machine Learning (ICML) (2024)."},{"key":"e_1_2_1_24_1","volume-title":"Selfattention with relative position representations. arXiv preprint arXiv:1803.02155","author":"SHAW P.","year":"2018","unstructured":"SHAW, P., USZKOREIT, J., AND VASWANI, A. Selfattention with relative position representations. arXiv preprint arXiv:1803.02155 (2018)."},{"key":"e_1_2_1_25_1","unstructured":"SHENOY V. AND KIELY P. A guide to llm inference and performance n.d."},{"key":"e_1_2_1_26_1","volume-title":"Conference on Language Modeling(COLM)","author":"SHI L.","year":"2024","unstructured":"SHI, L., ZHANG, H., YAO, Y., LI, Z., AND ZHAO, H. Keep the cost down: A review on methods to optimize llm's kv-cache consumption. Conference on Language Modeling(COLM) (2024)."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2023.127063"},{"key":"e_1_2_1_28_1","volume-title":"Razorattention: Efficient kv cache compression through retrieval heads. arXiv preprint arXiv:2407.15891","author":"TANG H.","year":"2024","unstructured":"TANG, H., LIN, Y., LIN, J., HAN, Q., HONG, S., YAO, Y., AND WANG, G. Razorattention: Efficient kv cache compression through retrieval heads. arXiv preprint arXiv:2407.15891 (2024)."},{"key":"e_1_2_1_29_1","volume-title":"S. QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference. In Proceedings of the International Conference on Machine Learning (ICML)","author":"TANG J.","year":"2024","unstructured":"TANG, J., ZHAO, Y., ZHU, K., XIAO, G., KASIKCI, B., AND HAN, S. QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference. In Proceedings of the International Conference on Machine Learning (ICML) (2024)."},{"key":"e_1_2_1_30_1","volume-title":"Scaling laws vs model architectures: How does inductive bias influence scaling? arXiv preprint arXiv:2207.10551","author":"TAY Y.","year":"2022","unstructured":"TAY, Y., DEHGHANI, M., ABNAR, S., CHUNG, H. W., FEDUS, W., RAO, J., NARANG, S., TRAN, V. Q., YOGATAMA, D., AND METZLER, D. Scaling laws vs model architectures: How does inductive bias influence scaling? arXiv preprint arXiv:2207.10551 (2022)."},{"key":"e_1_2_1_31_1","volume-title":"Attention is all you need. Advances in neural information processing systems 30","author":"VASWANI A.","year":"2017","unstructured":"VASWANI, A., SHAZEER, N., PARMAR, N., USZKOREIT, J., JONES, L., GOMEZ, A. N., KAISER,., AND POLOSUKHIN, I. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_2_1_32_1","unstructured":"VERGE T. Google launches gemini the ai model it hopes will take down gpt-4."},{"key":"e_1_2_1_33_1","volume-title":"Retrieval head mechanistically explains long-context factuality. arXiv preprint arXiv:2404.15574","author":"WU W.","year":"2024","unstructured":"WU, W., WANG, Y., XIAO, G., PENG, H., AND FU, Y. Retrieval head mechanistically explains long-context factuality. arXiv preprint arXiv:2404.15574 (2024)."},{"key":"e_1_2_1_34_1","volume-title":"ET AL. Streamingllm: Efficient processing of streaming data with large language models. arXiv preprint arXiv:2304.05678","author":"XIAO G.","year":"2023","unstructured":"XIAO, G., ET AL. Streamingllm: Efficient processing of streaming data with large language models. arXiv preprint arXiv:2304.05678 (2023)."},{"key":"e_1_2_1_35_1","volume-title":"Kv cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches. arXiv preprint arXiv:2407.01527","author":"XIAO G.","year":"2024","unstructured":"XIAO, G., TANG, J., GUO, J., TANG, H., YANG, S., ZUO, J., FU, Y., AND HAN, S. Kv cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches. arXiv preprint arXiv:2407.01527 (2024)."},{"key":"e_1_2_1_36_1","volume-title":"Duoattention: Efficient long-context llm inference with retrieval and streaming heads. arXiv preprint arXiv:2410.10819","author":"XIAO G.","year":"2024","unstructured":"XIAO, G., TANG, J., ZUO, J., GUO, J., YANG, S., TANG, H., FU, Y., AND HAN, S. Duoattention: Efficient long-context llm inference with retrieval and streaming heads. arXiv preprint arXiv:2410.10819 (2024)."},{"key":"e_1_2_1_37_1","volume-title":"Llm inference unveiled: Survey and roofline model insights. arXiv preprint arXiv:2402.16363","author":"YUAN Z.","year":"2024","unstructured":"YUAN, Z., SHANG, Y., ZHOU, Y., DONG, Z., ZHOU, Z., XUE, C., WU, B., LI, Z., GU, Q., AND LEE, Y. J. Llm inference unveiled: Survey and roofline model insights. arXiv preprint arXiv:2402.16363 (2024)."},{"key":"e_1_2_1_38_1","volume-title":"H2o: Hybrid hierarchical optimization for long-context language models. arXiv preprint arXiv:2403.11234","author":"ZHANG Y.","year":"2024","unstructured":"ZHANG, Y., ET AL. H2o: Hybrid hierarchical optimization for long-context language models. arXiv preprint arXiv:2403.11234 (2024)."}],"container-title":["ACM SIGOPS Operating Systems Review"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3759441.3759448","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,7]],"date-time":"2025-08-07T19:50:11Z","timestamp":1754596211000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3759441.3759448"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,4]]},"references-count":38,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,8,4]]}},"alternative-id":["10.1145\/3759441.3759448"],"URL":"https:\/\/doi.org\/10.1145\/3759441.3759448","relation":{},"ISSN":["0163-5980"],"issn-type":[{"type":"print","value":"0163-5980"}],"subject":[],"published":{"date-parts":[[2025,8,4]]},"assertion":[{"value":"2025-08-06","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}