{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,6]],"date-time":"2026-06-06T01:11:59Z","timestamp":1780708319172,"version":"3.54.1"},"publisher-location":"New York, NY, USA","reference-count":83,"publisher":"ACM","license":[{"start":{"date-parts":[[2025,3,30]],"date-time":"2025-03-30T00:00:00Z","timestamp":1743292800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,3,30]]},"DOI":"10.1145\/3669940.3707256","type":"proceedings-article","created":{"date-parts":[[2025,2,6]],"date-time":"2025-02-06T12:28:01Z","timestamp":1738844881000},"page":"1133-1150","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":25,"title":["vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-7266-5522","authenticated-orcid":false,"given":"Ramya","family":"Prabhu","sequence":"first","affiliation":[{"name":"Microsoft Research, Bangalore, India"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5313-1328","authenticated-orcid":false,"given":"Ajay","family":"Nayak","sequence":"additional","affiliation":[{"name":"Indian Institute of Science, Bangalore, India"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-5260-3203","authenticated-orcid":false,"given":"Jayashree","family":"Mohan","sequence":"additional","affiliation":[{"name":"Microsoft Research, Bangalore, India"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0007-6040","authenticated-orcid":false,"given":"Ramachandran","family":"Ramjee","sequence":"additional","affiliation":[{"name":"Microsoft Research, Bangalore, India"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-0621-4412","authenticated-orcid":false,"given":"Ashish","family":"Panwar","sequence":"additional","affiliation":[{"name":"Microsoft Research, Bangalore, India"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2025,3,30]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"2022. FlashAttention. https:\/\/github.com\/Dao-AILab\/flash-attention."},{"key":"e_1_3_2_1_2_1","unstructured":"2023. Flash-Decoding for long-context inference. https:\/\/crfm.stanford.edu\/2023\/10\/12\/flashdecoding.html."},{"key":"e_1_3_2_1_3_1","unstructured":"2023. FlashInfer: Kernel Library for LLM Serving. https:\/\/github.com\/flashinfer-ai\/flashinfer."},{"key":"e_1_3_2_1_4_1","unstructured":"2023. LightLLM: A Light and Fast Inference Service for LLM. https:\/\/github.com\/ModelTC\/lightllm."},{"key":"e_1_3_2_1_5_1","unstructured":"2023. Performance decay when using paged attention. https:\/\/github.com\/NVIDIA\/TensorRT-LLM\/issues\/75."},{"key":"e_1_3_2_1_6_1","unstructured":"2023. TensorRT-LLM: A TensorRT Toolbox for Optimized Large Language Model Inference. https:\/\/github.com\/NVIDIA\/TensorRT-LLM."},{"key":"e_1_3_2_1_7_1","unstructured":"2023. Use optimized kernels for MQA\/GQA. https:\/\/github.com\/vllmproject\/vllm\/issues\/1880."},{"key":"e_1_3_2_1_8_1","unstructured":"2024. Add support for small page sizes. https:\/\/github.com\/Dao-AILab\/flash-attention\/pull\/824."},{"key":"e_1_3_2_1_9_1","unstructured":"2024. Amazon CodeWhisperer. https:\/\/aws.amazon.com\/codewhisperer\/."},{"key":"e_1_3_2_1_10_1","unstructured":"2024. Bing AI. https:\/\/www.bing.com\/chat."},{"key":"e_1_3_2_1_11_1","unstructured":"2024. ccdv\/arxiv-summarization. https:\/\/huggingface.co\/datasets\/ccdv\/arxiv-summarization."},{"key":"e_1_3_2_1_12_1","unstructured":"2024. CUDA Toolkit Documentation: Virtual Memory Management. https:\/\/docs.nvidia.com\/cuda\/cuda-driver-api\/group__CUDA__VA.html."},{"key":"e_1_3_2_1_13_1","unstructured":"2024. Custom CUDA kernels for KV cache copy operations. https:\/\/github.com\/vllm-project\/vllm\/blob\/main\/csrc\/cache_kernels.cu."},{"key":"e_1_3_2_1_14_1","unstructured":"2024. Custom strides to support non-contiguous KVcache. https:\/\/github.com\/flashinfer-ai\/flashinfer\/commit\/ 85b1878996a29814f674ee5000facb1e2e763d9a."},{"key":"e_1_3_2_1_15_1","unstructured":"2024. Faster Transformer. https:\/\/github.com\/NVIDIA\/FasterTransformer."},{"key":"e_1_3_2_1_16_1","unstructured":"2024. [Feature]: FlashAttention 3 support. https:\/\/github.com\/vllmproject\/vllm\/issues\/6348#issuecomment-2540969988."},{"key":"e_1_3_2_1_17_1","unstructured":"2024. Fix eager mode performance. https:\/\/github.com\/vllm-project\/vllm\/pull\/2377."},{"key":"e_1_3_2_1_18_1","unstructured":"2024. Github Copilot. https:\/\/github.com\/features\/copilot."},{"key":"e_1_3_2_1_19_1","unstructured":"2024. Google Bard. https:\/\/bard.google.com."},{"key":"e_1_3_2_1_20_1","unstructured":"2024. Implement Page KV Cache. https:\/\/github.com\/Dao-AILab\/flashattention\/commit\/54e80a3829c6d2337570d01e78ebd9529c02d342."},{"key":"e_1_3_2_1_21_1","unstructured":"2024. Meta-Llama-3-8B. https:\/\/huggingface.co\/meta-llama\/Meta-Llama-3-8B."},{"key":"e_1_3_2_1_22_1","unstructured":"2024. Pascal MMU Format Changes. https:\/\/nvidia.github.io\/opengpu-doc\/pascal\/gp100-mmu-format.pdf."},{"key":"e_1_3_2_1_23_1","unstructured":"2024. PoC of dAttention support (based on vAttention). https:\/\/github. com\/vllm-project\/vllm\/pull\/9078."},{"key":"e_1_3_2_1_24_1","unstructured":"2024. Refactor Attention Take 2. https:\/\/github.com\/vllm-project\/vllm\/pull\/3462."},{"key":"e_1_3_2_1_25_1","unstructured":"2024. Replit Ghostwriter. https:\/\/replit.com\/site\/ghostwriter."},{"key":"e_1_3_2_1_26_1","unstructured":"2024. [Roadmap] vLLM Roadmap Q4 2024 #9006. https:\/\/github.com\/vllm-project\/vllm\/issues\/9006#issue-2559831134."},{"key":"e_1_3_2_1_27_1","unstructured":"2024. Separate attention backends. https:\/\/github.com\/vllm-project\/vllm\/pull\/3005\/."},{"key":"e_1_3_2_1_28_1","unstructured":"2024. Text Generation Inference. https:\/\/huggingface.co\/textgeneration-inference."},{"key":"e_1_3_2_1_29_1","unstructured":"2024. Tile primitives for speedy kernels. https:\/\/github.com\/HazyResearch\/ThunderKittens."},{"key":"e_1_3_2_1_30_1","unstructured":"2024. Use FlashInfer for Decoding. https:\/\/github.com\/vllm-project\/vllm\/pull\/4353."},{"key":"e_1_3_2_1_31_1","unstructured":"2024. VMM KV cache for NVIDIA GPUs. https:\/\/github.com\/vllmproject\/vllm\/pull\/6102."},{"key":"e_1_3_2_1_32_1","unstructured":"2024. Yi-34B-200K. https:\/\/huggingface.co\/01-ai\/Yi-34B-200K."},{"key":"e_1_3_2_1_33_1","unstructured":"2024. Yi-6B-200K. https:\/\/huggingface.co\/01-ai\/Yi-6B-200K."},{"key":"e_1_3_2_1_34_1","volume-title":"Proceedings of The Seventh Annual Conference on Machine Learning and Systems, 2024","author":"Agrawal Amey","year":"2024","unstructured":"Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S Gulavani, Ramachandran Ramjee, and Alexey Tumanov. 2024. Vidur: A Large-Scale Simulation Framework For LLM Inference. Proceedings of The Seventh Annual Conference on Machine Learning and Systems, 2024, Santa Clara (2024)."},{"key":"e_1_3_2_1_35_1","volume-title":"Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Agrawal Amey","year":"2024","unstructured":"Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 117--134. https:\/\/www.usenix.org\/conference\/osdi24\/presentation\/agrawal"},{"key":"e_1_3_2_1_36_1","volume-title":"SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. arXiv:2308.16369 [cs.LG]","author":"Agrawal Amey","year":"2023","unstructured":"Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. arXiv:2308.16369 [cs.LG]"},{"key":"e_1_3_2_1_37_1","volume-title":"GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv:2305.13245 [cs.CL]","author":"Ainslie Joshua","year":"2023","unstructured":"Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr\u00f3n, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv:2305.13245 [cs.CL]"},{"key":"e_1_3_2_1_38_1","volume-title":"Longformer: The Long-Document Transformer. arXiv:2004.05150 [cs.CL]","author":"Beltagy Iz","year":"2020","unstructured":"Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. arXiv:2004.05150 [cs.CL]"},{"key":"e_1_3_2_1_39_1","unstructured":"Ganesh Bikshandi and Jay Shah. 2023. A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library. arXiv:2312.11918 [cs.LG]"},{"key":"e_1_3_2_1_40_1","unstructured":"Rewon Child Scott Gray Alec Radford and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. arXiv:1904.10509 [cs.LG]"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","unstructured":"Aakanksha Chowdhery Sharan Narang Jacob Devlin Maarten Bosma Gaurav Mishra Adam Roberts Paul Barham Hyung Won Chung Charles Sutton Sebastian Gehrmann Parker Schuh Kensen Shi Sasha Tsvyashchenko Joshua Maynez Abhishek Rao Parker Barnes Yi Tay Noam Shazeer Vinodkumar Prabhakaran Emily Reif Nan Du Ben Hutchinson Reiner Pope James Bradbury Jacob Austin Michael Isard Guy Gur-Ari Pengcheng Yin Toju Duke Anselm Levskaya Sanjay Ghemawat Sunipa Dev Henryk Michalewski Xavier Garcia Vedant Misra Kevin Robinson Liam Fedus Denny Zhou Daphne Ippolito David Luan Hyeontaek Lim Barret Zoph Alexander Spiridonov Ryan Sepassi David Dohan Shivani Agrawal Mark Omernick Andrew M. Dai Thanumalayan Sankaranarayana Pillai Marie Pellat Aitor Lewkowycz Erica Moreira Rewon Child Oleksandr Polozov Katherine Lee Zongwei Zhou Xuezhi Wang Brennan Saeta Mark Diaz Orhan Firat Michele Catasta Jason Wei Kathy Meier-Hellstern Douglas Eck Jeff Dean Slav Petrov and Noah Fiedel. 2022. PaLM: Scaling Language Modeling with Pathways. CoRR abs\/2204.02311 (2022). https:\/\/doi.org\/10.48550\/arXiv.2204.02311 arXiv:2204.02311","DOI":"10.48550\/arXiv.2204.02311"},{"key":"e_1_3_2_1_42_1","unstructured":"Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691 [cs.LG]"},{"key":"e_1_3_2_1_43_1","volume-title":"Proceedings of the 36th International Conference on Neural Information Processing Systems","author":"Dao Tri","year":"2024","unstructured":"Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. 2024. FLASHATTENTION: fast and memory-efficient exact attention with IO-awareness. In Proceedings of the 36th International Conference on Neural Information Processing Systems (New Orleans, LA, USA) (NIPS '22). Curran Associates Inc., Red Hook, NY, USA, Article 1189, 16 pages."},{"key":"e_1_3_2_1_44_1","volume-title":"M\u00e9lange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity. arXiv:2404.14527 [cs.DC] https:\/\/arxiv.org\/abs\/2404.14527","author":"Griggs Tyler","year":"2024","unstructured":"Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, and Ion Stoica. 2024. M\u00e9lange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity. arXiv:2404.14527 [cs.DC] https:\/\/arxiv.org\/abs\/2404.14527"},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640423"},{"key":"e_1_3_2_1_46_1","volume-title":"Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Yuxiong He.","author":"Holmes Connor","year":"2024","unstructured":"Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Yuxiong He. 2024. DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference. arXiv:2401.08671 [cs.PF]"},{"key":"e_1_3_2_1_47_1","volume-title":"Proceedings of Machine Learning and Systems, P. Gibbons, G. Pekhimenko, and C. De Sa (Eds.)","volume":"6","author":"Hong Ke","year":"2024","unstructured":"Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, kangdi chen, Yuhan Dong, and Yu Wang. 2024. FlashDecoding: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics. In Proceedings of Machine Learning and Systems, P. Gibbons, G. Pekhimenko, and C. De Sa (Eds.), Vol. 6. 148--161. https:\/\/proceedings.mlsys.org\/paper_files\/paper\/2024\/file\/ 5321b1dabcd2be188d796c21b733e8c7-Paper-Conference.pdf"},{"key":"e_1_3_2_1_48_1","volume-title":"Interference: Disaggregate LLM Inference for Mixed Downstream Workloads. arXiv preprint arXiv:2401.11181","author":"Hu Cunchen","year":"2024","unstructured":"Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, et al. 2024. Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads. arXiv preprint arXiv:2401.11181 (2024)."},{"key":"e_1_3_2_1_49_1","unstructured":"Aditya K Kamath Ramya Prabhu Jayashree Mohan Simon Peter Ramachandran Ramjee and Ashish Panwar. 2024. POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference. arXiv:2410.18038 [cs.LG] https:\/\/arxiv.org\/abs\/2410.18038"},{"key":"e_1_3_2_1_50_1","unstructured":"Ferdi Kossmann Bruce Fontaine Daya Khudia Michael Cafarella and Samuel Madden. 2024. Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs. arXiv:2410.17840 [cs.LG] https:\/\/arxiv.org\/abs\/2410.17840"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_2_1_52_1","volume-title":"Coordinated and Efficient Huge Page Management with Ingens. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16)","author":"Kwon Youngjin","year":"2016","unstructured":"Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, and Emmett Witchel. 2016. Coordinated and Efficient Huge Page Management with Ingens. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, Savannah, GA, 705--721. https:\/\/www.usenix.org\/conference\/osdi16\/ technical-sessions\/presentation\/kwon"},{"key":"e_1_3_2_1_53_1","unstructured":"Wonbeom Lee Jungi Lee Junghwan Seo and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. arXiv:2406.19707 [cs.LG] https:\/\/arxiv.org\/abs\/2406.19707"},{"key":"e_1_3_2_1_54_1","unstructured":"Da Ma Lu Chen Situo Zhang Yuxun Miao Su Zhu Zhi Chen Hongshen Xu Hanqi Li Shuai Fan Lei Pan and Kai Yu. 2024. Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity. arXiv:2412.02252 [cs.CL] https:\/\/arxiv.org\/abs\/2412.02252"},{"key":"e_1_3_2_1_55_1","volume-title":"MAPLE: A Framework for Active Preference Learning Guided by Large Language Models. arXiv:2412.07207 [cs.LG] https:\/\/arxiv.org\/abs\/2412.07207","author":"Mahmud Saaduddin","year":"2024","unstructured":"Saaduddin Mahmud, Mason Nakamura, and Shlomo Zilberstein. 2024. MAPLE: A Framework for Active Preference Learning Guided by Large Language Models. arXiv:2412.07207 [cs.LG] https:\/\/arxiv.org\/abs\/2412.07207"},{"key":"e_1_3_2_1_56_1","volume-title":"Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs. arXiv:2406.01566 [cs.LG] https:\/\/arxiv.org\/abs\/2406.01566","author":"Mei Yixuan","year":"2024","unstructured":"Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2024. Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs. arXiv:2406.01566 [cs.LG] https:\/\/arxiv.org\/abs\/2406.01566"},{"key":"e_1_3_2_1_57_1","unstructured":"Xupeng Miao Chunan Shi Jiangfei Duan Xiaoli Xi Dahua Lin Bin Cui and Zhihao Jia. 2023. SpotServe: Serving Generative Large Language Models on Preemptible Instances. arXiv:2311.15566 [cs.DC] https:\/\/arxiv.org\/abs\/2311.15566"},{"key":"e_1_3_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3433210.3453077"},{"key":"e_1_3_2_1_59_1","unstructured":"Matthew Nicely and NVIDIA. 2024. Accelerating Transformers with NVIDIA cuDNN 9. https:\/\/developer.nvidia.com\/blog\/acceleratingtransformers-with-nvidia-cudnn-9\/"},{"key":"e_1_3_2_1_60_1","unstructured":"NVIDIA. 2024. CUDA C Programming Guide. https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html"},{"key":"e_1_3_2_1_61_1","unstructured":"OpenAI. 2023. GPT-4 Technical Report. CoRR abs\/2303.08774 (2023). https:\/\/doi.org\/10.48550\/arXiv.2303.08774 arXiv:2303.08774"},{"key":"e_1_3_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/3297858.3304064"},{"key":"e_1_3_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA59077.2024.00019"},{"key":"e_1_3_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/2541940.2541942"},{"key":"e_1_3_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO56248.2022.00036"},{"key":"e_1_3_2_1_66_1","volume-title":"Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers. arXiv:2405.10480 [cs.AR] https:\/\/arxiv.org\/abs\/2405.10480","author":"Sanovar Rya","year":"2024","unstructured":"Rya Sanovar, Srikant Bharadwaj, Renee St. Amant, Victor R\u00fchle, and Saravan Rajmohan. 2024. Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers. arXiv:2405.10480 [cs.AR] https:\/\/arxiv.org\/abs\/2405.10480"},{"key":"e_1_3_2_1_67_1","unstructured":"Jay Shah Ganesh Bikshandi Ying Zhang Vijay Thakkar Pradeep Ramani and Tri Dao. 2024. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. (2024)."},{"key":"e_1_3_2_1_68_1","unstructured":"Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150 [cs.NE]"},{"key":"e_1_3_2_1_69_1","volume-title":"Proceedings of the 40th International Conference on Machine Learning","author":"Sheng Ying","year":"2023","unstructured":"Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher R\u00e9, Ion Stoica, and Ce Zhang. 2023. FlexGen: high-throughput generative inference of large language models with a single GPU. In Proceedings of the 40th International Conference on Machine Learning (Honolulu, Hawaii, USA) (ICML'23). JMLR.org, Article 1288, 23 pages."},{"key":"e_1_3_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2018.00036"},{"key":"e_1_3_2_1_71_1","volume-title":"Loki: Low-rank Keys for Efficient Sparse Attention. arXiv:2406.02542 [cs.LG] https:\/\/arxiv.org\/abs\/2406.02542","author":"Singhania Prajwal","year":"2024","unstructured":"Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, and Abhinav Bhatele. 2024. Loki: Low-rank Keys for Efficient Sparse Attention. arXiv:2406.02542 [cs.LG] https:\/\/arxiv.org\/abs\/2406.02542"},{"key":"e_1_3_2_1_72_1","unstructured":"Foteini Strati Sara Mcallister Amar Phanishayee Jakub Tarnawski and Ana Klimovic. 2024. D\u00e9j\u00e0Vu: KV-cache Streaming for Fast Faulttolerant Generative LLM Serving. arXiv:2403.01876 [cs.DC] https:\/\/arxiv.org\/abs\/2403.01876"},{"key":"e_1_3_2_1_73_1","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2017\/file\/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"},{"key":"e_1_3_2_1_74_1","unstructured":"Guan Wang Sijie Cheng Xianyuan Zhan Xiangang Li Sen Song and Yang Liu. 2023. OpenChat: Advancing Open-source Language Models with Mixed-Quality Data. arXiv:2309.11235 [cs.CL]"},{"key":"e_1_3_2_1_75_1","unstructured":"Bingyang Wu Yinmin Zhong Zili Zhang Gang Huang Xuanzhe Liu and Xin Jin. 2023. Fast Distributed Inference Serving for Large Language Models. arXiv:2305.05920 [cs.LG]"},{"key":"e_1_3_2_1_76_1","unstructured":"Mengdi Wu Xinhao Cheng Oded Padon and Zhihao Jia. 2024. A Multi-Level Superoptimizer for Tensor Programs. arXiv:2405.05751"},{"key":"e_1_3_2_1_77_1","unstructured":"Zihao Ye Lequn Chen Ruihang Lai Yilong Zhao Size Zheng Junru Shao Bohan Hou Hongyi Jin Yifei Zuo Liangsheng Yin Tianqi Chen and Luis Ceze. 2024. Accelerating Self-Attentions for LLM Serving with FlashInfer. https:\/\/flashinfer.ai\/2024\/02\/02\/introduce-flashinfer.html"},{"key":"e_1_3_2_1_78_1","volume-title":"Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Yu Gyeong-In","year":"2022","unstructured":"Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 521--538. https:\/\/www.usenix.org\/conference\/osdi22\/presentation\/yu"},{"key":"e_1_3_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.1145\/3576915.3616672"},{"key":"e_1_3_2_1_80_1","volume-title":"Levine (Eds.)","volume":"36","author":"Zhang Zhenyu","year":"2023","unstructured":"Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R\u00e9, Clark Barrett, Zhangyang \"Atlas\" Wang, and Beidi Chen. 2023. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 34661--34710. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2023\/file\/6ceefa7b15572587b78ecfcebb2827f8-Paper-Conference.pdf"},{"key":"e_1_3_2_1_81_1","volume-title":"Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng.","author":"Zheng Lianmin","year":"2024","unstructured":"Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: Efficient Execution of Structured Language Model Programs. arXiv:2312.07104 [cs.AI] https:\/\/arxiv.org\/abs\/2312.07104"},{"key":"e_1_3_2_1_82_1","volume-title":"DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Zhong Yinmin","year":"2024","unstructured":"Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 193--210. https:\/\/www.usenix.org\/conference\/osdi24\/presentation\/zhongyinmin"},{"key":"e_1_3_2_1_83_1","unstructured":"Kan Zhu Yilong Zhao Liangyu Zhao Gefei Zuo Yile Gu Dedong Xie Yufei Gao Qinyu Xu Tian Tang Zihao Ye Keisuke Kamahori Chien-Yu Lin Stephanie Wang Arvind Krishnamurthy and Baris Kasikci. 2024. NanoFlow: Towards Optimal Large Language Model Serving Throughput. arXiv:2408.12757 [cs.DC] https:\/\/arxiv.org\/abs\/2408.12757"}],"event":{"name":"ASPLOS '25: 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems","location":"Rotterdam Netherlands","acronym":"ASPLOS '25","sponsor":["SIGPLAN ACM Special Interest Group on Programming Languages","SIGOPS ACM Special Interest Group on Operating Systems","SIGARCH ACM Special Interest Group on Computer Architecture"]},"container-title":["Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3669940.3707256","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3669940.3707256","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T14:52:17Z","timestamp":1755787937000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3669940.3707256"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,30]]},"references-count":83,"alternative-id":["10.1145\/3669940.3707256","10.1145\/3669940"],"URL":"https:\/\/doi.org\/10.1145\/3669940.3707256","relation":{},"subject":[],"published":{"date-parts":[[2025,3,30]]},"assertion":[{"value":"2025-03-30","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}