{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T20:51:09Z","timestamp":1777063869492,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":68,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,4,27]]},"DOI":"10.1145\/3767295.3769328","type":"proceedings-article","created":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T20:20:04Z","timestamp":1777062004000},"page":"497-513","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-7397-6311","authenticated-orcid":false,"given":"Junyi","family":"Chen","sequence":"first","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-3465-4023","authenticated-orcid":false,"given":"Chuheng","family":"Du","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9710-6116","authenticated-orcid":false,"given":"Renyuan","family":"Liu","sequence":"additional","affiliation":[{"name":"George Mason University, Fairfax, VA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7446-1430","authenticated-orcid":false,"given":"Shuochao","family":"Yao","sequence":"additional","affiliation":[{"name":"George Mason University, Fairfax, VA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-3112-8215","authenticated-orcid":false,"given":"Dingtian","family":"Yan","sequence":"additional","affiliation":[{"name":"China Telecom Corporation Limited Shanghai Branch, Shanghai, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-2128-3096","authenticated-orcid":false,"given":"Jiang","family":"Liao","sequence":"additional","affiliation":[{"name":"China Telecom Corporation Limited Shanghai Branch, Shanghai, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7643-7239","authenticated-orcid":false,"given":"Shengzhong","family":"Liu","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0965-9058","authenticated-orcid":false,"given":"Fan","family":"Wu","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6934-1685","authenticated-orcid":false,"given":"Guihai","family":"Chen","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2026,4,26]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774","author":"Achiam Josh","year":"2023","unstructured":"Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023."},{"key":"e_1_3_2_1_2_1","first-page":"134","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI)","author":"Agrawal Amey","year":"2024","unstructured":"Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming throughput-latency tradeoff in llm inference with sarathiserve. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 117\u2013134, 2024."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.acl-long.678"},{"key":"e_1_3_2_1_4_1","first-page":"15","volume-title":"SC22: International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Aminabadi Reza Yazdani","unstructured":"Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1\u201315. IEEE, 2022."},{"key":"e_1_3_2_1_5_1","volume-title":"Qwen technical report. arXiv preprint arXiv:2309.16609","author":"Bai Jinze","year":"2023","unstructured":"Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023."},{"key":"e_1_3_2_1_6_1","volume-title":"Language models are few-shot learners. Advances in neural information processing systems (NeurIPS), 33:1877\u20131901","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems (NeurIPS), 33:1877\u20131901, 2020."},{"key":"e_1_3_2_1_7_1","first-page":"11267","volume-title":"Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers)","author":"Chen Junyi","year":"2025","unstructured":"Junyi Chen, Shihao Bai, Zaijun Wang, Siyu Wu, Chuheng Du, Hailong Yang, Ruihao Gong, Shengzhong Liu, Fan Wu, and Guihai Chen. Pre3: Enabling deterministic pushdown automata for faster structured LLM generation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers), pages 11253\u201311267, Vienna, Austria, July 2025. Association for Computational Linguistics."},{"key":"e_1_3_2_1_8_1","volume-title":"Slos-serve: Optimized serving of multi-slo llms. arXiv preprint arXiv:2504.08784","author":"Chen Siyuan","year":"2025","unstructured":"Siyuan Chen, Zhipeng Jia, Samira Khan, Arvind Krishnamurthy, and Phillip B Gibbons. Slos-serve: Optimized serving of multi-slo llms. arXiv preprint arXiv:2504.08784, 2025."},{"key":"e_1_3_2_1_9_1","volume-title":"Slice-level scheduling for high throughput and load balanced llm serving. arXiv preprint arXiv:2406.13511","author":"Cheng Ke","year":"2024","unstructured":"Ke Cheng, Wen Hu, Zhi Wang, Hongen Peng, Jianguo Li, and Sheng Zhang. Slice-level scheduling for high throughput and load balanced llm serving. arXiv preprint arXiv:2406.13511, 2024."},{"key":"e_1_3_2_1_10_1","volume-title":"Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference. arXiv preprint arXiv:2402.09398","author":"Dong Harry","year":"2024","unstructured":"Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, and Beidi Chen. Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference. arXiv preprint arXiv:2402.09398, 2024."},{"key":"e_1_3_2_1_11_1","first-page":"224","article-title":"Sparsity-inspired data-aware serving for efficient and scalable large mixture-of-experts models","volume":"6","author":"Du Zhixu","year":"2024","unstructured":"Zhixu Du, Shiyu Li, Yuhao Wu, Xiangyu Jiang, Jingwei Sun, Qilin Zheng, Yongkai Wu, Ang Li, Hai Li, and Yiran Chen. Sida: Sparsity-inspired data-aware serving for efficient and scalable large mixture-of-experts models. Proceedings of Machine Learning and Systems, 6:224\u2013238, 2024.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3689031.3717481"},{"key":"e_1_3_2_1_13_1","first-page":"126","volume-title":"2024 USENIX Annual Technical Conference (USENIX ATC)","author":"Gao Bin","year":"2024","unstructured":"Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. Cost-Efficient large language model serving for multi-turn conversations with Cached Attention. In 2024 USENIX Annual Technical Conference (USENIX ATC), pages 111\u2013126, Santa Clara, CA, July 2024. USENIX Association."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3689031.3696072"},{"key":"e_1_3_2_1_15_1","volume-title":"Proceedings of Machine Learning and Systems (MLSys), 6:325\u2013338","author":"Gim In","year":"2024","unstructured":"In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference. Proceedings of Machine Learning and Systems (MLSys), 6:325\u2013338, 2024."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3676641.3716011"},{"key":"e_1_3_2_1_17_1","volume-title":"The llama 3 herd of models. arXiv preprint arXiv:2407.21783","author":"Grattafiori Aaron","year":"2024","unstructured":"Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024."},{"key":"e_1_3_2_1_18_1","volume-title":"Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948","author":"Guo Daya","year":"2025","unstructured":"Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3663384.3663398"},{"key":"e_1_3_2_1_20_1","volume-title":"Proceedings of Machine Learning and Systems (MLSys), 6:148\u2013161","author":"Hong Ke","year":"2024","unstructured":"Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, and Yu Wang. Flashdecoding++: Faster large language model inference with asynchronization, flat gemm optimization, and heuristics. Proceedings of Machine Learning and Systems (MLSys), 6:148\u2013161, 2024."},{"key":"e_1_3_2_1_21_1","volume-title":"P\/d-serve: Serving disaggregated large language model at scale. arXiv preprint arXiv:2408.08147","author":"Jin Yibo","year":"2024","unstructured":"Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, et al. P\/d-serve: Serving disaggregated large language model at scale. arXiv preprint arXiv:2408.08147, 2024."},{"key":"e_1_3_2_1_22_1","volume-title":"Is the gpu half-empty or half-full? practical scheduling techniques for llms. arXiv preprint arXiv:2410.17840","author":"Kossmann Ferdi","year":"2024","unstructured":"Ferdi Kossmann, Bruce Fontaine, Daya Khudia, Michael Cafarella, and Samuel Madden. Is the gpu half-empty or half-full? practical scheduling techniques for llms. arXiv preprint arXiv:2410.17840, 2024."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_2_1_24_1","first-page":"959","volume-title":"2023 USENIX Annual Technical Conference (USENIX ATC)","author":"Li Jiamin","year":"2023","unstructured":"Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. Accelerating distributed {MoE} training and inference with lina. In 2023 USENIX Annual Technical Conference (USENIX ATC), pages 945\u2013959, 2023."},{"key":"e_1_3_2_1_25_1","volume-title":"A speed odyssey for deployable quantization of llms. arXiv preprint arXiv:2311.09550","author":"Li Qingyuan","year":"2023","unstructured":"Qingyuan Li, Ran Meng, Yiduo Li, Bo Zhang, Liang Li, Yifan Lu, Xiangxiang Chu, Yerui Sun, and Yuchen Xie. A speed odyssey for deployable quantization of llms. arXiv preprint arXiv:2311.09550, 2023."},{"key":"e_1_3_2_1_26_1","volume-title":"Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems (NeurIPS), 37:22947\u201322970","author":"Li Yuhong","year":"2024","unstructured":"Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems (NeurIPS), 37:22947\u201322970, 2024."},{"key":"e_1_3_2_1_27_1","volume-title":"Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434","author":"Liu Aixin","year":"2024","unstructured":"Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024."},{"key":"e_1_3_2_1_28_1","volume-title":"Minicache: Kv cache compression in depth dimension for large language models. Advances in Neural Information Processing Systems (NeurIPS), 37:139997\u2013140031","author":"Liu Akide","year":"2024","unstructured":"Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Reza Haffari, and Bohan Zhuang. Minicache: Kv cache compression in depth dimension for large language models. Advances in Neural Information Processing Systems (NeurIPS), 37:139997\u2013140031, 2024."},{"key":"e_1_3_2_1_29_1","volume-title":"Andes: Defining and enhancing quality-of-experience in llm-based text streaming services. arXiv preprint arXiv:2404.16283","author":"Liu Jiachen","year":"2024","unstructured":"Jiachen Liu, Jae-Won Chung, Zhiyu Wu, Fan Lai, Myungjin Lee, and Mosharaf Chowdhury. Andes: Defining and enhancing quality-of-experience in llm-based text streaming services. arXiv preprint arXiv:2404.16283, 2024."},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3662006.3662063"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41598-017-08652-0"},{"key":"e_1_3_2_1_32_1","first-page":"52342","article-title":"Exploiting the persistence of importance hypothesis for llm kv cache compression at test time","volume":"36","author":"Liu Zichang","year":"2023","unstructured":"Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36:52342\u201352364, 2023.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_33_1","volume-title":"Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750","author":"Liu Zirui","year":"2024","unstructured":"Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640383"},{"key":"e_1_3_2_1_35_1","volume-title":"Accessed","author":"AI.","year":"2025","unstructured":"OpenAI. What are tokens and how to count them? https:\/\/help.openai.com\/en\/articles\/4936856-what-are-tokens-and-how-to-count-them, 2023. Accessed: September 11, 2025."},{"key":"e_1_3_2_1_36_1","first-page":"2025","year":"2025","unstructured":"OpenAI. Sharegpt, 2025. Accessed: 2025-05-16.","journal-title":"Sharegpt"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA59077.2024.00019"},{"key":"e_1_3_2_1_38_1","volume-title":"Proceedings of Machine Learning and Systems (MLSys), 5:606\u2013624","author":"Pope Reiner","year":"2023","unstructured":"Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems (MLSys), 5:606\u2013624, 2023."},{"key":"e_1_3_2_1_39_1","first-page":"170","volume-title":"23rd USENIX Conference on File and Storage Technologies (FAST 25)","author":"Qin Ruoyu","year":"2025","unstructured":"Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: Trading more storage for less computation\u2014a {KVCache-centric} architecture for serving {LLM} chatbot. In 23rd USENIX Conference on File and Storage Technologies (FAST 25), pages 155\u2013170, 2025."},{"issue":"8","key":"e_1_3_2_1_40_1","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.","journal-title":"OpenAI blog"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.3389\/fcomp.2023.1208550"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.infsof.2024.107610"},{"key":"e_1_3_2_1_43_1","first-page":"31116","volume-title":"International Conference on Machine Learning (ICML)","author":"Sheng Ying","unstructured":"Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher R\u00e9, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning (ICML), pages 31094\u201331116. PMLR, 2023."},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3694715.3695964"},{"key":"e_1_3_2_1_45_1","volume-title":"Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation (OSDI), USA","author":"Sun Biao","year":"2024","unstructured":"Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: dynamic scheduling for large language model serving. In Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation (OSDI), USA, 2024. USENIX Association."},{"key":"e_1_3_2_1_46_1","volume-title":"et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023."},{"key":"e_1_3_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3627508.3638344"},{"key":"e_1_3_2_1_48_1","volume-title":"Understanding user experience in large language model interactions. arXiv preprint arXiv:2401.08329","author":"Wang Jiayin","year":"2024","unstructured":"Jiayin Wang, Weizhi Ma, Peijie Sun, Min Zhang, and Jian-Yun Nie. Understanding user experience in large language model interactions. arXiv preprint arXiv:2401.08329, 2024."},{"key":"e_1_3_2_1_49_1","volume-title":"Amelie Chi Zhou, et al. Burstgpt: A real-world workload dataset to optimize llm serving systems. arXiv preprint arXiv:2401.17644","author":"Wang Yuxin","year":"2024","unstructured":"Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, et al. Burstgpt: A real-world workload dataset to optimize llm serving systems. arXiv preprint arXiv:2401.17644, 2024."},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3694715.3695948"},{"key":"e_1_3_2_1_51_1","volume-title":"Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920","author":"Wu Bingyang","year":"2023","unstructured":"Bingyang Wu, Yinmin Zhong, Zili Zhang, Shengyu Liu, Fangyue Liu, Yuanhang Sun, Gang Huang, Xuanzhe Liu, and Xin Jin. Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920, 2023."},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3689031.3717455"},{"key":"e_1_3_2_1_53_1","volume-title":"fast and slow: Cognitive load-aware streaming for efficient llm serving. arXiv preprint arXiv:2504.17999","author":"Xiao Chang","year":"2025","unstructured":"Chang Xiao and Brenda Yang. Streaming, fast and slow: Cognitive load-aware streaming for efficient llm serving. arXiv preprint arXiv:2504.17999, 2025."},{"key":"e_1_3_2_1_54_1","volume-title":"Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453","author":"Xiao Guangxuan","year":"2023","unstructured":"Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023."},{"key":"e_1_3_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/3703187.3703197"},{"key":"e_1_3_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/3688399"},{"key":"e_1_3_2_1_57_1","volume-title":"Nova: Real-time agentic vision-language model serving with adaptive cross-stage parallelization","author":"Xu Yuhang","year":"2025","unstructured":"Yuhang Xu, Shengzhong Liu, Dong Zhang, Bingheng Yan, Fan Wu, and Guihai Chen. Nova: Real-time agentic vision-language model serving with adaptive cross-stage parallelization, 2025."},{"key":"e_1_3_2_1_58_1","volume-title":"Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115","author":"Yang An","year":"2024","unstructured":"An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024."},{"key":"e_1_3_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/3689031.3696098"},{"key":"e_1_3_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/3689031.3717468"},{"key":"e_1_3_2_1_61_1","volume-title":"Chunkattention: Efficient self-attention with prefix-aware kv cache and two-phase partition. arXiv preprint arXiv:2402.15220","author":"Ye Lu","year":"2024","unstructured":"Lu Ye, Ze Tao, Yong Huang, and Yang Li. Chunkattention: Efficient self-attention with prefix-aware kv cache and two-phase partition. arXiv preprint arXiv:2402.15220, 2024."},{"key":"e_1_3_2_1_62_1","first-page":"538","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI)","author":"Yu Gyeong-In","year":"2022","unstructured":"Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 521\u2013538, Carlsbad, CA, July 2022."},{"key":"e_1_3_2_1_63_1","volume-title":"Voltanallm: Feedback-driven frequency control and state-space routing for energy-efficient llm serving. arXiv preprint arXiv:2509.04827","author":"Yu Jiahuan","year":"2025","unstructured":"Jiahuan Yu, Aryan Taneja, Junfeng Lin, and Minjia Zhang. Voltanallm: Feedback-driven frequency control and state-space routing for energy-efficient llm serving. arXiv preprint arXiv:2509.04827, 2025."},{"key":"e_1_3_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/3689031.3696086"},{"key":"e_1_3_2_1_65_1","volume-title":"H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems (NeurIPS), 36:34661\u201334710","author":"Zhang Zhenyu","year":"2023","unstructured":"Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R\u00e9, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems (NeurIPS), 36:34661\u201334710, 2023."},{"key":"e_1_3_2_1_66_1","doi-asserted-by":"crossref","unstructured":"Lianmin Zheng Liangsheng Yin Zhiqiang Xie Chuyue Livia Sun Jeff Huang Cody Hao Yu Shiyi Cao Christos Kozyrakis Ion Stoica Joseph E Gonzalez et al. Sglang: Efficient execution of structured language model programs. Advances in Neural Information Processing Systems (NeurIPS) 37:62557\u201362583 2024.","DOI":"10.52202\/079017-2000"},{"key":"e_1_3_2_1_67_1","first-page":"210","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI)","author":"Zhong Yinmin","year":"2024","unstructured":"Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 193\u2013210, 2024."},{"key":"e_1_3_2_1_68_1","first-page":"765","volume-title":"19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25)","author":"Zhu Kan","year":"2025","unstructured":"Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, et al. {NanoFlow}: Towards optimal large language model serving throughput. In 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pages 749\u2013765, 2025."}],"event":{"name":"EUROSYS '26: 21st European Conference on Computer Systems","location":"McEwan Hall\/The University of Edinburgh Edinburgh Scotland UK","acronym":"EUROSYS '26","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems"]},"container-title":["Proceedings of the 21st European Conference on Computer Systems"],"original-title":[],"deposited":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T20:21:30Z","timestamp":1777062090000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3767295.3769328"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,4,26]]},"references-count":68,"alternative-id":["10.1145\/3767295.3769328","10.1145\/3767295"],"URL":"https:\/\/doi.org\/10.1145\/3767295.3769328","relation":{},"subject":[],"published":{"date-parts":[[2026,4,26]]},"assertion":[{"value":"2026-04-26","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}