{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T00:19:45Z","timestamp":1777421985581,"version":"3.51.4"},"reference-count":91,"publisher":"Association for Computing Machinery (ACM)","issue":"3","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2025,6,17]]},"abstract":"<jats:p>Large language model (LLM) inference serving systems are essential to various LLM-based applications. As demand for LLM services continues to grow, scaling these systems to handle high request rates while meeting latency Service-Level Objectives (SLOs), referred to as effective throughput, becomes critical. However, existing systems often struggle to improve effective throughput, primarily due to a significant decline in Time To First Token (TTFT) SLO attainment. We identify two major causes of this bottleneck: (1) memory-intensive KV cache that limits batch size expansion under GPU memory constraints, and (2) rigid batch composition enforced by the default First-Come-First-Serve scheduling policy. In this paper, we introduce Apt-Serve, a scalable framework designed to enhance effective throughput in LLM inference serving. Apt-Serve features a new hybrid cache scheme that combines KV cache with a memory-efficient hidden cache for reusable input hidden state vectors, allowing large batch sizes and improving request concurrency. Based on the hybrid cache, Apt-Serve employs an adaptive runtime scheduling mechanism that dynamically optimizes batch composition. We formally define the adaptive scheduling optimization problem and propose an efficient algorithm with theoretical guarantees. Extensive evaluations on three real-world datasets and LLMs ranging from 13B to 66B parameters demonstrate that Apt-Serve achieves up to 8.8x improvement in effective throughput compared to the state-of-the-art inference serving systems.<\/jats:p>","DOI":"10.1145\/3725394","type":"journal-article","created":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T21:23:29Z","timestamp":1750281809000},"page":"1-28","source":"Crossref","is-referenced-by-count":2,"title":["Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving"],"prefix":"10.1145","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0413-9005","authenticated-orcid":false,"given":"Shihong","family":"Gao","sequence":"first","affiliation":[{"name":"The Hong Kong University of Science and Technology, Hong Kong SAR, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8560-5006","authenticated-orcid":false,"given":"Xin","family":"Zhang","sequence":"additional","affiliation":[{"name":"The Hong Kong University of Science and Technology, Hong Kong SAR, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8364-3674","authenticated-orcid":false,"given":"Yanyan","family":"Shen","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8257-5806","authenticated-orcid":false,"given":"Lei","family":"Chen","sequence":"additional","affiliation":[{"name":"The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China and The Hong Kong University of Science and Technology, Hong Kong SAR, China"}]}],"member":"320","published-online":{"date-parts":[[2025,6,18]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2022. Character ai. https:\/\/character.ai."},{"key":"e_1_2_1_2_1","unstructured":"2022. Perplexity ai. https:\/\/www.perplexity.ai\/."},{"key":"e_1_2_1_3_1","unstructured":"2023. Amazon codewhispere. https:\/\/aws.amazon.com\/codewhisperer\/."},{"key":"e_1_2_1_4_1","unstructured":"2023. Anthropic claude. https:\/\/claude.ai."},{"key":"e_1_2_1_5_1","unstructured":"2023. Bing ai. https:\/\/www.bing.com\/chat."},{"key":"e_1_2_1_6_1","unstructured":"2023. Komo. https:\/\/komo.ai\/."},{"key":"e_1_2_1_7_1","unstructured":"2023. NCCL: The NVIDIA Collective Communication Library. https:\/\/developer.nvidia.com\/nccl."},{"key":"e_1_2_1_8_1","unstructured":"2023. Replit ghostwriter. https:\/\/replit.com\/site\/ghostwriter."},{"key":"e_1_2_1_9_1","unstructured":"2023. ShareGPT Team. https:\/\/sharegpt.com\/."},{"key":"e_1_2_1_10_1","volume-title":"Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al.","author":"Achiam Josh","year":"2023","unstructured":"Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)."},{"key":"e_1_2_1_11_1","volume-title":"Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375","author":"Agarap AF","year":"2018","unstructured":"AF Agarap. 2018. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375 (2018)."},{"key":"e_1_2_1_12_1","volume-title":"Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Agrawal Amey","year":"2024","unstructured":"Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 117--134."},{"key":"e_1_2_1_13_1","volume-title":"et al","author":"Bai Yushi","year":"2023","unstructured":"Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al . 2023. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508 (2023)."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2000.839442"},{"key":"e_1_2_1_15_1","volume-title":"Lin (Eds.)","volume":"33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877--1901. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2020\/file\/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588706"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.14778\/3675034.3675052"},{"key":"e_1_2_1_18_1","volume-title":"Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al .","author":"Chen Mark","year":"2021","unstructured":"Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al . 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/3681954.3681956"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536360.2536367"},{"key":"e_1_2_1_21_1","first-page":"1","article-title":"Palm: Scaling language modeling with pathways","volume":"24","author":"Chowdhery Aakanksha","year":"2023","unstructured":"Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1--113.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jss.2023.111734"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.52202\/068431-1189"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407857"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.14778\/3489496.3489500"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3639323"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.52202\/079017-1882"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.14778\/3641204.3641215"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3654977"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE60146.2024.00236"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/1516360.1516441"},{"key":"e_1_2_1_32_1","volume-title":"Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, et al.","author":"Holmes Connor","year":"2024","unstructured":"Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, et al. 2024. Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference. arXiv preprint arXiv:2401.08671 (2024)."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517848"},{"key":"e_1_2_1_34_1","first-page":"18015","article-title":"S3: Increasing GPU Utilization during Generative Inference for Higher Throughput","volume":"36","author":"Jin Yunho","year":"2023","unstructured":"Yunho Jin, Chun-Feng Wu, David Brooks, and Gu-Yeon Wei. 2023. S3: Increasing GPU Utilization during Generative Inference for Higher Throughput. Advances in Neural Information Processing Systems 36 (2023), 18015--18027.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.14778\/3425879.3425881"},{"key":"e_1_2_1_36_1","volume-title":"Scaling laws for neural language models. arXiv preprint arXiv:2001.08361","author":"Kaplan Jared","year":"2020","unstructured":"Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457542"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3589308"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588737"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.14778\/3583140.3583150"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.14778\/3648160.3648176"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE60146.2024.00123"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3651890.3672274"},{"key":"e_1_2_1_45_1","volume-title":"Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems 36","author":"Liu Zichang","year":"2024","unstructured":"Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. 2024. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems 36 (2024)."},{"key":"e_1_2_1_46_1","volume-title":"International Conference on Machine Learning. PMLR, 22137--22176","author":"Liu Zichang","year":"2023","unstructured":"Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, et al. 2023. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning. PMLR, 22137--22176."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3183751"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588913"},{"key":"e_1_2_1_49_1","volume-title":"Knapsack problems: algorithms and computer implementations","author":"Martello Silvano","unstructured":"Silvano Martello and Paolo Toth. 1990. Knapsack problems: algorithms and computer implementations. John Wiley & Sons, Inc."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3452773"},{"key":"e_1_2_1_51_1","volume-title":"Towards efficient generative large language model serving: A survey from algorithms to systems. arXiv preprint arXiv:2312.15234","author":"Miao Xupeng","year":"2023","unstructured":"Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, and Zhihao Jia. 2023. Towards efficient generative large language model serving: A survey from algorithms to systems. arXiv preprint arXiv:2312.15234 (2023)."},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640411"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.14778\/3598581.3598604"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517902"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.14778\/3570690.3570697"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.14778\/3489496.3489511"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.14778\/3446095.3446100"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640383"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA59077.2024.00019"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.14778\/3538598.3538614"},{"key":"e_1_2_1_61_1","first-page":"606","article-title":"Efficiently scaling transformer inference","volume":"5","author":"Pope Reiner","year":"2023","unstructured":"Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2023. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems 5 (2023), 606--624.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_2_1_62_1","volume-title":"Mooncake: Kimi's KVCache-centric Architecture for LLM Serving. arXiv preprint arXiv:2407.00079","author":"Qin Ruoyu","year":"2024","unstructured":"Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2024. Mooncake: Kimi's KVCache-centric Architecture for LLM Serving. arXiv preprint arXiv:2407.00079 (2024)."},{"key":"e_1_2_1_63_1","volume-title":"Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)."},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/3639307"},{"key":"e_1_2_1_65_1","volume-title":"Powerinfer: Fast large language model serving with a consumer-grade gpu. arXiv preprint arXiv:2312.12456","author":"Song Yixin","year":"2023","unstructured":"Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. 2023. Powerinfer: Fast large language model serving with a consumer-grade gpu. arXiv preprint arXiv:2312.12456 (2023)."},{"key":"e_1_2_1_66_1","volume-title":"Llumnix: Dynamic Scheduling for Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Sun Biao","year":"2024","unstructured":"Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 173--191."},{"key":"e_1_2_1_67_1","volume-title":"et al","author":"Team Gemini","year":"2023","unstructured":"Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al . 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)."},{"key":"e_1_2_1_68_1","volume-title":"Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)."},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.14778\/3579075.3579083"},{"key":"e_1_2_1_70_1","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2017\/file\/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"},{"key":"e_1_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457260"},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1145\/3003665.3003669"},{"key":"e_1_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2024.3370614"},{"key":"e_1_2_1_74_1","unstructured":"Jason Wei Yi Tay Rishi Bommasani Colin Raffel Barret Zoph Sebastian Borgeaud Dani Yogatama Maarten Bosma Denny Zhou Donald Metzler et al. 2022. Emergent Abilities of Large Language Models. Transactions on Machine Learning Research (2022)."},{"key":"e_1_2_1_75_1","volume-title":"Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920","author":"Wu Bingyang","year":"2023","unstructured":"Bingyang Wu, Yinmin Zhong, Zili Zhang, Gang Huang, Xuanzhe Liu, and Xin Jin. 2023. Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920 (2023)."},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.14778\/3626292.3626303"},{"key":"e_1_2_1_77_1","volume-title":"2024 USENIX Annual Technical Conference (USENIX ATC 24)","author":"Xia Haojun","year":"2024","unstructured":"Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, et al. 2024. Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs. In 2024 USENIX Annual Technical Conference (USENIX ATC 24). 699--713."},{"key":"e_1_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1145\/3709669"},{"key":"e_1_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.14778\/3554821.3554887"},{"key":"e_1_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.14778\/3547305.3547310"},{"key":"e_1_2_1_81_1","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Yu Gyeong-In","year":"2022","unstructured":"Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521--538."},{"key":"e_1_2_1_82_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE55515.2023.00248"},{"key":"e_1_2_1_83_1","volume-title":"Pqcache: Product quantization-based kvcache for long context llm inference. arXiv preprint arXiv:2407.12820","author":"Zhang Hailin","year":"2024","unstructured":"Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, and Bin Cui. 2024. Pqcache: Product quantization-based kvcache for long context llm inference. arXiv preprint arXiv:2407.12820 (2024)."},{"key":"e_1_2_1_84_1","volume-title":"Xi Victoria Lin, et al","author":"Zhang Susan","year":"2022","unstructured":"Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al . 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)."},{"key":"e_1_2_1_85_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM54844.2022.00083"},{"key":"e_1_2_1_86_1","doi-asserted-by":"publisher","DOI":"10.1145\/3589311"},{"key":"e_1_2_1_87_1","unstructured":"Zhenyu Zhang Ying Sheng Tianyi Zhou Tianlong Chen Lianmin Zheng Ruisi Cai Zhao Song Yuandong Tian Christopher R\u00e9 Clark Barrett et al. 2024. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems 36 (2024)."},{"key":"e_1_2_1_88_1","doi-asserted-by":"publisher","DOI":"10.1145\/3709703"},{"key":"e_1_2_1_89_1","volume-title":"DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Zhong Yinmin","year":"2024","unstructured":"Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 193--210."},{"key":"e_1_2_1_90_1","doi-asserted-by":"publisher","DOI":"10.14778\/3461535.3461547"},{"key":"e_1_2_1_91_1","doi-asserted-by":"publisher","DOI":"10.1145\/3580305.3599442"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3725394","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T18:54:54Z","timestamp":1774983294000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3725394"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,17]]},"references-count":91,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,6,17]]}},"alternative-id":["10.1145\/3725394"],"URL":"https:\/\/doi.org\/10.1145\/3725394","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,6,17]]}}}