{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,14]],"date-time":"2026-07-14T02:51:13Z","timestamp":1783997473524,"version":"3.55.0"},"publisher-location":"New York, NY, USA","reference-count":81,"publisher":"ACM","funder":[{"DOI":"10.13039\/100000001","name":"NSF (National Science Foundation)","doi-asserted-by":"publisher","award":["2326894, 2425655"],"award-info":[{"award-number":["2326894, 2425655"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,11,19]]},"DOI":"10.1145\/3772052.3772215","type":"proceedings-article","created":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T16:19:00Z","timestamp":1768321140000},"page":"88-101","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Oneiros: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7092-2401","authenticated-orcid":false,"given":"Ruihao","family":"Li","sequence":"first","affiliation":[{"name":"The University of Texas at Austin, Austin, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-6509-4019","authenticated-orcid":false,"given":"Shagnik","family":"Pal","sequence":"additional","affiliation":[{"name":"The University of Texas at Austin, Austin, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-0422-2516","authenticated-orcid":false,"given":"Vineeth Narayan","family":"Pullu","sequence":"additional","affiliation":[{"name":"The University of Texas at Austin, Austin, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5538-8829","authenticated-orcid":false,"given":"Prasoon","family":"Sinha","sequence":"additional","affiliation":[{"name":"The University of Texas at Austin, Austin, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-0401-3685","authenticated-orcid":false,"given":"Jeeho","family":"Ryoo","sequence":"additional","affiliation":[{"name":"Fairleigh Dickinson University, Vancouver, Canada"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8747-5214","authenticated-orcid":false,"given":"Lizy K.","family":"John","sequence":"additional","affiliation":[{"name":"The University of Texas at Austin, Austin, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-7556-3069","authenticated-orcid":false,"given":"Neeraja J.","family":"Yadwadkar","sequence":"additional","affiliation":[{"name":"The University of Texas at Austin, Austin, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2026,1,13]]},"reference":[{"key":"e_1_3_2_1_1_1","first-page":"114","article-title":"Keyformer: Kv cache reduction through key tokens selection for efficient generative inference","volume":"6","author":"Adnan Muhammad","year":"2024","unstructured":"Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant Nair, Ilya Soloveychik, and Purushotham Kamath. 2024. Keyformer: Kv cache reduction through key tokens selection for efficient generative inference. Proceedings of Machine Learning and Systems 6 (2024), 114\u2013127.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_2_1","unstructured":"Amey Agrawal Nitin Kedia Ashish Panwar Jayashree Mohan Nipun Kwatra Bhargav S Gulavani Alexey Tumanov and Ramachandran Ramjee. 2024. Taming throughput-latency tradeoff in llm inference with sarathi-serve. In OSDI."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3617232.3624849"},{"key":"e_1_3_2_1_4_1","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Bai Zhihao","year":"2020","unstructured":"Zhihao Bai, Zhen Zhang, Yibo Zhu, and Xin Jin. 2020. {PipeSwitch}: Fast pipelined context switching for deep learning applications. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 499\u2013514."},{"key":"e_1_3_2_1_5_1","volume-title":"Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201","author":"Chan Chi-Min","year":"2023","unstructured":"Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201 (2023)."},{"key":"e_1_3_2_1_6_1","volume-title":"2022 USENIX Annual Technical Conference (USENIX ATC 22)","author":"Choi Seungbeom","year":"2022","unstructured":"Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh. 2022. Serving heterogeneous machine learning models on {Multi-GPU} servers with {Spatio-Temporal} sharing. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 199\u2013216."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3132747.3132772"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3232559"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3419111.3421284"},{"key":"e_1_3_2_1_10_1","volume-title":"MuxServe: flexible spatial-temporal multiplexing for multiple LLM serving. arXiv preprint arXiv:2404.02015","author":"Duan Jiangfei","year":"2024","unstructured":"Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. 2024. MuxServe: flexible spatial-temporal multiplexing for multiple LLM serving. arXiv preprint arXiv:2404.02015 (2024)."},{"key":"e_1_3_2_1_11_1","unstructured":"Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Amy Yang Angela Fan et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)."},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO61859.2024.00020"},{"key":"e_1_3_2_1_13_1","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, 135\u2013153","author":"Fu Yao","year":"2024","unstructured":"Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. Serverlessllm: Low-latency serverless inference for large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, 135\u2013153."},{"key":"e_1_3_2_1_14_1","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Gujarati Arpan","year":"2020","unstructured":"Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving {DNNs} like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 443\u2013462."},{"key":"e_1_3_2_1_15_1","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Hadary Ori","year":"2020","unstructured":"Ori Hadary, Luke Marshall, Ishai Menache, Abhisek Pan, Esaias E Greeff, David Dion, Star Dorminey, Shailesh Joshi, Yang Chen, Mark Russinovich, et al. 2020. Protean:{ VM} allocation service at scale. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 845\u2013861."},{"key":"e_1_3_2_1_16_1","volume-title":"Proceedings of the 2024 ACM Symposium on Cloud Computing. 460\u2013469","author":"Han Bing-Shiun","year":"2024","unstructured":"Bing-Shiun Han, Tathagata Paul, Zhenhua Liu, and Anshul Gandhi. 2024. KACE: Kernel-Aware Colocation for Efficient GPU Spatial Sharing. In Proceedings of the 2024 ACM Symposium on Cloud Computing. 460\u2013469."},{"key":"e_1_3_2_1_17_1","volume-title":"LLM multi-agent systems: Challenges and open problems. arXiv preprint arXiv:2402.03578","author":"Han Shanshan","year":"2024","unstructured":"Shanshan Han, Qifan Zhang, Yuhang Yao, Weizhao Jin, Zhaozhuo Xu, and Chaoyang He. 2024. LLM multi-agent systems: Challenges and open problems. arXiv preprint arXiv:2402.03578 (2024)."},{"key":"e_1_3_2_1_18_1","volume-title":"Bandwidth Characterization of DeepSpeed on Distributed Large Language Model Training. In 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 241\u2013256","author":"Hanindhito Bagus","year":"2024","unstructured":"Bagus Hanindhito, Bhavesh Patel, and Lizy K John. 2024. Bandwidth Characterization of DeepSpeed on Distributed Large Language Model Training. In 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 241\u2013256."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2775054.2694384"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2015.15"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378530"},{"key":"e_1_3_2_1_22_1","volume-title":"Neo: Saving gpu memory crisis with cpu offloading for online llm inference.","author":"Jiang Xuanlin","year":"2025","unstructured":"Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, and Minlan Yu. 2025. Neo: Saving gpu memory crisis with cpu offloading for online llm inference. (2025)."},{"key":"e_1_3_2_1_23_1","volume-title":"Compute or load kv cache? why not both? arXiv preprint arXiv:2410.03065","author":"Jin Shuowei","year":"2024","unstructured":"Shuowei Jin, Xueshen Liu, Qingzhao Zhang, and Z Morley Mao. 2024. Compute or load kv cache? why not both? arXiv preprint arXiv:2410.03065 (2024)."},{"key":"e_1_3_2_1_24_1","volume-title":"16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19)","author":"Kaffes Kostis","year":"2019","unstructured":"Kostis Kaffes, Timothy Chong, Jack Tigar Humphries, Adam Belay, David Mazi\u00e8res, and Christos Kozyrakis. 2019. Shinjuku: Preemptive Scheduling for {\u03bcsecond-scale} Tail Latency. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 345\u2013360."},{"key":"e_1_3_2_1_25_1","volume-title":"Pod-attention: Unlocking full prefill-decode overlap for faster llm inference. arXiv preprint arXiv:2410.18038","author":"Kamath Aditya K","year":"2024","unstructured":"Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, and Ashish Panwar. 2024. Pod-attention: Unlocking full prefill-decode overlap for faster llm inference. arXiv preprint arXiv:2410.18038 (2024)."},{"key":"e_1_3_2_1_26_1","unstructured":"Jiin Kim Byeongjun Shin Jinha Chung and Minsoo Rhu. 2025. The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective. arXiv:2506.04301 [cs.LG] https:\/\/arxiv.org\/abs\/2506.04301 arXiv preprint."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41406.2024.00048"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2018.2845861"},{"key":"e_1_3_2_1_30_1","volume-title":"18th USENI XSymposium on Operating Systems Design and Implementation (OSDI 24). 155\u2013172.","author":"Lee Wonbeom","unstructured":"Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management. In 18th USENI XSymposium on Operating Systems Design and Implementation (OSDI 24). 155\u2013172."},{"key":"e_1_3_2_1_31_1","volume-title":"More agents is all you need. arXiv preprint arXiv:2402.05120","author":"Li Junyou","year":"2024","unstructured":"Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, and Deheng Ye. 2024. More agents is all you need. arXiv preprint arXiv:2402.05120 (2024)."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240302.3240315"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1007\/s44336-024-00009-2"},{"key":"e_1_3_2_1_34_1","volume-title":"17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)","author":"Li Zhuohan","year":"2023","unstructured":"Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gonzalez, et al. 2023. {AlpaServe}: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 663\u2013679."},{"key":"e_1_3_2_1_35_1","volume-title":"2021 USENIXAnnual Technical Conference (USENIX ATC 21)","author":"Lim Gangmuk","year":"2021","unstructured":"Gangmuk Lim, Jeongseob Ahn, Wencong Xiao, Youngjin Kwon, and Myeongjae Jeon. 2021. Zico: Efficient {GPU} memory sharing for concurrent {DNN} training. In 2021 USENIXAnnual Technical Conference (USENIX ATC 21). 161\u2013175."},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3651890.3672274"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640411"},{"key":"e_1_3_2_1_38_1","volume-title":"Proceedings of the 8th MLSys Conference.","author":"Na Seonjin","year":"2025","unstructured":"Seonjin Na, Geonhwa Jeong, Byung Hoon Ahn, Aaron Jezghani, Jeffrey Young, Christopher J Hughes, Tushar Krishna, and Hyesoon Kim. 2025. FlexInfer: Flexible LLM Inference with CPU Computations. In Proceedings of the 8th MLSys Conference."},{"key":"e_1_3_2_1_39_1","unstructured":"Nvidia. 2025. Nvidia A100 GPU Architecture. https:\/\/images.nvidia.com\/aem-dam\/en-zz\/Solutions\/data-center\/nvidia-ampere-architecture-whitepaper.pdf."},{"key":"e_1_3_2_1_40_1","unstructured":"Nvidia. 2025. Nvidia GB200 GPU Architecture. https:\/\/www.nvidia.com\/en-us\/data-center\/dgx-gb200\/."},{"key":"e_1_3_2_1_41_1","unstructured":"Nvidia. 2025. Nvidia GH200 GPU Architecture. https:\/\/www.nvidia.com\/en-us\/data-center\/grace-hopper-superchip\/."},{"key":"e_1_3_2_1_42_1","unstructured":"Nvidia. 2025. Nvidia H100 GPU Architecture. https:\/\/resources.nvidia.com\/en-us-tensor-core\/gtc22-whitepaper-hopper."},{"key":"e_1_3_2_1_43_1","unstructured":"Nvidia. 2025. NVIDIA Multi-Instance GPU. https:\/\/www.nvidia.com\/en-us\/technologies\/multi-instance-gpu\/."},{"key":"e_1_3_2_1_44_1","unstructured":"Nvidia. 2025. NVIDIA Multi-Process Service. https:\/\/docs.nvidia.com\/deploy\/mps\/index.html\/."},{"key":"e_1_3_2_1_45_1","volume-title":"2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 970\u2013982","author":"Park Sang-Soo","year":"2024","unstructured":"Sang-Soo Park, KyungSoo Kim, Jinin So, Jin Jung, Jonggeon Lee, Kyoungwan Woo, Nayeon Kim, Younghyun Lee, Hyungyo Kim, Yongsuk Kwon, et al. 2024. An lpddr-based cxl-pnm platform for tco-efficient inference of transformer-based large language models. In 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 970\u2013982."},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA59077.2024.00019"},{"key":"e_1_3_2_1_47_1","volume-title":"Proceedings of the 2024 ACM Symposium on Cloud Computing. 18\u201335","author":"Patke Archit","year":"2024","unstructured":"Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2024. Queue management for slo-oriented large language model serving. In Proceedings of the 2024 ACM Symposium on Cloud Computing. 18\u201335."},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378505"},{"key":"e_1_3_2_1_49_1","volume-title":"Proceedings of Machine Learning and Systems 5","author":"Pope Reiner","year":"2023","unstructured":"Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2023. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems 5 (2023)."},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"crossref","unstructured":"Ramya Prabhu Ajay Nayak Jayashree Mohan Ramachandran Ramjee and Ashish Panwar. 2025. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention. In ASPLOS.","DOI":"10.1145\/3669940.3707256"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3132747.3132780"},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00024"},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476205"},{"key":"e_1_3_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3406703"},{"key":"e_1_3_2_1_55_1","volume-title":"2021 USENIX Annual Technical Conference (USENIX ATC 21)","author":"Ren Jie","year":"2021","unstructured":"Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. {Zero-offload}: Democratizing {billion-scale} model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 551\u2013564."},{"key":"e_1_3_2_1_56_1","volume-title":"2021 USENIX Annual Technical Conference (USENIX ATC 21)","author":"Romero Francisco","year":"2021","unstructured":"Francisco Romero, Qian Li, Neeraja J Yadwadkar, and Christos Kozyrakis. 2021. {INFaaS}: Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 397\u2013411."},{"key":"e_1_3_2_1_57_1","unstructured":"Mohammad Shahrad Rodrigo Fonseca Inigo Goiri Gohar Chaudhry Paul Batum Jason Cooke Eduardo Laureano Colby Tresness Mark Russinovich and Ricardo Bianchini. 2020. Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider. In 2020 USENIX annual technical conference (USENIX ATC 20). 205\u2013218."},{"key":"e_1_3_2_1_58_1","volume-title":"International Conference on Machine Learning. PMLR, 31094\u201331116","author":"Sheng Ying","year":"2023","unstructured":"Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher R\u00e9, Ion Stoica, and Ce Zhang. 2023. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning. PMLR, 31094\u201331116."},{"key":"e_1_3_2_1_59_1","volume-title":"2024 ACM\/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 437\u2013451","author":"Stojkovic Jovan","year":"2024","unstructured":"Jovan Stojkovic, Pulkit A Misra, \u00cd\u00f1igo Goiri, Sam Whitlock, Esha Choukse, Mayukh Das, Chetan Bansal, Jason Lee, Zoey Sun, Haoran Qiu, et al. 2024. SmartOClock: Workload-and risk-aware overclocking in the cloud. In 2024 ACM\/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 437\u2013451."},{"key":"e_1_3_2_1_60_1","volume-title":"Dynamollm: Designing llm inference clusters for performance and energy efficiency. arXiv preprint arXiv:2408.00741","author":"Stojkovic Jovan","year":"2024","unstructured":"Jovan Stojkovic, Chaojie Zhang, \u00cd\u00f1igo Goiri, Josep Torrellas, and Esha Choukse. 2024. Dynamollm: Designing llm inference clusters for performance and energy efficiency. arXiv preprint arXiv:2408.00741 (2024)."},{"key":"e_1_3_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/3627703.3629578"},{"key":"e_1_3_2_1_62_1","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Sun Biao","year":"2024","unstructured":"Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic scheduling for large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 173\u2013191."},{"key":"e_1_3_2_1_63_1","volume-title":"Multi-agent collaboration: Harnessing the power of intelligent llm agents. arXiv preprint arXiv:2306.03314","author":"Talebirad Yashar","year":"2023","unstructured":"Yashar Talebirad and Amirhossein Nadiri. 2023. Multi-agent collaboration: Harnessing the power of intelligent llm agents. arXiv preprint arXiv:2306.03314 (2023)."},{"key":"e_1_3_2_1_64_1","unstructured":"Rohan Taori Ishaan Gulrajani Tianyi Zhang Yann Dubois Xuechen Li Carlos Guestrin Percy Liang and Tatsunori B Hashimoto. 2023. Stanford alpaca: An instruction-following llama model."},{"key":"e_1_3_2_1_65_1","unstructured":"ShareGPT Team. 2025. https:\/\/sharegpt.com\/."},{"key":"e_1_3_2_1_66_1","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)."},{"key":"e_1_3_2_1_67_1","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic Sergey Edunov and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]"},{"key":"e_1_3_2_1_68_1","unstructured":"Fali Wang Zhiwei Zhang Xianren Zhang Zongyu Wu Tzuhao Mo Qiuhao Lu Wanjing Wang Rui Li Junjie Xu Xianfeng Tang et al. 2024. A comprehensive survey of small language models in the era of large language models: Techniques enhancements applications collaboration with llms and trustworthiness. arXiv preprint arXiv:2411.03350 (2024)."},{"key":"e_1_3_2_1_69_1","volume-title":"Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services. 450\u2013463","author":"Wang Qipeng","year":"2022","unstructured":"Qipeng Wang, Mengwei Xu, Chao Jin, Xinran Dong, Jinliang Yuan, Xin Jin, Gang Huang, Yunxin Liu, and Xuanzhe Liu. 2022. Melon: Breaking the memory wall for resource-efficient on-device machine learning. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services. 450\u2013463."},{"key":"e_1_3_2_1_70_1","volume-title":"Improving GPU Multi-Tenancy Through Dynamic Multi-Instance GPU Reconfiguration. arXiv preprint arXiv:2407.13126","author":"Wang Tianyu","year":"2024","unstructured":"Tianyu Wang, Sheng Li, Bingyao Li, Yue Dai, Ao Li, Geng Yuan, Yufei Ding, Youtao Zhang, and Xulong Tang. 2024. Improving GPU Multi-Tenancy Through Dynamic Multi-Instance GPU Reconfiguration. arXiv preprint arXiv:2407.13126 (2024)."},{"key":"e_1_3_2_1_71_1","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Xiao Wencong","year":"2020","unstructured":"Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. 2020. {AntMan}: Dynamic scaling on {GPU} clusters for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 533\u2013548."},{"key":"e_1_3_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2022.3232715"},{"key":"e_1_3_2_1_73_1","volume-title":"Pie: Pooling CPU Memory for LLM Inference. arXiv preprint arXiv:2411.09317","author":"Xu Yi","year":"2024","unstructured":"Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, and Ion Stoica. 2024. Pie: Pooling CPU Memory for LLM Inference. arXiv preprint arXiv:2411.09317 (2024)."},{"key":"e_1_3_2_1_74_1","volume-title":"LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention. arXiv preprint arXiv:2502.14866","author":"Yang Shang","year":"2025","unstructured":"Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, and Song Han. 2025. LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention. arXiv preprint arXiv:2502.14866 (2025)."},{"key":"e_1_3_2_1_75_1","volume-title":"Masahiro Tanaka, Olatunji Ruwase, Aamir Shafi, Hari Subramoni, and Dhabaleswar K Panda.","author":"Yao Jinghan","year":"2024","unstructured":"Jinghan Yao, Sam Ade Jacobs, Masahiro Tanaka, Olatunji Ruwase, Aamir Shafi, Hari Subramoni, and Dhabaleswar K Panda. 2024. Training ultra long context language model with fully pipelined distributed transformer. arXiv preprint arXiv:2408.16978 (2024)."},{"key":"e_1_3_2_1_76_1","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Yu Gyeong-In","year":"2022","unstructured":"Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for { Transformer-Based} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521\u2013538."},{"key":"e_1_3_2_1_77_1","volume-title":"Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving. arXiv preprint arXiv:2505.04021","author":"Yu Shan","year":"2025","unstructured":"Shan Yu, Jiarong Xing, Yifan Qiao, Mingyuan Ma, Yangmin Li, Yang Wang, Shuo Yang, Zhiqiang Xie, Shiyi Cao, Ke Bao, et al. 2025. Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving. arXiv preprint arXiv:2505.04021 (2025)."},{"key":"e_1_3_2_1_78_1","volume-title":"2024 USENIX Annual Technical Conference (USENIX ATC 24)","author":"Yuan Tailing","year":"2024","unstructured":"Tailing Yuan, Yuliang Liu, Xucheng Ye, Shenglong Zhang, Jianchao Tan, Bin Chen, Chengru Song, and Di Zhang. 2024. Accelerating the training of large language models using efficient activation rematerialization and optimal hybrid parallelism. In 2024 USENIX Annual Technical Conference (USENIX ATC 24). 545\u2013561."},{"key":"e_1_3_2_1_79_1","volume-title":"Jenga: Effective Memory Management for Serving LLM with Heterogeneity.","author":"Zhang Chen","year":"2025","unstructured":"Chen Zhang, Kuntai Du, Shu Liu, Woosuk Kwon, Xiangxi Mo, Yufeng Wang, Xiaoxuan Liu, Kaichao You, Zhuohan Li, Mingsheng Long, et al. 2025. Jenga: Effective Memory Management for Serving LLM with Heterogeneity. (2025), 446\u2013461."},{"key":"e_1_3_2_1_80_1","volume-title":"Xi Victoria Lin, et al","author":"Zhang Susan","year":"2022","unstructured":"Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)."},{"key":"e_1_3_2_1_81_1","volume-title":"20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)","author":"Zhao Kevin","year":"2023","unstructured":"Kevin Zhao, Prateesh Goyal, Mohammad Alizadeh, and Thomas E Anderson. 2023. Scalable tail latency estimation for data center networks. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 685\u2013702."}],"event":{"name":"SoCC '25: ACM Symposium on Cloud Computing","location":"Online USA","acronym":"SoCC '25","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems","SIGMOD ACM Special Interest Group on Management of Data"]},"container-title":["Proceedings of the 2025 ACM Symposium on Cloud Computing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3772052.3772215","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T16:23:37Z","timestamp":1768321417000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3772052.3772215"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,19]]},"references-count":81,"alternative-id":["10.1145\/3772052.3772215","10.1145\/3772052"],"URL":"https:\/\/doi.org\/10.1145\/3772052.3772215","relation":{},"subject":[],"published":{"date-parts":[[2025,11,19]]},"assertion":[{"value":"2026-01-13","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}