{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T23:08:02Z","timestamp":1768345682500,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":76,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,11,19]]},"DOI":"10.1145\/3772052.3772206","type":"proceedings-article","created":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T16:19:00Z","timestamp":1768321140000},"page":"1-15","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Understanding Diffusion Model Serving in Production: A Top-Down Analysis of Workload, Scheduling, and Resource Efficiency"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4809-9543","authenticated-orcid":false,"given":"Yanying","family":"Lin","sequence":"first","affiliation":[{"name":"Shenzhen Institute of Advanced Integration Technology, Chinese Academy of Sciences, Shenzhen, Guangdong, China and University of Chinese Academy of Sciences, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-4764-2601","authenticated-orcid":false,"given":"Shuaipeng","family":"Wu","sequence":"additional","affiliation":[{"name":"Southern University of Science and Technology, Shenzhen, Guangdong, China, Shenzhen Institute of Advanced Integration Technology, Chinese Academy of Sciences, Shenzhen, Guangdong, China and AIOS Team, Alibaba Group Inc, Hangzhou, Zhejiang, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3064-5841","authenticated-orcid":false,"given":"Shutian","family":"Luo","sequence":"additional","affiliation":[{"name":"University of Virginia, Charlottesville, Virginia, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9359-9571","authenticated-orcid":false,"given":"Hong","family":"Xu","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong, Hongkong, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7548-6223","authenticated-orcid":false,"given":"Haiying","family":"Shen","sequence":"additional","affiliation":[{"name":"University of Virginia, Charlottesville, Virginia, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-4165-0387","authenticated-orcid":false,"given":"Chong","family":"Ma","sequence":"additional","affiliation":[{"name":"AIOS Team, Alibaba Group Inc, Hangzhou, Zhejiang, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-1443-7920","authenticated-orcid":false,"given":"Min","family":"Shen","sequence":"additional","affiliation":[{"name":"AIOS Team, Alibaba Group Inc, Hangzhou, Zhejiang, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-1725-6466","authenticated-orcid":false,"given":"Le","family":"Chen","sequence":"additional","affiliation":[{"name":"AIOS Team, Alibaba Group Inc, Hangzhou, Zhejiang, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9480-0356","authenticated-orcid":false,"given":"Chengzhong","family":"Xu","sequence":"additional","affiliation":[{"name":"University of Macau, Macau, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-2028-0780","authenticated-orcid":false,"given":"Lin","family":"Qu","sequence":"additional","affiliation":[{"name":"AIOS Team, Alibaba Group Inc, Hangzhou, Zhejiang, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6133-407X","authenticated-orcid":false,"given":"Kejiang","family":"Ye","sequence":"additional","affiliation":[{"name":"Shenzhen Institute of Advanced Integration Technology, Chinese Academy of Sciences, Shenzhen, Guangdong, China"}]}],"member":"320","published-online":{"date-parts":[[2026,1,13]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Mart'in Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S. Corrado Andy Davis et al. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. https:\/\/arxiv.org\/abs\/1603.04467v2."},{"key":"e_1_3_2_1_2_1","volume-title":"Plutus: Bandwidth-Efficient Memory Security for GPUs. In IEEE International Symposium on High-Performance Computer Architecture, HPCA 2023","author":"Abdullah Rahaf","year":"2023","unstructured":"Rahaf Abdullah, Huiyang Zhou, and Amro Awad. 2023. Plutus: Bandwidth-Efficient Memory Security for GPUs. In IEEE International Symposium on High-Performance Computer Architecture, HPCA 2023, Montreal, QC, Canada, February 25 - March 1, 2023. 543\u2013555."},{"key":"e_1_3_2_1_3_1","volume-title":"Firecracker: Lightweight Virtualization for Serverless Applications. In 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020","author":"Agache Alexandru","year":"2020","unstructured":"Alexandru Agache, Marc Brooker, Alexandra Iordache, Anthony Liguori, Rolf Neugebauer, Phil Piwonka, and Diana-Maria Popa. 2020. Firecracker: Lightweight Virtualization for Serverless Applications. In 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020, Santa Clara, CA, USA, February 25-27, 2020. 419\u2013434."},{"key":"e_1_3_2_1_4_1","volume-title":"Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models. In 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024","author":"Agarwal Shubham","year":"2024","unstructured":"Shubham Agarwal, Subrata Mitra, Sarthak Chakraborty, Srikrishna Karanam, Koyel Mukherjee, and Shiv Kumar Saini. 2024. Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models. In 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024, Santa Clara, CA, April 15-17, 2024."},{"key":"e_1_3_2_1_5_1","volume-title":"Proceedings of the 27th ACM Symposium on Operating Systems Principles. Huntsville Ontario Canada, 353\u2013369","author":"Aghayev Abutalib","year":"2019","unstructured":"Abutalib Aghayev, Sage Weil, Michael Kuchnik, Mark Nelson, Gregory R. Ganger, and George Amvrosiadis. 2019. File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. Huntsville Ontario Canada, 353\u2013369."},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3617232.3624849"},{"key":"e_1_3_2_1_7_1","volume-title":"Proceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC 2018","author":"Akkus Istemi Ekin","year":"2018","unstructured":"Istemi Ekin Akkus, Ruichuan Chen, Ivica Rimac, Manuel Stein, Klaus Satzke, Andre Beck, Paarijaat Aditya, and Volker Hilt. 2018. SAND: Towards High-Performance Serverless Computing. In Proceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC 2018, Boston, MA, USA, July 11-13, 2018. 923\u2013935."},{"key":"e_1_3_2_1_8_1","volume-title":"Fork in the Road: Reflections and Optimizations for Cold Start Latency in Production Serverless Systems. In 19th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2025","author":"Chai Xiaohu","year":"2025","unstructured":"Xiaohu Chai, Tianyu Zhou, Keyang Hu, Jianfeng Tan, Tiwei Bie, Anqi Shen, Dawei Shen, Qi Xing, et al. 2025. Fork in the Road: Reflections and Optimizations for Cold Start Latency in Production Serverless Systems. In 19th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2025, Boston, MA, USA, July 7-9, 2025. 199\u2013218."},{"key":"e_1_3_2_1_9_1","volume-title":"Punica: Multi-Tenant LoRA Serving. arXiv:2310.18547 [cs]","author":"Chen Lequn","year":"2023","unstructured":"Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy. 2023. Punica: Multi-Tenant LoRA Serving. arXiv:2310.18547 [cs]"},{"key":"e_1_3_2_1_10_1","volume-title":"Proceedings of the 2022 USENIX Annual Technical Conference, USENIX ATC 2022","author":"Choi Seungbeom","year":"2022","unstructured":"Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh. 2022. Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing. In Proceedings of the 2022 USENIX Annual Technical Conference, USENIX ATC 2022, Carlsbad, CA, USA, July 11-13, 2022. 199\u2013216."},{"key":"e_1_3_2_1_11_1","volume-title":"Proceedings of the 26th Symposium on Operating Systems Principles. Shanghai China, 153\u2013167","author":"Cortez Eli","year":"2017","unstructured":"Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms. In Proceedings of the 26th Symposium on Operating Systems Principles. Shanghai China, 153\u2013167."},{"key":"e_1_3_2_1_12_1","volume-title":"QoS-Aware Scheduling in Heterogeneous Datacenters with Paragon. ACM Transactions on Computer Systems (Dec","author":"Delimitrou Christina","year":"2013","unstructured":"Christina Delimitrou and Christos Kozyrakis. 2013. QoS-Aware Scheduling in Heterogeneous Datacenters with Paragon. ACM Transactions on Computer Systems (Dec. 2013), 1\u201334."},{"key":"e_1_3_2_1_13_1","volume-title":"Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. Lausanne Switzerland, 467\u2013481","author":"Du Dong","year":"2020","unstructured":"Dong Du, Tianyi Yu, Yubin Xia, Binyu Zang, Guanglu Yan, Chenggang Qin, Qixuan Wu, and Haibo Chen. 2020. Catalyzer: Sub-millisecond Startup for Serverless Computing with Initialization-less Booting. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. Lausanne Switzerland, 467\u2013481."},{"key":"e_1_3_2_1_14_1","unstructured":"Jiangfei Duan Runyu Lu Haojie Duanmu Xiuhong Li Xingcheng Zhang Dahua Lin Ion Stoica and Hao Zhang. 2024. MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving. arXiv:2404.02015 [cs]"},{"key":"e_1_3_2_1_15_1","volume-title":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Virtual Event Republic of Korea, 431\u2013445","author":"Fan Shiqing","year":"2021","unstructured":"Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, et al. 2021. DAPPLE: A Pipelined Data Parallel Approach for Training Large Models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Virtual Event Republic of Korea, 431\u2013445."},{"key":"e_1_3_2_1_16_1","volume-title":"ServerlessLLM: Low-Latency Serverless Inference for Large Language Models. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024","author":"Fu Yao","year":"2024","unstructured":"Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low-Latency Serverless Inference for Large Language Models. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. 135\u2013153."},{"key":"e_1_3_2_1_17_1","volume-title":"Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019","author":"Gu Juncheng","year":"2019","unstructured":"Juncheng Gu, Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019, Boston, MA, February 26-28, 2019. USA, 485\u2013500."},{"key":"e_1_3_2_1_18_1","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020","author":"Gujarati Arpan","year":"2020","unstructured":"Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving DNNs like Clockwork: Performance Predictability from the Bottom Up. In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, Virtual Event, November 4-6, 2020. 443\u2013462."},{"key":"e_1_3_2_1_19_1","volume-title":"Cocktail: A Multidimensional Optimization for Model Serving in Cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022","author":"Gunasekaran Jashwant Raj","year":"2022","unstructured":"Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R. Das. 2022. Cocktail: A Multidimensional Optimization for Model Serving in Cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022, Renton, WA, USA, April 4-6, 2022. 1041\u20131057."},{"key":"e_1_3_2_1_20_1","volume-title":"Microsecond-Scale Preemption for Concurrent GPU-accelerated DNN Inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022","author":"Han Mingcong","year":"2022","unstructured":"Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-Scale Preemption for Concurrent GPU-accelerated DNN Inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022. 539\u2013558."},{"key":"e_1_3_2_1_21_1","volume-title":"EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2025","author":"Heo Jaehoon","year":"2025","unstructured":"Jaehoon Heo, Adiwena Putra, Jieon Yoon, Sungwoong Yune, Hangyeol Lee, Ji-Hoon Kim, and Joo-Young Kim. 2025. EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2025, Las Vegas, NV, USA, March 1-5, 2025. 324\u2013337."},{"key":"e_1_3_2_1_22_1","unstructured":"Cunchen Hu Heyang Huang Junhao Hu Jiang Xu Xusheng Chen Tao Xie Chenxi Wang Sa Wang et al. 2024. MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool. arXiv:2406.17565 [cs]"},{"key":"e_1_3_2_1_23_1","unstructured":"Edward J. Hu Yelong Shen Phillip Wallis Zeyuan Allen-Zhu Yuanzhi Li Shean Wang Lu Wang and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs]"},{"key":"e_1_3_2_1_24_1","volume-title":"Proceedings of the 2025 USENIX Annual Technical Conference, USENIX ATC 2025","author":"Hu Junhao","year":"2025","unstructured":"Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Jie Meng, et al. 2025. DEEPSERVE: Serverless Large Language Model Serving at Scale. In Proceedings of the 2025 USENIX Annual Technical Conference, USENIX ATC 2025, Boston, MA, USA, July 7-9, 2025. 57\u201372."},{"key":"e_1_3_2_1_25_1","volume-title":"Characterization of Large Language Model Development in the Datacenter. In 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024","author":"Hu Qinghao","year":"2024","unstructured":"Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, et al. 2024. Characterization of Large Language Model Development in the Datacenter. In 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024, Santa Clara, CA, April 15-17, 2024. 709\u2013729."},{"key":"e_1_3_2_1_26_1","volume-title":"Proceedings of the 29th Symposium on Operating Systems Principles. Koblenz Germany, 642\u2013657","author":"Subramanya Suhas Jayaram","unstructured":"Suhas Jayaram Subramanya, Daiyaan Arfeen, Shouxu Lin, Aurick Qiao, Zhihao Jia, and Gregory R. Ganger. 2023. Sia: Heterogeneity-aware, Goodput-Optimized ML-cluster Scheduling. In Proceedings of the 29th Symposium on Operating Systems Principles. Koblenz Germany, 642\u2013657."},{"key":"e_1_3_2_1_27_1","volume-title":"Proceedings of the 26th Symposium on Operating Systems Principles. Shanghai China, 121\u2013136","author":"Jin Xin","year":"2017","unstructured":"Xin Jin, Xiaozhou Li, Haoyu Zhang, Robert Soul'e, Jeongkeun Lee, Nate Foster, Changhoon Kim, and Ion Stoica. 2017. NetCache: Balancing Key-Value Stores with Fast In-Network Caching. In Proceedings of the 26th Symposium on Operating Systems Principles. Shanghai China, 121\u2013136."},{"key":"e_1_3_2_1_28_1","volume-title":"IEEE International Symposium on High Performance Computer Architecture, HPCA 2025","author":"Kim Sungbin","year":"2025","unstructured":"Sungbin Kim, Hyunwuk Lee, Wonho Cho, Mincheol Park, and Won Woo Ro. 2025. Ditto: Accelerating Diffusion Model via Temporal Value Similarity. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2025, Las Vegas, NV, USA, March 1-5, 2025. 338\u2013352."},{"key":"e_1_3_2_1_29_1","volume-title":"Cambricon-D: Full-Network Differential Acceleration for Diffusion Models. In 51st ACM\/IEEE Annual International Symposium on Computer Architecture, ISCA 2024","author":"Kong Weihao","year":"2024","unstructured":"Weihao Kong, Yifan Hao, Qi Guo, Yongwei Zhao, Xinkai Song, Xiaqing Li, Mo Zou, Zidong Du, et al. 2024. Cambricon-D: Full-Network Differential Acceleration for Diffusion Models. In 51st ACM\/IEEE Annual International Symposium on Computer Architecture, ISCA 2024, Buenos Aires, Argentina, June 29 - July 3, 2024. 903\u2013914."},{"key":"e_1_3_2_1_30_1","volume-title":"Sourav Sengupta, Puneet Gupta, and Arindam Mallik.","author":"Kundu Joyjit","year":"2024","unstructured":"Joyjit Kundu, Wenzhe Guo, Ali BanaGozar, Udari De Alwis, Sourav Sengupta, Puneet Gupta, and Arindam Mallik. 2024. Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference. arXiv:2407.14645"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_2_1_32_1","volume-title":"InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024","author":"Lee Wonbeom","year":"2024","unstructured":"Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. 155\u2013172."},{"key":"e_1_3_2_1_33_1","volume-title":"Proceedings of the 2022 USENIX Annual Technical Conference, USENIX ATC 2022","author":"Li Jie","year":"2022","unstructured":"Jie Li, Laiping Zhao, Yanan Yang, Kunlin Zhan, and Keqiu Li. 2022. Tetris: Memory-efficient Serverless Inference through Tensor Sharing. In Proceedings of the 2022 USENIX Annual Technical Conference, USENIX ATC 2022, Carlsbad, CA, USA, July 11-13, 2022."},{"key":"e_1_3_2_1_34_1","unstructured":"Suyi Li Hanfeng Lu Tianyuan Wu Minchen Yu Qizhen Weng Xusheng Chen Yizhou Shan Binhang Yuan et al. 2024. CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference. arXiv:2401.11240 [cs]"},{"key":"e_1_3_2_1_35_1","volume-title":"AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In 17th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2023","author":"Li Zhuohan","year":"2023","unstructured":"Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, et al. 2023. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In 17th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2023, Boston, MA, USA, July 10-12, 2023. 663\u2013679."},{"key":"e_1_3_2_1_36_1","unstructured":"Yanying Lin Shijie Peng Chengzhi Lu Chengzhong Xu and Kejiang Ye. 2025. FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters. arXiv:2510.11938 [cs]"},{"key":"e_1_3_2_1_37_1","volume-title":"Proceedings of the Eighteenth European Conference on Computer Systems. Rome Italy, 416\u2013432","author":"Lu Chengzhi","year":"2023","unstructured":"Chengzhi Lu, Huanle Xu, Kejiang Ye, Guoyao Xu, Liping Zhang, Guodong Yang, and Chengzhong Xu. 2023. Understanding and Optimizing Workloads for Unified Resource Management in Large Cloud Platforms. In Proceedings of the Eighteenth European Conference on Computer Systems. Rome Italy, 416\u2013432."},{"key":"e_1_3_2_1_38_1","volume-title":"Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis. In SoCC '21: ACM Symposium on Cloud Computing","author":"Luo Shutian","year":"2021","unstructured":"Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Yu Ding, Jian He, et al. 2021. Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis. In SoCC '21: ACM Symposium on Cloud Computing, Seattle, WA, USA, November 1 - 4, 2021. 412\u2013426."},{"key":"e_1_3_2_1_39_1","volume-title":"Themis: Fair and Efficient GPU Cluster Scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020","author":"Mahajan Kshiteej","year":"2020","unstructured":"Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and Efficient GPU Cluster Scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020, Santa Clara, CA, USA, February 25-27, 2020. USA, 289\u2013304."},{"key":"e_1_3_2_1_40_1","volume-title":"Vyas Sekar, and Justine Sherry.","author":"Manousis Antonis","year":"2020","unstructured":"Antonis Manousis, Rahul Anand Sharma, Vyas Sekar, and Justine Sherry. 2020. Contention-Aware Performance Prediction For Virtualized Network Functions. In Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication. Virtual Event USA, 270\u2013282."},{"key":"e_1_3_2_1_41_1","volume-title":"Taming Performance Variability. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018","author":"Maricq Aleksander","year":"2018","unstructured":"Aleksander Maricq, Dmitry Duplyakin, Ivo Jimenez, Carlos Maltzahn, Ryan Stutsman, Robert Ricci, and Ana Klimovic. 2018. Taming Performance Variability. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018. 409\u2013425."},{"key":"e_1_3_2_1_42_1","volume-title":"Christine Cheng, Cody Coleman, Greg Diamos, David Kanter, Paulius Micikevicius, David Patterson, et al.","author":"Mattson Peter","year":"2020","unstructured":"Peter Mattson, Vijay Janapa Reddi, Christine Cheng, Cody Coleman, Greg Diamos, David Kanter, Paulius Micikevicius, David Patterson, et al. 2020. MLPerf: An Industry Standard Benchmark Suite for Machine Learning Performance. IEEE Micro (March 2020), 8\u201316."},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640411"},{"key":"e_1_3_2_1_44_1","volume-title":"Proceedings of the 27th ACM Symposium on Operating Systems Principles. Huntsville Ontario Canada, 1\u201315","author":"Narayanan Deepak","year":"2019","unstructured":"Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. Huntsville Ontario Canada, 1\u201315."},{"key":"e_1_3_2_1_45_1","volume-title":"Splitwise: Efficient Generative LLM Inference Using Phase Splitting. In 51st ACM\/IEEE Annual International Symposium on Computer Architecture, ISCA 2024","author":"Patel Pratyush","year":"2024","unstructured":"Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, 'I nigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. In 51st ACM\/IEEE Annual International Symposium on Computer Architecture, ISCA 2024, Buenos Aires, Argentina, June 29 - July 3, 2024. 118\u2013132."},{"key":"e_1_3_2_1_46_1","volume-title":"Proceedings of the 2024 ACM Symposium on Cloud Computing","author":"Patke Archit","year":"2024","unstructured":"Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2024. Queue Management for SLO-Oriented Large Language Model Serving. In Proceedings of the 2024 ACM Symposium on Cloud Computing. New York, NY, USA, 18\u201335."},{"key":"e_1_3_2_1_47_1","volume-title":"Low Latency Geo-distributed Data Analytics. ACM SIGCOMM Computer Communication Review (Sept","author":"Pu Qifan","year":"2015","unstructured":"Qifan Pu, Ganesh Ananthanarayanan, Peter Bodik, Srikanth Kandula, Aditya Akella, Paramvir Bahl, and Ion Stoica. 2015. Low Latency Geo-distributed Data Analytics. ACM SIGCOMM Computer Communication Review (Sept. 2015), 421\u2013434."},{"key":"e_1_3_2_1_48_1","volume-title":"Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning. In 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2021","author":"Qiao Aurick","year":"2021","unstructured":"Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, and Eric P. Xing. 2021. Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning. In 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2021, July 14-16, 2021."},{"key":"e_1_3_2_1_49_1","volume-title":"Proceedings of the ACM Symposium on Cloud Computing. Seattle WA USA, 122\u2013137","author":"Romero Francisco","year":"2021","unstructured":"Francisco Romero, Gohar Irfan Chaudhry, 'I nigo Goiri, Pragna Gopa, Paul Batum, Neeraja J. Yadwadkar, Rodrigo Fonseca, Christos Kozyrakis, et al. 2021. FaaT: A Transparent Auto-Scaling Cache for Serverless Applications. In Proceedings of the ACM Symposium on Cloud Computing. Seattle WA USA, 122\u2013137."},{"key":"e_1_3_2_1_50_1","volume-title":"Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. Lausanne Switzerland, 753\u2013767","author":"Roy Rohan Basu","year":"2022","unstructured":"Rohan Basu Roy, Tirthak Patel, and Devesh Tiwari. 2022. IceBreaker: Warming Serverless Functions Better with Heterogeneity. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. Lausanne Switzerland, 753\u2013767."},{"key":"e_1_3_2_1_51_1","volume-title":"Proceedings of the 1st Workshop on Machine Learning and Systems. Online United Kingdom, 15\u201323","author":"Santhanam Keshav","year":"2021","unstructured":"Keshav Santhanam, Siddharth Krishna, Ryota Tomioka, Andrew Fitzgibbon, and Tim Harris. 2021. DistIR: An Intermediate Representation for Optimizing Distributed Neural Networks. In Proceedings of the 1st Workshop on Machine Learning and Systems. Online United Kingdom, 15\u201323."},{"key":"e_1_3_2_1_52_1","volume-title":"Proceedings of the 2020 USENIX Annual Technical Conference, USENIX ATC 2020","author":"Shahrad Mohammad","year":"2020","unstructured":"Mohammad Shahrad, Rodrigo Fonseca, 'I nigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, et al. 2020. Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider. In Proceedings of the 2020 USENIX Annual Technical Conference, USENIX ATC 2020, July 15-17, 2020. 205\u2013218."},{"key":"e_1_3_2_1_53_1","volume-title":"Proceedings of the 29th Symposium on Operating Systems Principles. Koblenz Germany, 675\u2013691","author":"Shen Jiacheng","unstructured":"Jiacheng Shen, Pengfei Zuo, Xuchuan Luo, Yuxin Su, Jiazhen Gu, Hao Feng, Yangfan Zhou, and Michael R. Lyu. 2023. Ditto: An Elastic and Adaptive Memory-Disaggregated Caching System. In Proceedings of the 29th Symposium on Operating Systems Principles. Koblenz Germany, 675\u2013691."},{"key":"e_1_3_2_1_54_1","unstructured":"Ying Sheng Shiyi Cao Dacheng Li Coleman Hooper Nicholas Lee Shuo Yang Christopher Chou Banghua Zhu et al. 2024. S-LoRA: Serving Thousands of Concurrent LoRA Adapters. arXiv:2311.03285 [cs]"},{"key":"e_1_3_2_1_55_1","volume-title":"Fairness in Serving Large Language Models. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024","author":"Sheng Ying","year":"2024","unstructured":"Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, and Ion Stoica. 2024. Fairness in Serving Large Language Models. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. 965 - 988."},{"key":"e_1_3_2_1_56_1","volume-title":"USHER: Holistic Interference Avoidance for Resource Optimized ML Inference. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024","author":"Shubha Sudipta Saha","year":"2024","unstructured":"Sudipta Saha Shubha, Haiying Shen, and Anand Iyer. 2024. USHER: Holistic Interference Avoidance for Resource Optimized ML Inference. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. 947\u2013964."},{"key":"e_1_3_2_1_57_1","volume-title":"Ekko: A Large-Scale Deep Learning Recommender System with Low-Latency Model Upyear. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022","author":"Sima Chijun","year":"2022","unstructured":"Chijun Sima, Yao Fu, Man-Kit Sit, Liyi Guo, Xuri Gong, Feng Lin, Junyu Wu, Yongsheng Li, et al. 2022. Ekko: A Large-Scale Deep Learning Recommender System with Low-Latency Model Upyear. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022. 821\u2013839."},{"key":"e_1_3_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613169"},{"key":"e_1_3_2_1_59_1","volume-title":"Proceedings of the Workshop on Hot Topics in Operating Systems. Ann Arbor Michigan, 26\u201332","author":"Stoica Ion","year":"2021","unstructured":"Ion Stoica and Scott Shenker. 2021. From Cloud Computing to Sky Computing. In Proceedings of the Workshop on Hot Topics in Operating Systems. Ann Arbor Michigan, 26\u201332."},{"key":"e_1_3_2_1_60_1","volume-title":"DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2025","author":"Stojkovic Jovan","year":"2025","unstructured":"Jovan Stojkovic, Chaojie Zhang, 'I nigo Goiri, Josep Torrellas, and Esha Choukse. 2025. DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2025, Las Vegas, NV, USA, March 1-5, 2025. 1348\u20131362."},{"key":"e_1_3_2_1_61_1","volume-title":"Llumnix: Dynamic Scheduling for Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024","author":"Sun Biao","year":"2024","unstructured":"Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. 173\u2013191."},{"key":"e_1_3_2_1_62_1","volume-title":"MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022","author":"Weng Qizhen","year":"2022","unstructured":"Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, et al. 2022. MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022, Renton, WA, USA, April 4-6, 2022. 945\u2013960."},{"key":"e_1_3_2_1_63_1","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024","author":"Wu Bingyang","year":"2024","unstructured":"Bingyang Wu, Ruidong Zhu, Zili Zhang, Peng Sun, Xuanzhe Liu, and Xin Jin. 2024. dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024. 911\u2013927."},{"key":"e_1_3_2_1_64_1","volume-title":"Rock: Serving Multimodal Models in Cloud with Heterogeneous-Aware Resource Orchestration for Thousands of LoRA Adapters. In 2025 IEEE International Conference on Cluster Computing (CLUSTER). 1\u201313","author":"Wu Shuaipeng","year":"2025","unstructured":"Shuaipeng Wu, Yanying Lin, Shijie Peng, Wenyan Chen, Chong Ma, Min Shen, Le Chen, Chengzhong Xu, et al. 2025. Rock: Serving Multimodal Models in Cloud with Heterogeneous-Aware Resource Orchestration for Thousands of LoRA Adapters. In 2025 IEEE International Conference on Cluster Computing (CLUSTER). 1\u201313."},{"key":"e_1_3_2_1_65_1","volume-title":"Proceedings of the 2024 USENIX Annual Technical Conference, USENIX ATC 2024","author":"Xia Haojun","year":"2024","unstructured":"Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, et al. 2024. Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs. In Proceedings of the 2024 USENIX Annual Technical Conference, USENIX ATC 2024, Santa Clara, CA, USA, July 10-12, 2024. 699\u2013713."},{"key":"e_1_3_2_1_66_1","volume-title":"AntMan: Dynamic Scaling on GPU Clusters for Deep Learning. In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020","author":"Xiao Wencong","year":"2020","unstructured":"Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, et al. 2020. AntMan: Dynamic Scaling on GPU Clusters for Deep Learning. In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, Virtual Event, November 4-6, 2020. 533\u2013548."},{"key":"e_1_3_2_1_67_1","volume-title":"High-Throughput Inference. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. Lausanne Switzerland, 768\u2013781","author":"Yang Yanan","year":"2022","unstructured":"Yanan Yang, Laiping Zhao, Yiming Li, Huanyu Zhang, Jie Li, Mingyang Zhao, Xingzhen Chen, and Keqiu Li. 2022. INFless: A Native Serverless System for Low-Latency, High-Throughput Inference. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. Lausanne Switzerland, 768\u2013781."},{"key":"e_1_3_2_1_68_1","volume-title":"Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022","author":"Yu Gyeong-In","year":"2022","unstructured":"Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022. 521\u2013538."},{"key":"e_1_3_2_1_69_1","volume-title":"Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"1","author":"Zeng Shaoxun","year":"2025","unstructured":"Shaoxun Zeng, Minhui Xie, Shiwei Gao, Youmin Chen, and Youyou Lu. 2025. Medusa: Accelerating Serverless LLM Inference with Materialization. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, ASPLOS 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025. 653\u2013668."},{"key":"e_1_3_2_1_70_1","volume-title":"SHEPHERD: Serving DNNs in the Wild. In 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023","author":"Zhang Hong","year":"2023","unstructured":"Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. SHEPHERD: Serving DNNs in the Wild. In 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023, Boston, MA, April 17-19, 2023. 787\u2013808."},{"key":"e_1_3_2_1_71_1","doi-asserted-by":"crossref","unstructured":"Lvmin Zhang Anyi Rao and Maneesh Agrawala. 2023. Adding Conditional Control to Text-to-Image Diffusion Models. arXiv:2302.05543 [cs]","DOI":"10.1109\/ICCV51070.2023.00355"},{"key":"e_1_3_2_1_72_1","volume-title":"Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, SOSP 2025, Lotte Hotel World","author":"Zhang Yuxuan","year":"2025","unstructured":"Yuxuan Zhang and Sebastian Angel. 2025. Quilt: Resource-aware Merging of Serverless Workflows. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, SOSP 2025, Lotte Hotel World, Seoul, Republic of Korea, October 13-16, 2025. 907\u2013927."},{"key":"e_1_3_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1145\/3477132.3483580"},{"key":"e_1_3_2_1_74_1","volume-title":"51st ACM\/IEEE Annual International Symposium on Computer Architecture, ISCA 2024","author":"Zhao Youpeng","year":"2024","unstructured":"Youpeng Zhao, Di Wu, and Jun Wang. 2024. ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching. In 51st ACM\/IEEE Annual International Symposium on Computer Architecture, ISCA 2024, Buenos Aires, Argentina, June 29-July 3, 2024. 1005\u20131017."},{"key":"e_1_3_2_1_75_1","volume-title":"Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022","author":"Zheng Lianmin","year":"2022","unstructured":"Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, et al. 2022. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022. 559\u2013578."},{"key":"e_1_3_2_1_76_1","volume-title":"Shiyi Cao, Christos Kozyrakis, et al.","author":"Zheng Lianmin","year":"2024","unstructured":"Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, et al. 2024. SGLang: Efficient Execution of Structured Language Model Programs. arXiv:2312.07104 [cs]"}],"event":{"name":"SoCC '25: ACM Symposium on Cloud Computing","location":"Online USA","acronym":"SoCC '25","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems","SIGMOD ACM Special Interest Group on Management of Data"]},"container-title":["Proceedings of the 2025 ACM Symposium on Cloud Computing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3772052.3772206","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T16:25:25Z","timestamp":1768321525000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3772052.3772206"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,19]]},"references-count":76,"alternative-id":["10.1145\/3772052.3772206","10.1145\/3772052"],"URL":"https:\/\/doi.org\/10.1145\/3772052.3772206","relation":{},"subject":[],"published":{"date-parts":[[2025,11,19]]},"assertion":[{"value":"2026-01-13","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}