{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,1]],"date-time":"2026-06-01T18:24:11Z","timestamp":1780338251019,"version":"3.54.1"},"reference-count":56,"publisher":"Association for Computing Machinery (ACM)","issue":"3","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Meas. Anal. Comput. Syst."],"published-print":{"date-parts":[[2025,12]]},"abstract":"<jats:p>\n                    Global cloud service providers handle inference workloads for Large Language Models (LLMs) that span latency-sensitive (e.g., chatbots) and insensitive (e.g., report writing) tasks, resulting in diverse and often conflicting Service Level Agreement (SLA) requirements. Managing such mixed workloads is challenging due to the complexity of the inference serving stack, which encompasses multiple models, GPU hardware, and global data centers. Existing solutions often silo such fast and slow tasks onto separate GPU resource pools with different SLAs, but this leads to significant under-utilization of expensive accelerators due to load mismatch. In this article, we characterize the LLM serving workloads at Microsoft Office 365, one of the largest users of LLMs within Microsoft Azure cloud with over 10 million requests per day, and highlight key observations across workloads in different data center regions and across time. This is one of the first such public studies of Internet-scale LLM workloads. We use these insights to propose\n                    <jats:sc>SageServe<\/jats:sc>\n                    , a comprehensive LLM serving framework that dynamically adapts to workload demands using multi-timescale control knobs. It combines short-term request routing to data centers with long-term scaling of GPU VMs and model placement with higher lead times, and co-optimizes the routing and resource allocation problem using a traffic forecast model and an Integer Linear Programming (ILP) solution. We evaluate\n                    <jats:sc>SageServe<\/jats:sc>\n                    through real runs and realistic simulations on 10 million production requests across three regions and four open-source models. We achieve up to 25% savings in GPU-hours compared to the current baseline deployment and reduce GPU-hour wastage due to inefficient auto-scaling by 80%, resulting in a potential monthly cost savings of up to $2.5 million, while maintaining tail latency and meeting SLAs. The workload traces, our simulator harness and the\n                    <jats:sc>SageServe<\/jats:sc>\n                    scheduler are available at https:\/\/github.com\/shashwatj07\/SageServe.\n                  <\/jats:p>","DOI":"10.1145\/3771576","type":"journal-article","created":{"date-parts":[[2025,12,2]],"date-time":"2025-12-02T20:07:03Z","timestamp":1764706023000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["SAGESERVE: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-Scaling"],"prefix":"10.1145","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2526-5780","authenticated-orcid":false,"given":"Shashwat","family":"Jaiswal","sequence":"first","affiliation":[{"name":"University of Illinois, Urbana-Champaign, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-2617-6251","authenticated-orcid":false,"given":"Kunal","family":"Jain","sequence":"additional","affiliation":[{"name":"Georgia Institute of Technology, Atlanta, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4140-7774","authenticated-orcid":false,"given":"Yogesh","family":"Simmhan","sequence":"additional","affiliation":[{"name":"Indian Institute of Science, Bangalore, India"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6296-0395","authenticated-orcid":false,"given":"Anjaly","family":"Parayil","sequence":"additional","affiliation":[{"name":"Microsoft Research, Bangalore, India"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-7068-5627","authenticated-orcid":false,"given":"Ankur","family":"Mallick","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4019-5327","authenticated-orcid":false,"given":"Rujia","family":"Wang","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-9387-5886","authenticated-orcid":false,"given":"Renee St.","family":"Amant","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0102-8139","authenticated-orcid":false,"given":"Chetan","family":"Bansal","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8957-7628","authenticated-orcid":false,"given":"Victor","family":"Ruhle","sequence":"additional","affiliation":[{"name":"Microsoft Research, Cambridge, United Kingdom"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-4412-1252","authenticated-orcid":false,"given":"Anoop","family":"Kulkarni","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-8558-5954","authenticated-orcid":false,"given":"Steve","family":"Kofsky","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-0204-7187","authenticated-orcid":false,"given":"Saravan","family":"Rajmohan","sequence":"additional","affiliation":[{"name":"Microsoft 365, Redmond, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2025,12,2]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"[n.d.]. ChatGPT. http:\/\/chat.openai.com."},{"key":"e_1_2_1_2_1","unstructured":"[n.d.]. Copilot. http:\/\/copilot.microsoft.com."},{"key":"e_1_2_1_3_1","unstructured":"[n.d.]. Gemini. http:\/\/gemini.google.com."},{"key":"e_1_2_1_4_1","first-page":"351","article-title":"Vidur: A large-scale simulation framework for llm inference","volume":"6","author":"Agrawal Amey","year":"2024","unstructured":"Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S Gulavani, Ramachandran Ramjee, and Alexey Tumanov. 2024. Vidur: A large-scale simulation framework for llm inference. Proceedings of Machine Learning and Systems 6 (2024), 351-366.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_2_1_5_1","first-page":"117","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Agrawal Amey","year":"2024","unstructured":"Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming {Throughput-Latency} Tradeoff in {LLM} Inference with {Sarathi-Serve}. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 117-134."},{"key":"e_1_2_1_6_1","unstructured":"''AWS''. ''2024''. ''Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs)''. ''https:\/\/aws.amazon.com\/blogs\/machine-learning\/introducing-fast-model-loader-in-sagemaker-inference-accelerate-autoscaling-for-your-large-language-models-llms-part-1\/''."},{"key":"e_1_2_1_7_1","unstructured":"BigScience. [n.d.]. Introducing The World's Largest Open Multilingual Language Model: BLOOM [Online]. In https:\/\/bigscience.huggingface.co\/blog\/bloom."},{"key":"e_1_2_1_8_1","unstructured":"Rishi Bommasani Drew A. Hudson Ehsan Adeli Russ Altman Simran Arora Sydney von Arx Michael S. Bernstein Jeannette Bohg Antoine Bosselut Emma Brunskill Erik Brynjolfsson S. Buch Dallas Card Rodrigo Castellon Niladri S. Chatterji Annie S. Chen Kathleen A. Creel Jared Davis Dora Demszky Chris Donahue Moussa Doumbouya Esin Durmus Stefano Ermon John Etchemendy Kawin Ethayarajh Li Fei-Fei Chelsea Finn Trevor Gale Lauren E. Gillespie Karan Goel Noah D. Goodman Shelby Grossman Neel Guha Tatsunori Hashimoto Peter Henderson John Hewitt Daniel E. Ho Jenny Hong Kyle Hsu Jing Huang Thomas F. Icard Saahil Jain Dan Jurafsky Pratyusha Kalluri Siddharth Karamcheti Geoff Keeling Fereshte Khani O. Khattab Pang Wei Koh Mark S. Krass Ranjay Krishna Rohith Kuditipudi Ananya Kumar Faisal Ladhak Mina Lee Tony Lee Jure Leskovec Isabelle Levent Xiang Lisa Li Xuechen Li Tengyu Ma Ali Malik Christopher D. Manning Suvir P. Mirchandani Eric Mitchell Zanele Munyikwa Suraj Nair Avanika Narayan Deepak Narayanan Benjamin Newman Allen Nie Juan Carlos Niebles Hamed Nilforoshan J. F. Nyarko Giray Ogut Laurel Orr Isabel Papadimitriou Joon Sung Park Chris Piech Eva Portelance Christopher Potts Aditi Raghunathan Robert Reich Hongyu Ren Frieda Rong Yusuf H. Roohani Camilo Ruiz Jack Ryan Christopher R\u00e9 Dorsa Sadigh Shiori Sagawa Keshav Santhanam Andy Shih Krishna Parasuram Srinivasan Alex Tamkin Rohan Taori Armin W. Thomas Florian Tram\u00e8r Rose E. Wang William Wang Bohan Wu Jiajun Wu Yuhuai Wu Sang Michael Xie Michihiro Yasunaga Jiaxuan You Matei A. Zaharia Michael Zhang Tianyi Zhang Xikun Zhang Yuhui Zhang Lucia Zheng Kaitlyn Zhou and Percy Liang. 2021. On the Opportunities and Risks of Foundation Models. ArXiv (2021). https:\/\/crfm.stanford.edu\/assets\/report.pdf"},{"key":"e_1_2_1_9_1","unstructured":"''Google Cloud''. ''2025''. ''From LLMs to image generation: Accelerate inference workloads with AI Hypercomputer''. ''https:\/\/cloud.google.com\/blog\/products\/compute\/ai-hypercomputer-inference-updates-for-google-cloud-tpu-and-gpu''."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-2097"},{"key":"e_1_2_1_11_1","volume-title":"Chenfei Jiang, Elizabeth Dorfman, Eun-Ah Kim, Michael P Brenner, Viren Jain, Sameera Ponda, and Subhashini Venugopalan.","author":"Cui Hao","year":"2025","unstructured":"Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shutong Li, Maria Tikhanovskaya, Peter Norgaard, Nayantara Mudur, Martyna Plomecka, Paul Raccuglia, Yasaman Bahri, Victor V. Albert, Pranesh Srinivasan, Haining Pan, Philippe Faist, Brian Rohr, Michael J. Statt, Dan Morris, Drew Purves, Elise Kleeman, Ruth Alcantara, Matthew Abraham, Muqthar Mohammad, Ean Phing VanLee, Chenfei Jiang, Elizabeth Dorfman, Eun-Ah Kim, Michael P Brenner, Viren Jain, Sameera Ponda, and Subhashini Venugopalan. 2025. CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning. arXiv:2503.13517 [cs.CL] https:\/\/arxiv.org\/abs\/2503.13517"},{"key":"e_1_2_1_12_1","first-page":"16344","article-title":"Flashattention: Fast and memory-efficient exact attention with io-awareness","volume":"35","author":"Dao Tri","year":"2022","unstructured":"Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344-16359.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_13_1","first-page":"135","volume-title":"ServerlessLLM: Low-Latency Serverless Inference for Large Language Models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Fu Yao","year":"2024","unstructured":"Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low-Latency Serverless Inference for Large Language Models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 135-153."},{"key":"e_1_2_1_14_1","first-page":"111","volume-title":"2024 USENIX Annual Technical Conference (USENIX ATC 24)","author":"Gao Bin","year":"2024","unstructured":"Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. {Cost-Efficient} large language model serving for multi-turn conversations with {CachedAttention}. In 2024 USENIX Annual Technical Conference (USENIX ATC 24). 111-126."},{"key":"e_1_2_1_15_1","volume-title":"Melange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity. arXiv preprint arXiv:2404.14527","author":"Griggs Tyler","year":"2024","unstructured":"Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, and Ion Stoica. 2024. Melange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity. arXiv preprint arXiv:2404.14527 (2024)."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.peva.2007.06.012"},{"key":"e_1_2_1_17_1","first-page":"845","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Hadary Ori","year":"2020","unstructured":"Ori Hadary, Luke Marshall, Ishai Menache, Abhisek Pan, Esaias E Greeff, David Dion, Star Dorminey, Shailesh Joshi, Yang Chen, Mark Russinovich, et al. 2020. Protean:{VM} allocation service at scale. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 845-861."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3712003"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3721146.3721947"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_2_1_21_1","first-page":"663","volume-title":"17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)","author":"Li Zhuohan","year":"2023","unstructured":"Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gonzalez, et al. 2023. {AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 663-679."},{"key":"e_1_2_1_22_1","volume-title":"RingAttention with Blockwise Transformers for Near-Infinite Context. In The Twelfth International Conference on Learning Representations.","author":"Liu Hao","unstructured":"Hao Liu, Matei Zaharia, and Pieter Abbeel. [n.d.]. RingAttention with Blockwise Transformers for Near-Infinite Context. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_2_1_23_1","volume-title":"Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services. arXiv preprint arXiv:2404.16283","author":"Liu Jiachen","year":"2024","unstructured":"Jiachen Liu, Zhiyu Wu, Jae-Won Chung, Fan Lai, Myungjin Lee, and Mosharaf Chowdhury. 2024. Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services. arXiv preprint arXiv:2404.16283 (2024)."},{"key":"e_1_2_1_24_1","volume-title":"Jakob Foerster, Jeff Clune, and David Ha.","author":"Lu Chris","year":"2024","unstructured":"Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. 2024. The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery. arXiv:2408.06292 [cs.AI] https:\/\/arxiv.org\/abs\/2408.06292"},{"key":"e_1_2_1_25_1","volume-title":"Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs. arXiv preprint arXiv:2406.01566","author":"Mei Yixuan","year":"2024","unstructured":"Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2024. Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs. arXiv preprint arXiv:2406.01566 (2024)."},{"key":"e_1_2_1_26_1","volume-title":"Towards efficient generative large language model serving: A survey from algorithms to systems. Comput. Surveys","author":"Miao Xupeng","year":"2023","unstructured":"Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, and Zhihao Jia. 2023. Towards efficient generative large language model serving: A survey from algorithms to systems. Comput. Surveys (2023)."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640411"},{"key":"e_1_2_1_28_1","unstructured":"Microsoft. 2024. Online endpoint deployment for real-time inferencing. https:\/\/learn.microsoft.com\/en-us\/azure\/machine-learning\/concept-endpoints-online."},{"key":"e_1_2_1_29_1","unstructured":"Microsoft. 2024. Run Azure OpenAI models in batch endpoints to compute embeddings. https:\/\/learn.microsoft.com\/enus\/azure\/machine-learning\/how-to-use-batch-model-openai-embeddings."},{"key":"e_1_2_1_30_1","unstructured":"Microsoft. 2025. Azure OpenAI Batch API."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/MIS.2025.3544940"},{"key":"e_1_2_1_32_1","volume-title":"Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving. arXiv preprint arXiv:2405.06856","author":"Nie Chengyi","year":"2024","unstructured":"Chengyi Nie, Rodrigo Fonseca, and Zhenhua Liu. 2024. Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving. arXiv preprint arXiv:2405.06856 (2024)."},{"key":"e_1_2_1_33_1","unstructured":"OpenAI. 2024. Batch API."},{"key":"e_1_2_1_34_1","unstructured":"OpenAI. 2024. Streaming API. https:\/\/platform.openai.com\/docs\/api-reference\/streaming."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA59077.2024.00019"},{"key":"e_1_2_1_36_1","unstructured":"Archit Patke Dhemath Reddy Saurabh Jha Chandra Narayanaswami Zbigniew Kalbarczyk and Ravishankar Iyer. 2025. Hierarchical Autoscaling for Large Language Model Serving with Chiron. arXiv:2501.08090 [cs.DC] https:\/\/arxiv.org\/abs\/2501.08090"},{"key":"e_1_2_1_37_1","unstructured":"Ye Qi. 2025. Scaling Large Language Model Serving Infrastructure at Meta. https:\/\/www.infoq.com\/presentations\/llm-meta\/."},{"key":"e_1_2_1_38_1","volume-title":"ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving. arXiv preprint arXiv:2410.01228","author":"Qiao Yifan","year":"2024","unstructured":"Yifan Qiao, Shu Anzai, Shan Yu, Haoran Ma, Yang Wang, Miryung Kim, and Harry Xu. 2024. ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving. arXiv preprint arXiv:2410.01228 (2024)."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/CLOUD.2011.42"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3342195.3387524"},{"key":"e_1_2_1_41_1","unstructured":"P. Schmid O. Sansevieroa P. Cuenca and L. Tunstall. [n.d.]. Llama 2 is here - Get it on Hugging Face [Online]. In Available: https:\/\/huggingface.co\/blog\/llama2."},{"key":"e_1_2_1_42_1","first-page":"965","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Sheng Ying","year":"2024","unstructured":"Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E Gonzalez, and Ion Stoica. 2024. Fairness in serving large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 965-988."},{"key":"e_1_2_1_43_1","series-title":"Time series analysis and its applications: with R examples (2017)","volume-title":"ARIMA models","author":"Shumway Robert H","unstructured":"Robert H Shumway, David S Stoffer, Robert H Shumway, and David S Stoffer. 2017. ARIMA models. Time series analysis and its applications: with R examples (2017), 75-163."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3676641.3716025"},{"key":"e_1_2_1_45_1","volume-title":"Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation","author":"Sun Biao","year":"2024","unstructured":"Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: dynamic scheduling for large language model serving. In Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation (Santa Clara, CA, USA) (OSDI'24). USENIX Association, USA, Article 10, 19 pages."},{"key":"e_1_2_1_46_1","unstructured":"''Top500''. ''2025''. ''Top 500 Supercomputing List''. ''https:\/\/www.top500.org\/system\/180236\/''."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2018.2889851"},{"key":"e_1_2_1_48_1","first-page":"443","volume-title":"2021 USENIX Annual Technical Conference (USENIX ATC 21)","author":"Wang Ao","year":"2021","unstructured":"Ao Wang, Shuai Chang, Huangshi Tian, Hongqi Wang, Haoran Yang, Huiba Li, Rui Du, and Yue Cheng. 2021. {FaaSNet}: Scalable and fast provisioning of custom serverless container runtimes at alibaba cloud function compute. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 443-457."},{"key":"e_1_2_1_49_1","volume-title":"OpenChat: Advancing Open-source Language Models with Mixed-Quality Data. In The Twelfth International Conference on Learning Representations.","author":"Wang Guan","unstructured":"Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, and Yang Liu. [n.d.]. OpenChat: Advancing Open-source Language Models with Mixed-Quality Data. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_2_1_50_1","volume-title":"Amelie Chi Zhou, et al","author":"Wang Yuxin","year":"2024","unstructured":"Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, et al. 2024. Burstgpt: A real-world workload dataset to optimize llm serving systems. arXiv preprint arXiv:2401.17644 (2024)."},{"key":"e_1_2_1_51_1","unstructured":"''HPC Wire''. 2024. ''AWS Delivers the AI Heat: Project Rainier and GenAI Innovations Lead the Way''. https:\/\/www.hpcwire.com\/2024\/12\/05\/aws-delivers-the-ai-heat-project-rainier-and-genai-innovations-lead-the-way\/."},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3694715.3695948"},{"key":"e_1_2_1_53_1","volume-title":"Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920","author":"Wu Bingyang","year":"2023","unstructured":"Bingyang Wu, Yinmin Zhong, Zili Zhang, Shengyu Liu, Fangyue Liu, Yuanhang Sun, Gang Huang, Xuanzhe Liu, and Xin Jin. 2023. Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920 (2023)."},{"key":"e_1_2_1_54_1","first-page":"521","volume-title":"Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Yu Gyeong-In","year":"2022","unstructured":"Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 521-538. https:\/\/www.usenix.org\/conference\/osdi22\/presentation\/yu"},{"key":"e_1_2_1_55_1","volume-title":"AFlow: Automating Agentic Workflow Generation. In The Thirteenth International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=z5uVAKwmjf","author":"Zhang Jiayi","year":"2025","unstructured":"Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. 2025. AFlow: Automating Agentic Workflow Generation. In The Thirteenth International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=z5uVAKwmjf"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/3708530"}],"container-title":["Proceedings of the ACM on Measurement and Analysis of Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3771576","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,3]],"date-time":"2025-12-03T17:26:37Z","timestamp":1764782797000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3771576"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12]]},"references-count":56,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,12]]}},"alternative-id":["10.1145\/3771576"],"URL":"https:\/\/doi.org\/10.1145\/3771576","relation":{},"ISSN":["2476-1249"],"issn-type":[{"value":"2476-1249","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12]]},"assertion":[{"value":"2025-12-02","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}