{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,15]],"date-time":"2026-01-15T15:08:28Z","timestamp":1768489708008,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":44,"publisher":"ACM","license":[{"start":{"date-parts":[[2025,3,30]],"date-time":"2025-03-30T00:00:00Z","timestamp":1743292800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,3,30]]},"DOI":"10.1145\/3721146.3721947","type":"proceedings-article","created":{"date-parts":[[2025,4,1]],"date-time":"2025-04-01T17:42:05Z","timestamp":1743529325000},"page":"19-30","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Performance Aware LLM Load Balancer for Mixed Workloads"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-2617-6251","authenticated-orcid":false,"given":"Kunal","family":"Jain","sequence":"first","affiliation":[{"name":"Microsoft, Bangalore, India"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6296-0395","authenticated-orcid":false,"given":"Anjaly","family":"Parayil","sequence":"additional","affiliation":[{"name":"Microsoft Research, Bangalore, India"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-7068-5627","authenticated-orcid":false,"given":"Ankur","family":"Mallick","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0371-5522","authenticated-orcid":false,"given":"Esha","family":"Choukse","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3631-9024","authenticated-orcid":false,"given":"Xiaoting","family":"Qin","sequence":"additional","affiliation":[{"name":"Microsoft Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0472-9168","authenticated-orcid":false,"given":"Jue","family":"Zhang","sequence":"additional","affiliation":[{"name":"Microsoft Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2591-4012","authenticated-orcid":false,"given":"\u00cd\u00f1igo","family":"Goiri","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4019-5327","authenticated-orcid":false,"given":"Rujia","family":"Wang","sequence":"additional","affiliation":[{"name":"Microsoft, Chicago, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0102-8139","authenticated-orcid":false,"given":"Chetan","family":"Bansal","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8957-7628","authenticated-orcid":false,"given":"Victor","family":"R\u00fchle","sequence":"additional","affiliation":[{"name":"Microsoft Research, Cambridge, United Kingdom"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-4412-1252","authenticated-orcid":false,"given":"Anoop","family":"Kulkarni","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-8558-5954","authenticated-orcid":false,"given":"Steve","family":"Kofsky","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-0204-7187","authenticated-orcid":false,"given":"Saravan","family":"Rajmohan","sequence":"additional","affiliation":[{"name":"Microsoft 365, Redmond, USA"}]}],"member":"320","published-online":{"date-parts":[[2025,4]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Daniel Adiwardana Minh-Thang Luong David R So Jamie Hall Noah Fiedel Romal Thoppilan Zi Yang Apoorv Kulshreshtha Gaurav Nemade Yifeng Lu et al. 2020. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977 (2020)."},{"key":"e_1_3_2_1_2_1","unstructured":"Amey Agrawal Nitin Kedia Ashish Panwar Jayashree Mohan Nipun Kwatra Bhargav S. Gulavani Alexey Tumanov and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. arXiv:2403.02310 [cs.LG]"},{"key":"e_1_3_2_1_3_1","volume-title":"Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369","author":"Agrawal Amey","year":"2023","unstructured":"Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachandran Ramjee. 2023. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369 (2023)."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ParCompTech.2013.6621389"},{"key":"e_1_3_2_1_5_1","volume-title":"Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al.","author":"Chen Mark","year":"2021","unstructured":"Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)."},{"key":"e_1_3_2_1_6_1","first-page":"13550","article-title":"Heuristic-guided reinforcement learning","volume":"34","author":"Cheng Ching-An","year":"2021","unstructured":"Ching-An Cheng, Andrey Kolobov, and Adith Swaminathan. 2021. Heuristic-guided reinforcement learning. Advances in Neural Information Processing Systems 34 (2021), 13550--13563.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_7_1","first-page":"16344","article-title":"Flashattention: Fast and memory-efficient exact attention with io-awareness","volume":"35","author":"Dao Tri","year":"2022","unstructured":"Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344--16359.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-4418"},{"key":"e_1_3_2_1_9_1","volume-title":"On Efficient Approximate Queries over Machine Learning Models. arXiv preprint arXiv:2206.02845","author":"Ding Dujian","year":"2022","unstructured":"Dujian Ding, Sihem Amer-Yahia, and Laks VS Lakshmanan. 2022. On Efficient Approximate Queries over Machine Learning Models. arXiv preprint arXiv:2206.02845 (2022)."},{"key":"e_1_3_2_1_10_1","volume-title":"Laks VS Lakshmanan, and Ahmed Hassan Awadallah","author":"Ding Dujian","year":"2024","unstructured":"Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks VS Lakshmanan, and Ahmed Hassan Awadallah. 2024. Hybrid LLM: Cost-efficient and quality-aware query routing. arXiv preprint arXiv:2404.14618 (2024)."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/p19-1346"},{"key":"e_1_3_2_1_12_1","unstructured":"Google. [n. d.]. Vertex AI. https:\/\/cloud.google.com\/vertex-ai."},{"key":"e_1_3_2_1_13_1","volume-title":"Interference: Disaggregate LLM Inference for Mixed Downstream Workloads. arXiv preprint arXiv:2401.11181","author":"Hu Cunchen","year":"2024","unstructured":"Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, et al. 2024. Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads. arXiv preprint arXiv:2401.11181 (2024)."},{"key":"e_1_3_2_1_14_1","unstructured":"HuggingFace. [n. d.]. Hugging Face Inference API. https:\/\/huggingface.co\/inference-api."},{"key":"e_1_3_2_1_15_1","volume-title":"Efficient Reinforcement Learning for Routing Jobs in Heterogeneous Queueing Systems. arXiv preprint arXiv:2402.01147","author":"Jali Neharika","year":"2024","unstructured":"Neharika Jali, Guannan Qu, Weina Wang, and Gauri Joshi. 2024. Efficient Reinforcement Learning for Routing Jobs in Heterogeneous Queueing Systems. arXiv preprint arXiv:2402.01147 (2024)."},{"key":"e_1_3_2_1_16_1","volume-title":"Learned Best-Effort LLM Serving. arXiv preprint arXiv:2401.07886","author":"Jha Siddharth","year":"2024","unstructured":"Siddharth Jha, Coleman Hooper, Xiaoxuan Liu, Sehoon Kim, and Kurt Keutzer. 2024. Learned Best-Effort LLM Serving. arXiv preprint arXiv:2401.07886 (2024)."},{"key":"e_1_3_2_1_17_1","volume-title":"Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al.","author":"Jiang Albert Q","year":"2024","unstructured":"Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. 2024. Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)."},{"key":"e_1_3_2_1_18_1","volume-title":"Thirty-seventh Conference on Neural Information Processing Systems. https:\/\/openreview.net\/forum?id=zUYfbdNl1m","author":"Jin Yunho","year":"2023","unstructured":"Yunho Jin, Chun-Feng Wu, David Brooks, and Gu-Yeon Wei. 2023. $S^3$: Increasing GPU Utilization during Generative Inference for Higher Throughput. In Thirty-seventh Conference on Neural Information Processing Systems. https:\/\/openreview.net\/forum?id=zUYfbdNl1m"},{"key":"e_1_3_2_1_19_1","volume-title":"The Eleventh International Conference on Learning Representations.","author":"Kag Anil","year":"2022","unstructured":"Anil Kag, Igor Fedorov, Aditya Gangrade, Paul Whatmough, and Venkatesh Saligrama. 2022. Efficient Edge Inference by Selective Query. In The Eleventh International Conference on Learning Representations."},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_2_1_21_1","volume-title":"BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models. arXiv preprint arXiv:2404.18322","author":"Li Jiamin","year":"2024","unstructured":"Jiamin Li, Le Xu, Hong Xu, and Aditya Akella. 2024. BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models. arXiv preprint arXiv:2404.18322 (2024)."},{"key":"e_1_3_2_1_22_1","unstructured":"Bin Lin Tao Peng Chen Zhang Minmin Sun Lanbo Li Hanyu Zhao Wencong Xiao Qi Xu Xiafei Qiu Shen Li et al. 2024. Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache. arXiv preprint arXiv:2401.02669 (2024)."},{"key":"e_1_3_2_1_23_1","volume-title":"Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services. arXiv preprint arXiv:2404.16283","author":"Liu Jiachen","year":"2024","unstructured":"Jiachen Liu, Zhiyu Wu, Jae-Won Chung, Fan Lai, Myungjin Lee, and Mosharaf Chowdhury. 2024. Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services. arXiv preprint arXiv:2404.16283 (2024)."},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.5555\/2002472.2002491"},{"key":"e_1_3_2_1_25_1","volume-title":"Proceedings of the Nineteenth European Conference on Computer Systems. 1016--1038","author":"Mendoza Daniel","year":"2024","unstructured":"Daniel Mendoza, Francisco Romero, and Caroline Trippel. 2024. Model Selection for Latency-Critical Inference Serving. In Proceedings of the Nineteenth European Conference on Computer Systems. 1016--1038."},{"key":"e_1_3_2_1_26_1","unstructured":"Microsoft. [n. d.]. Azure AI Studio. https:\/\/ai.azure.com\/."},{"key":"e_1_3_2_1_27_1","volume-title":"Routellm: Learning to route llms with preference data. arXiv preprint arXiv:2406.18665","author":"Ong Isaac","year":"2024","unstructured":"Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. 2024. Routellm: Learning to route llms with preference data. arXiv preprint arXiv:2406.18665 (2024)."},{"key":"e_1_3_2_1_28_1","unstructured":"OpenAI. [n. d.]. OpenAI Platform. https:\/\/platform.openai.com\/overview."},{"key":"e_1_3_2_1_30_1","volume-title":"Splitwise: Efficient generative LLM inference using phase splitting. arXiv:2311.18677 [cs.AR]","author":"Patel Pratyush","year":"2023","unstructured":"Pratyush Patel, Esha Choukse, Chaojie Zhang, \u00cd\u00f1igo Goiri, Aashaka Shah, Saeed Maleki, and Ricardo Bianchini. 2023. Splitwise: Efficient generative LLM inference using phase splitting. arXiv:2311.18677 [cs.AR]"},{"key":"e_1_3_2_1_31_1","unstructured":"Archit Patke Dhemath Reddy Saurabh Jha Haoran Qiu Christian Pinto Shengkun Cui Chandra Narayanaswami Zbigniew Kalbarczyk and Ravishankar Iyer. 2024. One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving. arXiv:2407.00047 [cs.DC] https:\/\/arxiv.org\/abs\/2407.00047"},{"key":"e_1_3_2_1_32_1","volume-title":"vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention. arXiv preprint arXiv:2405.04437","author":"Prabhu Ramya","year":"2024","unstructured":"Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, and Ashish Panwar. 2024. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention. arXiv preprint arXiv:2405.04437 (2024)."},{"key":"e_1_3_2_1_33_1","volume-title":"Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction. arXiv preprint arXiv:2404.08509","author":"Qiu Haoran","year":"2024","unstructured":"Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T Kalbarczyk, Tamer Ba\u015far, and Ravishankar K Iyer. 2024. Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction. arXiv preprint arXiv:2404.08509 (2024)."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1264"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"crossref","unstructured":"Stephen Roller Emily Dinan Naman Goyal Da Ju Mary Williamson Yinhan Liu Jing Xu Myle Ott Kurt Shuster Eric M Smith et al. 2020. Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637 (2020).","DOI":"10.18653\/v1\/2021.eacl-main.24"},{"key":"e_1_3_2_1_36_1","volume-title":"Accelerating llm inference with staged speculative decoding. arXiv preprint arXiv:2308.04623","author":"Spector Benjamin","year":"2023","unstructured":"Benjamin Spector and Chris Re. 2023. Accelerating llm inference with staged speculative decoding. arXiv preprint arXiv:2308.04623 (2023)."},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2022.3231981"},{"key":"e_1_3_2_1_38_1","volume-title":"Llumnix: Dynamic Scheduling for Large Language Model Serving. arXiv preprint arXiv:2406.03243","author":"Sun Biao","year":"2024","unstructured":"Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. arXiv preprint arXiv:2406.03243 (2024)."},{"key":"e_1_3_2_1_39_1","volume-title":"Barto","author":"Sutton Richard S.","year":"2018","unstructured":"Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA."},{"key":"e_1_3_2_1_40_1","volume-title":"Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)","author":"Tiedemann J\u00f6rg","year":"2012","unstructured":"J\u00f6rg Tiedemann. 2012. Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet U\u011fur Do\u011fan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association (ELRA), Istanbul, Turkey, 2214--2218. http:\/\/www.lrec-conf.org\/proceedings\/lrec2012\/pdf\/463_Paper.pdf"},{"key":"e_1_3_2_1_41_1","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic Sergey Edunov and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]"},{"key":"e_1_3_2_1_42_1","volume-title":"Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920","author":"Wu Bingyang","year":"2023","unstructured":"Bingyang Wu, Yinmin Zhong, Zili Zhang, Gang Huang, Xuanzhe Liu, and Xin Jin. 2023. Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920 (2023)."},{"key":"e_1_3_2_1_43_1","volume-title":"Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Yu Gyeong-In","year":"2022","unstructured":"Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 521--538. https:\/\/www.usenix.org\/conference\/osdi22\/presentation\/yu"},{"key":"e_1_3_2_1_44_1","unstructured":"Zhenyu Zhang Ying Sheng Tianyi Zhou Tianlong Chen Lianmin Zheng Ruisi Cai Zhao Song Yuandong Tian Christopher R\u00e9 Clark Barrett et al. 2024. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems 36 (2024)."},{"key":"e_1_3_2_1_45_1","volume-title":"DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. arXiv preprint arXiv:2401.09670","author":"Zhong Yinmin","year":"2024","unstructured":"Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. arXiv preprint arXiv:2401.09670 (2024)."}],"event":{"name":"EuroMLSys '25: 5th Workshop on Machine Learning and Systems","location":"World Trade Center Rotterdam Netherlands","acronym":"EuroMLSys '25","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems"]},"container-title":["Proceedings of the 5th Workshop on Machine Learning and Systems"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3721146.3721947","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3721146.3721947","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:57:39Z","timestamp":1750298259000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3721146.3721947"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,30]]},"references-count":44,"alternative-id":["10.1145\/3721146.3721947","10.1145\/3721146"],"URL":"https:\/\/doi.org\/10.1145\/3721146.3721947","relation":{},"subject":[],"published":{"date-parts":[[2025,3,30]]},"assertion":[{"value":"2025-04-01","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}