{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,3]],"date-time":"2025-12-03T17:30:48Z","timestamp":1764783048539,"version":"3.46.0"},"reference-count":36,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2025,12,2]],"date-time":"2025-12-02T00:00:00Z","timestamp":1764633600000},"content-version":"vor","delay-in-days":1,"URL":"http:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["2148224, CNS-2212202"],"award-info":[{"award-number":["2148224, CNS-2212202"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Meas. Anal. Comput. Syst."],"published-print":{"date-parts":[[2025,12]]},"abstract":"<jats:p>With the growing use of Large Language Model (LLM)-based tools like ChatGPT, Perplexity, and Gemini across industries, there is a rising need for efficient LLM inference systems. These systems handle requests with a unique two-phase computation structure: a prefill-phase that processes the full input prompt and a decode-phase that autoregressively generates tokens one at a time. This structure calls for new strategies for routing and scheduling requests.<\/jats:p>\n                  <jats:p>In this paper, we take a comprehensive approach to this challenge by developing a theoretical framework that models routing and scheduling in LLM inference systems. We identify two key design principles-optimal tiling and dynamic resource allocation-that are essential for achieving high throughput. Guided by these principles, we propose the Resource-Aware Dynamic (RAD) scheduler and prove that it achieves throughput optimality under mild conditions. To address practical Service Level Objectives (SLOs) such as serving requests with different Time Between Token (TBT) constraints, we design the SLO-Aware LLM Inference (SLAI) scheduler. SLAI uses real-time measurements to prioritize decode requests that are close to missing their TBT deadlines and reorders prefill requests based on known prompt lengths to further reduce the Time To First Token (TTFT) delays.<\/jats:p>\n                  <jats:p>We evaluate SLAI on the openchat_shareGPT4 dataset using the Mistral-7B model on an NVIDIA RTX ADA 6000 GPU. Compared to Sarathi-Serve, SLAI reduces the median TTFT by 53% and increases the maximum serving capacity by 26% such that median TTFT is below 0.5 seconds, while meeting tail TBT latency constraints. The complete source code is available at: https:\/\/github.com\/agrimUT\/SLAI.<\/jats:p>","DOI":"10.1145\/3771574","type":"journal-article","created":{"date-parts":[[2025,12,2]],"date-time":"2025-12-02T20:07:03Z","timestamp":1764706023000},"page":"1-43","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Optimal Scheduling Algorithms for LLM Inference: Theory and Practice"],"prefix":"10.1145","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-1550-8683","authenticated-orcid":false,"given":"Agrim","family":"Bari","sequence":"first","affiliation":[{"name":"Chandra Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3477-4947","authenticated-orcid":false,"given":"Parikshit","family":"Hegde","sequence":"additional","affiliation":[{"name":"Chandra Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1498-494X","authenticated-orcid":false,"given":"Gustavo","family":"de Veciana","sequence":"additional","affiliation":[{"name":"Chandra Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA"}]}],"member":"320","published-online":{"date-parts":[[2025,12,2]]},"reference":[{"volume-title":"Introducing AMD CDNA\u2122 3 Architecture. White Paper. Advanced Micro Devices","year":"2025","key":"e_1_2_1_1_1","unstructured":"2023. Introducing AMD CDNA\u2122 3 Architecture. White Paper. Advanced Micro Devices, Inc. https:\/\/www.amd.com\/content\/dam\/amd\/en\/documents\/instinct-tech-docs\/white-papers\/amd-cdna-3-white-paper.pdf Accessed: 2025-06-18."},{"key":"e_1_2_1_2_1","first-page":"117","volume-title":"Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Agrawal Amey","year":"2024","unstructured":"Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 117-134. https:\/\/www.usenix.org\/conference\/osdi24\/presentation\/agrawal"},{"key":"e_1_2_1_3_1","volume-title":"Yury Zemlyanskiy, Federico Lebr\u00f3n, and Sumit Sanghai.","author":"Ainslie Joshua","year":"2023","unstructured":"Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebr\u00f3n, and Sumit Sanghai. 2023. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245 (2023)."},{"key":"e_1_2_1_4_1","unstructured":"NVIDIA Corporation. 2025. NVIDIA Dynamo. https:\/\/developer.nvidia.com\/dynamo. Accessed: 2025-07-07."},{"key":"e_1_2_1_5_1","unstructured":"Tri Dao Daniel Y. Fu Stefano Ermon Atri Rudra and Christopher R\u00e9. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135 [cs.LG] https:\/\/arxiv.org\/abs\/2205.14135"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1214\/aoms\/1177728976"},{"key":"e_1_2_1_7_1","volume-title":"Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Yuxiong He.","author":"Holmes Connor","year":"2024","unstructured":"Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Yuxiong He. 2024. DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference. arXiv:2401.08671 [cs.PF] https:\/\/arxiv.org\/abs\/2401.08671"},{"key":"e_1_2_1_8_1","unstructured":"Ke Hong Guohao Dai Jiaming Xu Qiuli Mao Xiuhong Li Jun Liu Kangdi Chen Yuhan Dong and Yu Wang. 2024. FlashDecoding++: Faster Large Language Model Inference on GPUs. arXiv:2311.01282 [cs.LG] https:\/\/arxiv.org\/abs\/2311.01282"},{"key":"e_1_2_1_9_1","volume-title":"Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen.","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: efficient training of giant neural networks using pipeline parallelism. Curran Associates Inc., Red Hook, NY, USA."},{"key":"e_1_2_1_10_1","unstructured":"Intel Corporation. 2024. Intel\u00ae Gaudi\u00ae 3 AI Accelerator White Paper. Technical Report. Intel Corporation. https:\/\/www.intel.com\/content\/www\/us\/en\/content-details\/817486\/intel-gaudi-3-ai-accelerator-white-paper.html Accessed: 2025-06-18."},{"key":"e_1_2_1_11_1","unstructured":"Kunal Jain Anjaly Parayil Ankur Mallick Esha Choukse Xiaoting Qin Jue Zhang \u00cd\u00f1igo Goiri Rujia Wang Chetan Bansal Victor R\u00fchle Anoop Kulkarni Steve Kofsky and Saravan Rajmohan. 2025. Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing. arXiv:2408.13510 [cs.DC] https:\/\/arxiv.org\/abs\/2408.13510"},{"key":"e_1_2_1_12_1","unstructured":"Albert Q. Jiang Alexandre Sablayrolles Arthur Mensch Chris Bamford Devendra Singh Chaplot Diego de las Casas Florian Bressand Gianna Lengyel Guillaume Lample Lucile Saulnier L\u00e9lio Renard Lavaud Marie-Anne Lachaux Pierre Stock Teven Le Scao Thibaut Lavril Thomas Wang Timoth\u00e9e Lacroix and William El Sayed. 2023. Mistral 7B. arXiv:2310.06825 [cs.CL] https:\/\/arxiv.org\/abs\/2310.06825"},{"key":"e_1_2_1_13_1","volume-title":"Hydragen: High-Throughput LLM Inference with Shared Prefixes. arXiv:2402.05099 [cs.LG] https:\/\/arxiv.org\/abs\/2402.05099","author":"Juravsky Jordan","year":"2024","unstructured":"Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher R\u00e9, and Azalia Mirhoseini. 2024. Hydragen: High-Throughput LLM Inference with Shared Prefixes. arXiv:2402.05099 [cs.LG] https:\/\/arxiv.org\/abs\/2402.05099"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3676641.3715996"},{"key":"e_1_2_1_15_1","volume-title":"Joseph E. Gonzalez, Hao Zhang, and Ion Stoica.","author":"Kwon Woosuk","year":"2023","unstructured":"Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv:2309.06180 [cs.LG] https:\/\/arxiv.org\/abs\/2309.06180"},{"key":"e_1_2_1_16_1","unstructured":"Yaniv Leviathan Matan Kalman and Yossi Matias. 2023. Fast Inference from Transformers via Speculative Decoding. arXiv:2211.17192 [cs.LG] https:\/\/arxiv.org\/abs\/2211.17192"},{"key":"e_1_2_1_17_1","volume-title":"Throughput-optimal scheduling algorithms for llm inference and ai agents. arXiv preprint arXiv:2504.07347","author":"Li Yueying","year":"2025","unstructured":"Yueying Li, Jim Dai, and Tianyi Peng. 2025. Throughput-optimal scheduling algorithms for llm inference and ai agents. arXiv preprint arXiv:2504.07347 (2025)."},{"key":"e_1_2_1_18_1","doi-asserted-by":"crossref","unstructured":"Yuhan Liu Hanchen Li Yihua Cheng Siddhant Ray Yuyang Huang Qizheng Zhang Kuntai Du Jiayi Yao Shan Lu Ganesh Ananthanarayanan Michael Maire Henry Hoffmann Ari Holtzman and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. arXiv:2310.07240 [cs.NI] https:\/\/arxiv.org\/abs\/2310.07240","DOI":"10.1145\/3651890.3672274"},{"key":"e_1_2_1_19_1","volume-title":"Challenges and Open Problems. arXiv preprint arXiv:2503.07545 (March","author":"Mitzenmacher Michael","year":"2025","unstructured":"Michael Mitzenmacher and Rana Shahout. 2025. Queueing, Predictions, and LLMs: Challenges and Open Problems. arXiv preprint arXiv:2503.07545 (March 2025). arXiv:2503.07545 [cs.AI] https:\/\/arxiv.org\/abs\/2503.07545"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3721146.3721962"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476209"},{"key":"e_1_2_1_22_1","volume-title":"Accessed","author":"NVIDIA.","year":"2025","unstructured":"NVIDIA. [n.d.]. Matrix Multiplication Background User's Guide. https:\/\/docs.nvidia.com\/deeplearning\/performance\/dl-performance-matrix-multiplication\/index.html. Accessed: February 27, 2025."},{"key":"e_1_2_1_23_1","unstructured":"NVIDIA. 2025. FasterTransformer. https:\/\/github.com\/NVIDIA\/FasterTransformer"},{"key":"e_1_2_1_24_1","unstructured":"NVIDIA. 2025. TensorRT-LLM: A TensorRT toolbox for optimized large-language-model inference. https:\/\/github.com\/NVIDIA\/TensorRT-LLM"},{"key":"e_1_2_1_25_1","unstructured":"NVIDIA Corporation. 2020. NVIDIA A100 Tensor Core GPU Architecture. Technical Report. NVIDIA Corporation. https:\/\/images.nvidia.com\/aem-dam\/en-zz\/Solutions\/data-center\/nvidia-ampere-architecture-whitepaper.pdf Version 1.0."},{"key":"e_1_2_1_26_1","unstructured":"OpenAI. 2025. ChatGPT. https:\/\/chat.openai.com"},{"key":"e_1_2_1_27_1","volume-title":"Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150","author":"Shazeer Noam","year":"2019","unstructured":"Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150 (2019)."},{"key":"e_1_2_1_28_1","series-title":"SIAM review 35, 2","volume-title":"A review of regenerative processes","author":"Sigman Karl","year":"1993","unstructured":"Karl Sigman and Ronald W Wolff. 1993. A review of regenerative processes. SIAM review 35, 2 (1993), 269-288."},{"key":"e_1_2_1_29_1","first-page":"173","volume-title":"Llumnix: Dynamic Scheduling for Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Sun Biao","year":"2024","unstructured":"Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 173-191. https:\/\/www.usenix.org\/conference\/osdi24\/presentation\/sun-biao"},{"key":"e_1_2_1_30_1","volume-title":"Attention is all you need. Advances in neural information processing systems 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_2_1_31_1","unstructured":"Guan Wang Sijie Cheng Xianyuan Zhan Xiangang Li Sen Song and Yang Liu. 2024. OpenChat: Advancing Open-source Language Models with Mixed-Quality Data. arXiv:2309.11235 [cs.CL] https:\/\/arxiv.org\/abs\/2309.11235"},{"key":"e_1_2_1_32_1","unstructured":"Zihao Ye Lequn Chen Ruihang Lai Wuwei Lin Yineng Zhang Stephanie Wang Tianqi Chen Baris Kasikci Vinod Grover Arvind Krishnamurthy and Luis Ceze. 2025. FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving. arXiv:2501.01005 [cs.DC] https:\/\/arxiv.org\/abs\/2501.01005"},{"key":"e_1_2_1_33_1","first-page":"521","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Yu Gyeong-In","year":"2022","unstructured":"Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for {Transformer-Based} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521-538."},{"key":"e_1_2_1_34_1","volume-title":"BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching. arXiv preprint arXiv:2411.16102","author":"Zhao Yilong","year":"2024","unstructured":"Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, and Ion Stoica. 2024. BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching. arXiv preprint arXiv:2411.16102 (2024)."},{"key":"e_1_2_1_35_1","volume-title":"Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng.","author":"Zheng Lianmin","year":"2024","unstructured":"Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: Efficient Execution of Structured Language Model Programs. arXiv:2312.07104 [cs.AI] https:\/\/arxiv.org\/abs\/2312.07104"},{"key":"e_1_2_1_36_1","first-page":"193","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Zhong Yinmin","year":"2024","unstructured":"Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 193-210."}],"container-title":["Proceedings of the ACM on Measurement and Analysis of Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3771574","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3771574","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,3]],"date-time":"2025-12-03T17:27:02Z","timestamp":1764782822000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3771574"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12]]},"references-count":36,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,12]]}},"alternative-id":["10.1145\/3771574"],"URL":"https:\/\/doi.org\/10.1145\/3771574","relation":{},"ISSN":["2476-1249"],"issn-type":[{"type":"electronic","value":"2476-1249"}],"subject":[],"published":{"date-parts":[[2025,12]]},"assertion":[{"value":"2025-12-02","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}