{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,15]],"date-time":"2026-03-15T15:30:11Z","timestamp":1773588611859,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":76,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,3,22]]},"DOI":"10.1145\/3779212.3790133","type":"proceedings-article","created":{"date-parts":[[2026,3,10]],"date-time":"2026-03-10T13:55:26Z","timestamp":1773150926000},"page":"255-273","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["BlendServe: Optimizing Offline Inference with Resource-Aware Batching"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-7523-7400","authenticated-orcid":false,"given":"Yilong","family":"Zhao","sequence":"first","affiliation":[{"name":"University of California, Berkeley, Berkeley, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-9950-8129","authenticated-orcid":false,"given":"Shuo","family":"Yang","sequence":"additional","affiliation":[{"name":"University of California, Berkeley, Berkeley, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-3462-3292","authenticated-orcid":false,"given":"Kan","family":"Zhu","sequence":"additional","affiliation":[{"name":"University of Washington, Seattle, WA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6611-4612","authenticated-orcid":false,"given":"Lianmin","family":"Zheng","sequence":"additional","affiliation":[{"name":"University of California, Berkeley, Berkeley, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6122-8998","authenticated-orcid":false,"given":"Baris","family":"Kasikci","sequence":"additional","affiliation":[{"name":"University of Washington, Seattle, WA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-3651-6973","authenticated-orcid":false,"given":"Yifan","family":"Qiao","sequence":"additional","affiliation":[{"name":"University of California, Berkeley, Berkeley, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3082-7872","authenticated-orcid":false,"given":"Yang","family":"Zhou","sequence":"additional","affiliation":[{"name":"University of California, Davis, Sacramento, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-6163-0569","authenticated-orcid":false,"given":"Jiarong","family":"Xing","sequence":"additional","affiliation":[{"name":"Rice University, Houston, TX, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5373-0088","authenticated-orcid":false,"given":"Ion","family":"Stoica","sequence":"additional","affiliation":[{"name":"University of California, Berkeley, Berkeley, CA, USA"}]}],"member":"320","published-online":{"date-parts":[[2026,3,22]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Amey Agrawal Nitin Kedia Ashish Panwar Jayashree Mohan Nipun Kwatra Bhargav S. Gulavani Alexey Tumanov and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. arXiv:2403.02310 [cs.LG] https:\/\/arxiv.org\/abs\/2403.02310"},{"key":"e_1_3_2_1_2_1","volume-title":"GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv:2305.13245 [cs.CL] https:\/\/arxiv.org\/abs\/2305.13245","author":"Ainslie Joshua","year":"2023","unstructured":"Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr\u00f3n, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv:2305.13245 [cs.CL] https:\/\/arxiv.org\/abs\/2305.13245"},{"key":"e_1_3_2_1_3_1","unstructured":"Loubna Ben Allal Anton Lozhkov and Daniel van Strien. 2024. Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models -- huggingface.co. https:\/\/huggingface.co\/blog\/cosmopedia. [Accessed 25-10-2024]."},{"key":"e_1_3_2_1_4_1","unstructured":"Anthropic. 2024. Introducing the Message Batches API -- anthropic.com. https:\/\/www.anthropic.com\/news\/message-batches-api. [Accessed 20-10-2024]."},{"key":"e_1_3_2_1_5_1","unstructured":"Anyscale. 2024. LLM offline batch inference with Ray Data and vLLM | Anyscale Docs -- docs.anyscale.com. https:\/\/docs.anyscale.com\/examples\/batch-llm\/. [Accessed 26-10-2024]."},{"key":"e_1_3_2_1_6_1","unstructured":"I naki Arango Ayush Noori Yepeng Huang Rana Shahout and Minlan Yu. 2025. Prefix and Output Length-Aware Scheduling for Efficient Online LLM Inference. In Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts Quantization Hardware and Inference. https:\/\/openreview.net\/forum?id=DOZiCWyK0N"},{"key":"e_1_3_2_1_7_1","unstructured":"AWS. 2024. Supported Regions and models for batch inference - Amazon Bedrock -- docs.aws.amazon.com. https:\/\/docs.aws.amazon.com\/bedrock\/latest\/userguide\/batch-inference-supported.html. [Accessed 26-10-2024]."},{"key":"e_1_3_2_1_8_1","unstructured":"Jinze Bai Shuai Bai Yunfei Chu Zeyu Cui Kai Dang Xiaodong Deng Yang Fan Wenbin Ge Yu Han Fei Huang Binyuan Hui Luo Ji Mei Li Junyang Lin Runji Lin Dayiheng Liu Gao Liu Chengqiang Lu Keming Lu Jianxin Ma Rui Men Xingzhang Ren Xuancheng Ren Chuanqi Tan Sinan Tan Jianhong Tu Peng Wang Shijie Wang Wei Wang Shengguang Wu Benfeng Xu Jin Xu An Yang Hao Yang Jian Yang Shusheng Yang Yang Yao Bowen Yu Hongyi Yuan Zheng Yuan Jianwei Zhang Xingxuan Zhang Yichang Zhang Zhenru Zhang Chang Zhou Jingren Zhou Xiaohuan Zhou and Tianhang Zhu. 2023. Qwen Technical Report. arXiv:2309.16609 [cs.CL] https:\/\/arxiv.org\/abs\/2309.16609"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"crossref","unstructured":"Yushi Bai Xin Lv Jiajie Zhang Hongchang Lyu Jiankai Tang Zhidian Huang Zhengxiao Du Xiao Liu Aohan Zeng Lei Hou Yuxiao Dong Jie Tang and Juanzi Li. 2024. LongBench: A Bilingual Multitask Benchmark for Long Context Understanding. arXiv:2308.14508 [cs.CL] https:\/\/arxiv.org\/abs\/2308.14508","DOI":"10.18653\/v1\/2024.acl-long.172"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cor.2021.105692"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620666.3651379"},{"key":"e_1_3_2_1_12_1","volume-title":"Punica: Multi-Tenant LoRA Serving. arXiv:2310.18547 [cs.DC] https:\/\/arxiv.org\/abs\/2310.18547","author":"Chen Lequn","year":"2023","unstructured":"Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy. 2023. Punica: Multi-Tenant LoRA Serving. arXiv:2310.18547 [cs.DC] https:\/\/arxiv.org\/abs\/2310.18547"},{"key":"e_1_3_2_1_13_1","unstructured":"Jean-Baptiste Cordonnier Andreas Loukas and Martin Jaggi. 2021. Multi-Head Attention: Collaborate Instead of Concatenate. arXiv:2006.16362 [cs.LG] https:\/\/arxiv.org\/abs\/2006.16362"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","unstructured":"Weihao Cui Yukang Chen Han Zhao Ziyi Xu Quan Chen Xusheng Chen Zhou Yangjie Shixuan Sun and Minyi Guo. 2025. Optimizing SLO-oriented LLM Serving with PD-Multiplexing. https:\/\/doi.org\/10.48550\/arXiv.2504.14489","DOI":"10.48550\/arXiv.2504.14489"},{"key":"e_1_3_2_1_15_1","unstructured":"Databricks. 2024. Introducing Simple Fast and Scalable Batch LLM Inference on Mosaic AI Model Serving -- databricks.com. https:\/\/www.databricks.com\/blog\/introducing-simple-fast-and-scalable-batch-llm-inference-mosaic-ai-model-serving. [Accessed 26-10-2024]."},{"key":"e_1_3_2_1_16_1","unstructured":"DeepSeek-AI: Xiao Bi Deli Chen Guanting Chen Shanhuang Chen Damai Dai Chengqi Deng Honghui Ding Kai Dong Qiushi Du Zhe Fu Huazuo Gao Kaige Gao Wenjun Gao Ruiqi Ge Kang Guan Daya Guo Jianzhong Guo Guangbo Hao Zhewen Hao Ying He Wenjie Hu Panpan Huang Erhang Li Guowei Li Jiashi Li Yao Li Y. K. Li Wenfeng Liang Fangyun Lin A. X. Liu Bo Liu Wen Liu Xiaodong Liu Xin Liu Yiyuan Liu Haoyu Lu Shanghao Lu Fuli Luo Shirong Ma Xiaotao Nie Tian Pei Yishi Piao Junjie Qiu Hui Qu Tongzheng Ren Zehui Ren Chong Ruan Zhangli Sha Zhihong Shao Junxiao Song Xuecheng Su Jingxiang Sun Yaofeng Sun Minghui Tang Bingxuan Wang Peiyi Wang Shiyu Wang Yaohui Wang Yongji Wang Tong Wu Y. Wu Xin Xie Zhenda Xie Ziwei Xie Yiliang Xiong Hanwei Xu R. X. Xu Yanhong Xu Dejian Yang Yuxiang You Shuiping Yu Xingkai Yu B. Zhang Haowei Zhang Lecong Zhang Liyue Zhang Mingchuan Zhang Minghua Zhang Wentao Zhang Yichao Zhang Chenggang Zhao Yao Zhao Shangyan Zhou Shunfeng Zhou Qihao Zhu and Yuheng Zou. 2024. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism. arXiv:2401.02954 [cs.CL] https:\/\/arxiv.org\/abs\/2401.02954"},{"key":"e_1_3_2_1_17_1","unstructured":"Jiangfei Duan Runyu Lu Haojie Duanmu Xiuhong Li Xingcheng Zhang Dahua Lin Ion Stoica and Hao Zhang. 2024. MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving. arXiv:2404.02015 [cs.DC] https:\/\/arxiv.org\/abs\/2404.02015"},{"key":"e_1_3_2_1_18_1","unstructured":"Yichao Fu Siqi Zhu Runlong Su Aurick Qiao Ion Stoica and Hao Zhang. 2024. Efficient LLM Scheduling by Learning to Rank. arXiv:2408.15792 [cs.LG] https:\/\/arxiv.org\/abs\/2408.15792"},{"key":"e_1_3_2_1_19_1","unstructured":"Dan Hendrycks Collin Burns Steven Basart Andy Zou Mantas Mazeika Dawn Song and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding. arXiv:2009.03300 [cs.CY] https:\/\/arxiv.org\/abs\/2009.03300"},{"key":"e_1_3_2_1_20_1","volume-title":"Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Yuxiong He.","author":"Holmes Connor","year":"2024","unstructured":"Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Yuxiong He. 2024. DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference. arXiv:2401.08671 [cs.PF] https:\/\/arxiv.org\/abs\/2401.08671"},{"key":"e_1_3_2_1_21_1","volume-title":"Samyam Rajbhandari, and Yuxiong He.","author":"Jacobs Sam Ade","year":"2023","unstructured":"Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. 2023. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models. arXiv:2309.14509 [cs.LG] https:\/\/arxiv.org\/abs\/2309.14509"},{"key":"e_1_3_2_1_22_1","volume-title":"NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference. arXiv:2411.01142 [cs.DC] https:\/\/arxiv.org\/abs\/2411.01142","author":"Jiang Xuanlin","year":"2024","unstructured":"Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, and Minlan Yu. 2024. NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference. arXiv:2411.01142 [cs.DC] https:\/\/arxiv.org\/abs\/2411.01142"},{"key":"e_1_3_2_1_23_1","volume-title":"Hydragen: High-Throughput LLM Inference with Shared Prefixes. arXiv:2402.05099 [cs.LG] https:\/\/arxiv.org\/abs\/2402.05099","author":"Juravsky Jordan","year":"2024","unstructured":"Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher R\u00e9, and Azalia Mirhoseini. 2024. Hydragen: High-Throughput LLM Inference with Shared Prefixes. arXiv:2402.05099 [cs.LG] https:\/\/arxiv.org\/abs\/2402.05099"},{"key":"e_1_3_2_1_24_1","volume-title":"Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models. arXiv:2402.07033 [cs.LG] https:\/\/arxiv.org\/abs\/2402.07033","author":"Kamahori Keisuke","year":"2024","unstructured":"Keisuke Kamahori, Yile Gu, Kan Zhu, and Baris Kasikci. 2024. Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models. arXiv:2402.07033 [cs.LG] https:\/\/arxiv.org\/abs\/2402.07033"},{"key":"e_1_3_2_1_25_1","volume-title":"Joseph E. Gonzalez, Hao Zhang, and Ion Stoica.","author":"Kwon Woosuk","year":"2023","unstructured":"Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv:2309.06180 [cs.LG] https:\/\/arxiv.org\/abs\/2309.06180"},{"key":"e_1_3_2_1_26_1","volume-title":"Parrot: Efficient Serving of LLM-based Applications with Semantic Variable. arXiv:2405.19888 [cs.LG] https:\/\/arxiv.org\/abs\/2405.19888","author":"Lin Chaofan","year":"2024","unstructured":"Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient Serving of LLM-based Applications with Semantic Variable. arXiv:2405.19888 [cs.LG] https:\/\/arxiv.org\/abs\/2405.19888"},{"key":"e_1_3_2_1_27_1","volume-title":"Bullet: Boosting GPU Utilization for LLM Serving via Dynamic Spatial-Temporal Orchestration. arXiv:2504.19516 [cs.DC] https:\/\/arxiv.org\/abs\/2504.19516","author":"Lin Zejia","year":"2025","unstructured":"Zejia Lin, Hongxin Xu, Guanyi Chen, Xianwei Zhang, and Yutong Lu. 2025. Bullet: Boosting GPU Utilization for LLM Serving via Dynamic Spatial-Temporal Orchestration. arXiv:2504.19516 [cs.DC] https:\/\/arxiv.org\/abs\/2504.19516"},{"key":"e_1_3_2_1_28_1","unstructured":"Hao Liu Wilson Yan Matei Zaharia and Pieter Abbeel. 2024b. World Model on Million-Length Video And Language With Blockwise RingAttention. arXiv:2402.08268 [cs.LG] https:\/\/arxiv.org\/abs\/2402.08268"},{"key":"e_1_3_2_1_29_1","unstructured":"Hao Liu Matei Zaharia and Pieter Abbeel. 2023. Ring Attention with Blockwise Transformers for Near-Infinite Context. arXiv:2310.01889 [cs.CL] https:\/\/arxiv.org\/abs\/2310.01889"},{"key":"e_1_3_2_1_30_1","unstructured":"Shu Liu Asim Biswal Audrey Cheng Xiangxi Mo Shiyi Cao Joseph E. Gonzalez Ion Stoica and Matei Zaharia. 2024a. Optimizing LLM Queries in Relational Workloads. arXiv:2403.05821 [cs.LG] https:\/\/arxiv.org\/abs\/2403.05821"},{"key":"e_1_3_2_1_31_1","unstructured":"Jiasen Lu Christopher Clark Sangho Lee Zichen Zhang Savya Khosla Ryan Marten Derek Hoiem and Aniruddha Kembhavi. 2023. Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action. arXiv:2312.17172 [cs.CV] https:\/\/arxiv.org\/abs\/2312.17172"},{"key":"e_1_3_2_1_32_1","unstructured":"Jiasen Lu Christopher Clark Rowan Zellers Roozbeh Mottaghi and Aniruddha Kembhavi. 2022. Unified-IO: A Unified Model for Vision Language and Multi-Modal Tasks. arXiv:2206.08916 [cs.CV] https:\/\/arxiv.org\/abs\/2206.08916"},{"key":"e_1_3_2_1_33_1","first-page":"881","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Ma Lingxiao","year":"2020","unstructured":"Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. 2020. Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 881-897. https:\/\/www.usenix.org\/conference\/osdi20\/presentation\/ma"},{"key":"e_1_3_2_1_34_1","unstructured":"Meta-Team. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https:\/\/arxiv.org\/abs\/2407.21783"},{"key":"e_1_3_2_1_35_1","unstructured":"Microsoft. 2023. GitHub Copilot \u00b7 Your AI pair programmer -- github.com. https:\/\/github.com\/features\/copilot. [Accessed 28-10-2024]."},{"key":"e_1_3_2_1_36_1","unstructured":"Kepan Nan Rui Xie Penghao Zhou Tiehan Fan Zhenheng Yang Zhijie Chen Xiang Li Jian Yang and Ying Tai. 2024. OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation. arXiv:2407.02371 [cs.CV] https:\/\/arxiv.org\/abs\/2407.02371"},{"key":"e_1_3_2_1_37_1","unstructured":"OpenAI. 2022. Introducing ChatGPT. https:\/\/openai.com\/index\/chatgpt\/. [Accessed 20-10-2024]."},{"key":"e_1_3_2_1_38_1","unstructured":"OpenAI. 2024. Introducing Batch API. https:\/\/platform.openai.com\/docs\/guides\/batch. [Accessed 20-10-2024]."},{"key":"e_1_3_2_1_39_1","unstructured":"Yiwei Qin Xuefeng Li Haoyang Zou Yixiu Liu Shijie Xia Zhen Huang Yixin Ye Weizhe Yuan Hector Liu Yuanzhi Li and Pengfei Liu. 2024. O1 Replication Journey: A Strategic Progress Report - Part 1. arXiv:2410.18982 [cs.AI] https:\/\/arxiv.org\/abs\/2410.18982"},{"key":"e_1_3_2_1_40_1","unstructured":"ShareGPT. 2023. ShareGPT. https:\/\/huggingface.co\/datasets\/anon8231489123\/ShareGPT_Vicuna_unfiltered."},{"key":"e_1_3_2_1_41_1","unstructured":"Ying Sheng Shiyi Cao Dacheng Li Banghua Zhu Zhuohan Li Danyang Zhuo Joseph E. Gonzalez and Ion Stoica. 2024. Fairness in Serving Large Language Models. arXiv:2401.00588 [cs.AI] https:\/\/arxiv.org\/abs\/2401.00588"},{"key":"e_1_3_2_1_42_1","unstructured":"Ying Sheng Lianmin Zheng Binhang Yuan Zhuohan Li Max Ryabinin Daniel Y. Fu Zhiqiang Xie Beidi Chen Clark Barrett Joseph E. Gonzalez Percy Liang Christopher R\u00e9 Ion Stoica and Ce Zhang. 2023. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. arXiv:2303.06865 [cs.LG] https:\/\/arxiv.org\/abs\/2303.06865"},{"key":"e_1_3_2_1_43_1","unstructured":"Xiaoxiang Shi Colin Cai Junjia Du and Zhihao Jia. 2025. Nexus:Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving. arXiv:2507.06608 [cs.DC] https:\/\/arxiv.org\/abs\/2507.06608"},{"key":"e_1_3_2_1_44_1","unstructured":"Mohammad Shoeybi Mostofa Patwary Raul Puri Patrick LeGresley Jared Casper and Bryan Catanzaro. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL] https:\/\/arxiv.org\/abs\/1909.08053"},{"key":"e_1_3_2_1_45_1","unstructured":"Charlie Snell Jaehoon Lee Kelvin Xu and Aviral Kumar. 2024. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv:2408.03314 [cs.LG] https:\/\/arxiv.org\/abs\/2408.03314"},{"key":"e_1_3_2_1_46_1","unstructured":"Yixin Song Zeyu Mi Haotong Xie and Haibo Chen. 2023. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. arXiv:2312.12456 [cs.LG] https:\/\/arxiv.org\/abs\/2312.12456"},{"key":"e_1_3_2_1_47_1","volume-title":"Preble: Efficient Distributed Prompt Scheduling for LLM Serving. arXiv:2407.00023 [cs.DC] https:\/\/arxiv.org\/abs\/2407.00023","author":"Srivatsa Vikranth","year":"2024","unstructured":"Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, and Yiying Zhang. 2024a. Preble: Efficient Distributed Prompt Scheduling for LLM Serving. arXiv:2407.00023 [cs.DC] https:\/\/arxiv.org\/abs\/2407.00023"},{"key":"e_1_3_2_1_48_1","unstructured":"Vikranth Srivatsa Dongming Li Yiying Zhang and Reyna Abhyankar. 2024b. MLSys @ WukLab - Can Scheduling Overhead Dominate LLM Inference Performance? A Study of CPU Scheduling Overhead on Two Popular LLM Inference Systems -- mlsys.wuklab.io. https:\/\/mlsys.wuklab.io\/posts\/scheduling_overhead\/. [Accessed 25-10-2024]."},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"crossref","unstructured":"Jovan Stojkovic Chaojie Zhang \u00cd\u00f1igo Goiri Josep Torrellas and Esha Choukse. 2024. DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency. arXiv:2408.00741 [cs.AI] https:\/\/arxiv.org\/abs\/2408.00741","DOI":"10.1109\/HPCA61900.2025.00102"},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3627703.3629578"},{"key":"e_1_3_2_1_51_1","volume-title":"Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference. arXiv:2406.10774 [cs.CL] https:\/\/arxiv.org\/abs\/2406.10774","author":"Tang Jiaming","year":"2024","unstructured":"Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference. arXiv:2406.10774 [cs.CL] https:\/\/arxiv.org\/abs\/2406.10774"},{"key":"e_1_3_2_1_52_1","unstructured":"DeepSeek Team. 2024. Context Caching with SSD Offloading. https:\/\/api-docs.deepseek.com\/guides\/kv_cache. [Accessed 19-08-2025]."},{"key":"e_1_3_2_1_53_1","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic Sergey Edunov and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL] https:\/\/arxiv.org\/abs\/2307.09288"},{"key":"e_1_3_2_1_54_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez Lukasz Kaiser and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs.CL] https:\/\/arxiv.org\/abs\/1706.03762"},{"key":"e_1_3_2_1_55_1","unstructured":"Xinlong Wang Xiaosong Zhang Zhengxiong Luo Quan Sun Yufeng Cui Jinsheng Wang Fan Zhang Yueze Wang Zhen Li Qiying Yu Yingli Zhao Yulong Ao Xuebin Min Tao Li Boya Wu Bo Zhao Bowen Zhang Liangdong Wang Guang Liu Zheqi He Xi Yang Jingjing Liu Yonghua Lin Tiejun Huang and Zhongyuan Wang. 2024b. Emu3: Next-Token Prediction is All You Need. arXiv:2409.18869 [cs.CV] https:\/\/arxiv.org\/abs\/2409.18869"},{"key":"e_1_3_2_1_56_1","volume-title":"Amelie Chi Zhou, and Xiaowen Chu","author":"Wang Yuxin","year":"2024","unstructured":"Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, and Xiaowen Chu. 2024a. BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems. arXiv:2401.17644"},{"key":"e_1_3_2_1_57_1","volume-title":"MIO: A Foundation Model on Multimodal Tokens. arXiv:2409.17692 [cs.CL] https:\/\/arxiv.org\/abs\/2409.17692","author":"Wang Zekun","year":"2024","unstructured":"Zekun Wang, King Zhu, Chunpu Xu, Wangchunshu Zhou, Jiaheng Liu, Yibo Zhang, Jiashuo Wang, Ning Shi, Siyu Li, Yizhi Li, Haoran Que, Zhaoxiang Zhang, Yuanxing Zhang, Ge Zhang, Ke Xu, Jie Fu, and Wenhao Huang. 2024c. MIO: A Foundation Model on Multimodal Tokens. arXiv:2409.17692 [cs.CL] https:\/\/arxiv.org\/abs\/2409.17692"},{"key":"e_1_3_2_1_58_1","volume-title":"Chi, Quoc Le, and Denny Zhou","author":"Wei Jason","year":"2023","unstructured":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL] https:\/\/arxiv.org\/abs\/2201.11903"},{"key":"e_1_3_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/1498765.1498785"},{"key":"e_1_3_2_1_60_1","unstructured":"Bingyang Wu Yinmin Zhong Zili Zhang Shengyu Liu Fangyue Liu Yuanhang Sun Gang Huang Xuanzhe Liu and Xin Jin. 2024c. Fast Distributed Inference Serving for Large Language Models. arXiv:2305.05920 [cs.LG] https:\/\/arxiv.org\/abs\/2305.05920"},{"key":"e_1_3_2_1_61_1","volume-title":"Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation. arXiv:2410.13848 [cs.CV] https:\/\/arxiv.org\/abs\/2410.13848","author":"Wu Chengyue","year":"2024","unstructured":"Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. 2024a. Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation. arXiv:2410.13848 [cs.CV] https:\/\/arxiv.org\/abs\/2410.13848"},{"key":"e_1_3_2_1_62_1","unstructured":"Yecheng Wu Zhuoyang Zhang Junyu Chen Haotian Tang Dacheng Li Yunhao Fang Ligeng Zhu Enze Xie Hongxu Yin Li Yi Song Han and Yao Lu. 2024b. VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation. arXiv:2409.04429 [cs.CV] https:\/\/arxiv.org\/abs\/2409.04429"},{"key":"e_1_3_2_1_63_1","unstructured":"Fuzhao Xue Yukang Chen Dacheng Li Qinghao Hu Ligeng Zhu Xiuyu Li Yunhao Fang Haotian Tang Shang Yang Zhijian Liu Ethan He Hongxu Yin Pavlo Molchanov Jan Kautz Linxi Fan Yuke Zhu Yao Lu and Song Han. 2024. LongVILA: Scaling Long-Context Visual Language Models for Long Videos. arXiv:2408.10188 [cs.CV] https:\/\/arxiv.org\/abs\/2408.10188"},{"key":"e_1_3_2_1_64_1","unstructured":"Shunyu Yao Dian Yu Jeffrey Zhao Izhak Shafran Thomas L. Griffiths Yuan Cao and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601 [cs.CL] https:\/\/arxiv.org\/abs\/2305.10601"},{"key":"e_1_3_2_1_65_1","volume-title":"LIMO: Less is More for Reasoning. arXiv:2502.03387 [cs.CL] https:\/\/arxiv.org\/abs\/2502.03387","author":"Ye Yixin","year":"2025","unstructured":"Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. 2025. LIMO: Less is More for Reasoning. arXiv:2502.03387 [cs.CL] https:\/\/arxiv.org\/abs\/2502.03387"},{"key":"e_1_3_2_1_66_1","volume-title":"Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding. https:\/\/flashinfer.ai\/2024\/02\/02\/cascade-inference.html","author":"Ye Zihao","year":"2024","unstructured":"Zihao Ye, Ruihang Lai, Bo-Ru Lu, Chien-Yu Lin, Size Zheng, Lequn Chen, Tianqi Chen, and Luis Ceze. 2024. Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding. https:\/\/flashinfer.ai\/2024\/02\/02\/cascade-inference.html"},{"key":"e_1_3_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1145\/3688351.3689164"},{"key":"e_1_3_2_1_68_1","first-page":"521","volume-title":"Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Yu Gyeong-In","year":"2022","unstructured":"Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 521-538. https:\/\/www.usenix.org\/conference\/osdi22\/presentation\/yu"},{"key":"e_1_3_2_1_69_1","unstructured":"Ted Zadouri Hubert Strauss and Tri Dao. 2025. Hardware-Efficient Attention for Fast Decoding. arXiv:2505.21487 [cs.LG] https:\/\/arxiv.org\/abs\/2505.21487"},{"key":"e_1_3_2_1_70_1","unstructured":"Wenting Zhao Xiang Ren Jack Hessel Claire Cardie Yejin Choi and Yuntian Deng. 2024c. WildChat: 1M ChatGPT Interaction Logs in the Wild. arXiv:2405.01470 [cs.CL] https:\/\/arxiv.org\/abs\/2405.01470"},{"key":"e_1_3_2_1_71_1","unstructured":"Xuanlei Zhao Bin Jia Haotian Zhou Ziming Liu Shenggan Cheng and Yang You. 2024a. HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices. arXiv:2403.01164 [cs.PF] https:\/\/arxiv.org\/abs\/2403.01164"},{"key":"e_1_3_2_1_72_1","volume-title":"Atom: Low-bit Quantization for Efficient and Accurate LLM Serving. arXiv:2310.19102 [cs.LG] https:\/\/arxiv.org\/abs\/2310.19102","author":"Zhao Yilong","year":"2024","unstructured":"Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. 2024b. Atom: Low-bit Quantization for Efficient and Accurate LLM Serving. arXiv:2310.19102 [cs.LG] https:\/\/arxiv.org\/abs\/2310.19102"},{"key":"e_1_3_2_1_73_1","volume-title":"Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng.","author":"Zheng Lianmin","year":"2024","unstructured":"Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: Efficient Execution of Structured Language Model Programs. arXiv:2312.07104 [cs.AI] https:\/\/arxiv.org\/abs\/2312.07104"},{"key":"e_1_3_2_1_74_1","unstructured":"Zhen Zheng Xin Ji Taosong Fang Fanghao Zhou Chuanjie Liu and Gang Peng. 2025. BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching. arXiv:2412.03594 [cs.CL] https:\/\/arxiv.org\/abs\/2412.03594"},{"key":"e_1_3_2_1_75_1","unstructured":"Yinmin Zhong Shengyu Liu Junda Chen Jianbo Hu Yibo Zhu Xuanzhe Liu Xin Jin and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. arXiv:2401.09670 [cs.DC] https:\/\/arxiv.org\/abs\/2401.09670"},{"key":"e_1_3_2_1_76_1","first-page":"749","volume-title":"NanoFlow: Towards Optimal Large Language Model Serving Throughput. In 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25)","author":"Zhu Kan","year":"2025","unstructured":"Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, et al., 2025. NanoFlow: Towards Optimal Large Language Model Serving Throughput. In 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). 749-765."}],"event":{"name":"ASPLOS '26: 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems","location":"Pittsburgh PA USA","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems","SIGPLAN ACM Special Interest Group on Programming Languages","SIGARCH ACM Special Interest Group on Computer Architecture","SIGBED ACM Special Interest Group on Embedded Systems"]},"container-title":["Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2"],"original-title":[],"deposited":{"date-parts":[[2026,3,15]],"date-time":"2026-03-15T13:56:54Z","timestamp":1773583014000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3779212.3790133"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,22]]},"references-count":76,"alternative-id":["10.1145\/3779212.3790133","10.1145\/3779212"],"URL":"https:\/\/doi.org\/10.1145\/3779212.3790133","relation":{},"subject":[],"published":{"date-parts":[[2026,3,22]]},"assertion":[{"value":"2026-03-22","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}