{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,24]],"date-time":"2026-03-24T11:32:34Z","timestamp":1774351954330,"version":"3.50.1"},"reference-count":62,"publisher":"Association for Computing Machinery (ACM)","issue":"2","funder":[{"name":"Strategic Priority Research Program of Chinese Academy of Sciences","award":["XDA0320000 and XDA0320300"],"award-info":[{"award-number":["XDA0320000 and XDA0320300"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62090022, U24B6012 and 62172388"],"award-info":[{"award-number":["62090022, U24B6012 and 62172388"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100002858","name":"China Postdoctoral Science Foundation","doi-asserted-by":"crossref","award":["2024M762550"],"award-info":[{"award-number":["2024M762550"]}],"id":[{"id":"10.13039\/501100002858","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Shaanxi Postdoctoral Research Foundation","award":["2024BSHSDZZ102"],"award-info":[{"award-number":["2024BSHSDZZ102"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2025,6,30]]},"abstract":"<jats:p>Transformer-based large language model (LLM) inference serving is now the backbone of many cloud services. LLM inference consists of a prefill phase and a decode phase. However, existing LLM deployment practices often overlook the distinct characteristics of these phases, leading to significant interference. To mitigate interference, our insight is to carefully schedule and group inference requests based on their characteristics. We realize this idea in ShuffleInfer through three pillars. First, it partitions prompts into fixed-size chunks so that the accelerator always runs close to its computation-saturated limit. Second, it disaggregates prefill and decode instances so each can run independently. Finally, it uses a smart two-level scheduling algorithm augmented with predicted resource usage to avoid decode scheduling hotspots. Results show that ShuffleInfer improves time-to-first-token (TTFT), job completion time (JCT), and inference efficiency in terms of performance per dollar by a large margin, e.g., it uses 38% less resources all the while lowering average TTFT and average JCT by 97% and 47%, respectively.<\/jats:p>","DOI":"10.1145\/3732941","type":"journal-article","created":{"date-parts":[[2025,4,30]],"date-time":"2025-04-30T11:24:17Z","timestamp":1746012257000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":14,"title":["ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-0605-3792","authenticated-orcid":false,"given":"Cunchen","family":"Hu","sequence":"first","affiliation":[{"name":"State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences","place":["BeiJing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-7613-1336","authenticated-orcid":false,"given":"Heyang","family":"Huang","sequence":"additional","affiliation":[{"name":"State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences","place":["BeiJing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5202-2795","authenticated-orcid":false,"given":"Liangliang","family":"Xu","sequence":"additional","affiliation":[{"name":"Institute of Mathematics and Interdisciplinary Sciences, Xidian University","place":["XiAn, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2807-9780","authenticated-orcid":false,"given":"Xusheng","family":"Chen","sequence":"additional","affiliation":[{"name":"Huawei Cloud","place":["ShangHai, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1451-3101","authenticated-orcid":false,"given":"Chenxi","family":"Wang","sequence":"additional","affiliation":[{"name":"State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences","place":["BeiJing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-4673-8957","authenticated-orcid":false,"given":"Jiang","family":"Xu","sequence":"additional","affiliation":[{"name":"Huawei Cloud","place":["ShangHai, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5392-1458","authenticated-orcid":false,"given":"Shuang","family":"Chen","sequence":"additional","affiliation":[{"name":"Huawei Cloud","place":["ShangHai, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-4510-8321","authenticated-orcid":false,"given":"Hao","family":"Feng","sequence":"additional","affiliation":[{"name":"Huawei Cloud","place":["ShangHai, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9629-0860","authenticated-orcid":false,"given":"Sa","family":"Wang","sequence":"additional","affiliation":[{"name":"State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences","place":["BeiJing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6565-5276","authenticated-orcid":false,"given":"Yungang","family":"Bao","sequence":"additional","affiliation":[{"name":"State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences","place":["BeiJing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1953-1392","authenticated-orcid":false,"given":"Ninghui","family":"Sun","sequence":"additional","affiliation":[{"name":"State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences","place":["BeiJing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-9519-0546","authenticated-orcid":false,"given":"Yizhou","family":"Shan","sequence":"additional","affiliation":[{"name":"Huawei Cloud","place":["ShangHai, China"]}]}],"member":"320","published-online":{"date-parts":[[2025,7]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"Amey Agrawal Nitin Kedia Ashish Panwar Jayashree Mohan Nipun Kwatra Bhargav Gulavani Alexey Tumanov and Ramachandran Ramjee. 2024. Taming throughput-latency tradeoff in LLM inference with sarathi-serve. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201924)."},{"key":"e_1_3_1_3_2","unstructured":"Amey Agrawal Ashish Panwar Jayashree Mohan Nipun Kwatra Bhargav S Gulavani and Ramachandran Ramjee. 2023. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv:2308.16369. Retrieved from https:\/\/arxiv.org\/abs\/2308.16369"},{"key":"e_1_3_1_4_2","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Aminabadi Reza Yazdani","year":"2022","unstructured":"Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et\u00a0al. 2022. DeepSpeed-inference: Enabling efficient inference of transformer models at unprecedented scale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis."},{"key":"e_1_3_1_5_2","unstructured":"AWS Bedrock. 2024. LLM Parameter. Retrieved 5 January 2024 from https:\/\/docs.aws.amazon.com\/bedrock\/latest\/userguide\/inference-parameters.html"},{"key":"e_1_3_1_6_2","unstructured":"Xueying Du Mingwei Liu Kaixin Wang Hanlin Wang Junwei Liu Yixuan Chen Jiayi Feng Chaofeng Sha Xin Peng and Yiling Lou. 2021. Evaluating large language models in class-level code generation. In Proceedings of the IEEE\/ACM 46th International Conference on Software Engineering (ICSE\u201924)."},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00027"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/2619239.2626315"},{"key":"e_1_3_1_9_2","unstructured":"Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR\u201924)."},{"key":"e_1_3_1_10_2","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Dao Tri","year":"2022","unstructured":"Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Proceedings of the Advances in Neural Information Processing Systems."},{"key":"e_1_3_1_11_2","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Dettmers Tim","year":"2022","unstructured":"Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. In Proceedings of the Advances in Neural Information Processing Systems."},{"key":"e_1_3_1_12_2","unstructured":"Tim Dettmers Ruslan Svirschevski Vage Egiazarian Denis Kuznedelev Elias Frantar Saleh Ashkboos Alexander Borzunov Torsten Hoefler and Dan Alistarh. 2024. Spqr: A sparse-quantized representation for near-lossless llm weight compression. In The Twelfth International Conference on Learning Representations (ICLR\u201924)."},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3580305.3599572"},{"key":"e_1_3_1_14_2","volume-title":"Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation","author":"Dragojevi\u0107 Aleksandar","year":"2014","unstructured":"Aleksandar Dragojevi\u0107, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation."},{"key":"e_1_3_1_15_2","volume-title":"Proceedings of the 11th International Conference on Learning Representations .","author":"Frantar Elias","year":"2022","unstructured":"Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. OPTQ: Accurate quantization for generative pre-trained transformers. In Proceedings of the 11th International Conference on Learning Representations ."},{"key":"e_1_3_1_16_2","unstructured":"Yao Fu. 2024. Challenges in deploying long-context transformers: A theoretical peak performance analysis. arXiv:2405.08944. Retrieved from https:\/\/arxiv.org\/abs\/2405.08944"},{"key":"e_1_3_1_17_2","unstructured":"Yao Fu Leyang Xue Yeqi Huang Andrei-Octavian Brabete Dmitrii Ustiugov Yuvraj Patel and Luo Mai. 2024. Serverlessllm: Locality-enhanced serverless inference for large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201924)."},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503222.3507762"},{"key":"e_1_3_1_19_2","unstructured":"HiAscend. 2023. Atlas 900 AI Cluster. Retrieved 5 January 2024 from https:\/\/www.hiascend.com\/en\/hardware\/cluster"},{"key":"e_1_3_1_20_2","unstructured":"HiAscend. 2023. CANN aclrtMemcpy. Retrieved 5 January 2024 from https:\/\/www.hiascend.com\/document\/detail\/en\/canncommercial\/601\/inferapplicationdev\/aclcppdevg\/aclcppdevg_03_0081.html"},{"key":"e_1_3_1_21_2","unstructured":"Ke Hong Guohao Dai Jiaming Xu Qiuli Mao Xiuhong Li Jun Liu Kangdi Chen Yuhan Dong and Yu Wang. 2024. Flashdecoding++: Faster large language model inference with asynchronization flat gemm optimization and heuristics. In Proceedings of Machine Learning and Systems 6 (2024) 148\u2013161."},{"key":"e_1_3_1_22_2","unstructured":"Cunchen Hu Heyang Huang Junhao Hu Jiang Xu Xusheng Chen Tao Xie Chenxi Wang Sa Wang Yungang Bao Ninghui Sun et\u00a0al. 2024. Memserve: Context caching for disaggregated llm serving with elastic memory pool. arXiv:2406.17565. Retrieved from https:\/\/arxiv.org\/abs\/2406.17565"},{"key":"e_1_3_1_23_2","unstructured":"HugginFace. 2022. Retrieved 5 January 2024 from https:\/\/huggingface.co\/docs\/transformers\/model_doc\/opt##transformers.OPTForSequenceClassification"},{"key":"e_1_3_1_24_2","unstructured":"Hugging Face. 2023. Summarization. Retrieved 5 January 2024 from https:\/\/huggingface.co\/datasets\/ZhongshengWang\/Alpaca-pubmed-summarization"},{"key":"e_1_3_1_25_2","unstructured":"Hugging Face. 2023. Writing. Retrieved 5 January 2024 from https:\/\/huggingface.co\/datasets\/lancexiao\/write_doc_sft_v1"},{"key":"e_1_3_1_26_2","volume-title":"Proceedings of the Workshop on Efficient Systems for Foundation Models","author":"Isik Berivan","year":"2023","unstructured":"Berivan Isik, Hermann Kumbong, Wanyi Ning, Xiaozhe Yao, Sanmi Koyejo, and Ce Zhang. 2023. GPT-Zip: Deep compression of finetuned large language models. In Proceedings of the Workshop on Efficient Systems for Foundation Models."},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNET.2021.3138923"},{"key":"e_1_3_1_28_2","article-title":"The promise and peril of generative AI","author":"Jo A","year":"2023","unstructured":"A Jo. 2023. The promise and peril of generative AI. Nature 614, 1 (2023), 214\u2013216.","journal-title":"Nature"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_1_30_2","volume-title":"Proceedings of the International Conference on machine learning","author":"Li Zhuohan","year":"2020","unstructured":"Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joey Gonzalez. 2020. Train big, then compress: Rethinking model size for efficient training and inference of transformers. In Proceedings of the International Conference on machine learning."},{"key":"e_1_3_1_31_2","volume-title":"Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation","author":"Li Zhuohan","year":"2023","unstructured":"Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, et\u00a0al. 2023. AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation."},{"key":"e_1_3_1_32_2","unstructured":"Bin Lin Chen Zhang Tao Peng Hanyu Zhao Wencong Xiao Minmin Sun Anmin Liu Zhipeng Zhang Lanbo Li Xiafei Qiu et\u00a0al. 2024. Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache. arXiv:2401.02669. Retrieved from https:\/\/arxiv.org\/abs\/2401.02669"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/3651890.3672274"},{"key":"e_1_3_1_34_2","volume-title":"Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation","author":"Ma Lingxiao","year":"2020","unstructured":"Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. 2020. Rammer: Enabling holistic deep learning compiler optimizations with rTasks. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation."},{"key":"e_1_3_1_35_2","unstructured":"NGINX. 2001. NGINX. Retrieved 5 January 2024 from https:\/\/www.nginx.com\/blog\/nginx-power-of-two-choices-load-balancing-algorithm\/"},{"key":"e_1_3_1_36_2","unstructured":"NVIDIA. 2019. GPU Direct. Retrieved 5 January 2024 from https:\/\/developer.nvidia.com\/gpudirect"},{"key":"e_1_3_1_37_2","unstructured":"NVIDIA. 2020. NCCL. Retrieved 5 January 2024 from https:\/\/docs.nvidia.com\/deeplearning\/nccl\/user-guide\/docs\/overview.html"},{"key":"e_1_3_1_38_2","unstructured":"NVIDIA. 2022. CUDA Runtime API Memory Management. Retrieved 5 January 2024 from https:\/\/docs.nvidia.com\/cuda\/cuda-runtime-api\/group__CUDART__MEMORY.html"},{"key":"e_1_3_1_39_2","unstructured":"NVIDIA FasterTransformer. 2021. Retrieved 5 January 2024 from https:\/\/github.com\/NVIDIA\/FasterTransformer"},{"key":"e_1_3_1_40_2","unstructured":"NVIDIA Triton Inference Server.2024. Retrieved 5 January 2024 from https:\/\/developer.nvidia.com\/"},{"key":"e_1_3_1_41_2","unstructured":"Charles Packer Sarah Wooders Kevin Lin Vivian Fang Shishir G. Patil Ion Stoica and Joseph E. Gonzalez. 2023. Memgpt: Towards llms as operating systems. arXiv:2310.08560. Retrieved from https:\/\/arxiv.org\/abs\/2310.08560"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA59077.2024.00019"},{"key":"e_1_3_1_43_2","article-title":"Efficiently scaling transformer inference","author":"Pope Reiner","year":"2023","unstructured":"Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2023. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems 5 (2023), 606\u2013624.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_1_44_2","unstructured":"Ruoyu Qin Zheming Li Weiran He Jialei Cui Feng Ren Mingxing Zhang Yongwei Wu Weimin Zheng and Xinran Xu. 2025. Mooncake: Trading more storage for less computation\u2014a kvcachecentric architecture for serving llm chatbot. In 23rd USENIX Conference on File and Storage Technologies (FAST\u201925)."},{"key":"e_1_3_1_45_2","volume-title":"Proceedings of the International Conference on Machine Learning.","author":"Ramesh Aditya","year":"2021","unstructured":"Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning."},{"key":"e_1_3_1_46_2","unstructured":"Ray Serve. 2023. Ray Serve. Retrieved 5 January 2024 from https:\/\/docs.ray.io\/en\/latest\/serve\/index.html"},{"key":"e_1_3_1_47_2","unstructured":"Sharegpt teams.2023. ShareGPT. Retrieved 5 January 2024 from https:\/\/sharegpt.com\/"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.5555\/3691938.3691990"},{"key":"e_1_3_1_49_2","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Sheng Ying","year":"2023","unstructured":"Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher R\u00e9, Ion Stoica, and Ce Zhang. 2023. Flexgen: High-throughput generative inference of large language models with a single GPU. In Proceedings of the International Conference on Machine Learning."},{"key":"e_1_3_1_50_2","unstructured":"Foteini Strati Sara McAllister Amar Phanishayee Jakub Tarnawski and Ana Klimovic. 2024. D\u00e9j\u00e0Vu: KV-cache streaming for fast fault-tolerant generative LLM serving. arXiv:2403.01876. Retrieved from https:\/\/arxiv.org\/abs\/2403.01876"},{"key":"e_1_3_1_51_2","unstructured":"Rohan Taori Ishaan Gulrajani Tianyi Zhang Yann Dubois Xuechen Li Carlos Guestrin Percy Liang and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. Retrieved 15 July 2024 from https:\/\/github.com\/tatsu-lab\/stanford_alpaca"},{"key":"e_1_3_1_52_2","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale et\u00a0al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288. Retrieved from https:\/\/arxiv.org\/abs\/2307.09288"},{"key":"e_1_3_1_53_2","doi-asserted-by":"crossref","unstructured":"Xiaohui Wang Ying Xiong Yang Wei Mingxuan Wang and Lei Li. Lightseq: A high performance inference library for transformers. NAACL-HLT. 113\u2013120.","DOI":"10.18653\/v1\/2021.naacl-industry.15"},{"key":"e_1_3_1_54_2","unstructured":"Wikipedia. 2014. NVLink. Retrieved 5 January 2024 from https:\/\/en.wikipedia.org\/wiki\/NVLink"},{"key":"e_1_3_1_55_2","unstructured":"Bingyang Wu Yinmin Zhong Zili Zhang Gang Huang Xuanzhe Liu and Xin Jin. 2023. Fast distributed inference serving for large language models. arXiv:2305.05920. Retrieved from https:\/\/arxiv.org\/abs\/2305.05920"},{"key":"e_1_3_1_56_2","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Xiao Guangxuan","year":"2023","unstructured":"Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In Proceedings of the International Conference on Machine Learning."},{"key":"e_1_3_1_57_2","unstructured":"Zhewei Yao Cheng Li Xiaoxia Wu Stephen Youn and Yuxiong He. 2023. A comprehensive study on post-training quantization for large language models. arXiv:2303.08302. Retrieved from https:\/\/arxiv.org\/abs\/2303.08302"},{"key":"e_1_3_1_58_2","volume-title":"Proceedings of the Advances in Neural Information Processing Systems.","author":"Yao Zhewei","year":"2022","unstructured":"Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. 2022. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. In Proceedings of the Advances in Neural Information Processing Systems."},{"key":"e_1_3_1_59_2","volume-title":"Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation","author":"Yu Gyeong-In","year":"2022","unstructured":"Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for transformer-based generative models. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation."},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS54959.2023.00042"},{"key":"e_1_3_1_61_2","unstructured":"Susan Zhang Stephen Roller Naman Goyal Mikel Artetxe Moya Chen Shuohui Chen Christopher Dewan Mona Diab Xian Li Xi Victoria Lin et\u00a0al. 2022. Opt: Open pre-trained transformer language models. arXiv:2205.01068. Retrieved from https:\/\/arxiv.org\/abs\/2205.01068"},{"key":"e_1_3_1_62_2","volume-title":"Proceedings of the Advances in Neural Information Processing Systems.","author":"Zheng Zangwei","year":"2024","unstructured":"Zangwei Zheng, Xiaozhe Ren, Fuzhao Xue, Yang Luo, Xin Jiang, and Yang You. 2024. Response length perception and sequence scheduling: An llm-empowered llm inference pipeline. In Proceedings of the Advances in Neural Information Processing Systems."},{"key":"e_1_3_1_63_2","unstructured":"Yinmin Zhong Shengyu Liu Junda Chen Jianbo Hu Yibo Zhu Xuanzhe Liu Xin Jin and Hao Zhang. 2024. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201924)."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3732941","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,1]],"date-time":"2025-07-01T12:32:04Z","timestamp":1751373124000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3732941"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,30]]},"references-count":62,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2025,6,30]]}},"alternative-id":["10.1145\/3732941"],"URL":"https:\/\/doi.org\/10.1145\/3732941","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,6,30]]},"assertion":[{"value":"2024-07-22","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-04-13","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-07-01","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}