{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,6]],"date-time":"2026-05-06T16:39:54Z","timestamp":1778085594740,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":63,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,10,23]],"date-time":"2023-10-23T00:00:00Z","timestamp":1698019200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,10,23]]},"DOI":"10.1145\/3600006.3613165","type":"proceedings-article","created":{"date-parts":[[2023,10,3]],"date-time":"2023-10-03T14:44:17Z","timestamp":1696344257000},"page":"611-626","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1040,"title":["Efficient Memory Management for Large Language Model Serving with PagedAttention"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-8870-4892","authenticated-orcid":false,"given":"Woosuk","family":"Kwon","sequence":"first","affiliation":[{"name":"UC Berkeley, Berkeley, United States of America"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-1534-9106","authenticated-orcid":false,"given":"Zhuohan","family":"Li","sequence":"additional","affiliation":[{"name":"UC Berkeley, Berkeley, United States of America"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-3787-0316","authenticated-orcid":false,"given":"Siyuan","family":"Zhuang","sequence":"additional","affiliation":[{"name":"UC Berkeley, Berkeley, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1883-2126","authenticated-orcid":false,"given":"Ying","family":"Sheng","sequence":"additional","affiliation":[{"name":"UC Berkeley and Stanford University, Berkeley, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6611-4612","authenticated-orcid":false,"given":"Lianmin","family":"Zheng","sequence":"additional","affiliation":[{"name":"UC Berkeley, Berkeley, United States of America"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9298-6254","authenticated-orcid":false,"given":"Cody Hao","family":"Yu","sequence":"additional","affiliation":[{"name":"Independent Researcher, Berkeley, United States of America"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2921-956X","authenticated-orcid":false,"given":"Joseph","family":"Gonzalez","sequence":"additional","affiliation":[{"name":"UC Berkeley, Berkeley, United States of America"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-8392-3977","authenticated-orcid":false,"given":"Hao","family":"Zhang","sequence":"additional","affiliation":[{"name":"UC San Diego, La Jolla, United States of America"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5373-0088","authenticated-orcid":false,"given":"Ion","family":"Stoica","sequence":"additional","affiliation":[{"name":"UC Berkeley, Berkeley, United States of America"}]}],"member":"320","published-online":{"date-parts":[[2023,10,23]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, et al.","author":"Aminabadi Reza Yazdani","year":"2022","unstructured":"Reza Yazdani Aminabadi , Samyam Rajbhandari , Minjia Zhang , Ammar Ahmad Awan , Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, et al. 2022 . DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale . arXiv preprint arXiv:2207.00032 (2022). Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, et al. 2022. DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. arXiv preprint arXiv:2207.00032 (2022)."},{"key":"e_1_3_2_1_2_1","volume-title":"Jamie Ryan Kiros, and Geoffrey E Hinton","author":"Ba Jimmy Lei","year":"2016","unstructured":"Jimmy Lei Ba , Jamie Ryan Kiros, and Geoffrey E Hinton . 2016 . Layer normalization. arXiv preprint arXiv:1607.06450 (2016). Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016)."},{"key":"e_1_3_2_1_3_1","volume-title":"A neural probabilistic language model. Advances in neural information processing systems 13","author":"Bengio Yoshua","year":"2000","unstructured":"Yoshua Bengio , R\u00e9jean Ducharme , and Pascal Vincent . 2000. A neural probabilistic language model. Advances in neural information processing systems 13 ( 2000 ). Yoshua Bengio, R\u00e9jean Ducharme, and Pascal Vincent. 2000. A neural probabilistic language model. Advances in neural information processing systems 13 (2000)."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W16-2301"},{"key":"e_1_3_2_1_5_1","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell etal 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901.  Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901."},{"key":"e_1_3_2_1_6_1","volume-title":"Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al.","author":"Chen Mark","year":"2021","unstructured":"Mark Chen , Jerry Tworek , Heewoo Jun , Qiming Yuan , Henrique Ponde de Oliveira Pinto , Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021 . Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021). Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)."},{"key":"e_1_3_2_1_7_1","volume-title":"Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174","author":"Chen Tianqi","year":"2016","unstructured":"Tianqi Chen , Bing Xu , Chiyuan Zhang , and Carlos Guestrin . 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 ( 2016 ). Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016)."},{"key":"e_1_3_2_1_8_1","volume-title":"Xing","author":"Chiang Wei-Lin","year":"2023","unstructured":"Wei-Lin Chiang , Zhuohan Li , Zi Lin , Ying Sheng , Zhanghao Wu , Hao Zhang , Lianmin Zheng , Siyuan Zhuang , Yonghao Zhuang , Joseph E. Gonzalez , Ion Stoica , and Eric P . Xing . 2023 . Vicuna : An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality . https:\/\/lmsys.org\/blog\/2023-03-30-vicuna\/ Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https:\/\/lmsys.org\/blog\/2023-03-30-vicuna\/"},{"key":"e_1_3_2_1_9_1","volume-title":"Charles Sutton, Sebastian Gehrmann, et al.","author":"Chowdhery Aakanksha","year":"2022","unstructured":"Aakanksha Chowdhery , Sharan Narang , Jacob Devlin , Maarten Bosma , Gaurav Mishra , Adam Roberts , Paul Barham , Hyung Won Chung , Charles Sutton, Sebastian Gehrmann, et al. 2022 . Palm : Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022). Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022)."},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3419111.3421285"},{"key":"e_1_3_2_1_11_1","volume-title":"Clipper: A Low-Latency Online Prediction Serving System. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17)","author":"Crankshaw Daniel","year":"2017","unstructured":"Daniel Crankshaw , Xin Wang , Guilio Zhou , Michael J Franklin , Joseph E Gonzalez , and Ion Stoica . 2017 . Clipper: A Low-Latency Online Prediction Serving System. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) . 613--627. Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction Serving System. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 613--627."},{"key":"e_1_3_2_1_12_1","volume-title":"DVABatch: Diversity-aware Multi-Entry Multi-Exit Batching for Efficient Processing of DNN Services on GPUs. In 2022 USENIX Annual Technical Conference (USENIX ATC 22)","author":"Cui Weihao","year":"2022","unstructured":"Weihao Cui , Han Zhao , Quan Chen , Hao Wei , Zirui Li , Deze Zeng , Chao Li , and Minyi Guo . 2022 . DVABatch: Diversity-aware Multi-Entry Multi-Exit Batching for Efficient Processing of DNN Services on GPUs. In 2022 USENIX Annual Technical Conference (USENIX ATC 22) . 183--198. Weihao Cui, Han Zhao, Quan Chen, Hao Wei, Zirui Li, Deze Zeng, Chao Li, and Minyi Guo. 2022. DVABatch: Diversity-aware Multi-Entry Multi-Exit Batching for Efficient Processing of DNN Services on GPUs. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 183--198."},{"key":"e_1_3_2_1_13_1","first-page":"16344","article-title":"Flashattention: Fast and memory-efficient exact attention with io-awareness","volume":"35","author":"Dao Tri","year":"2022","unstructured":"Tri Dao , Dan Fu , Stefano Ermon , Atri Rudra , and Christopher R\u00e9 . 2022 . Flashattention: Fast and memory-efficient exact attention with io-awareness . Advances in Neural Information Processing Systems 35 (2022), 16344 -- 16359 . Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344--16359.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3437801.3441578"},{"key":"e_1_3_2_1_15_1","unstructured":"FastAPI. 2023. FastAPI. https:\/\/github.com\/tiangolo\/fastapi.  FastAPI. 2023. FastAPI. https:\/\/github.com\/tiangolo\/fastapi."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3190508.3190541"},{"key":"e_1_3_2_1_17_1","first-page":"6","article-title":"Ai and memory wall","volume":"1","author":"Gholami Amir","year":"2021","unstructured":"Amir Gholami , Zhewei Yao , Sehoon Kim , Michael W Mahoney , and Kurt Keutzer . 2021 . Ai and memory wall . RiseLab Medium Post 1 (2021), 6 . Amir Gholami, Zhewei Yao, Sehoon Kim, Michael W Mahoney, and Kurt Keutzer. 2021. Ai and memory wall. RiseLab Medium Post 1 (2021), 6.","journal-title":"RiseLab Medium Post"},{"key":"e_1_3_2_1_18_1","unstructured":"Github. 2022. https:\/\/github.com\/features\/copilot  Github. 2022. https:\/\/github.com\/features\/copilot"},{"key":"e_1_3_2_1_19_1","unstructured":"Google. 2023. https:\/\/bard.google.com\/  Google. 2023. https:\/\/bard.google.com\/"},{"key":"e_1_3_2_1_20_1","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Gujarati Arpan","year":"2020","unstructured":"Arpan Gujarati , Reza Karimi , Safya Alzayat , Wei Hao , Antoine Kaufmann , Ymir Vigfusson , and Jonathan Mace . 2020 . Serving {DNNs} like Clockwork: Performance Predictability from the Bottom Up . In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) . 443--462. Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving {DNNs} like Clockwork: Performance Predictability from the Bottom Up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 443--462."},{"key":"e_1_3_2_1_21_1","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Han Mingcong","year":"2022","unstructured":"Mingcong Han , Hanze Zhang , Rong Chen , and Haibo Chen . 2022 . Microsecond-scale Preemption for Concurrent {GPU-accelerated}{DNN} Inferences . In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . 539--558. Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-scale Preemption for Concurrent {GPU-accelerated}{DNN} Inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 539--558."},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378530"},{"key":"e_1_3_2_1_24_1","first-page":"497","article-title":"Checkmate: Breaking the memory wall with optimal tensor rematerialization","volume":"2","author":"Jain Paras","year":"2020","unstructured":"Paras Jain , Ajay Jain , Aniruddha Nrusimha , Amir Gholami , Pieter Abbeel , Joseph Gonzalez , Kurt Keutzer , and Ion Stoica . 2020 . Checkmate: Breaking the memory wall with optimal tensor rematerialization . Proceedings of Machine Learning and Systems 2 (2020), 497 -- 511 . Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica. 2020. Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems 2 (2020), 497--511.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/TEC.1962.5219356"},{"key":"e_1_3_2_1_26_1","volume-title":"The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691","author":"Lester Brian","year":"2021","unstructured":"Brian Lester , Rami Al-Rfou , and Noah Constant . 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 ( 2021 ). Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021)."},{"key":"e_1_3_2_1_27_1","volume-title":"Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190","author":"Li Xiang Lisa","year":"2021","unstructured":"Xiang Lisa Li and Percy Liang . 2021 . Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021). Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021)."},{"key":"e_1_3_2_1_28_1","unstructured":"Zhuohan Li Lianmin Zheng Yinmin Zhong Vincent Liu Ying Sheng Xin Jin Yanping Huang Zhifeng Chen Hao Zhang Joseph E Gonzalez etal 2023. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. arXiv preprint arXiv:2302.11665 (2023).  Zhuohan Li Lianmin Zheng Yinmin Zhong Vincent Liu Ying Sheng Xin Jin Yanping Huang Zhifeng Chen Hao Zhang Joseph E Gonzalez et al. 2023. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. arXiv preprint arXiv:2302.11665 (2023)."},{"key":"e_1_3_2_1_29_1","volume-title":"Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 881--897","author":"Ma Lingxiao","year":"2020","unstructured":"Lingxiao Ma , Zhiqiang Xie , Zhi Yang , Jilong Xue , Youshan Miao , Wei Cui , Wenxiang Hu , Fan Yang , Lintao Zhang , and Lidong Zhou . 2020 . Rammer: Enabling holistic deep learning compiler optimizations with rtasks . In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 881--897 . Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. 2020. Rammer: Enabling holistic deep learning compiler optimizations with rtasks. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 881--897."},{"key":"e_1_3_2_1_30_1","unstructured":"NVIDIA. [n. d.]. Triton Inference Server. https:\/\/developer.nvidia.com\/nvidia-triton-inference-server.  NVIDIA. [n. d.]. Triton Inference Server. https:\/\/developer.nvidia.com\/nvidia-triton-inference-server."},{"key":"e_1_3_2_1_31_1","unstructured":"NVIDIA. 2023. FasterTransformer. https:\/\/github.com\/NVIDIA\/FasterTransformer.  NVIDIA. 2023. FasterTransformer. https:\/\/github.com\/NVIDIA\/FasterTransformer."},{"key":"e_1_3_2_1_32_1","volume-title":"NCCL: The NVIDIA Collective Communication Library. https:\/\/developer.nvidia.com\/nccl.","author":"NVIDIA.","year":"2023","unstructured":"NVIDIA. 2023 . NCCL: The NVIDIA Collective Communication Library. https:\/\/developer.nvidia.com\/nccl. NVIDIA. 2023. NCCL: The NVIDIA Collective Communication Library. https:\/\/developer.nvidia.com\/nccl."},{"key":"e_1_3_2_1_33_1","volume-title":"Tensorflow-serving: Flexible, high-performance ml serving. arXiv preprint arXiv:1712.06139","author":"Olston Christopher","year":"2017","unstructured":"Christopher Olston , Noah Fiedel , Kiril Gorovoy , Jeremiah Harmsen , Li Lao , Fangwei Li , Vinu Rajashekhar , Sukriti Ramesh , and Jordan Soyke . 2017 . Tensorflow-serving: Flexible, high-performance ml serving. arXiv preprint arXiv:1712.06139 (2017). Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke. 2017. Tensorflow-serving: Flexible, high-performance ml serving. arXiv preprint arXiv:1712.06139 (2017)."},{"key":"e_1_3_2_1_34_1","unstructured":"OpenAI. 2020. https:\/\/openai.com\/blog\/openai-api  OpenAI. 2020. https:\/\/openai.com\/blog\/openai-api"},{"key":"e_1_3_2_1_35_1","unstructured":"OpenAI. 2022. https:\/\/openai.com\/blog\/chatgpt  OpenAI. 2022. https:\/\/openai.com\/blog\/chatgpt"},{"key":"e_1_3_2_1_36_1","unstructured":"OpenAI. 2023. https:\/\/openai.com\/blog\/custom-instructions-for-chatgpt  OpenAI. 2023. https:\/\/openai.com\/blog\/custom-instructions-for-chatgpt"},{"key":"e_1_3_2_1_38_1","unstructured":"LMSYS ORG. 2023. Chatbot Arena Leaderboard Week 8: Introducing MT-Bench and Vicuna-33B. https:\/\/lmsys.org\/blog\/2023-06-22-leaderboard\/.  LMSYS ORG. 2023. Chatbot Arena Leaderboard Week 8: Introducing MT-Bench and Vicuna-33B. https:\/\/lmsys.org\/blog\/2023-06-22-leaderboard\/."},{"key":"e_1_3_2_1_39_1","volume-title":"Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke , Sam Gross , Francisco Massa , Adam Lerer , James Bradbury , Gregory Chanan , Trevor Killeen , Zeming Lin , Natalia Gimelshein , Luca Antiga , 2019 . Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019). Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_3_2_1_40_1","volume-title":"POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging. In International Conference on Machine Learning. PMLR, 17573--17583","author":"Patil Shishir G","year":"2022","unstructured":"Shishir G Patil , Paras Jain , Prabal Dutta , Ion Stoica , and Joseph Gonzalez . 2022 . POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging. In International Conference on Machine Learning. PMLR, 17573--17583 . Shishir G Patil, Paras Jain, Prabal Dutta, Ion Stoica, and Joseph Gonzalez. 2022. POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging. In International Conference on Machine Learning. PMLR, 17573--17583."},{"key":"e_1_3_2_1_41_1","volume-title":"Efficiently Scaling Transformer Inference. arXiv preprint arXiv:2211.05102","author":"Pope Reiner","year":"2022","unstructured":"Reiner Pope , Sholto Douglas , Aakanksha Chowdhery , Jacob Devlin , James Bradbury , Anselm Levskaya , Jonathan Heek , Kefan Xiao , Shivani Agrawal , and Jeff Dean . 2022. Efficiently Scaling Transformer Inference. arXiv preprint arXiv:2211.05102 ( 2022 ). Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2022. Efficiently Scaling Transformer Inference. arXiv preprint arXiv:2211.05102 (2022)."},{"key":"e_1_3_2_1_42_1","volume-title":"USENIX Annual Technical Conference. 551--564","author":"Ren Jie","year":"2021","unstructured":"Jie Ren , Samyam Rajbhandari , Reza Yazdani Aminabadi , Olatunji Ruwase , Shuangyan Yang , Minjia Zhang , Dong Li , and Yuxiong He . 2021 . ZeRO-Offload: Democratizing Billion-Scale Model Training .. In USENIX Annual Technical Conference. 551--564 . Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training.. In USENIX Annual Technical Conference. 551--564."},{"key":"e_1_3_2_1_43_1","unstructured":"Reuters. 2023. https:\/\/www.reuters.com\/technology\/tech-giants-ai-like-bing-bard-poses-billion-dollar-search-problem-2023-02-22\/  Reuters. 2023. https:\/\/www.reuters.com\/technology\/tech-giants-ai-like-bing-bard-poses-billion-dollar-search-problem-2023-02-22\/"},{"key":"e_1_3_2_1_44_1","unstructured":"Amazon Web Services. 2023. https:\/\/aws.amazon.com\/bedrock\/  Amazon Web Services. 2023. https:\/\/aws.amazon.com\/bedrock\/"},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359658"},{"key":"e_1_3_2_1_46_1","unstructured":"Ying Sheng Lianmin Zheng Binhang Yuan Zhuohan Li Max Ryabinin Daniel Y Fu Zhiqiang Xie Beidi Chen Clark Barrett Joseph E Gonzalez etal 2023. High-throughput Generative Inference of Large Language Models with a Single GPU. arXiv preprint arXiv:2303.06865 (2023).  Ying Sheng Lianmin Zheng Binhang Yuan Zhuohan Li Max Ryabinin Daniel Y Fu Zhiqiang Xie Beidi Chen Clark Barrett Joseph E Gonzalez et al. 2023. High-throughput Generative Inference of Large Language Models with a Single GPU. arXiv preprint arXiv:2303.06865 (2023)."},{"key":"e_1_3_2_1_47_1","volume-title":"Megatron-lm: Training multibillion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi , Mostofa Patwary , Raul Puri , Patrick LeGresley , Jared Casper , and Bryan Catanzaro . 2019 . Megatron-lm: Training multibillion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019). Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multibillion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)."},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2210.12924"},{"key":"e_1_3_2_1_49_1","volume-title":"Sequence to sequence learning with neural networks. Advances in neural information processing systems 27","author":"Sutskever Ilya","year":"2014","unstructured":"Ilya Sutskever , Oriol Vinyals , and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 ( 2014 ). Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014)."},{"key":"e_1_3_2_1_50_1","volume-title":"Hashimoto","author":"Taori Rohan","year":"2023","unstructured":"Rohan Taori , Ishaan Gulrajani , Tianyi Zhang , Yann Dubois , Xuechen Li , Carlos Guestrin , Percy Liang , and Tatsunori B . Hashimoto . 2023 . Stanford Alpaca : An Instruction-following LLaMA model. https:\/\/github.com\/tatsu-lab\/stanford_alpaca. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https:\/\/github.com\/tatsu-lab\/stanford_alpaca."},{"key":"e_1_3_2_1_51_1","unstructured":"ShareGPT Team. 2023. https:\/\/sharegpt.com\/  ShareGPT Team. 2023. https:\/\/sharegpt.com\/"},{"key":"e_1_3_2_1_52_1","volume-title":"Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron , Thibaut Lavril , Gautier Izacard , Xavier Martinet , MarieAnne Lachaux , Timoth\u00e9e Lacroix , Baptiste Rozi\u00e8re , Naman Goyal , Eric Hambro , Faisal Azhar , 2023 . Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023). Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, MarieAnne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)."},{"key":"e_1_3_2_1_53_1","volume-title":"Attention is all you need. Advances in neural information processing systems 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N Gomez , \u0141ukasz Kaiser , and Illia Polosukhin . 2017. Attention is all you need. Advances in neural information processing systems 30 ( 2017 ). Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_2_1_54_1","volume-title":"2022 USENIX Annual Technical Conference (USENIX ATC 22)","author":"Wang Jing","year":"2022","unstructured":"Jing Wang , Youyou Lu , Qing Wang , Minhui Xie , Keji Huang , and Jiwu Shu . 2022 . Pacman: An Efficient Compaction Approach for {Log-Structured} {Key-Value} Store on Persistent Memory . In 2022 USENIX Annual Technical Conference (USENIX ATC 22) . 773--788. Jing Wang, Youyou Lu, Qing Wang, Minhui Xie, Keji Huang, and Jiwu Shu. 2022. Pacman: An Efficient Compaction Approach for {Log-Structured} {Key-Value} Store on Persistent Memory. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 773--788."},{"key":"e_1_3_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/3178487.3178491"},{"key":"e_1_3_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-industry.15"},{"key":"e_1_3_2_1_57_1","volume-title":"Self-Instruct: Aligning Language Model with Self Generated Instructions. arXiv preprint arXiv:2212.10560","author":"Wang Yizhong","year":"2022","unstructured":"Yizhong Wang , Yeganeh Kordi , Swaroop Mishra , Alisa Liu , Noah A Smith , Daniel Khashabi , and Hannaneh Hajishirzi . 2022. Self-Instruct: Aligning Language Model with Self Generated Instructions. arXiv preprint arXiv:2212.10560 ( 2022 ). Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-Instruct: Aligning Language Model with Self Generated Instructions. arXiv preprint arXiv:2212.10560 (2022)."},{"key":"e_1_3_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"e_1_3_2_1_59_1","unstructured":"Yonghui Wu Mike Schuster Zhifeng Chen Quoc V Le Mohammad Norouzi Wolfgang Macherey Maxim Krikun Yuan Cao Qin Gao Klaus Macherey etal 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).  Yonghui Wu Mike Schuster Zhifeng Chen Quoc V Le Mohammad Norouzi Wolfgang Macherey Maxim Krikun Yuan Cao Qin Gao Klaus Macherey et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)."},{"key":"e_1_3_2_1_60_1","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Yu Gyeong-In","year":"2022","unstructured":"Gyeong-In Yu , Joo Seong Jeong , Geon-Woo Kim , Soojeong Kim , and Byung-Gon Chun . 2022 . Orca: A Distributed Serving System for {Transformer-Based} Generative Models . In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . 521--538. Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for {Transformer-Based} Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521--538."},{"key":"e_1_3_2_1_61_1","volume-title":"SHEPHERD: Serving DNNs in the Wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)","author":"Zhang Hong","year":"2023","unstructured":"Hong Zhang , Yupeng Tang , Anurag Khandelwal , and Ion Stoica . 2023 . SHEPHERD: Serving DNNs in the Wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) . USENIX Association, Boston, MA, 787--808. https:\/\/www.usenix.org\/conference\/nsdi23\/presentation\/zhang-hong Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. SHEPHERD: Serving DNNs in the Wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 787--808. https:\/\/www.usenix.org\/conference\/nsdi23\/presentation\/zhang-hong"},{"key":"e_1_3_2_1_62_1","volume-title":"Xi Victoria Lin, et al","author":"Zhang Susan","year":"2022","unstructured":"Susan Zhang , Stephen Roller , Naman Goyal , Mikel Artetxe , Moya Chen , Shuohui Chen , Christopher Dewan , Mona Diab , Xian Li , Xi Victoria Lin, et al . 2022 . Opt : Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022). Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)."},{"key":"e_1_3_2_1_63_1","volume-title":"Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Zheng Lianmin","year":"2022","unstructured":"Lianmin Zheng , Zhuohan Li , Hao Zhang , Yonghao Zhuang , Zhifeng Chen , Yanping Huang , Yida Wang , Yuanzhong Xu , Danyang Zhuo , Eric P Xing , 2022 . Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . 559--578. Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559--578."},{"key":"e_1_3_2_1_64_1","volume-title":"PetS: A Unified Framework for Parameter-Efficient Transformers Serving. In 2022 USENIX Annual Technical Conference (USENIX ATC 22)","author":"Zhou Zhe","year":"2022","unstructured":"Zhe Zhou , Xuechao Wei , Jiejing Zhang , and Guangyu Sun . 2022 . PetS: A Unified Framework for Parameter-Efficient Transformers Serving. In 2022 USENIX Annual Technical Conference (USENIX ATC 22) . 489--504. Zhe Zhou, Xuechao Wei, Jiejing Zhang, and Guangyu Sun. 2022. PetS: A Unified Framework for Parameter-Efficient Transformers Serving. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 489--504."}],"event":{"name":"SOSP '23: 29th Symposium on Operating Systems Principles","location":"Koblenz Germany","acronym":"SOSP '23","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems","USENIX"]},"container-title":["Proceedings of the 29th Symposium on Operating Systems Principles"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3600006.3613165","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:36:49Z","timestamp":1750178209000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3600006.3613165"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,23]]},"references-count":63,"alternative-id":["10.1145\/3600006.3613165","10.1145\/3600006"],"URL":"https:\/\/doi.org\/10.1145\/3600006.3613165","relation":{},"subject":[],"published":{"date-parts":[[2023,10,23]]},"assertion":[{"value":"2023-10-23","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}