{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,6]],"date-time":"2026-06-06T16:50:34Z","timestamp":1780764634702,"version":"3.54.1"},"reference-count":375,"publisher":"Association for Computing Machinery (ACM)","issue":"1","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2026,1,31]]},"abstract":"<jats:p>In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency, particularly in scenarios demanding low latency and high throughput. This survey addresses the imperative need for efficient LLM serving methodologies from a machine learning system (MLSys) research perspective, standing at the crux of advanced AI innovations and practical system optimizations. We provide in-depth analysis, covering a spectrum of solutions, ranging from cutting-edge algorithmic modifications to groundbreaking changes in system designs. The survey aims to provide a comprehensive understanding of the current state and future directions in efficient LLM serving, offering valuable insights for researchers and practitioners in overcoming the barriers of effective LLM deployment, thereby reshaping the future of AI.<\/jats:p>","DOI":"10.1145\/3754448","type":"journal-article","created":{"date-parts":[[2025,7,25]],"date-time":"2025-07-25T09:27:07Z","timestamp":1753435627000},"page":"1-37","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":28,"title":["Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems"],"prefix":"10.1145","volume":"58","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9371-8358","authenticated-orcid":false,"given":"Xupeng","family":"Miao","sequence":"first","affiliation":[{"name":"Purdue University","place":["West Lafayette, United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5406-0736","authenticated-orcid":false,"given":"Gabriele","family":"Oliaro","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University","place":["Pittsburgh, United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-8409-2717","authenticated-orcid":false,"given":"Zhihao","family":"Zhang","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University","place":["Pittsburgh, United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-3375-497X","authenticated-orcid":false,"given":"Xinhao","family":"Cheng","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University","place":["Pittsburgh, United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6894-6554","authenticated-orcid":false,"given":"Hongyi","family":"Jin","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University","place":["Pittsburgh, United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5744-3940","authenticated-orcid":false,"given":"Tianqi","family":"Chen","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University","place":["Pittsburgh, United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1270-5185","authenticated-orcid":false,"given":"Zhihao","family":"Jia","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University","place":["Pittsburgh, United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2025,9,4]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"2020. NVIDIA Effective Transformer. Retrieved November 25 2023 from https:\/\/github.com\/bytedance\/effective_transformer. (2020). Commit: e406421."},{"key":"e_1_3_1_3_2","unstructured":"2021. NVIDIA FasterTransformer. Retrieved November 25 2023 from https:\/\/github.com\/NVIDIA\/FasterTransformer. (2021). Commit: df4a753."},{"key":"e_1_3_1_4_2","unstructured":"2022. DeepSpeed Inference. Retrieved November 25 2023 from https:\/\/github.com\/microsoft\/DeepSpeed. (2022). Commit: 2afa1c7."},{"key":"e_1_3_1_5_2","unstructured":"2022. NVIDIA H100 Tensor Core GPU Architecture. Retrieved November 25 2023 from https:\/\/resources.nvidia.com\/en-us-tensor-core\/gtc22-whitepaper-hopper. (2022)."},{"key":"e_1_3_1_6_2","unstructured":"2023. AnyScale LLMPerf leaderboard. Retrieved December 23 2023 from https:\/\/github.com\/ray-project\/llmperf-leaderboard. (2023)."},{"key":"e_1_3_1_7_2","unstructured":"2023. AWS Inferentia. Retrieved from https:\/\/aws.amazon.com\/blogs\/machine-learning\/deploy-large-language-models-on-aws-inferentia2-using-large-model-inference-containers\/. (2023)."},{"key":"e_1_3_1_8_2","unstructured":"2023. ChatGLM2-6B. Retrieved from https:\/\/huggingface.co\/THUDM\/chatglm2-6b. (2023)."},{"key":"e_1_3_1_9_2","unstructured":"2023. CTranslate2. Retrieved November 25 2023 from https:\/\/github.com\/OpenNMT\/CTranslate2. (2023). Commit: d963499."},{"key":"e_1_3_1_10_2","unstructured":"2023. DeepSpeed-FastGen. Retrieved November 25 2023 from https:\/\/github.com\/microsoft\/DeepSpeed\/tree\/master\/blogs\/deepspeed-fastgen. (2023)."},{"key":"e_1_3_1_11_2","unstructured":"2023. DeepSpeed-Inference v.s. ZeRO-Inference. Retrieved November 25 2023 from https:\/\/github.com\/microsoft\/DeepSpeed\/issues\/4234. (2023)."},{"key":"e_1_3_1_12_2","unstructured":"2023. DeepSpeed-MII. Retrieved November 25 2023 from https:\/\/github.com\/microsoft\/DeepSpeed-MII. (2023). Commit: f34b772."},{"key":"e_1_3_1_13_2","unstructured":"2023. FlexFlow-Serve. Retrieved November 25 2023 from https:\/\/github.com\/Flexflow\/FlexFlow\/tree\/inference. (2023). Commit: 672cdad."},{"key":"e_1_3_1_14_2","unstructured":"2023. FlexGen. Retrieved November 25 2023 from https:\/\/github.com\/FMInference\/FlexGen. (2023). Commit: d34f7b4."},{"key":"e_1_3_1_15_2","unstructured":"2023. ggml. Retrieved November 25 2023 from https:\/\/github.com\/ggerganov\/ggml. (2023). Commit: a5e4560."},{"key":"e_1_3_1_16_2","unstructured":"2023. gpt-fast. Retrieved December 23 2023 from https:\/\/github.com\/pytorch-labs\/gpt-fast. (2023). Commit: 8c8c463."},{"key":"e_1_3_1_17_2","unstructured":"2023. Graphcore. Retrieved from https:\/\/www.graphcore.ai\/posts\/dolly-2.0-open-source-language-model-with-chatgpt-like-interactivity. (2023)."},{"key":"e_1_3_1_18_2","unstructured":"2023. Graphcore PopTransformer. Retrieved November 25 2023 from https:\/\/github.com\/graphcore\/PopTransformer. (2023). Commit: 1314598."},{"key":"e_1_3_1_19_2","unstructured":"2023. Huggingface Text Generation Inference. Retrieved November 25 2023 from https:\/\/github.com\/huggingface\/text-generation-inference. (2023). Commit: 3c02262."},{"key":"e_1_3_1_20_2","unstructured":"2023. Intel Extension for Transformers. Retrieved December 23 2023 from https:\/\/github.com\/intel\/intel-extension-for-transformers. (2023). Commit: 37d4007."},{"key":"e_1_3_1_21_2","unstructured":"2023. InterLM LMDeploy. Retrieved November 25 2023 from https:\/\/github.com\/InternLM\/lmdeploy. (2023). Commit: c07f60f."},{"key":"e_1_3_1_22_2","unstructured":"2023. LightLLM. Retrieved November 25 2023 from https:\/\/github.com\/ModelTC\/lightllm. (2023). Commit: 84671a7."},{"key":"e_1_3_1_23_2","unstructured":"2023. Llama-v2-7b benchmark. Retrieved November 25 2023 from https:\/\/hamel.dev\/notes\/llm\/inference\/03_inference.html. (2023)."},{"key":"e_1_3_1_24_2","unstructured":"2023. NVIDIA cuDNN MultiHeadAttn. Retrieved November 25 2023 from https:\/\/docs.nvidia.com\/deeplearning\/cudnn\/api\/index.html##cudnnMultiHeadAttnForward. (2023)."},{"key":"e_1_3_1_25_2","unstructured":"2023. NVIDIA CUTLASS. Retrieved November 25 2023 from https:\/\/github.com\/NVIDIA\/cutlass. (2023). Commit: b5d8a5d."},{"key":"e_1_3_1_26_2","unstructured":"2023. NVIDIA TensorRT-LLM. Retrieved November 25 2023 from https:\/\/github.com\/NVIDIA\/TensorRT-LLM. (2023). Commit: 6837c81."},{"key":"e_1_3_1_27_2","unstructured":"2023. OpenLLM. Retrieved November 25 2023 from https:\/\/github.com\/bentoml\/OpenLLM. (2023). Commit: b4ea4b3."},{"key":"e_1_3_1_28_2","unstructured":"2023. RayLLM. Retrieved November 25 2023 from https:\/\/github.com\/ray-project\/ray-llm. (2023). Commit: fa3a766."},{"key":"e_1_3_1_29_2","unstructured":"2023. Sambanova. Retrieved from https:\/\/sambanova.ai\/press\/sambanova-unveils-new-chip-the-sn40l\/. (2023)."},{"key":"e_1_3_1_30_2","unstructured":"2023. vLLM. Retrieved November 25 2023 from https:\/\/github.com\/vllm-project\/vllm. (2023). Commit: 7c60044."},{"key":"e_1_3_1_31_2","unstructured":"2023. Xorbits Inference (Xinference). Retrieved November 25 2023 from https:\/\/github.com\/xorbitsai\/inference. (2023). Commit: 22732d8."},{"key":"e_1_3_1_32_2","unstructured":"Josh Achiam Steven Adler Sandhini Agarwal Lama Ahmad Ilge Akkaya Florencia Leoni Aleman Diogo Almeida Janko Altenschmidt Sam Altman Shyamal Anadkat 2023. Gpt-4 technical report. arXiv:2303.08774. Retrieved from https:\/\/arxiv.org\/abs\/2303.08774 (2023)."},{"key":"e_1_3_1_33_2","first-page":"117","volume-title":"Proceedings of the OSDI 2024","author":"Agrawal Amey","year":"2024","unstructured":"Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming throughput-latency tradeoff in LLM inference with Sarathi-Serve. In Proceedings of the OSDI 2024. 117\u2013134."},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.emnlp-main.298"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00073"},{"key":"e_1_3_1_36_2","unstructured":"Keivan Alizadeh Iman Mirzadeh Dmitry Belenko Karen Khatamifard Minsik Cho Carlo C. Del Mundo Mohammad Rastegari and Mehrdad Farajtabar. 2023. LLM in a flash: Efficient large language model inference with limited memory. arXiv:2312.11514. Retrieved from https:\/\/arxiv.org\/abs\/2312.11514 (2023)."},{"key":"e_1_3_1_37_2","doi-asserted-by":"crossref","unstructured":"Reza Yazdani Aminabadi Samyam Rajbhandari Minjia Zhang Ammar Ahmad Awan Cheng Li Du Li Elton Zheng Jeff Rasley Shaden Smith Olatunji Ruwase and Yuxiong He. 2022. Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale. arXiv:2207.00032. Retrieved from https:\/\/arxiv.org\/abs\/2207.00032. (2022).","DOI":"10.1109\/SC41404.2022.00051"},{"key":"e_1_3_1_38_2","unstructured":"Sotiris Anagnostidis Dario Pavllo Luca Biggio Lorenzo Noci Aurelien Lucchi and Thomas Hoffmann. 2023. Dynamic context pruning for efficient and interpretable autoregressive transformers. arXiv:2305.15805. Retrieved from https:\/\/arxiv.org\/abs\/2305.15805. (2023)."},{"key":"e_1_3_1_39_2","volume-title":"Proceedings of the COLM 2024","author":"Ankner Zachary","year":"2024","unstructured":"Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. 2024. Hydra: Sequentially-dependent draft heads for medusa decoding. In Proceedings of the COLM 2024."},{"key":"e_1_3_1_40_2","unstructured":"Sangmin Bae Jongwoo Ko Hwanjun Song and Se-Young Yun. 2023. Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding. arXiv:2310.05424. Retrieved from https:\/\/arxiv.org\/abs\/2310.05424. (2023)."},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.acl-long.172"},{"key":"e_1_3_1_42_2","first-page":"499","volume-title":"Proceedings of the OSDI 2020","author":"Bai Zhihao","year":"2020","unstructured":"Zhihao Bai, Zhen Zhang, Yibo Zhu, and Xin Jin. 2020. PipeSwitch: Fast pipelined context switching for deep learning applications. In Proceedings of the OSDI 2020. 499\u2013514."},{"key":"e_1_3_1_43_2","unstructured":"Peter Belcak and Roger Wattenhofer. 2023. Exponentially faster language modelling. arXiv:2311.10770. Retrieved from https:\/\/arxiv.org\/abs\/2311.10770. (2023)."},{"key":"e_1_3_1_44_2","unstructured":"Iz Beltagy Matthew E. Peters and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv:2004.05150. Retrieved from https:\/\/arxiv.org\/abs\/2004.05150. (2020)."},{"key":"e_1_3_1_45_2","first-page":"1","article-title":"Demystifying parallel and distributed deep learning: An in-depth concurrency analysis","author":"Ben-Nun Tal","year":"2019","unstructured":"Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR) (2019), 1\u201343.","journal-title":"ACM Computing Surveys (CSUR)"},{"key":"e_1_3_1_46_2","first-page":"2206","volume-title":"Proceedings of the ICML","author":"Borgeaud Sebastian","year":"2022","unstructured":"Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae, Erich Elsen, and Laurent Sifre. 2022. Improving language models by retrieving from trillions of tokens. In Proceedings of the ICML. 2206\u20132240."},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-demo.54"},{"key":"e_1_3_1_48_2","doi-asserted-by":"crossref","unstructured":"Alexander Borzunov Max Ryabinin Artem Chumachenko Dmitry Baranchuk Tim Dettmers Younes Belkada Pavel Samygin and Colin Raffel. 2023. Distributed inference and fine-tuning of large language models over the internet. arXiv:2312.08361. Retrieved from https:\/\/arxiv.org\/abs\/2312.08361. (2023).","DOI":"10.18653\/v1\/2023.acl-demo.54"},{"key":"e_1_3_1_49_2","unstructured":"Vukasin Bozic Danilo Dordevic Daniele Coppola and Joseph Thommes. 2023. Rethinking attention: Exploring shallow feed-forward neural networks as an alternative to attention layers in transformers. arXiv:2311.10642. Retrieved from https:\/\/arxiv.org\/abs\/2311.10642. (2023)."},{"key":"e_1_3_1_50_2","unstructured":"William Brandon Aniruddha Nrusimha Kevin Qian Zachary Ankner Tian Jin Zhiye Song and Jonathan Ragan-Kelley. 2023. Striped attention: Faster ring attention for causal transformers. arXiv:2311.09431. Retrieved from https:\/\/arxiv.org\/abs\/2311.09431. (2023)."},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.1985.6312218"},{"key":"e_1_3_1_52_2","unstructured":"Tianle Cai Yuhong Li Zhengyang Geng Hongwu Peng and Tri Dao. 2023. Medusa: Simple framework for accelerating llm generation with multiple decoding heads. Retrieved November 25 2023 from https:\/\/github.com\/FasterDecoding\/Medusa. (2023). Commit: dd9c8a5."},{"key":"e_1_3_1_53_2","unstructured":"Rahul Chand Yashoteja Prabhu and Pratyush Kumar. 2023. DSFormer: Effective compression of text-transformers by dense-sparse weight factorization. arXiv:2312.13211. Retrieved from https:\/\/arxiv.org\/abs\/2312.13211. (2023)."},{"key":"e_1_3_1_54_2","unstructured":"Carol Chen. 2022. Transformer Inference Arithmetic. Retrieved November 25 2023 from https:\/\/kipp.ly\/blog\/transformer-inference-arithmetic\/. (2022)."},{"key":"e_1_3_1_55_2","unstructured":"Charlie Chen Sebastian Borgeaud Geoffrey Irving Jean-Baptiste Lespiau Laurent Sifre and John Jumper. 2023. Accelerating large language model decoding with speculative sampling. arXiv:2302.01318. Retrieved from https:\/\/arxiv.org\/abs\/2302.01318. (2023)."},{"key":"e_1_3_1_56_2","unstructured":"Lequn Chen Zihao Ye Yongji Wu Danyang Zhuo Luis Ceze and Arvind Krishnamurthy. 2023. Punica: Multi-tenant LoRA serving. arXiv:2310.18547. Retrieved from https:\/\/arxiv.org\/abs\/2310.18547. (2023)."},{"key":"e_1_3_1_57_2","unstructured":"Lingjiao Chen Matei Zaharia and James Zou. 2023. FrugalGPT: How to use large language models while reducing cost and improving performance. arXiv:2305.05176. Retrieved from https:\/\/arxiv.org\/abs\/2305.05176. (2023)."},{"key":"e_1_3_1_58_2","unstructured":"Mark Chen Jerry Tworek Heewoo Jun Qiming Yuan Henrique Ponde de Oliveira Pinto Jared Kaplan Harri Edwards Yuri Burda Nicholas Joseph Greg Brockman Alex Ray Raul Puri Gretchen Krueger Michael Petrov Heidy Khlaaf Girish Sastry Pamela Mishkin Brooke Chan Scott Gray Nick Ryder Mikhail Pavlov Alethea Power Lukasz Kaiser Mohammad Bavarian Clemens Winter Philippe Tillet Felipe Petroski Such Dave Cummings Matthias Plappert Fotios Chantzis Elizabeth Barnes Ariel Herbert-Voss William Hebgen Guss Alex Nichol Alex Paino Nikolas Tezak Jie Tang Igor Babuschkin Suchir Balaji Shantanu Jain William Saunders Christopher Hesse Andrew N. Carr Jan Leike Josh Achiam Vedant Misra Evan Morikawa Alec Radford Matthew Knight Miles Brundage Mira Murati Katie Mayer Peter Welinder Bob McGrew Dario Amodei Sam McCandlish Ilya Sutskever and Wojciech Zaremba. 2021. Evaluating large language models trained on code. arXiv:2107.03374. Retrieved from https:\/\/arxiv.org\/abs\/2107.03374. (2021)."},{"key":"e_1_3_1_59_2","first-page":"1","volume-title":"Proceedings of the HPCA 2021","author":"Chen Shiyang","year":"2021","unstructured":"Shiyang Chen, Shaoyi Huang, Santosh Pandey, Bingbing Li, Guang R. Gao, Long Zheng, Caiwen Ding, and Hang Liu. 2021. Et: Re-thinking self-attention for transformer models on gpus. In Proceedings of the HPCA 2021. 1\u201318."},{"key":"e_1_3_1_60_2","unstructured":"Shouyuan Chen Sherman Wong Liangjian Chen and Yuandong Tian. 2023. Extending context window of large language models via positional interpolation. arXiv:2306.15595. Retrieved from https:\/\/arxiv.org\/abs\/2306.15595. (2023)."},{"key":"e_1_3_1_61_2","unstructured":"Wenhu Chen Xueguang Ma Xinyi Wang and William W. Cohen. 2022. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv:2211.12588. Retrieved from https:\/\/arxiv.org\/abs\/2211.12588. (2022)."},{"key":"e_1_3_1_62_2","first-page":"129531","article-title":"Sequoia: Scalable and robust speculative decoding","author":"Chen Zhuoming","year":"2024","unstructured":"Zhuoming Chen, Avner May, Ruslan Svirschevski, Yu-Hsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen. 2024. Sequoia: Scalable and robust speculative decoding. Proceedings of NeurIPS (2024), 129531\u2013129563.","journal-title":"Proceedings of NeurIPS"},{"key":"e_1_3_1_63_2","volume-title":"Proceedings of the ICLR 2024","author":"Chen Zhuoming","year":"2024","unstructured":"Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, and Beidi Chen. 2024. Magicpig: Lsh sampling for efficient llm generation. In Proceedings of the ICLR 2024."},{"key":"e_1_3_1_64_2","doi-asserted-by":"crossref","unstructured":"Alexis Chevalier Alexander Wettig Anirudh Ajith and Danqi Chen. 2023. Adapting language models to compress contexts. arXiv:2305.14788. Retrieved from https:\/\/arxiv.org\/abs\/2305.14788 (2023).","DOI":"10.18653\/v1\/2023.emnlp-main.232"},{"key":"e_1_3_1_65_2","unstructured":"Wei-Lin Chiang Zhuohan Li Zi Lin Ying Sheng Zhanghao Wu Hao Zhang Lianmin Zheng Siyuan Zhuang Yonghao Zhuang Joseph E. Gonzalez Ion Stoica and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. (2023). Retrieved from https:\/\/lmsys.org\/blog\/2023-03-30-vicuna\/"},{"key":"e_1_3_1_66_2","unstructured":"Rewon Child Scott Gray Alec Radford and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv:1904.10509. Retrieved from https:\/\/arxiv.org\/abs\/1904.10509 (2019)."},{"key":"e_1_3_1_67_2","doi-asserted-by":"publisher","DOI":"10.1109\/SCW63240.2024.00178"},{"key":"e_1_3_1_68_2","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC55918.2022.00018"},{"key":"e_1_3_1_69_2","unstructured":"Aakanksha Chowdhery Sharan Narang Jacob Devlin Maarten Bosma Gaurav Mishra Adam Roberts Paul Barham Hyung Won Chung Charles Sutton Sebastian Gehrmann Parker Schuh Kensen Shi Sasha Tsvyashchenko Joshua Maynez Abhishek Rao Parker Barnes Yi Tay Noam Shazeer Vinodkumar Prabhakaran Emily Reif Nan Du Ben Hutchinson Reiner Pope James Bradbury Jacob Austin Michael Isard Guy Gur-Ari Pengcheng Yin Toju Duke Anselm Levskaya Sanjay Ghemawat Sunipa Dev Henryk Michalewski Xavier Garcia Vedant Misra Kevin Robinson Liam Fedus Denny Zhou Daphne Ippolito David Luan Hyeontaek Lim Barret Zoph Alexander Spiridonov Ryan Sepassi David Dohan Shivani Agrawal Mark Omernick Andrew M. Dai Thanumalayan Sankaranarayana Pillai Marie Pellat Aitor Lewkowycz Erica Moreira Rewon Child Oleksandr Polozov Katherine Lee Zongwei Zhou Xuezhi Wang Brennan Saeta Mark Diaz Orhan Firat Michele Catasta Jason Wei Kathy Meier-Hellstern Douglas Eck Jeff Dean Slav Petrov and Noah Fiedel. 2022. Palm: Scaling language modeling with pathways. arXiv:2204.02311. Retrieved from https:\/\/arxiv.org\/abs\/2204.02311 (2022)."},{"key":"e_1_3_1_70_2","unstructured":"Jacob K. Christopher Brian R. Bartoldson Tal Ben-Nun Michael Cardei Bhavya Kailkhura and Ferdinando Fioretto. 2024. Speculative diffusion decoding: Accelerating language generation through diffusion. arXiv:2408.05636. Retrieved from https:\/\/arxiv.org\/abs\/2408.05636 (2024)."},{"key":"e_1_3_1_71_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1223"},{"key":"e_1_3_1_72_2","unstructured":"R\u00f3bert Csord\u00e1s Piotr Pi\u0119kos and Kazuki Irie. 2023. SwitchHead: Accelerating transformers with mixture-of-experts attention. arXiv:2312.07987. Retrieved from https:\/\/arxiv.org\/abs\/2312.07987. (2023)."},{"key":"e_1_3_1_73_2","doi-asserted-by":"crossref","unstructured":"Fahim Dalvi Maram Hasanain Sabri Boughorbel Basel Mousi Samir Abdaljalil Nizi Nazar Ahmed Abdelali Shammur Absar Chowdhury Hamdy Mubarak Ahmed Ali Majd Hawasly Nadir Durrani and Firoj Alam. 2023. LLMeBench: A flexible framework for accelerating LLMs benchmarking. arXiv:2308.04945. Retrieved from https:\/\/arxiv.org\/abs\/2308.04945. (2023).","DOI":"10.18653\/v1\/2024.eacl-demo.23"},{"key":"e_1_3_1_74_2","unstructured":"Databricks. 2023. LLM Inference Performance Engineering: Best Practices. (2023). Retrieved November 25 2023 from https:\/\/www.databricks.com\/blog\/llm-inference-performance-engineering-best-practices"},{"key":"e_1_3_1_75_2","first-page":"933","volume-title":"Proceedings of the ICML","author":"Dauphin Yann N.","year":"2017","unstructured":"Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated convolutional networks. In Proceedings of the ICML. 933\u2013941."},{"key":"e_1_3_1_76_2","unstructured":"DeciAI. 2023. DeciLM 6B. (2023). Retrieved from https:\/\/huggingface.co\/Deci\/DeciLM-6b"},{"key":"e_1_3_1_77_2","unstructured":"Luciano Del Corro Allie Del Giorno Sahaj Agarwal Bin Yu Ahmed Awadallah and Subhabrata Mukherjee. 2023. SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference. arXiv:2307.02628. Retrieved from https:\/\/arxiv.org\/abs\/2307.02628. (2023)."},{"key":"e_1_3_1_78_2","unstructured":"Tim Dettmers Mike Lewis Younes Belkada and Luke Zettlemoyer. 2022. LLM.int8(): 8-bit matrix multiplication for transformers at scale. arXiv:2208.07339. Retrieved from https:\/\/arxiv.org\/abs\/2208.07339. (2022)."},{"key":"e_1_3_1_79_2","unstructured":"Tim Dettmers Artidoro Pagnoni Ari Holtzman and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms. arXiv:2305.14314. Retrieved from https:\/\/arxiv.org\/abs\/2305.14314. (2023)."},{"key":"e_1_3_1_80_2","unstructured":"Tim Dettmers Ruslan Svirschevski Vage Egiazarian Denis Kuznedelev Elias Frantar Saleh Ashkboos Alexander Borzunov Torsten Hoefler and Dan Alistarh. 2023. SpQR: A sparse-quantized representation for near-lossless LLM weight compression. arXiv:2306.03078. Retrieved from https:\/\/arxiv.org\/abs\/2306.03078. (2023)."},{"key":"e_1_3_1_81_2","volume-title":"Proceedings of the ICLR 2023","author":"Dettmers Tim","year":"2023","unstructured":"Tim Dettmers and Luke Zettlemoyer. 2023. The case for 4-bit precision: K-bit inference scaling laws. In Proceedings of the ICLR 2023."},{"key":"e_1_3_1_82_2","unstructured":"Nolan Dey Gurpreet Gosal Hemant Khachane William Marshall Ribhu Pathria Marvin Tom Joel Hestness and Zhiming (Charles) Chen. 2023. Cerebras-GPT: Open compute-optimal language models trained on the Cerebras wafer-scale cluster. arXiv:2304.03208. Retrieved from https:\/\/arxiv.org\/abs\/2304.03208. (2023)."},{"key":"e_1_3_1_83_2","doi-asserted-by":"crossref","unstructured":"Jiayu Ding Shuming Ma Li Dong Xingxing Zhang Shaohan Huang Wenhui Wang and Furu Wei. 2023. Longnet: Scaling transformers to 1 000 000 000 tokens. arXiv:2307.02486. Retrieved from https:\/\/arxiv.org\/abs\/2307.02486. (2023).","DOI":"10.14218\/JCTH.2022.00006S"},{"key":"e_1_3_1_84_2","unstructured":"Juechu Dong Boyuan Feng Driss Guessous Yanbo Liang and Horace He. 2024. Flex attention: A programming model for generating optimized attention kernels. arXiv:2412.05496. Retrieved from https:\/\/arxiv.org\/abs\/2412.05496. (2024)."},{"key":"e_1_3_1_85_2","doi-asserted-by":"publisher","DOI":"10.1145\/3580305.3599572"},{"key":"e_1_3_1_86_2","unstructured":"Yixin Dong Charlie F. Ruan Yaxing Cai Ruihang Lai Ziyi Xu Yilong Zhao and Tianqi Chen. 2024. Xgrammar: Flexible and efficient structured generation engine for large language models. arXiv:2411.15100. Retrieved from https:\/\/arxiv.org\/abs\/2411.15100. (2024)."},{"key":"e_1_3_1_87_2","first-page":"1","article-title":"Improving computation and memory efficiency for real-world transformer inference on GPUs","author":"Du Jiangsu","year":"2023","unstructured":"Jiangsu Du, Jiazhi Jiang, Jiang Zheng, Hongbin Zhang, Dan Huang, and Yutong Lu. 2023. Improving computation and memory efficiency for real-world transformer inference on GPUs. ACM TACO (2023), 1\u201322.","journal-title":"ACM TACO"},{"key":"e_1_3_1_88_2","first-page":"5547","volume-title":"Proceedings of the ICML 2022","author":"Du Nan","year":"2022","unstructured":"Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, and Claire Cui. 2022. Glam: Efficient scaling of language models with mixture-of-experts. In Proceedings of the ICML 2022. 5547\u20135569."},{"key":"e_1_3_1_89_2","unstructured":"Murali Emani Sam Foreman Varuni Sastry Zhen Xie Siddhisanket Raskar William Arnold Rajeev Thakur Venkatram Vishwanath and Michael E. Papka. 2023. A comprehensive performance study of large language models on novel AI accelerators. arXiv:2310.04607. Retrieved from https:\/\/arxiv.org\/abs\/2310.04607. (2023)."},{"key":"e_1_3_1_90_2","unstructured":"Ahmad Faiz Sotaro Kaneda Ruhan Wang Rita Osi Parteek Sharma Fan Chen and Lei Jiang. 2023. LLMCarbon: Modeling the end-to-end carbon footprint of large language models. arXiv:2309.14393. Retrieved from https:\/\/arxiv.org\/abs\/2309.14393. (2023)."},{"key":"e_1_3_1_91_2","volume-title":"Proceedings of the ICLR 2019","author":"Fan Angela","year":"2019","unstructured":"Angela Fan, Edouard Grave, and Armand Joulin. 2019. Reducing transformer depth on demand with structured dropout. In Proceedings of the ICLR 2019."},{"key":"e_1_3_1_92_2","doi-asserted-by":"publisher","DOI":"10.1145\/3437801.3441578"},{"key":"e_1_3_1_93_2","first-page":"5232","article-title":"Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity","author":"Fedus William","year":"2022","unstructured":"William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research (2022), 5232\u20135270.","journal-title":"The Journal of Machine Learning Research"},{"key":"e_1_3_1_94_2","first-page":"721","article-title":"The CoRa tensor compiler: Compilation for ragged tensors with minimal padding","author":"Fegade Pratik","year":"2022","unstructured":"Pratik Fegade, Tianqi Chen, Phillip Gibbons, and Todd Mowry. 2022. The CoRa tensor compiler: Compilation for ragged tensors with minimal padding. Proceedings of Machine Learning and Systems (2022), 721\u2013747.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_1_95_2","unstructured":"Weizhi Fei Xueyan Niu Pingyi Zhou Lu Hou Bo Bai Lei Deng and Wei Han. 2023. Extending context window of large language models via semantic compression. arXiv:2312.09571. Retrieved from https:\/\/arxiv.org\/abs\/2312.09571. (2023)."},{"key":"e_1_3_1_96_2","doi-asserted-by":"publisher","DOI":"10.1145\/3575693.3576933"},{"key":"e_1_3_1_97_2","first-page":"10323","volume-title":"Proceedings of the ICML 2023","author":"Frantar Elias","year":"2023","unstructured":"Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive language models can be accurately pruned in one-shot. In Proceedings of the ICML 2023. 10323\u201310337."},{"key":"e_1_3_1_98_2","unstructured":"Elias Frantar Saleh Ashkboos Torsten Hoefler and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv:2210.17323. Retrieved from https:\/\/arxiv.org\/abs\/2210.17323 (2022)."},{"key":"e_1_3_1_99_2","volume-title":"Proceedings of the ICLR 2022","author":"Frantar Elias","year":"2022","unstructured":"Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. OPTQ: Accurate quantization for generative pre-trained transformers. In Proceedings of the ICLR 2022."},{"key":"e_1_3_1_100_2","article-title":"Compiling machine learning programs via high-level tracing","author":"Frostig Roy","year":"2018","unstructured":"Roy Frostig, Matthew James Johnson, and Chris Leary. 2018. Compiling machine learning programs via high-level tracing. Systems for Machine Learning (2018).","journal-title":"Systems for Machine Learning"},{"key":"e_1_3_1_101_2","volume-title":"Proceedings of the ICLR 2022","author":"Fu Daniel Y.","year":"2022","unstructured":"Daniel Y. Fu, Tri Dao, Khaled Kamal Saab, Armin W. Thomas, Atri Rudra, and Christopher Re. 2022. Hungry hungry hippos: Towards language modeling with state space models. In Proceedings of the ICLR 2022."},{"key":"e_1_3_1_102_2","first-page":"14060","volume-title":"Proceedings of the ICML 2024","author":"Fu Yichao","year":"2024","unstructured":"Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. 2024. Break the sequential dependency of LLM inference using LOOKAHEAD DECODING. In Proceedings of the ICML 2024. 14060\u201314079."},{"key":"e_1_3_1_103_2","unstructured":"Yichao Fu Junda Chen Siqi Zhu Zheyu Fu Zhongdongming Dai Aurick Qiao and Hao Zhang. 2024. Efficiently serving LLM reasoning programs with certaindex. arXiv:2412.20993. Retrieved from https:\/\/arxiv.org\/abs\/2412.20993. (2024)."},{"key":"e_1_3_1_104_2","first-page":"135","volume-title":"Proceedings of the OSDI 2024","author":"Fu Yao","year":"2024","unstructured":"Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM:Low-latency serverless inference for large language models. In Proceedings of the OSDI 2024. 135\u2013153."},{"key":"e_1_3_1_105_2","article-title":"MegaBlocks: Efficient sparse training with mixture-of-experts","author":"Gale Trevor","year":"2023","unstructured":"Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. 2023. MegaBlocks: Efficient sparse training with mixture-of-experts. Proceedings of Machine Learning and Systems (2023).","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_1_106_2","first-page":"111","volume-title":"Proceedings of the USENIX ATC 2024","author":"Gao Bin","year":"2024","unstructured":"Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. Cost-efficient large language model serving for multi-turn conversations with cachedattention. In Proceedings of the USENIX ATC 2024. 111\u2013126."},{"key":"e_1_3_1_107_2","doi-asserted-by":"publisher","DOI":"10.1145\/3689031.3696072"},{"key":"e_1_3_1_108_2","volume-title":"Proceedings of the WANT@ NeurIPS 2023","author":"Ge Suyu","year":"2023","unstructured":"Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. 2023. Model tells you what to discard: Adaptive KV cache compression for LLMs. In Proceedings of the WANT@ NeurIPS 2023."},{"key":"e_1_3_1_109_2","unstructured":"Tao Ge Jing Hu Xun Wang Si-Qing Chen and Furu Wei. 2023. In-context autoencoder for context compression in a large language model. arXiv:2307.06945. Retrieved from https:\/\/arxiv.org\/abs\/2307.06945. (2023)."},{"key":"e_1_3_1_110_2","unstructured":"Tao Ge Heming Xia Xin Sun Si-Qing Chen and Furu Wei. 2022. Lossless acceleration for Seq2seq generation with aggressive decoding. arXiv:2205.10350. Retrieved from https:\/\/arxiv.org\/abs\/2205.10350. (2022)."},{"key":"e_1_3_1_111_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1633"},{"key":"e_1_3_1_112_2","unstructured":"Marjan Ghazvininejad Omer Levy and Luke Zettlemoyer. 2020. Semi-autoregressive training improves mask-predict decoding. arXiv:2001.08785. Retrieved from https:\/\/arxiv.org\/abs\/2001.08785. (2020)."},{"key":"e_1_3_1_113_2","doi-asserted-by":"publisher","DOI":"10.1201\/9781003162810-13"},{"key":"e_1_3_1_114_2","unstructured":"In Gim Guojun Chen Seung-seob Lee Nikhil Sarda Anurag Khandelwal and Lin Zhong. 2023. Prompt cache: Modular attention reuse for low-latency inference. arXiv:2311.04934. Retrieved from https:\/\/arxiv.org\/abs\/2311.04934. (2023)."},{"key":"e_1_3_1_115_2","volume-title":"Proceedings of the ICML 2020","author":"Goyal Saurabh","year":"2020","unstructured":"Saurabh Goyal, Anamitra Roy Choudhury, Saurabh Raje, Venkatesan Chakaravarthy, Yogish Sabharwal, and Ashish Verma. 2020. PoWER-BERT: Accelerating BERT inference via progressive word-vector elimination. In Proceedings of the ICML 2020."},{"key":"e_1_3_1_116_2","volume-title":"Proceedings of the COLM 2024","author":"Gu Albert","year":"2023","unstructured":"Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the COLM 2024."},{"key":"e_1_3_1_117_2","volume-title":"Proceedings of the ICLR 2021","author":"Gu Albert","year":"2021","unstructured":"Albert Gu, Karan Goel, and Christopher Re. 2021. Efficiently modeling long sequences with structured state spaces. In Proceedings of the ICLR 2021."},{"key":"e_1_3_1_118_2","volume-title":"Proceedings of the ICLR 2018","author":"Gu J","year":"2018","unstructured":"J Gu, J Bradbury, C Xiong, VOK Li, and R Socher. 2018. Non-autoregressive neural machine translation. In Proceedings of the ICLR 2018."},{"key":"e_1_3_1_119_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.findings-acl.11"},{"key":"e_1_3_1_120_2","volume-title":"Proceedings of the ICLR 2024","author":"Gu Yuxian","year":"2024","unstructured":"Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2024. MiniLLM: Knowledge distillation of large language models. In Proceedings of the ICLR 2024."},{"key":"e_1_3_1_121_2","first-page":"1041","volume-title":"Proceedings of the NSDI 2022","author":"Gunasekaran Jashwant Raj","year":"2022","unstructured":"Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R Das. 2022. Cocktail: A multidimensional optimization for model serving in cloud. In Proceedings of the NSDI 2022. 1041\u20131057."},{"key":"e_1_3_1_122_2","unstructured":"Daya Guo Dejian Yang Haowei Zhang Junxiao Song Ruoyu Zhang Runxin Xu Qihao Zhu Shirong Ma Peiyi Wang Xiao Bi 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv:2501.12948. Retrieved from https:\/\/arxiv.org\/abs\/2501.12948. (2025)."},{"key":"e_1_3_1_123_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33013723"},{"key":"e_1_3_1_124_2","doi-asserted-by":"publisher","DOI":"10.1145\/3575693.3575698"},{"key":"e_1_3_1_125_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1133"},{"key":"e_1_3_1_126_2","unstructured":"Ankit Gupta and Jonathan Berant. 2020. Gmat: Global memory augmentation for transformers. arXiv:2006.03274. Retrieved from https:\/\/arxiv.org\/abs\/2006.03274. (2020)."},{"key":"e_1_3_1_127_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.sustainlp-1.5"},{"key":"e_1_3_1_128_2","doi-asserted-by":"publisher","DOI":"10.1145\/3487045"},{"key":"e_1_3_1_129_2","first-page":"539","volume-title":"Proceedings of the OSDI 2022","author":"Han Mingcong","year":"2022","unstructured":"Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-scale preemption for concurrent GPU-accelerated DNN inferences. In Proceedings of the OSDI 2022. 539\u2013558."},{"key":"e_1_3_1_130_2","unstructured":"Bobby He and Thomas Hofmann. 2023. Simplifying transformer blocks. arXiv:2311.01906. Retrieved from https:\/\/arxiv.org\/abs\/2311.01906. (2023)."},{"key":"e_1_3_1_131_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503221.3508418"},{"key":"e_1_3_1_132_2","unstructured":"Xuanli He Iman Keivanloo Yi Xu Xiang He Belinda Zeng Santosh Rajagopalan and Trishul Chilimbi. 2021. Magic pyramid: Accelerating inference with early exiting and token pruning. arXiv:2111.00230. Retrieved from https:\/\/arxiv.org\/abs\/2111.00230. (2021)."},{"key":"e_1_3_1_133_2","unstructured":"Zhenyu He Zexuan Zhong Tianle Cai Jason D. Lee and Di He. 2023. REST: Retrieval-based speculative decoding. arXiv:2311.08252. Retrieved from https:\/\/arxiv.org\/abs\/2311.08252. (2023)."},{"key":"e_1_3_1_134_2","unstructured":"Ke Hong Guohao Dai Jiaming Xu Qiuli Mao Xiuhong Li Jun Liu Kangdi Chen Hanyu Dong and Yu Wang. 2023. FlashDecoding++: Faster large language model inference on GPUs. arXiv:2311.01282. Retrieved from https:\/\/arxiv.org\/abs\/2311.01282. (2023)."},{"key":"e_1_3_1_135_2","unstructured":"Coleman Hooper Sehoon Kim Hiva Mohammadzadeh Hasan Genc Kurt Keutzer Amir Gholami and Sophia Shao. 2023. SPEED: Speculative pipelined execution for efficient decoding. arXiv:2310.12072. Retrieved from https:\/\/arxiv.org\/abs\/2310.12072. (2023)."},{"key":"e_1_3_1_136_2","unstructured":"Cunchen Hu Heyang Huang Liangliang Xu Xusheng Chen Jiang Xu Shuang Chen Hao Feng Chenxi Wang Sa Wang Yungang Bao Ninghui Sun and Yizhou Shan. 2024. Inference without interference: Disaggregate llm inference for mixed downstream workloads. arXiv:2401.11181. Retrieved from https:\/\/arxiv.org\/abs\/2401.11181. (2024)."},{"key":"e_1_3_1_137_2","unstructured":"Haiyang Huang Newsha Ardalani Anna Sun Liu Ke Hsien-Hsin S. Lee Anjali Sridhar Shruti Bhosale Carole-Jean Wu and Benjamin Lee. 2023. Towards MoE deployment: Mitigating inefficiencies in mixture-of-expert (MoE) inference. arXiv:2303.06182. Retrieved from https:\/\/arxiv.org\/abs\/2303.06182. (2023)."},{"key":"e_1_3_1_138_2","unstructured":"Kaiyu Huang Hao Wu Zhubo Shi Han Zou Minchen Yu and Qingjiang Shi. 2025. SpecServe: Efficient and SLO-aware large language model serving with adaptive speculative decoding. arXiv:2503.05096. Retrieved from https:\/\/arxiv.org\/abs\/2503.05096. (2025)."},{"key":"e_1_3_1_139_2","article-title":"Tutel: Adaptive mixture-of-experts at scale","author":"Hwang Changho","year":"2023","unstructured":"Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, Joe Chau, Peng Cheng, Fan Yang, Mao Yang, and Yongqiang Xiong2023. Tutel: Adaptive mixture-of-experts at scale. Proceedings of Machine Learning and Systems (2023).","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_1_140_2","unstructured":"R\u00e9gis Pierrard Ilyas Moutawwakil. 2023. LLM-Perf Leaderboard. Retrieved from https:\/\/huggingface.co\/spaces\/optimum\/llm-perf-leaderboard. (2023)."},{"key":"e_1_3_1_141_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581784.3607102"},{"key":"e_1_3_1_142_2","volume-title":"Proceedings of the Workshop on Efficient Systems for Foundation Models@ ICML2023","author":"Isik Berivan","year":"2023","unstructured":"Berivan Isik, Hermann Kumbong, Wanyi Ning, Xiaozhe Yao, Sanmi Koyejo, and Ce Zhang. 2023. GPT-Zip: Deep compression of finetuned large language models. In Proceedings of the Workshop on Efficient Systems for Foundation Models@ ICML2023."},{"key":"e_1_3_1_143_2","unstructured":"Sam Ade Jacobs Masahiro Tanaka Chengming Zhang Minjia Zhang Shuaiwen Leon Song Samyam Rajbhandari and Yuxiong He. 2023. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv:2309.14509. Retrieved from https:\/\/arxiv.org\/abs\/2309.14509. (2023)."},{"key":"e_1_3_1_144_2","unstructured":"Ajay Jaiswal Zhe Gan Xianzhi Du Bowen Zhang Zhangyang Wang and Yinfei Yang. 2023. Compressing LLMs: The truth is rarely pure and never simple. arXiv:2310.01382. Retrieved from https:\/\/arxiv.org\/abs\/2310.01382. (2023)."},{"key":"e_1_3_1_145_2","volume-title":"Proceedings of the ICLR 2024","author":"Jang Doohyuk","year":"2024","unstructured":"Doohyuk Jang, Sihwan Park, June Yong Yang, Yeonsung Jung, Jihun Yun, Souvik Kundu, Sung-Yub Kim, and Eunho Yang. 2024. LANTERN: Accelerating visual autoregressive models with relaxed speculative decoding. In Proceedings of the ICLR 2024."},{"key":"e_1_3_1_146_2","doi-asserted-by":"publisher","DOI":"10.1145\/3669940.3707220"},{"key":"e_1_3_1_147_2","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359630"},{"key":"e_1_3_1_148_2","first-page":"1","article-title":"Beyond data and model parallelism for deep neural networks.","author":"Jia Zhihao","year":"2019","unstructured":"Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond data and model parallelism for deep neural networks. Proceedings of Machine Learning and Systems (2019), 1\u201313.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_1_149_2","unstructured":"Albert Q. Jiang Alexandre Sablayrolles Arthur Mensch Chris Bamford Devendra Singh Chaplot Diego de las Casas Florian Bressand Gianna Lengyel Guillaume Lample Lucile Saulnier L\u00e9lio Renard Lavaud Marie-Anne Lachaux Pierre Stock Teven Le Scao Thibaut Lavril Thomas Wang Timoth\u00e9e Lacroix and William El Sayed. 2023. Mistral 7B. arXiv:2310.06825. Retrieved from https:\/\/arxiv.org\/abs\/2310.06825. (2023)."},{"key":"e_1_3_1_150_2","doi-asserted-by":"crossref","unstructured":"Huiqiang Jiang Qianhui Wu Chin-Yew Lin Yuqing Yang and Lili Qiu. 2023. Llmlingua: Compressing prompts for accelerated inference of large language models. arXiv:2310.05736. Retrieved from https:\/\/arxiv.org\/abs\/2310.05736. (2023).","DOI":"10.18653\/v1\/2023.emnlp-main.825"},{"key":"e_1_3_1_151_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.acl-long.91"},{"key":"e_1_3_1_152_2","unstructured":"Youhe Jiang Fangcheng Fu Xiaozhe Yao Guoliang He Xupeng Miao Ana Klimovic Bin Cui Binhang Yuan and Eiko Yoneki. 2025. Demystifying cost-efficiency in LLM serving over heterogeneous GPUs. arXiv:2502.00722. Retrieved from https:\/\/arxiv.org\/abs\/2502.00722. (2025)."},{"key":"e_1_3_1_153_2","unstructured":"Youhe Jiang Ran Yan Xiaozhe Yao Beidi Chen and Binhang Yuan. 2023. HexGen: Generative inference of foundation model over heterogeneous decentralized environment. arXiv:2311.11514. Retrieved from https:\/\/arxiv.org\/abs\/2311.11514. (2023)."},{"key":"e_1_3_1_154_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.372"},{"key":"e_1_3_1_155_2","unstructured":"Hongyi Jin Ruihang Lai Charlie F. Ruan Yingcheng Wang Todd C. Mowry Xupeng Miao Zhihao Jia and Tianqi Chen. 2024. A system for microserving of LLMs. arXiv:2412.12488. Retrieved from https:\/\/arxiv.org\/abs\/2412.12488. (2024)."},{"key":"e_1_3_1_156_2","unstructured":"Yunho Jin Chun-Feng Wu David Brooks and Gu-Yeon Wei. 2023. S \\(^3\\) : Increasing GPU utilization during generative inference for higher throughput. arXiv:2306.06000. Retrieved from https:\/\/arxiv.org\/abs\/2306.06000. (2023)."},{"key":"e_1_3_1_157_2","first-page":"214","article-title":"The promise and peril of generative AI","author":"Jo A","year":"2023","unstructured":"A Jo. 2023. The promise and peril of generative AI. Nature (2023), 214\u2013216.","journal-title":"Nature"},{"key":"e_1_3_1_158_2","doi-asserted-by":"publisher","DOI":"10.1145\/3579371.3589350"},{"key":"e_1_3_1_159_2","volume-title":"Proceedings of the ICLR 2020","author":"Kasai Jungo","year":"2020","unstructured":"Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross, and Noah Smith. 2020. Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation. In Proceedings of the ICLR 2020."},{"key":"e_1_3_1_160_2","doi-asserted-by":"publisher","DOI":"10.1145\/3497776.3517770"},{"key":"e_1_3_1_161_2","first-page":"5156","volume-title":"Proceedings of the ICML 2020","author":"Katharopoulos Angelos","year":"2020","unstructured":"Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran\u00e7ois Fleuret. 2020. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the ICML 2020. 5156\u20135165."},{"key":"e_1_3_1_162_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1027"},{"key":"e_1_3_1_163_2","unstructured":"Jeonghoon Kim Jung Hyun Lee Sungdong Kim Joonsuk Park Kang Min Yoo Se Jung Kwon and Dongsoo Lee. 2023. Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. arXiv:2305.14152. Retrieved from https:\/\/arxiv.org\/abs\/2305.14152. (2023)."},{"key":"e_1_3_1_164_2","unstructured":"Sehoon Kim Coleman Hooper Amir Gholami Zhen Dong Xiuyu Li Sheng Shen Michael W. Mahoney and Kurt Keutzer. 2023. SqueezeLLM: Dense-and-sparse quantization. arXiv:2306.07629. Retrieved from https:\/\/arxiv.org\/abs\/2306.07629. (2023)."},{"key":"e_1_3_1_165_2","unstructured":"Sehoon Kim Coleman Hooper Thanakul Wattanawong Minwoo Kang Ruohan Yan Hasan Genc Grace Dinh Qijing Huang Kurt Keutzer Michael W. Mahoney Yakun Sophia Shao and Amir Gholami. 2023. Full stack optimization of transformer inference: a survey. arXiv:2302.14017. Retrieved from https:\/\/arxiv.org\/abs\/2302.14017. (2023)."},{"key":"e_1_3_1_166_2","unstructured":"Sehoon Kim Karttikeya Mangalam Jitendra Malik Michael W. Mahoney Amir Gholami and Kurt Keutzer. 2023. Big little transformer decoder. arXiv:2302.07863. Retrieved from https:\/\/arxiv.org\/abs\/2302.07863. (2023)."},{"key":"e_1_3_1_167_2","volume-title":"Proceedings of the ICLR 2019","author":"Kitaev Nikita","year":"2019","unstructured":"Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2019. Reformer: The efficient transformer. In Proceedings of the ICLR 2019."},{"key":"e_1_3_1_168_2","first-page":"4677","volume-title":"Proceedings of the COLING","author":"Kong Jun","year":"2022","unstructured":"Jun Kong, Jin Wang, Liang-Chih Yu, and Xuejie Zhang. 2022. Accelerating inference for pretrained language models by unified multi-perspective early exiting. In Proceedings of the COLING. 4677\u20134686."},{"key":"e_1_3_1_169_2","first-page":"341","article-title":"Reducing activation recomputation in large transformer models","author":"Korthikanti Vijay Anand","year":"2023","unstructured":"Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2023. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems (2023), 341\u2013353.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_1_170_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.findings-emnlp.304"},{"key":"e_1_3_1_171_2","unstructured":"Eldar Kurtic Elias Frantar and Dan Alistarh. 2023. Ziplm: Hardware-aware structured pruning of language models. arXiv:2302.04089. Retrieved from https:\/\/arxiv.org\/abs\/2302.04089. (2023)."},{"key":"e_1_3_1_172_2","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_1_173_2","doi-asserted-by":"publisher","DOI":"10.1145\/3676641.3716249"},{"key":"e_1_3_1_174_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1149"},{"key":"e_1_3_1_175_2","first-page":"155","volume-title":"Proceedings of the OSDI 2024","author":"Lee Wonbeom","year":"2024","unstructured":"Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient generative inference of large language models with dynamic KV cache management. In Proceedings of the OSDI 2024. 155\u2013172."},{"key":"e_1_3_1_176_2","unstructured":"Benjamin Lefaudeux Francisco Massa Diana Liskovich Wenhan Xiong Vittorio Caggiano Sean Naren Min Xu Jieru Hu Marta Tintore Susan Zhang et\u00a0al. 2022. xFormers: A modular and hackable Transformer modelling library. Retrieved November 25 2023 from https:\/\/github.com\/facebookresearch\/xformers. (2022). Commit: fbf349a."},{"key":"e_1_3_1_177_2","volume-title":"Proceedings of the ICLR 2020","author":"Lepikhin Dmitry","year":"2020","unstructured":"Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. GShard: Scaling giant models with conditional computation and automatic sharding. In Proceedings of the ICLR 2020."},{"key":"e_1_3_1_178_2","first-page":"19274","volume-title":"Proceedings of the ICML 2023","author":"Leviathan Yaniv","year":"2023","unstructured":"Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding. In Proceedings of the ICML 2023. 19274\u201319286."},{"key":"e_1_3_1_179_2","first-page":"945","volume-title":"Proceedings of the USENIX ATC 2023","author":"Li Jiamin","year":"2023","unstructured":"Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. 2023. Accelerating distributed MoE training and inference with Lina. In Proceedings of the USENIX ATC 2023. 945\u2013959."},{"key":"e_1_3_1_180_2","doi-asserted-by":"crossref","unstructured":"Lei Li Yankai Lin Deli Chen Shuhuai Ren Peng Li Jie Zhou and Xu Sun. 2020. Cascadebert: Accelerating inference of pre-trained language models via calibrated complete models cascade. arXiv:2012.14682. Retrieved from https:\/\/arxiv.org\/abs\/2012.14682. (2020).","DOI":"10.18653\/v1\/2021.findings-emnlp.43"},{"key":"e_1_3_1_181_2","unstructured":"Qingyuan Li Ran Meng Yiduo Li Bo Zhang Liang Li Yifan Lu Xiangxiang Chu Yerui Sun and Yuchen Xie. 2023. A speed odyssey for deployable quantization of LLMs. arXiv:2311.09550. Retrieved from https:\/\/arxiv.org\/abs\/2311.09550. (2023)."},{"key":"e_1_3_1_182_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.emnlp-main.391"},{"key":"e_1_3_1_183_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i15.17572"},{"key":"e_1_3_1_184_2","first-page":"28935","volume-title":"Proceedings of the ICML 2024","author":"Li Yuhui","year":"2024","unstructured":"Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. EAGLE: speculative sampling requires rethinking feature uncertainty. In Proceedings of the ICML 2024. 28935\u201328948."},{"key":"e_1_3_1_185_2","volume-title":"Proceedings of the ICML 2023","author":"Li Yixiao","year":"2023","unstructured":"Yixiao Li, Yifan Yu, Qingru Zhang, Chen Liang, Pengcheng He, Weizhu Chen, and Tuo Zhao. 2023. LoSparse: Structured compression of large language models based on low-rank and sparse approximation. In Proceedings of the ICML 2023."},{"key":"e_1_3_1_186_2","unstructured":"Zikun Li Zhuofu Chen Remi Delacourt Gabriele Oliaro Zeyu Wang Qinghan Chen Shuhuai Lin April Yang Zhihao Zhang Zhuoming Chen Sean Lai Xinhao Cheng Xupeng Miao and Zhihao Jia2025. AdaServe: SLO-customized LLM serving with fine-grained speculative decoding. arXiv:2501.12162. Retrieved from https:\/\/arxiv.org\/abs\/2501.12162. (2025)."},{"key":"e_1_3_1_187_2","first-page":"663","volume-title":"Proceedings of the OSDI 2023","author":"Li Zhuohan","year":"2023","unstructured":"Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In Proceedings of the OSDI 2023. 663\u2013679."},{"key":"e_1_3_1_188_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.162"},{"key":"e_1_3_1_189_2","first-page":"929","volume-title":"Proceedings of the OSDI 2024","author":"Lin Chaofan","year":"2024","unstructured":"Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient serving of LLM-based applications with semantic variable. In Proceedings of the OSDI 2024. 929\u2013945."},{"key":"e_1_3_1_190_2","first-page":"87","article-title":"Awq: Activation-aware weight quantization for on-device llm compression and acceleration","volume":"6","author":"Lin Ji","year":"2024","unstructured":"Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems 6 (2024), 87\u2013100.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_1_191_2","unstructured":"Aixin Liu Bei Feng Bin Wang Bingxuan Wang Bo Liu Chenggang Zhao Chengqi Dengr Chong Ruan Damai Dai Daya Guo 2024. Deepseek-v2: A strong economical and efficient mixture-of-experts language model. arXiv:2405.04434. Retrieved from https:\/\/arxiv.org\/abs\/2405.04434. (2024)."},{"key":"e_1_3_1_192_2","first-page":"11946","article-title":"Kangaroo: Lossless self-speculative decoding for accelerating LLMs via double early exiting","author":"Liu Fangcheng","year":"2024","unstructured":"Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Duyu Tang, Kai Han, and Yunhe Wang. 2024. Kangaroo: Lossless self-speculative decoding for accelerating LLMs via double early exiting. Proceedings of NeurIPS (2024), 11946\u201311965.","journal-title":"Proceedings of NeurIPS"},{"key":"e_1_3_1_193_2","unstructured":"Hao Liu Matei Zaharia and Pieter Abbeel. 2023. Ring attention with blockwise transformers for near-infinite context. arXiv:2310.01889. Retrieved from https:\/\/arxiv.org\/abs\/2310.01889. (2023)."},{"key":"e_1_3_1_194_2","unstructured":"Jingyu Liu Beidi Chen and Ce Zhang. 2025. Speculative prefill: Turbocharging TTFT with lightweight and training-free token importance estimation. arXiv:2502.02789. Retrieved from https:\/\/arxiv.org\/abs\/2502.02789. (2025)."},{"key":"e_1_3_1_195_2","unstructured":"Jiachen Liu Jae-Won Chung Zhiyu Wu Fan Lai Myungjin Lee and Mosharaf Chowdhury. 2024. Andes: Defining and enhancing quality-of-experience in llm-based text streaming services. arXiv:2404.16283. Retrieved from https:\/\/arxiv.org\/abs\/2404.16283. (2024)."},{"key":"e_1_3_1_196_2","unstructured":"Nelson F. Liu Kevin Lin John Hewitt Ashwin Paranjape Michele Bevilacqua Fabio Petroni and Percy Liang. 2023. Lost in the middle: How language models use long contexts. arXiv:2307.03172. Retrieved from https:\/\/arxiv.org\/abs\/2307.03172. (2023)."},{"key":"e_1_3_1_197_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.537"},{"key":"e_1_3_1_198_2","unstructured":"Xiaoxuan Liu Lanxiang Hu Peter Bailis Ion Stoica Zhijie Deng Alvin Cheung and Hao Zhang. 2023. Online speculative decoding. arXiv:2310.07177. Retrieved from https:\/\/arxiv.org\/abs\/2310.07177. (2023)."},{"key":"e_1_3_1_199_2","volume-title":"Proceedings of the SIGCOMM 2024","author":"Liu Yuhan","year":"2024","unstructured":"Yuhan Liu, Hanchen Li, Kuntai Du, Jiayi Yao, Yihua Cheng, Yuyang Huang, Shan Lu, Michael Maire, Henry Hoffmann, Ari Holtzman, Siddhant Ray, Qizheng Zhang, Ganesh Ananthanarayanan, and Junchen Jiang. 2024. CacheGen: Fast context loading for language model applications. In Proceedings of the SIGCOMM 2024."},{"key":"e_1_3_1_200_2","unstructured":"Zichang Liu Aditya Desai Fangshuo Liao Weitao Wang Victor Xie Zhaozhuo Xu Anastasios Kyrillidis and Anshumali Shrivastava. 2023. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. arXiv:2305.17118. Retrieved from https:\/\/arxiv.org\/abs\/2305.17118. (2023)."},{"key":"e_1_3_1_201_2","unstructured":"Zechun Liu Barlas Oguz Changsheng Zhao Ernie Chang Pierre Stock Yashar Mehdad Yangyang Shi Raghuraman Krishnamoorthi and Vikas Chandra. 2023. LLM-QAT: Data-free quantization aware training for large language models. arXiv:2305.17888. Retrieved from https:\/\/arxiv.org\/abs\/2305.17888. (2023)."},{"key":"e_1_3_1_202_2","volume-title":"Proceedings of the ICML 2023","author":"Liu Zichang","year":"2023","unstructured":"Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, and Beidi Chen. 2023. Deja vu: Contextual sparsity for efficient llms at inference time. In Proceedings of the ICML 2023."},{"key":"e_1_3_1_203_2","article-title":"BumbleBee: Secure two-party inference framework for large transformers","author":"Lu Wen-jie","year":"2023","unstructured":"Wen-jie Lu, Zhicong Huang, Zhen Gu, Jingyu Li, Jian Liu, Kui Ren, Cheng Hong, Tao Wei, and WenGuang Chen. 2023. BumbleBee: Secure two-party inference framework for large transformers. Cryptology ePrint Archive (2023).","journal-title":"Cryptology ePrint Archive"},{"key":"e_1_3_1_204_2","unstructured":"Xinyin Ma Gongfan Fang and Xinchao Wang. 2023. LLM-Pruner: On the structural pruning of large language models. arXiv:2305.11627. Retrieved from https:\/\/arxiv.org\/abs\/2305.11627. (2023)."},{"key":"e_1_3_1_205_2","unstructured":"Ziming Mao Tian Xia Zhanghao Wu Wei-Lin Chiang Tyler Griggs Romil Bhardwaj Zongheng Yang Scott Shenker and Ion Stoica. 2024. Skyserve: Serving ai models across regions and clouds with spot instances. arXiv:2411.01438. Retrieved from https:\/\/arxiv.org\/abs\/2411.01438. (2024)."},{"key":"e_1_3_1_206_2","first-page":"1","article-title":"Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools","author":"Mayer Ruben","year":"2020","unstructured":"Ruben Mayer and Hans-Arno Jacobsen. 2020. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. ACM Computing Surveys (CSUR) (2020), 1\u201337.","journal-title":"ACM Computing Surveys (CSUR)"},{"key":"e_1_3_1_207_2","volume-title":"Proceedings of the ICLR 2022","author":"Mehta Harsh","year":"2022","unstructured":"Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur. 2022. Long range language modeling via gated state spaces. In Proceedings of the ICLR 2022."},{"key":"e_1_3_1_208_2","doi-asserted-by":"publisher","DOI":"10.1145\/3669940.3707215"},{"key":"e_1_3_1_209_2","unstructured":"Zhiyu Mei Wei Fu Kaiwei Li Guangju Wang Huanchen Zhang and Yi Wu. 2024. Realhf: Optimized rlhf training for large language models through parameter reallocation. arXiv:2406.14088. Retrieved from https:\/\/arxiv.org\/abs\/2406.14088. (2024)."},{"key":"e_1_3_1_210_2","doi-asserted-by":"publisher","DOI":"10.1145\/3626246.3654683"},{"key":"e_1_3_1_211_2","unstructured":"Xupeng Miao Gabriele Oliaro Xinhao Cheng Mengdi Wu Colin Unger and Zhihao Jia. 2024. FlexLLM: A system for co-serving large language model inference and parameter-efficient finetuning. arXiv:2402.18789. Retrieved from https:\/\/arxiv.org\/abs\/2402.18789. (2024)."},{"key":"e_1_3_1_212_2","doi-asserted-by":"publisher","DOI":"10.1145\/3620666.3651335"},{"key":"e_1_3_1_213_2","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640411"},{"key":"e_1_3_1_214_2","first-page":"470","article-title":"Galvatron: Efficient transformer training over multiple GPUs using automatic parallelism","author":"Miao Xupeng","year":"2023","unstructured":"Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. 2023. Galvatron: Efficient transformer training over multiple GPUs using automatic parallelism. Proc. VLDB Endow. (2023), 470\u2013479.","journal-title":"Proc. VLDB Endow."},{"key":"e_1_3_1_215_2","article-title":"Are sixteen heads really better than one?","author":"Michel Paul","year":"2019","unstructured":"Paul Michel, Omer Levy, and Graham Neubig. 2019. Are sixteen heads really better than one? Proc. of NeurIPS (2019).","journal-title":"Proc. of NeurIPS"},{"key":"e_1_3_1_216_2","unstructured":"Maxim Milakov and Natalia Gimelshein. 2018. Online normalizer calculation for softmax. arXiv:1805.02867. Retrieved from https:\/\/arxiv.org\/abs\/1805.02867. (2018)."},{"key":"e_1_3_1_217_2","unstructured":"Asit Mishra Jorge Albericio Latorre Jeff Pool Darko Stosic Dusan Stosic Ganesh Venkatesh Chong Yu and Paulius Micikevicius. 2021. Accelerating sparse deep neural networks. arXiv:2104.08378. Retrieved from https:\/\/arxiv.org\/abs\/2104.08378. (2021)."},{"key":"e_1_3_1_218_2","unstructured":"Ali Modarressi Hosein Mohebbi and Mohammad Taher Pilehvar. 2022. Adapler: Speeding up inference by adaptive length reduction. arXiv:2203.08991. Retrieved from https:\/\/arxiv.org\/abs\/2203.08991. (2022)."},{"key":"e_1_3_1_219_2","unstructured":"Amirkeivan Mohtashami and Martin Jaggi. 2023. Landmark attention: Random-access infinite context length for transformers. arXiv:2305.16300. Retrieved from https:\/\/arxiv.org\/abs\/2305.16300. (2023)."},{"key":"e_1_3_1_220_2","unstructured":"Giovanni Monea Armand Joulin and Edouard Grave. 2023. PaSS: Parallel speculative sampling. arXiv:2311.13581. Retrieved from https:\/\/arxiv.org\/abs\/2311.13581. (2023)."},{"key":"e_1_3_1_221_2","volume-title":"Proceedings of the NeurIPS 2023","author":"Mu Jesse","year":"2023","unstructured":"Jesse Mu, Xiang Lisa Li, and Noah Goodman. 2023. Learning to compress prompts with gist tokens. In Proceedings of the NeurIPS 2023."},{"key":"e_1_3_1_222_2","unstructured":"Dor Muhlgay Ori Ram Inbal Magar Yoav Levine Nir Ratner Yonatan Belinkov Omri Abend Kevin Leyton-Brown Amnon Shashua and Yoav Shoham. 2023. Generating benchmarks for factuality evaluation of language models. arXiv:2307.06908. Retrieved from https:\/\/arxiv.org\/abs\/2307.06908. (2023)."},{"key":"e_1_3_1_223_2","unstructured":"Kabir Nagrecha and Arun Kumar. 2023. Saturn: An optimized data system for large model deep learning workloads. arXiv:2309.01226. Retrieved from https:\/\/arxiv.org\/abs\/2309.01226. (2023)."},{"key":"e_1_3_1_224_2","first-page":"7937","volume-title":"Proceedings of the ICML 2021","author":"Narayanan Deepak","year":"2021","unstructured":"Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2021. Memory-efficient pipeline-parallel dnn training. In Proceedings of the ICML 2021. 7937\u20137947."},{"key":"e_1_3_1_225_2","volume-title":"Proceedings of the NeurIPS 2023","author":"Narayanan Deepak","year":"2023","unstructured":"Deepak Narayanan, Keshav Santhanam, Peter Henderson, Rishi Bommasani, Tony Lee, and Percy Liang. 2023. Cheaply estimating inference efficiency metrics for autoregressive transformer models. In Proceedings of the NeurIPS 2023."},{"key":"e_1_3_1_226_2","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613163"},{"key":"e_1_3_1_227_2","unstructured":"Xiaonan Nie Xupeng Miao Shijie Cao Lingxiao Ma Qibin Liu Jilong Xue Youshan Miao Yi Liu Zhi Yang and Bin Cui. 2021. Evomoe: An evolutional mixture-of-experts training framework via dense-to-sparse gate. arXiv:2112.14397. Retrieved from https:\/\/arxiv.org\/abs\/2112.14397. (2021)."},{"key":"e_1_3_1_228_2","doi-asserted-by":"publisher","DOI":"10.1145\/3588964"},{"key":"e_1_3_1_229_2","doi-asserted-by":"publisher","DOI":"10.1145\/3620665.3640383"},{"key":"e_1_3_1_230_2","unstructured":"Gabriele Oliaro Zhihao Jia Daniel Campos and Aurick Qiao. 2024. SuffixDecoding: A model-free approach to speeding up large language model inference. arXiv:2411.04975. Retrieved from https:\/\/arxiv.org\/abs\/2411.04975. (2024)."},{"key":"e_1_3_1_231_2","first-page":"2671","volume-title":"Proceedings of the ICML","author":"Oliva Junier B.","year":"2017","unstructured":"Junier B. Oliva, Barnab\u00e1s P\u00f3czos, and Jeff Schneider. 2017. The statistical recurrent unit. In Proceedings of the ICML. 2671\u20132680."},{"key":"e_1_3_1_232_2","unstructured":"Antonio Orvieto Samuel L. Smith Albert Gu Anushan Fernando Caglar Gulcehre Razvan Pascanu and Soham De. 2023. Resurrecting recurrent neural networks for long sequences. arXiv:2303.06349. Retrieved from https:\/\/arxiv.org\/abs\/2303.06349. (2023)."},{"key":"e_1_3_1_233_2","unstructured":"Charles Packer Vivian Fang Shishir G. Patil Kevin Lin Sarah Wooders and Joseph E. Gonzalez. 2023. MemGPT: Towards LLMs as operating systems. arXiv:2310.08560. Retrieved from https:\/\/arxiv.org\/abs\/2310.08560. (2023)."},{"key":"e_1_3_1_234_2","unstructured":"Matteo Pagliardini Daniele Paliotta Martin Jaggi and Fran\u00e7ois Fleuret. 2023. Faster causal attention over large sequences through sparse flash attention. arXiv:2306.01160. Retrieved from https:\/\/arxiv.org\/abs\/2306.01160. (2023)."},{"key":"e_1_3_1_235_2","unstructured":"Rui Pan Zhuang Wang Zhen Jia Can Karakus Luca Zancato Tri Dao Yida Wang and Ravi Netravali. 2024. Marconi: Prefix caching for the era of hybrid llms. arXiv:2411.19379. Retrieved from https:\/\/arxiv.org\/abs\/2411.19379. (2024)."},{"key":"e_1_3_1_236_2","volume-title":"Proceedings of the ICLR 2022","author":"Park Gunho","year":"2022","unstructured":"Gunho Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, Dongsoo Lee, and Baeseong Park. 2022. LUT-GEMM: Quantized matrix multiplication based on LUTs for efficient inference in large-scale generative language models. In Proceedings of the ICLR 2022."},{"key":"e_1_3_1_237_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA59077.2024.00019"},{"key":"e_1_3_1_238_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.findings-emnlp.936"},{"key":"e_1_3_1_239_2","unstructured":"Baolin Peng Chunyuan Li Pengcheng He Michel Galley and Jianfeng Gao. 2023. Instruction tuning with gpt-4. arXiv:2304.03277. Retrieved from https:\/\/arxiv.org\/abs\/2304.03277. (2023)."},{"key":"e_1_3_1_240_2","unstructured":"Huwan Peng Scott Davidson Richard Shi Shuaiwen Leon Song and Michael Taylor. 2023. Chiplet Cloud: Building AI supercomputers for serving large generative language models. arXiv:2307.02666. Retrieved from https:\/\/arxiv.org\/abs\/2307.02666. (2023)."},{"key":"e_1_3_1_241_2","article-title":"Efficiently scaling transformer inference","author":"Pope Reiner","year":"2023","unstructured":"Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2023. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems (2023).","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_1_242_2","doi-asserted-by":"publisher","DOI":"10.1145\/3669940.3707256"},{"key":"e_1_3_1_243_2","volume-title":"Proceedings of the ICLR 2021","author":"Press Ofir","year":"2021","unstructured":"Ofir Press, Noah Smith, and Mike Lewis. 2021. Train short, test long: Attention with linear biases enables input length extrapolation. In Proceedings of the ICLR 2021."},{"key":"e_1_3_1_244_2","unstructured":"Yifan Qiao Shu Anzai Shan Yu Haoran Ma Yang Wang Miryung Kim and Harry Xu. 2024. ConServe: Harvesting GPUs for low-latency and high-throughput large language model serving. arXiv:2410.01228. Retrieved from https:\/\/arxiv.org\/abs\/2410.01228. (2024)."},{"key":"e_1_3_1_245_2","unstructured":"Ruoyu Qin Zheming Li Weiran He Mingxing Zhang Yongwei Wu Weimin Zheng and Xinran Xu. 2024. Mooncake: A kvcache-centric disaggregated architecture for llm serving. arXiv:2407.00079. Retrieved from https:\/\/arxiv.org\/abs\/2407.00079. (2024)."},{"key":"e_1_3_1_246_2","unstructured":"Qualcomm. 2023. The future of AI is hybrid. Retrieved November 25 2023 from https:\/\/www.qualcomm.com\/content\/dam\/qcomm-martech\/dm-assets\/documents\/Whitepaper-The-future-of-AI-is-hybrid-Part-2-Qualcomm-is-uniquely-positioned-to-scale-hybrid-AI.pdf. (2023)."},{"key":"e_1_3_1_247_2","unstructured":"Markus N. Rabe and Charles Staats. 2021. Self-attention does not need O( \\(n^2\\) ) memory. arXiv:2112.05682. Retrieved from https:\/\/arxiv.org\/abs\/2112.05682. (2021)."},{"key":"e_1_3_1_248_2","first-page":"18332","volume-title":"Proceedings of the ICML 2022","author":"Rajbhandari Samyam","year":"2022","unstructured":"Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In Proceedings of the ICML 2022. 18332\u201318346."},{"key":"e_1_3_1_249_2","first-page":"8821","volume-title":"Proceedings of the ICML 2021","author":"Ramesh Aditya","year":"2021","unstructured":"Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In Proceedings of the ICML 2021. 8821\u20138831."},{"key":"e_1_3_1_250_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00045"},{"key":"e_1_3_1_251_2","first-page":"17555","article-title":"Hash layers for large sparse models","author":"Roller Stephen","year":"2021","unstructured":"Stephen Roller, Sainbayar Sukhbaatar, Jason Weston, and Arthur Szlam. 2021. Hash layers for large sparse models. Proceedings of NeurIPS (2021), 17555\u201317566.","journal-title":"Proceedings of NeurIPS"},{"key":"e_1_3_1_252_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00353"},{"key":"e_1_3_1_253_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2014-80"},{"key":"e_1_3_1_254_2","unstructured":"Adrian Sampson Tianqi Chen and Jared Roesch. 2022. Apache TVM Unity: A vision for the ML software and hardware ecosystem. (2022)."},{"key":"e_1_3_1_255_2","unstructured":"Victor Sanh Lysandre Debut Julien Chaumond and Thomas Wolf. 2019. DistilBERT a distilled version of BERT: smaller faster cheaper and lighter. arXiv:1910.01108. Retrieved from https:\/\/arxiv.org\/abs\/1910.01108. (2019)."},{"key":"e_1_3_1_256_2","first-page":"20378","article-title":"Movement pruning: Adaptive sparsity by fine-tuning","author":"Sanh Victor","year":"2020","unstructured":"Victor Sanh, Thomas Wolf, and Alexander Rush. 2020. Movement pruning: Adaptive sparsity by fine-tuning. Proc. of NeurIPS (2020), 20378\u201320389.","journal-title":"Proc. of NeurIPS"},{"key":"e_1_3_1_257_2","unstructured":"Michael Santacroce Zixin Wen Yelong Shen and Yuanzhi Li. 2023. What matters in the structured pruning of generative language models? arXiv:2302.03773. Retrieved from https:\/\/arxiv.org\/abs\/2302.03773. (2023)."},{"key":"e_1_3_1_258_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-long.689"},{"key":"e_1_3_1_259_2","unstructured":"Cicero Nogueira dos Santos James Lee-Thorp Isaac Noble Chung-Ching Chang and David Uthus. 2023. Memory augmented language models through mixture of word experts. arXiv:2311.10768. Retrieved from https:\/\/arxiv.org\/abs\/2311.10768. (2023)."},{"key":"e_1_3_1_260_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.406"},{"key":"e_1_3_1_261_2","unstructured":"Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv:1911.02150. Retrieved from https:\/\/arxiv.org\/abs\/1911.02150. (2019)."},{"key":"e_1_3_1_262_2","volume-title":"Proceedings of the ICLR 2017","author":"Shazeer Noam","year":"2017","unstructured":"Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In Proceedings of the ICLR 2017."},{"key":"e_1_3_1_263_2","unstructured":"Haihao Shen Hanwen Chang Bo Dong Yu Luo and Hengyu Meng. 2023. Efficient LLM inference on CPUs. arXiv:2311.00502. Retrieved from https:\/\/arxiv.org\/abs\/2311.00502. (2023)."},{"key":"e_1_3_1_264_2","unstructured":"Guangming Sheng Chi Zhang Zilingfeng Ye Xibin Wu Wang Zhang Ru Zhang Yanghua Peng Haibin Lin and Chuan Wu. 2024. Hybridflow: A flexible and efficient rlhf framework. arXiv:2409.19256. Retrieved from https:\/\/arxiv.org\/abs\/2409.19256. (2024)."},{"key":"e_1_3_1_265_2","unstructured":"Ying Sheng Shiyi Cao Dacheng Li Coleman Hooper Nicholas Lee Shuo Yang Christopher Chou Banghua Zhu Lianmin Zheng Kurt Keutzer Joseph E. Gonzalez and Ion Stoica2023. S-LoRA: Serving thousands of concurrent LoRA adapters. arXiv:2311.03285. Retrieved from https:\/\/arxiv.org\/abs\/2311.03285. (2023)."},{"key":"e_1_3_1_266_2","first-page":"965","volume-title":"Proceedings of the OSDI 2024","author":"Sheng Ying","year":"2024","unstructured":"Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, and Ion Stoica. 2024. Fairness in serving large language models. In Proceedings of the OSDI 2024. 965\u2013988."},{"key":"e_1_3_1_267_2","first-page":"31094","volume-title":"Proceedings of the ICML 2023","author":"Sheng Ying","year":"2023","unstructured":"Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher R\u00e9, Ion Stoica, and Ce Zhang. 2023. FlexGen: High-throughput generative inference of large language models with a single GPU. In Proceedings of the ICML 2023. 31094\u201331116."},{"key":"e_1_3_1_268_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-2091"},{"key":"e_1_3_1_269_2","first-page":"701","volume-title":"Proceedings of the OSDI 2023","author":"Shi Yining","year":"2023","unstructured":"Yining Shi, Zhi Yang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Ziming Miao, Yuxiao Guo, Fan Yang, and Lidong Zhou. 2023. Welder: Scheduling deep learning memory access via tile-graph. In Proceedings of the OSDI 2023. 701\u2013718."},{"key":"e_1_3_1_270_2","unstructured":"Mohammad Shoeybi Mostofa Patwary Raul Puri Patrick LeGresley Jared Casper and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv:1909.08053. Retrieved from https:\/\/arxiv.org\/abs\/1909.08053. (2019)."},{"key":"e_1_3_1_271_2","unstructured":"Yixin Song Zeyu Mi Haotong Xie and Haibo Chen. 2023. PowerInfer: Fast large language model serving with a consumer-grade GPU. arXiv:2312.12456. Retrieved from https:\/\/arxiv.org\/abs\/2312.12456. (2023)."},{"key":"e_1_3_1_272_2","unstructured":"Benjamin Spector and Chris Re. 2023. Accelerating llm inference with staged speculative decoding. arXiv:2308.04623. Retrieved from https:\/\/arxiv.org\/abs\/2308.04623. (2023)."},{"key":"e_1_3_1_273_2","article-title":"Blockwise parallel decoding for deep autoregressive models","author":"Stern Mitchell","year":"2018","unstructured":"Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. 2018. Blockwise parallel decoding for deep autoregressive models. Proc. of NeurIPS (2018).","journal-title":"Proc. of NeurIPS"},{"key":"e_1_3_1_274_2","unstructured":"Jianlin Su Yu Lu Shengfeng Pan Ahmed Murtadha Bo Wen and Yunfeng Liu. 2021. Roformer: Enhanced transformer with rotary position embedding. arXiv:2104.09864. Retrieved from https:\/\/arxiv.org\/abs\/2104.09864. (2021)."},{"key":"e_1_3_1_275_2","first-page":"173","volume-title":"Proceedings of the OSDI 2024","author":"Sun Biao","year":"2024","unstructured":"Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic scheduling for large language model serving. In Proceedings of the OSDI 2024. 173\u2013191."},{"key":"e_1_3_1_276_2","volume-title":"Proceedings of the COLM 2024","author":"Sun Hanshi","year":"2024","unstructured":"Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, and Beidi Chen. 2024. TriForce: Lossless acceleration of long sequence generation with hierarchical speculative decoding. In Proceedings of the COLM 2024."},{"key":"e_1_3_1_277_2","unstructured":"Mingjie Sun Zhuang Liu Anna Bair and J. Zico Kolter. 2023. A simple and effective pruning approach for large language models. arXiv:2306.11695. Retrieved from https:\/\/arxiv.org\/abs\/2306.11695. (2023)."},{"key":"e_1_3_1_278_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1441"},{"key":"e_1_3_1_279_2","unstructured":"Tianxiang Sun Xiangyang Liu Wei Zhu Zhichao Geng Lingling Wu Yilong He Yuan Ni Guotong Xie Xuanjing Huang and Xipeng Qiu. 2022. A simple hash-based early exiting approach for language understanding and generation. arXiv:2203.01670. Retrieved from https:\/\/arxiv.org\/abs\/2203.01670. (2022)."},{"key":"e_1_3_1_280_2","unstructured":"Yutao Sun Li Dong Shaohan Huang Shuming Ma Yuqing Xia Jilong Xue Jianyong Wang and Furu Wei. 2023. Retentive network: A successor to transformer for large language models. arXiv:2307.08621. Retrieved from https:\/\/arxiv.org\/abs\/2307.08621. (2023)."},{"key":"e_1_3_1_281_2","volume-title":"Proceedings of the Workshop on Efficient Systems for Foundation Models@ ICML2023","author":"Sun Ziteng","year":"2023","unstructured":"Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, Felix Yu, Michael Riley, and Sanjiv Kumar. 2023. Spectr: Fast speculative decoding via optimal transport. In Proceedings of the Workshop on Efficient Systems for Foundation Models@ ICML2023."},{"key":"e_1_3_1_282_2","unstructured":"Zhenheng Tang Yuxin Wang Xin He Longteng Zhang Xinglin Pan Qiang Wang Rongfei Zeng Kaiyong Zhao Shaohuai Shi Bingsheng He and Xiaowen Chu2023. FusionAI: Decentralized training and deploying LLMs with massive consumer-level GPUs. arXiv:2309.01172. Retrieved from https:\/\/arxiv.org\/abs\/2309.01172. (2023)."},{"key":"e_1_3_1_283_2","unstructured":"Rohan Taori Ishaan Gulrajani Tianyi Zhang Yann Dubois Xuechen Li Carlos Guestrin Percy Liang and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. (2023)."},{"key":"e_1_3_1_284_2","volume-title":"Proceedings of the ICML 2020","author":"Tay Yi","year":"2020","unstructured":"Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. 2020. Sparse sinkhorn attention. In Proceedings of the ICML 2020."},{"key":"e_1_3_1_285_2","first-page":"109:1\u2013109:28","article-title":"Efficient transformers: A survey","author":"Tay Yi","year":"2023","unstructured":"Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2023. Efficient transformers: A survey. ACM Computing Surveys (2023), 109:1\u2013109:28.","journal-title":"ACM Computing Surveys"},{"key":"e_1_3_1_286_2","unstructured":"MLC team. 2023. MLC-LLM. (2023). Retrieved November 25 2023 from https:\/\/github.com\/mlc-ai\/mlc-llmCommit: 3358029."},{"key":"e_1_3_1_287_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICPR.2016.7900006"},{"key":"e_1_3_1_288_2","doi-asserted-by":"publisher","DOI":"10.1145\/3315508.3329973"},{"key":"e_1_3_1_289_2","first-page":"24261","article-title":"Mlp-mixer: An all-mlp architecture for vision","author":"Tolstikhin Ilya O.","year":"2021","unstructured":"Ilya O. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy2021. Mlp-mixer: An all-mlp architecture for vision. Proceedings of NeurIPS (2021), 24261\u201324272.","journal-title":"Proceedings of NeurIPS"},{"key":"e_1_3_1_290_2","unstructured":"Alexander Tornede Difan Deng Theresa Eimer Joseph Giovanelli Aditya Mohan Tim Ruhkopf Sarah Segel Daphne Theodorakopoulos Tanja Tornede Henning Wachsmuth and Marius Lindauer. 2023. AutoML in the age of large language models: Current challenges future opportunities and risks. arXiv:2306.08107. Retrieved from https:\/\/arxiv.org\/abs\/2306.08107. (2023)."},{"key":"e_1_3_1_291_2","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic Sergey Edunov and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288. Retrieved from https:\/\/arxiv.org\/abs\/2307.09288. (2023)."},{"key":"e_1_3_1_292_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00577"},{"key":"e_1_3_1_293_2","unstructured":"Francisco Massa Grigory Sizov Tri Dao Daniel Haziza. 2023. Flash-Decoding for long-context inference. (2023). Retrieved from https:\/\/pytorch.org\/blog\/flash-decoding\/"},{"key":"e_1_3_1_294_2","first-page":"267","volume-title":"Proceedings of the OSDI 2022","author":"Unger Colin","year":"2022","unstructured":"Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, Xi Luo, Dheevatsa Mudigere, Jongsoo Park, Misha Smelyanskiy, and Alex Aiken. 2022. Unity: Accelerating DNN training through joint optimization of algebraic transformations and parallelization. In Proceedings of the OSDI 2022. 267\u2013284."},{"key":"e_1_3_1_295_2","unstructured":"Tim Valicenti Justice Vidal and Ritik Patnaik. 2023. Mini-GPTs: Efficient large language models through contextual pruning. arXiv:2312.12682. Retrieved from https:\/\/arxiv.org\/abs\/2312.12682. (2023)."},{"key":"e_1_3_1_296_2","doi-asserted-by":"publisher","DOI":"10.1002\/(SICI)1096-9128(199704)9:4<255::AID-CPE250>3.0.CO;2-2"},{"key":"e_1_3_1_297_2","volume-title":"Proceedings of the NeurIPS 2017","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the NeurIPS 2017."},{"key":"e_1_3_1_298_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00735"},{"key":"e_1_3_1_299_2","unstructured":"Sinong Wang Belinda Z. Li Madian Khabsa Han Fang and Hao Ma. 2020. Linformer: Self-attention with linear complexity. arXiv:2006.04768. Retrieved from https:\/\/arxiv.org\/abs\/2006.04768. (2020)."},{"key":"e_1_3_1_300_2","first-page":"5776","article-title":"Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers","author":"Wang Wenhui","year":"2020","unstructured":"Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Proc. of NeurIPS (2020), 5776\u20135788.","journal-title":"Proc. of NeurIPS"},{"key":"e_1_3_1_301_2","unstructured":"Xiaohui Wang Ying Xiong Yang Wei Mingxuan Wang and Lei Li. 2020. LightSeq: A high performance inference library for transformers. arXiv:2010.13887. Retrieved from https:\/\/arxiv.org\/abs\/2010.13887. (2020)."},{"key":"e_1_3_1_302_2","doi-asserted-by":"publisher","DOI":"10.1145\/3552326.3587438"},{"key":"e_1_3_1_303_2","unstructured":"Yuxin Wang Yuhan Chen Zeyu Li Xueze Kang Zhenheng Tang Xin He Rui Guo Xin Wang Qiang Wang Amelie Chi Zhou Yuchu Fang Yeju Zhou Yang Zheng and Xiaowen Chu. 2024. BurstGPT: A real-world workload dataset to optimize LLM serving systems. arXiv:2401.17644. Retrieved from https:\/\/arxiv.org\/abs\/2401.17644. (2024)."},{"key":"e_1_3_1_304_2","first-page":"24824","article-title":"Chain-of-thought prompting elicits reasoning in large language models","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V. Le, Denny Zhou, et\u00a0al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Proceedings of NeurIPS (2022), 24824\u201324837.","journal-title":"Proceedings of NeurIPS"},{"key":"e_1_3_1_305_2","first-page":"945","volume-title":"Proceedings of the NSDI 2022","author":"Weng Qizhen","year":"2022","unstructured":"Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. 2022. MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters. In Proceedings of the NSDI 2022. 945\u2013960."},{"key":"e_1_3_1_306_2","unstructured":"BigScience Workshop Teven Le Scao Angela Fan Christopher Akiki Ellie Pavlick Suzana Ili\u0107 Daniel Hesslow Roman Castagn\u00e9 Alexandra Sasha Luccioni Fran\u00e7ois Yvon 2022. Bloom: A 176b-parameter open-access multilingual language model. arXiv:2211.05100. Retrieved from https:\/\/arxiv.org\/abs\/2211.05100. (2022)."},{"key":"e_1_3_1_307_2","doi-asserted-by":"publisher","DOI":"10.1145\/3694715.3695948"},{"key":"e_1_3_1_308_2","unstructured":"Bingyang Wu Yinmin Zhong Zili Zhang Gang Huang Xuanzhe Liu and Xin Jin. 2023. Fast distributed inference serving for large language models. arXiv:2305.05920. Retrieved from https:\/\/arxiv.org\/abs\/2305.05920. (2023)."},{"key":"e_1_3_1_309_2","first-page":"911","volume-title":"Proceedings of the OSDI 2024","author":"Wu Bingyang","year":"2024","unstructured":"Bingyang Wu, Ruidong Zhu, Zili Zhang, Peng Sun, Xuanzhe Liu, and Xin Jin. 2024. dLoRA: Dynamically orchestrating requests and adapters for LoRA LLM serving. In Proceedings of the OSDI 2024. 911\u2013927."},{"key":"e_1_3_1_310_2","first-page":"795","volume-title":"Proceedings of the 6th Conference on Machine Translation","author":"Wu Kaixin","year":"2021","unstructured":"Kaixin Wu, Bojie Hu, and Qi Ju. 2021. TenTrans high-performance inference toolkit for WMT2021 efficiency task. In Proceedings of the 6th Conference on Machine Translation. 795\u2013798."},{"key":"e_1_3_1_311_2","first-page":"5109","volume-title":"Proceedings of the COLING 2022","author":"Wu Kaixin","year":"2022","unstructured":"Kaixin Wu, Yue Zhang, Bojie Hu, and Tong Zhang. 2022. Speeding up transformer decoding via an attention refinement network. In Proceedings of the COLING 2022. 5109\u20135118."},{"key":"e_1_3_1_312_2","unstructured":"Mengdi Wu Xinhao Cheng Shengyu Liu Chunan Shi Jianan Ji Kit Ao Praveen Velliengiri Xupeng Miao Oded Padon and Zhihao Jia. 2024. Mirage: A multi-level superoptimizer for tensor programs. arXiv:2405.05751. Retrieved from https:\/\/arxiv.org\/abs\/2405.05751. (2024)."},{"key":"e_1_3_1_313_2","doi-asserted-by":"publisher","DOI":"10.1145\/3579990.3583093"},{"key":"e_1_3_1_314_2","unstructured":"Qingyun Wu Gagan Bansal Jieyu Zhang Yiran Wu Shaokun Zhang Erkang Zhu Beibin Li Li Jiang Xiaoyun Zhang and Chi Wang. 2023. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv:2308.08155. Retrieved from https:\/\/arxiv.org\/abs\/2308.08155. (2023)."},{"key":"e_1_3_1_315_2","unstructured":"Xiaoxia Wu Cheng Li Reza Yazdani Aminabadi Zhewei Yao and Yuxiong He. 2023. Understanding INT4 quantization for transformer models: Latency speedup composability and failure cases. arXiv:2301.12017. Retrieved from https:\/\/arxiv.org\/abs\/2301.12017. (2023)."},{"key":"e_1_3_1_316_2","unstructured":"Haojun Xia Zhen Zheng Yuchao Li Donglin Zhuang Zhongzhu Zhou Xiafei Qiu Yong Li Wei Lin and Shuaiwen Leon Song. 2023. Flash-LLM: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity. arXiv:2309.10285. Retrieved from https:\/\/arxiv.org\/abs\/2309.10285. (2023)."},{"key":"e_1_3_1_317_2","unstructured":"Guangxuan Xiao Ji Lin Mickael Seznec Julien Demouth and Song Han. 2022. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv:2211.10438. Retrieved from https:\/\/arxiv.org\/abs\/2211.10438. (2022)."},{"key":"e_1_3_1_318_2","unstructured":"Guangxuan Xiao Yuandong Tian Beidi Chen Song Han and Mike Lewis. 2023. Efficient streaming language models with attention sinks. arXiv:2309.17453. Retrieved from https:\/\/arxiv.org\/abs\/2309.17453. (2023)."},{"key":"e_1_3_1_319_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2019\/735"},{"key":"e_1_3_1_320_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2023.3277122"},{"key":"e_1_3_1_321_2","unstructured":"Zhiqiang Xie Hao Kang Ying Sheng Tushar Krishna Kayvon Fatahalian and Christos Kozyrakis. 2024. AI metropolis: Scaling large language model-based multi-agent simulation with out-of-order execution. arXiv:2411.03519. Retrieved from https:\/\/arxiv.org\/abs\/2411.03519. (2024)."},{"key":"e_1_3_1_322_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.204"},{"key":"e_1_3_1_323_2","unstructured":"Can Xu Qingfeng Sun Kai Zheng Xiubo Geng Pu Zhao Jiazhan Feng Chongyang Tao and Daxin Jiang. 2023. Wizardlm: Empowering large language models to follow complex instructions. arXiv:2304.12244. Retrieved from https:\/\/arxiv.org\/abs\/2304.12244. (2023)."},{"key":"e_1_3_1_324_2","unstructured":"Daliang Xu Wangsong Yin Xin Jin Ying Zhang Shiyun Wei Mengwei Xu and Xuanzhe Liu. 2023. LLMCad: Fast and scalable on-device large language model inference. arXiv:2309.04255. Retrieved from https:\/\/arxiv.org\/abs\/2309.04255. (2023)."},{"key":"e_1_3_1_325_2","unstructured":"Peng Xu Wei Ping Xianchao Wu Lawrence McAfee Chen Zhu Zihan Liu Sandeep Subramanian Evelina Bakhturina Mohammad Shoeybi and Bryan Catanzaro. 2023. Retrieval meets Long Context Large Language Models. arXiv:2310.03025. Retrieved from https:\/\/arxiv.org\/abs\/2310.03025. (2023)."},{"key":"e_1_3_1_326_2","unstructured":"Zhaozhuo Xu Zirui Liu Beidi Chen Yuxin Tang Jue Wang Kaixiong Zhou Xia Hu and Anshumali Shrivastava. 2023. Compress then prompt: Improving accuracy-efficiency trade-off of LLM inference with transferable prompt. arXiv:2305.11186. Retrieved from https:\/\/arxiv.org\/abs\/2305.11186. (2023)."},{"key":"e_1_3_1_327_2","unstructured":"Aiyuan Yang Bin Xiao Bingning Wang Borong Zhang Chao Yin Chenxu Lv Da Pan Dian Wang Dong Yan Fan Yang Ce Bian Fei Deng Feng Wang Feng Liu Guangwei Ai Guosheng Dong Haizhou Zhao Hang Xu Haoze Sun Hongda Zhang Hui Liu Jiaming Ji Jian Xie JunTao Dai Kun Fang Lei Su Liang Song Lifeng Liu Liyun Ru Luyao Ma Mang Wang Mickel Liu MingAn Lin Nuolan Nie Peidong Guo Ruiyang Sun Tao Zhang Tianpeng Li Tianyu Li Wei Cheng Weipeng Chen Xiangrong Zeng Xiaochuan Wang Xiaoxi Chen Xin Men Xin Yu Xuehai Pan Yanjun Shen Yiding Wang Yiyu Li Youxin Jiang Yuchen Gao Yupeng Zhang Zenan Zhou and Zhiying Wu. 2023. Baichuan 2: Open large-scale language models. arXiv:2309.10305. Retrieved from https:\/\/arxiv.org\/abs\/2309.10305. (2023)."},{"key":"e_1_3_1_328_2","unstructured":"Amy Yang Jingyi Yang Aya Ibrahim Xinfeng Xie Bangsheng Tang Grigory Sizov Jeremy Reizenstein Jongsoo Park and Jianyu Huang. 2024. Context parallelism for scalable million-token inference. arXiv:2411.01783. Retrieved from https:\/\/arxiv.org\/abs\/2411.01783. (2024)."},{"key":"e_1_3_1_329_2","unstructured":"Nan Yang Tao Ge Liang Wang Binxing Jiao Daxin Jiang Linjun Yang Rangan Majumder and Furu Wei. 2023. Inference with reference: Lossless acceleration of large language models. arXiv:2304.04487. Retrieved from https:\/\/arxiv.org\/abs\/2304.04487. (2023)."},{"key":"e_1_3_1_330_2","unstructured":"Penghui Yang Cunxiao Du Fengzhuo Zhang Haonan Wang Tianyu Pang Chao Du and Bo An. 2025. LongSpec: Long-context speculative decoding with efficient drafting and verification. arXiv:2502.17421. Retrieved from https:\/\/arxiv.org\/abs\/2502.17421. (2025)."},{"key":"e_1_3_1_331_2","unstructured":"Seongjun Yang Gibbeum Lee Jaewoong Cho Dimitris Papailiopoulos and Kangwook Lee. 2023. Predictive pipelined decoding: A compute-latency trade-off for exact LLM decoding. arXiv:2307.05908. Retrieved from https:\/\/arxiv.org\/abs\/2307.05908. (2023)."},{"key":"e_1_3_1_332_2","unstructured":"Zhewei Yao Cheng Li Xiaoxia Wu Stephen Youn and Yuxiong He. 2023. A comprehensive study on post-training quantization for large language models. arXiv:2303.08302. Retrieved from https:\/\/arxiv.org\/abs\/2303.08302. (2023)."},{"key":"e_1_3_1_333_2","first-page":"27168","article-title":"Zeroquant: Efficient and affordable post-training quantization for large-scale transformers","author":"Yao Zhewei","year":"2022","unstructured":"Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. 2022. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Proceedings of NeurIPS (2022), 27168\u201327183.","journal-title":"Proceedings of NeurIPS"},{"key":"e_1_3_1_334_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.463"},{"key":"e_1_3_1_335_2","unstructured":"Zihao Ye Lequn Chen Ruihang Lai Wuwei Lin Yineng Zhang Stephanie Wang Tianqi Chen Baris Kasikci Vinod Grover Arvind Krishnamurthy and Luis Ceze. 2025. Flashinfer: Efficient and customizable attention engine for llm inference serving. arXiv:2501.01005. Retrieved from https:\/\/arxiv.org\/abs\/2501.01005. (2025)."},{"key":"e_1_3_1_336_2","doi-asserted-by":"publisher","DOI":"10.1145\/3582016.3582047"},{"key":"e_1_3_1_337_2","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN54540.2023.10191067"},{"key":"e_1_3_1_338_2","first-page":"521","volume-title":"Proceedings of the OSDI 2022","author":"Yu Gyeong-In","year":"2022","unstructured":"Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for transformer-based generative models. In Proceedings of the OSDI 2022. 521\u2013538."},{"key":"e_1_3_1_339_2","unstructured":"Lingfan Yu Jinkun Lin and Jinyang Li. 2023. Stateful large language model serving with pensieve. arXiv:2312.05516. Retrieved from https:\/\/arxiv.org\/abs\/2312.05516. (2023)."},{"key":"e_1_3_1_340_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01055"},{"key":"e_1_3_1_341_2","unstructured":"Zhihang Yuan Lin Niu Jiawei Liu Wenyu Liu Xinggang Wang Yuzhang Shang Guangyu Sun Qiang Wu Jiaxiang Wu and Bingzhe Wu. 2023. RPTQ: Reorder-based Post-training quantization for large language models. arXiv:2304.01089. Retrieved from https:\/\/arxiv.org\/abs\/2304.01089. (2023)."},{"key":"e_1_3_1_342_2","volume-title":"Proceedings of the ICLR 2024","author":"Yue Murong","year":"2024","unstructured":"Murong Yue, Jie Zhao, Min Zhang, Du Liang, and Ziyu Yao. 2024. Large language model cascades with mixture of thoughts representations for cost-efficient reasoning. In Proceedings of the ICLR 2024."},{"key":"e_1_3_1_343_2","first-page":"17283","article-title":"Big bird: Transformers for longer sequences","author":"Zaheer Manzil","year":"2020","unstructured":"Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. Big bird: Transformers for longer sequences. Proceedings of NeurIPS (2020), 17283\u201317297.","journal-title":"Proceedings of NeurIPS"},{"key":"e_1_3_1_344_2","unstructured":"Aohan Zeng Xiao Liu Zhengxiao Du Zihan Wang Hanyu Lai Ming Ding Zhuoyi Yang Yifan Xu Wendi Zheng Xiao Xia Weng Lam Tam Zixuan Ma Yufei Xue Jidong Zhai Wenguang Chen Peng Zhang Yuxiao Dong and Jie Tang. 2022. Glm-130b: An open bilingual pre-trained model. arXiv:2210.02414. Retrieved from https:\/\/arxiv.org\/abs\/2210.02414. (2022)."},{"key":"e_1_3_1_345_2","unstructured":"Dewen Zeng Nan Du Tao Wang Yuanzhong Xu Tao Lei Zhifeng Chen and Claire Cui. 2023. Learning to skip for language modeling. arXiv:2311.15436. Retrieved from https:\/\/arxiv.org\/abs\/2311.15436. (2023)."},{"key":"e_1_3_1_346_2","unstructured":"Shuangfei Zhai Walter Talbott Nitish Srivastava Chen Huang Hanlin Goh Ruixiang Zhang and Josh Susskind. 2021. An attention free transformer. arXiv:2105.14103. Retrieved from https:\/\/arxiv.org\/abs\/2105.14103. (2021)."},{"key":"e_1_3_1_347_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS54959.2023.00042"},{"key":"e_1_3_1_348_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.iwslt-1.47"},{"key":"e_1_3_1_349_2","first-page":"1049","volume-title":"Proceedings of the USENIX ATC 2019","author":"Zhang Chengliang","year":"2019","unstructured":"Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. MArk: Exploiting cloud services for cost-effective, SLO-aware machine learning inference serving. In Proceedings of the USENIX ATC 2019. 1049\u20131062."},{"key":"e_1_3_1_350_2","unstructured":"Hailin Zhang Xiaodong Ji Yilin Chen Fangcheng Fu Xupeng Miao Xiaonan Nie Weipeng Chen and Bin Cui. 2024. Pqcache: Product quantization-based kvcache for long context llm inference. arXiv:2407.12820. Retrieved from https:\/\/arxiv.org\/abs\/2407.12820. (2024)."},{"key":"e_1_3_1_351_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.acl-long.607"},{"key":"e_1_3_1_352_2","unstructured":"Longteng Zhang Xiang Liu Zeyu Li Xinglin Pan Peijie Dong Ruibo Fan Rui Guo Xin Wang Qiong Luo Shaohuai Shi and Xiaowen Chu. 2023. Dissecting the runtime performance of the training fine-tuning and inference of large language models. arXiv:2311.03687. Retrieved from https:\/\/arxiv.org\/abs\/2311.03687. (2023)."},{"key":"e_1_3_1_353_2","unstructured":"Mengke Zhang Tianxing He Tianle Wang Fatemehsadat Mireshghallah Binyi Chen Hao Wang and Yulia Tsvetkov. 2023. LatticeGen: A cooperative framework which hides generated text in a lattice for privacy-aware generation on cloud. arXiv:2309.17157. Retrieved from https:\/\/arxiv.org\/abs\/2309.17157. (2023)."},{"key":"e_1_3_1_354_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.findings-emnlp.183"},{"key":"e_1_3_1_355_2","unstructured":"Susan Zhang Stephen Roller Naman Goyal Mikel Artetxe Moya Chen Shuohui Chen Christopher Dewan Mona Diab Xian Li Xi Victoria Lin Todor Mihaylov Myle Ott Sam Shleifer Kurt Shuster Daniel Simig Punit Singh Koura Anjali Sridhar Tianlu Wang and Luke Zettlemoyer. 2022. Opt: Open pre-trained transformer language models. arXiv:2205.01068. Retrieved from https:\/\/arxiv.org\/abs\/2205.01068. (2022)."},{"key":"e_1_3_1_356_2","unstructured":"Zhenyu Zhang Ying Sheng Tianyi Zhou Tianlong Chen Lianmin Zheng Ruisi Cai Zhao Song Yuandong Tian Christopher R\u00e9 Clark Barrett Zhangyang Wang and Beidi Chen. 2023. H \\({_{-}\\!}2\\) O: Heavy-hitter oracle for efficient generative inference of large language models. arXiv:2306.14048. Retrieved from https:\/\/arxiv.org\/abs\/2306.14048. (2023)."},{"key":"e_1_3_1_357_2","volume-title":"Proceedings of the ICML 2024","author":"Zhang Zhihao","year":"2024","unstructured":"Zhihao Zhang, Alan Zhu, Lijie Yang, Yihua Xu, Lanting Li, Phitchaya Mangpo Phothilimthana, and Zhihao Jia. 2024. Accelerating iterative retrieval-augmented language model serving with speculation. In Proceedings of the ICML 2024."},{"key":"e_1_3_1_358_2","unstructured":"Chenggang Zhao Shangyan Zhou Liyue Zhang Chengqi Deng Zhean Xu Yuxuan Liu Kuai Yu Jiashi Li and Liang Zhao. 2025. DeepEP: An efficient expert-parallel communication library. Retrieved from https:\/\/github.com\/deepseek-ai\/DeepEP. (2025)."},{"key":"e_1_3_1_359_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.emnlp-main.742"},{"key":"e_1_3_1_360_2","first-page":"196","article-title":"Atom: Low-bit quantization for efficient and accurate llm serving","volume":"6","author":"Zhao Yilong","year":"2024","unstructured":"Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. 2024. Atom: Low-bit quantization for efficient and accurate llm serving. Proceedings of Machine Learning and Systems 6 (2024), 196\u2013209.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_1_361_2","first-page":"559","volume-title":"Proceedings of the OSDI 2022","author":"Zheng Lianmin","year":"2022","unstructured":"Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, and Ion Stoica. 2022. Alpa: Automating inter-and Intra-Operator parallelism for distributed deep learning. In Proceedings of the OSDI 2022. 559\u2013578."},{"key":"e_1_3_1_362_2","first-page":"739","volume-title":"Proceedings of the OSDI 2023","author":"Zheng Liyan","year":"2023","unstructured":"Liyan Zheng, Haojie Wang, Jidong Zhai, Muyan Hu, Zixuan Ma, Tuowei Wang, Shuhong Huang, Xupeng Miao, Shizhi Tang, Kezhao Huang, and Zhihao Jia. 2023. EINNET: Optimizing tensor programs with derivation-based transformations. In Proceedings of the OSDI 2023. 739\u2013755."},{"key":"e_1_3_1_363_2","unstructured":"Lianmin Zheng Liangsheng Yin Zhiqiang Xie Jeff Huang Chuyue Sun Cody_Hao Yu Shiyi Cao Christos Kozyrakis Ion Stoica Joseph E. Gonzalez Clark Barrett and Ying Sheng. 2023. Efficiently programming large language models using SGLang. (2023)."},{"key":"e_1_3_1_364_2","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613139"},{"key":"e_1_3_1_365_2","first-page":"193","volume-title":"Proceedings of the OSDI 2024","author":"Zhong Yinmin","year":"2024","unstructured":"Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. In Proceedings of the OSDI 2024. 193\u2013210."},{"key":"e_1_3_1_366_2","unstructured":"Yinmin Zhong Zili Zhang Bingyang Wu Shengyu Liu Yukun Chen Changyi Wan Hanpeng Hu Lei Xia Ranchen Ming Yibo Zhu and Xin Jin. 2024. Rlhfuse: Efficient rlhf training for large language models with inter-and intra-stage fusion. arXiv:2409.13221. Retrieved from https:\/\/arxiv.org\/abs\/2409.13221. (2024)."},{"key":"e_1_3_1_367_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i12.17325"},{"key":"e_1_3_1_368_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA53966.2022.00082"},{"key":"e_1_3_1_369_2","first-page":"18330","article-title":"Bert loses patience: Fast and robust inference with early exit","author":"Zhou Wangchunshu","year":"2020","unstructured":"Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, and Furu Wei. 2020. Bert loses patience: Fast and robust inference with early exit. Proceedings of NeurIPS (2020), 18330\u201318341.","journal-title":"Proceedings of NeurIPS"},{"key":"e_1_3_1_370_2","first-page":"7103","article-title":"Mixture-of-experts with expert choice routing","author":"Zhou Yanqi","year":"2022","unstructured":"Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M. Dai, Quoc V. Le, James Laudon, and Zhifeng Chen. 2022. Mixture-of-experts with expert choice routing. Proceedings of NeurIPS (2022), 7103\u20137114.","journal-title":"Proceedings of NeurIPS"},{"key":"e_1_3_1_371_2","unstructured":"Yongchao Zhou Kaifeng Lyu Ankit Singh Rawat Aditya Krishna Menon Afshin Rostamizadeh Sanjiv Kumar Jean-Fran\u00e7ois Kagy and Rishabh Agarwal. 2023. DistillSpec: Improving speculative decoding via knowledge distillation. arXiv:2310.08461. Retrieved from https:\/\/arxiv.org\/abs\/2310.08461. (2023)."},{"key":"e_1_3_1_372_2","first-page":"489","volume-title":"Proceedings of the USENIX ATC 2022","author":"Zhou Zhe","year":"2022","unstructured":"Zhe Zhou, Xuechao Wei, Jiejing Zhang, and Guangyu Sun. 2022. PetS: A unified framework for parameter-efficient transformers serving. In Proceedings of the USENIX ATC 2022. 489\u2013504."},{"key":"e_1_3_1_373_2","unstructured":"Banghua Zhu Ying Sheng Lianmin Zheng Clark Barrett Michael I. Jordan and Jiantao Jiao. 2023. On optimal caching and model multiplexing for large model inference. arXiv:2306.02003. Retrieved from https:\/\/arxiv.org\/abs\/2306.02003. (2023)."},{"key":"e_1_3_1_374_2","unstructured":"Deyao Zhu Jun Chen Xiaoqian Shen Xiang Li and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592. Retrieved from https:\/\/arxiv.org\/abs\/2304.10592. (2023)."},{"key":"e_1_3_1_375_2","unstructured":"Xunyu Zhu Jian Li Yong Liu Can Ma and Weiping Wang. 2023. A survey on model compression for large language models. arXiv:2308.07633. Retrieved from https:\/\/arxiv.org\/abs\/2308.07633. (2023)."},{"key":"e_1_3_1_376_2","article-title":"Falcon LLM: A new frontier in natural language processing","author":"ZXhang Yoshua X.","year":"2023","unstructured":"Yoshua X. ZXhang, Yann M. Haxo, and Ying X. Mat. 2023. Falcon LLM: A new frontier in natural language processing. AC Investment Research Journal (2023).","journal-title":"AC Investment Research Journal"}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3754448","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,4]],"date-time":"2025-09-04T22:53:49Z","timestamp":1757026429000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3754448"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,4]]},"references-count":375,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,1,31]]}},"alternative-id":["10.1145\/3754448"],"URL":"https:\/\/doi.org\/10.1145\/3754448","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,9,4]]},"assertion":[{"value":"2023-12-23","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-07-16","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-04","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}