{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,14]],"date-time":"2026-07-14T02:50:28Z","timestamp":1783997428432,"version":"3.55.0"},"reference-count":28,"publisher":"Association for Computing Machinery (ACM)","issue":"1","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["SIGOPS Oper. Syst. Rev."],"published-print":{"date-parts":[[2025,8,4]]},"abstract":"<jats:p>Large Language Model (LLM) inference serving faces a fundamental challenge due to the distinct characteristics of its two phases: compute-intensive pre fill and memory-intensive decode. Existing scheduling strategies often prioritize one phase over the other, leading to a difficult tradeoff between system throughput and request latency. Prefill-prioritizing schedulers improve throughput but introduce significant latency jitter (generation stalls) by interfering with ongoing decodes. Conversely, decode-prioritizing schedulers maintain low latency but underutilize GPU resources, resulting in low throughput. This paper revisits the technique of chunked prefills, demonstrating its efficacy in mitigating this tradeoff. By splitting large prefill computations into smaller, manageable chunks and interleaving them with decode operations using stall-free batching, we can leverage the compute slack inherent in the decode phase. This approach significantly improves serving capacity under strict latency constraints, minimizes generation stalls, and reduces pipeline bubbles in distributed deployments, enabling efficient and responsive inference.<\/jats:p>","DOI":"10.1145\/3759441.3759444","type":"journal-article","created":{"date-parts":[[2025,8,6]],"date-time":"2025-08-06T14:43:44Z","timestamp":1754491424000},"page":"9-16","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Efficient LLM Inference via Chunked Prefills"],"prefix":"10.1145","volume":"59","author":[{"given":"Arney","family":"Agrawal","sequence":"first","affiliation":[{"name":"Georgia Institute of Technology, GA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Nitin","family":"Kedia","sequence":"additional","affiliation":[{"name":"Microsoft Research India, India"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ashish","family":"Panwar","sequence":"additional","affiliation":[{"name":"Microsoft Research India, India"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jayashree","family":"Mohan","sequence":"additional","affiliation":[{"name":"Microsoft Research India, India"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Nipun","family":"Kwatra","sequence":"additional","affiliation":[{"name":"Microsoft Research India, India"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Bhargav S.","family":"Gulavani","sequence":"additional","affiliation":[{"name":"Microsoft Research India, India"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Alexey","family":"Tumanov","sequence":"additional","affiliation":[{"name":"Georgia Institute of Technology, GA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ramachandran","family":"Ramjee","sequence":"additional","affiliation":[{"name":"Microsoft Research India, India"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2025,8,6]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"arxiv.org e-print archive. https:\/\/arxiv.org\/."},{"key":"e_1_2_1_2_1","unstructured":"Chatgpt. https:\/\/chat.openai.com."},{"key":"e_1_2_1_3_1","unstructured":"Faster Transformer. https:\/\/github.com\/NVIDIA\/ FasterTransformer."},{"key":"e_1_2_1_4_1","unstructured":"Google duet ai. https:\/\/workspace.google.com\/ solutions\/ai\/."},{"key":"e_1_2_1_5_1","unstructured":"Microsoft copilot. https:\/\/www.microsoft.com\/enus\/ microsoft-copilot."},{"key":"e_1_2_1_6_1","unstructured":"Yi series of large language models trained from scratch by developers at 01.AI. https:\/\/huggingface.co\/ 01-ai\/Yi-34B-200K."},{"key":"e_1_2_1_7_1","volume-title":"Proceedings of The Seventh Annual Conference on Machine Learning and Systems, 2024","author":"Agrawal Amey","year":"2024","unstructured":"Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S Gulavani, Ramachandran Ramjee, and Alexey Tumanov. Vidur: A large-scale simulation framework for llm inference. Proceedings of The Seventh Annual Conference on Machine Learning and Systems, 2024, Santa Clara, 2024."},{"key":"e_1_2_1_8_1","volume-title":"OSDI","author":"Agrawal Amey","year":"2024","unstructured":"Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. OSDI, 2024."},{"key":"e_1_2_1_9_1","volume-title":"Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills","author":"Agrawal Amey","year":"2023","unstructured":"Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills, 2023."},{"key":"e_1_2_1_10_1","volume-title":"Medha: Efficiently serving multi-million context length llm inference requests without approximations. arXiv preprint arXiv:2409.17264","author":"Agrawal Amey","year":"2024","unstructured":"Amey Agrawal, Haoran Qiu, Junda Chen, Inigo Goiri, Ramachandran Ramjee, Chaojie Zhang, Alexey Tumanov, and Esha Choukse. Medha: Efficiently serving multi-million context length llm inference requests without approximations. arXiv preprint arXiv:2409.17264, 2024."},{"key":"e_1_2_1_11_1","volume-title":"Gqa: Training generalized multi-query transformer models from multi-head checkpoints","author":"Ainslie Joshua","year":"2023","unstructured":"Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr\u00f3n, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023."},{"key":"e_1_2_1_12_1","volume-title":"The falcon series of open language models","author":"Almazrouei Ebtesam","year":"2023","unstructured":"Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, M\u00e9rouane Debbah, \u00c9tienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. The falcon series of open language models, 2023."},{"key":"e_1_2_1_13_1","volume-title":"Language models are few-shot learners. Advances in neural information processing systems, 33:1877-1901","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877-1901, 2020."},{"key":"e_1_2_1_14_1","unstructured":"Aakanksha Chowdhery Sharan Narang Jacob Devlin Maarten Bosma Gaurav Mishra Adam Roberts Paul Barham Hyung Won Chung Charles Sutton Sebastian Gehrmann Parker Schuh Kensen Shi Sasha Tsvyashchenko Joshua Maynez Abhishek Rao Parker Barnes Yi Tay Noam Shazeer Vinodkumar Prabhakaran Emily Reif Nan Du Ben Hutchinson Reiner Pope James Bradbury Jacob Austin Michael Isard Guy Gur-Ari Pengcheng Yin Toju Duke Anselm Levskaya Sanjay Ghemawat Sunipa Dev Henryk Michalewski Xavier Garcia Vedant Misra Kevin Robinson Liam Fedus Denny Zhou Daphne Ippolito David Luan Hyeontaek Lim Barret Zoph Alexander Spiridonov Ryan Sepassi David Dohan Shivani Agrawal Mark Omernick Andrew M. Dai Thanumalayan Sankaranarayana Pillai Marie Pellat Aitor Lewkowycz Erica Moreira Rewon Child Oleksandr Polozov Katherine Lee Zongwei Zhou Xuezhi Wang Brennan Saeta Mark Diaz Orhan Firat Michele Catasta Jason Wei Kathy Meier- Hellstern Douglas Eck Jeff Dean Slav Petrov and Noah Fiedel. Palm: Scaling language modeling with pathways. CoRR abs\/2204.02311 2022."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-2097"},{"key":"e_1_2_1_16_1","volume-title":"et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019."},{"key":"e_1_2_1_17_1","volume-title":"Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825","author":"Jiang Albert Q","year":"2023","unstructured":"Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023."},{"key":"e_1_2_1_18_1","volume-title":"Scaling laws for neural language models. CoRR, abs\/2001.08361","author":"Kaplan Jared","year":"2020","unstructured":"Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. CoRR, abs\/2001.08361, 2020."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_2_1_20_1","volume-title":"GPT-4 technical report. CoRR, abs\/2303.08774","author":"AI.","year":"2023","unstructured":"OpenAI. GPT-4 technical report. CoRR, abs\/2303.08774, 2023."},{"key":"e_1_2_1_21_1","volume-title":"Splitwise: Efficient generative llm inference using phase splitting","author":"Patel Pratyush","year":"2023","unstructured":"Pratyush Patel, Esha Choukse, Chaojie Zhang, \u00cd\u00f1igo Goiri, Aashaka Shah, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting, 2023."},{"key":"e_1_2_1_22_1","volume-title":"Fast transformer decoding: One writehead is all you need","author":"Shazeer Noam","year":"2019","unstructured":"Noam Shazeer. Fast transformer decoding: One writehead is all you need, 2019."},{"key":"e_1_2_1_23_1","volume-title":"Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053, 2019."},{"key":"e_1_2_1_24_1","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou Hakan Inan Marcin Kardas ViktorKerkez Madian Khabsa Isabel Kloumann Artem Korenev Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic Sergey Edunov and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models 2023."},{"key":"e_1_2_1_25_1","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017."},{"key":"e_1_2_1_26_1","volume-title":"Openchat: Advancing opensource language models with mixed-quality data","author":"Wang Guan","year":"2023","unstructured":"Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, and Yang Liu. Openchat: Advancing opensource language models with mixed-quality data, 2023."},{"key":"e_1_2_1_27_1","first-page":"2022","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, andWilliam Fedus. Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022, 2022.","journal-title":"Trans. Mach. Learn. Res."},{"key":"e_1_2_1_28_1","first-page":"521","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Yu Gyeong-In","year":"2022","unstructured":"Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521-538, Carlsbad, CA, July 2022. USENIX Association. 16"}],"container-title":["ACM SIGOPS Operating Systems Review"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3759441.3759444","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,7]],"date-time":"2025-08-07T19:50:34Z","timestamp":1754596234000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3759441.3759444"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,4]]},"references-count":28,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,8,4]]}},"alternative-id":["10.1145\/3759441.3759444"],"URL":"https:\/\/doi.org\/10.1145\/3759441.3759444","relation":{},"ISSN":["0163-5980"],"issn-type":[{"value":"0163-5980","type":"print"}],"subject":[],"published":{"date-parts":[[2025,8,4]]},"assertion":[{"value":"2025-08-06","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}