{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,19]],"date-time":"2026-05-19T07:16:02Z","timestamp":1779174962238,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":61,"publisher":"ACM","license":[{"start":{"date-parts":[[2025,3,30]],"date-time":"2025-03-30T00:00:00Z","timestamp":1743292800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100006374","name":"Nederlandse Organisatie voor Wetenschappelijk Onderzoek","doi-asserted-by":"publisher","award":["OCENW.KLEIN.561"],"award-info":[{"award-number":["OCENW.KLEIN.561"]}],"id":[{"id":"10.13039\/501100006374","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,3,30]]},"DOI":"10.1145\/3719330.3721230","type":"proceedings-article","created":{"date-parts":[[2025,3,22]],"date-time":"2025-03-22T06:19:33Z","timestamp":1742624373000},"page":"23-33","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["An I\/O Characterizing Study of Offloading LLM Models and KV Caches to NVMe SSD"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1466-0002","authenticated-orcid":false,"given":"Zebin","family":"Ren","sequence":"first","affiliation":[{"name":"Vrije Universiteit Amsterdam, Amsterdam, The Netherlands"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-7530-4438","authenticated-orcid":false,"given":"Krijn","family":"Doekemeijer","sequence":"additional","affiliation":[{"name":"Vrije Universiteit Amsterdam, Amsterdam, The Netherlands"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9158-6849","authenticated-orcid":false,"given":"Tiziano","family":"De Matteis","sequence":"additional","affiliation":[{"name":"Vrije Universiteit Amsterdam, Amsterdam, The Netherlands"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7060-2742","authenticated-orcid":false,"given":"Christian","family":"Pinto","sequence":"additional","affiliation":[{"name":"IBM Research Europe, Dublin, Ireland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-8089-866X","authenticated-orcid":false,"given":"Radu","family":"Stoica","sequence":"additional","affiliation":[{"name":"IBM Research Europe, Zurich, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3586-7168","authenticated-orcid":false,"given":"Animesh","family":"Trivedi","sequence":"additional","affiliation":[{"name":"IBM Research Europe, Zurich, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,3,30]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Accessed: 2025-02-26. DeepSpeed DeepNVMe. https:\/\/www.deepspeed.ai\/tutorials\/deepnvme\/."},{"key":"e_1_3_2_1_2_1","unstructured":"Accessed: 2025-02-26. FMInference\/FlexLLMGen. https:\/\/github.com\/FMInference\/FlexLLMGen."},{"key":"e_1_3_2_1_3_1","unstructured":"Accessed: 2025-02-26. GitHub DeepSpeed. https:\/\/github.com\/microsoft\/DeepSpeed."},{"key":"e_1_3_2_1_4_1","unstructured":"Accessed: 2025-02-26. HuggingFace Accelerate. https:\/\/huggingface.co\/docs\/accelerate\/index."},{"key":"e_1_3_2_1_5_1","unstructured":"Accessed: 2025-02-26. HuggingFace: OPT model. https:\/\/huggingface.co\/docs\/transformers\/en\/model_doc\/opt."},{"key":"e_1_3_2_1_6_1","unstructured":"Accessed: 2025-02-26. IBM Granite 3.1: Powerful Performance Longer Context New Embedding Models and More. https:\/\/www.ibm.com\/new\/announcements\/ibm-granite-3-1-powerful-performance-long-context-and-more."},{"key":"e_1_3_2_1_7_1","unstructured":"Accessed: 2025-02-26. Introducing Meta Llama 3: The Most Capable Openly Available LLM to Date. https:\/\/ai.meta.com\/blog\/meta-llama-3\/?utm_source=chatgpt.com."},{"key":"e_1_3_2_1_8_1","unstructured":"Accessed: 2025-02-26. Mistral: Model Weights. https:\/\/docs.mistral.ai\/getting-started\/models\/weights\/."},{"key":"e_1_3_2_1_9_1","unstructured":"Accessed: 2025-02-26. Samsung 990 pro. https:\/\/semiconductor.samsung.com\/consumer-storage\/internal-ssd\/990-pro\/."},{"key":"e_1_3_2_1_10_1","unstructured":"Accessed: 2025-02-26. Welcome to FIO's documentation! https:\/\/fio.readthedocs.io\/en\/latest\/."},{"key":"e_1_3_2_1_11_1","unstructured":"Accessed: 2025-02-26. ZeRO-Inference: 20X Faster Inference Through Weight Quantization and KV Cache Offloading. https:\/\/github.com\/microsoft\/DeepSpeedExamples\/blob\/master\/inference\/huggingface\/zero_inference\/README.md."},{"key":"e_1_3_2_1_12_1","volume-title":"Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024","author":"Agrawal Amey","year":"2024","unstructured":"Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Association, 117--134. https:\/\/www.usenix.org\/conference\/osdi24\/presentation\/agrawal"},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.18653\/V1\/2024.ACL-LONG.678"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41404.2022.00051"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","unstructured":"Rohan Anil Sebastian Borgeaud Yonghui Wu Jean-Baptiste Alayrac Jiahui Yu Radu Soricut Johan Schalkwyk Andrew M. Dai Anja Hauth Katie Millican David Silver Slav Petrov Melvin Johnson Ioannis Antonoglou Julian Schrittwieser Amelia Glaese Jilin Chen Emily Pitler Timothy P. Lillicrap Angeliki Lazaridou Orhan Firat James Molloy Michael Isard Paul Ronald Barham Tom Hennigan Benjamin Lee Fabio Viola Malcolm Reynolds Yuanzhong Xu Ryan Doherty Eli Collins Clemens Meyer Eliza Rutherford Erica Moreira Kareem Ayoub Megha Goel George Tucker Enrique Piqueras Maxim Krikun Iain Barr Nikolay Savinov Ivo Danihelka Becca Roelofs Ana\u00efs White Anders Andreassen Tamara von Glehn Lakshman Yagati Mehran Kazemi Lucas Gonzalez Misha Khalman Jakub Sygnowski and et al. 2023. Gemini: A Family of Highly Capable Multimodal Models. CoRR abs\/2312.11805 (2023). https:\/\/doi.org\/10.48550\/ARXIV.2312.11805 arXiv:2312.11805","DOI":"10.48550\/ARXIV.2312.11805"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2405.03146"},{"key":"e_1_3_2_1_17_1","volume-title":"FlashNeuron: SSD-Enabled Large-Batch Training of Very Deep Neural Networks. In 19th USENIX Conference on File and Storage Technologies, FAST 2021, February 23-25","author":"Bae Jonghyun","year":"2021","unstructured":"Jonghyun Bae, Jongsung Lee, Yunho Jin, Sam Son, Shine Kim, Hakbeom Jang, Tae Jun Ham, and Jae W. Lee. 2021. FlashNeuron: SSD-Enabled Large-Batch Training of Very Deep Neural Networks. In 19th USENIX Conference on File and Storage Technologies, FAST 2021, February 23-25, 2021, Marcos K. Aguilera and Gala Yadgar (Eds.). USENIX Association, 387--401. https:\/\/www.usenix.org\/conference\/fast21\/presentation\/bae"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3643757"},{"key":"e_1_3_2_1_19_1","volume-title":"ZNS: Avoiding the Block Interface Tax for Flash-based SSDs. In 2021 USENIX Annual Technical Conference (USENIX ATC 21)","author":"Bj\u00f8rling Matias","year":"2021","unstructured":"Matias Bj\u00f8rling, Abutalib Aghayev, Hans Holmberg, Aravind Ramesh, Damien Le Moal, Gregory R. Ganger, and George Amvrosiadis. 2021. ZNS: Avoiding the Block Interface Tax for Flash-based SSDs. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 689--703. https:\/\/www.usenix.org\/conference\/atc21\/presentation\/bjorling"},{"key":"e_1_3_2_1_20_1","volume-title":"Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020","author":"Brown Tom B.","year":"2020","unstructured":"Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https:\/\/proceedings.neurips.cc\/paper\/2020\/hash\/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2409.13761arXiv:2409.13761"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1016\/J.SYSARC.2023.102990"},{"key":"e_1_3_2_1_23_1","volume-title":"Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022","author":"Dettmers Tim","year":"2022","unstructured":"Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (Eds.). http:\/\/papers.nips.cc\/paper_files\/paper\/2022\/hash\/c3ba4962c05c49636d4c6206a97e9c8a-Abstract-Conference.html"},{"key":"e_1_3_2_1_24_1","volume-title":"SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. In The Twelfth International Conference on Learning Representations, ICLR 2024","author":"Dettmers Tim","year":"2024","unstructured":"Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. 2024. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net.https:\/\/openreview.net\/forum?id=Q1u25ahSuy"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3688351.3689160"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTER52292.2023.00018"},{"key":"e_1_3_2_1_27_1","volume-title":"OPTQ: Accurate Quantization for Generative Pre-trained Transformers. In The Eleventh International Conference on Learning Representations, ICLR 2023","author":"Frantar Elias","year":"2023","unstructured":"Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. OPTQ: Accurate Quantization for Generative Pre-trained Transformers. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https:\/\/openreview.net\/forum?id=tcbBPnfwxS"},{"key":"e_1_3_2_1_28_1","volume-title":"ServerlessLLM: Low-Latency Serverless Inference for Large Language Models. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024","author":"Fu Yao","year":"2024","unstructured":"Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low-Latency Serverless Inference for Large Language Models. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Association, 135--153. https:\/\/www.usenix.org\/conference\/osdi24\/presentation\/fu"},{"key":"e_1_3_2_1_29_1","volume-title":"Proceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference (Santa Clara, CA, USA) (USENIX ATC'24). USENIX Association, USA, Article 7, 16 pages.","author":"Gao Bin","year":"2025","unstructured":"Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2025. Cost-efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. In Proceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference (Santa Clara, CA, USA) (USENIX ATC'24). USENIX Association, USA, Article 7, 16 pages."},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2401.08671"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2401.11181"},{"key":"e_1_3_2_1_32_1","volume-title":"The Twelfth International Conference on Learning Representations, ICLR 2024","author":"Jimenez Carlos E.","year":"2024","unstructured":"Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https:\/\/openreview.net\/forum?id=VTF8yNQM66"},{"key":"e_1_3_2_1_33_1","volume-title":"22nd USENIX Conference on File and Storage Technologies (FAST 24)","author":"Joshi Kanchan","year":"2024","unstructured":"Kanchan Joshi, Anuj Gupta, Javier Gonzalez, Ankit Kumar, Krishna Kanth Reddy, Arun George, Simon Lund, and Jens Axboe. 2024. I\/O Passthru: Upstreaming a Flexible and Efficient I\/O Path in Linux. In 22nd USENIX Conference on File and Storage Technologies (FAST 24). USENIX Association, Santa Clara, CA, 107--121. https:\/\/www.usenix.org\/conference\/fast24\/presentation\/joshi"},{"key":"e_1_3_2_1_34_1","volume-title":"Scaling Laws for Neural Language Models. CoRR abs\/2001.08361","author":"Kaplan Jared","year":"2020","unstructured":"Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. CoRR abs\/2001.08361 (2020). arXiv:2001.08361 https:\/\/arxiv.org\/abs\/2001.08361"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA56546.2023.10071024"},{"key":"e_1_3_2_1_36_1","volume-title":"Behemoth: A Flash-centric Training Accelerator for Extreme-scale DNNs. In 19th USENIX Conference on File and Storage Technologies, FAST 2021, February 23-25","author":"Kim Shine","year":"2021","unstructured":"Shine Kim, Yunho Jin, Gina Sohn, Jonghyun Bae, Tae Jun Ham, and Jae W. Lee. 2021. Behemoth: A Flash-centric Training Accelerator for Extreme-scale DNNs. In 19th USENIX Conference on File and Storage Technologies, FAST 2021, February 23-25, 2021, Marcos K. Aguilera and Gala Yadgar (Eds.). USENIX Association, 371--385. https:\/\/www.usenix.org\/conference\/fast21\/presentation\/kim"},{"key":"e_1_3_2_1_37_1","volume-title":"NVMeVirt: A Versatile Software-defined Virtual NVMe Device. In 21st USENIX Conference on File and Storage Technologies (FAST 23)","author":"Kim Sang-Hoon","year":"2023","unstructured":"Sang-Hoon Kim, Jaehoon Shim, Euidong Lee, Seongyeop Jeong, Ilkueon Kang, and Jin-Soo Kim. 2023. NVMeVirt: A Versatile Software-defined Virtual NVMe Device. In 21st USENIX Conference on File and Storage Technologies (FAST 23). USENIX Association, Santa Clara, CA, 379--394. https:\/\/www.usenix.org\/conference\/fast23\/presentation\/kim-sang-hoon"},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_2_1_39_1","volume-title":"InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024","author":"Lee Wonbeom","year":"2024","unstructured":"Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Association, 155--172. https:\/\/www.usenix.org\/conference\/osdi24\/presentation\/lee"},{"key":"e_1_3_2_1_40_1","unstructured":"Haoyang Li Yiming Li Anxin Tian Tianhao Tang Zhanchao Xu Xuejia Chen Nicole Hu Wei Dong Qing Li and Lei Chen. 2025. A Survey on Large Language Model Acceleration based on KV Cache Management. arXiv:2412.19442 [cs.AI] https:\/\/arxiv.org\/abs\/2412.19442"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3534056.3534940"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2404.02060"},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3651890.3672274"},{"key":"e_1_3_2_1_44_1","unstructured":"Microsoft. 2025. NVMeVirt: A Versatile Software-defined Virtual NVMe Device. https:\/\/github.com\/snu-csl\/nvmevirt."},{"key":"e_1_3_2_1_45_1","unstructured":"OpenAI. 2023. GPT-4 Technical Report. CoRR abs\/2303.08774 (2023). https:\/\/doi.org\/10.48550\/ARXIV.2303.08774 arXiv:2303.08774"},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2409.04992"},{"key":"e_1_3_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA59077.2024.00019"},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2407.00079"},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476205"},{"key":"e_1_3_2_1_50_1","volume-title":"Proceedings of the 2021 USENIX Annual Technical Conference, USENIX ATC 2021","author":"Ren Jie","year":"2021","unstructured":"Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training. In Proceedings of the 2021 USENIX Annual Technical Conference, USENIX ATC 2021, July 14-16, 2021, Irina Calciu and Geoff Kuenning (Eds.). USENIX Association, 551--564. https:\/\/www.usenix.org\/conference\/atc21\/presentation\/ren-jie"},{"key":"e_1_3_2_1_51_1","volume-title":"International Conference on Machine Learning, ICML 2023","volume":"31116","author":"Sheng Ying","year":"2023","unstructured":"Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher R\u00e9, Ion Stoica, and Ce Zhang. 2023. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 31094--31116. https:\/\/proceedings.mlr.press\/v202\/sheng23a.html"},{"key":"e_1_3_2_1_52_1","volume-title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR abs\/1909.08053","author":"Shoeybi Mohammad","year":"2019","unstructured":"Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR abs\/1909.08053 (2019). arXiv:1909.08053 http:\/\/arxiv.org\/abs\/1909.08053"},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3694715.3695964"},{"key":"e_1_3_2_1_54_1","volume-title":"Faulttolerant Generative LLM Serving. In Forty-first International Conference on Machine Learning, ICML 2024","author":"Strati Foteini","year":"2024","unstructured":"Foteini Strati, Sara McAllister, Amar Phanishayee, Jakub Tarnawski, and Ana Klimovic. 2024. D\u00e9j\u00e0Vu: KV-cache Streaming for Fast, Faulttolerant Generative LLM Serving. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net. https:\/\/openreview.net\/forum?id=AbGbGZFYOD"},{"key":"e_1_3_2_1_55_1","volume-title":"16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019","author":"Subramanya Suhas Jayaram","year":"2019","unstructured":"Suhas Jayaram Subramanya, Harsha Vardhan Simhadri, Srajan Garg, Anil Kag, and Venkatesh Balasubramanian. 2019. BLAS-on-flash: An Efficient Alternative for Large Scale ML Training and Inference?. In 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019, Boston, MA, February 26-28, 2019, Jay R. Lorch and Minlan Yu (Eds.). USENIX Association, 469--484. https:\/\/www.usenix.org\/conference\/nsdi19\/presentation\/subramanya"},{"key":"e_1_3_2_1_56_1","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2017\/file\/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"},{"key":"e_1_3_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/3694715.3695948"},{"key":"e_1_3_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/CLOUD55607.2022.00041"},{"key":"e_1_3_2_1_59_1","volume-title":"CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion. arXiv preprint arXiv:2405.16444","author":"Yao Jiayi","year":"2024","unstructured":"Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2024. CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion. arXiv preprint arXiv:2405.16444 (2024)."},{"key":"e_1_3_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2205.01068"},{"key":"e_1_3_2_1_61_1","volume-title":"DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024","author":"Zhong Yinmin","year":"2024","unstructured":"Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024, Santa Clara, CA, USA, July 10-12, 2024, Ada Gavrilovska and Douglas B. Terry (Eds.). USENIX Association, 193--210. https:\/\/www.usenix.org\/conference\/osdi24\/presentation\/zhong-yinmin"}],"event":{"name":"EuroSys '25: Twentieth European Conference on Computer Systems","location":"Rotterdam Netherlands","acronym":"EuroSys '25","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems"]},"container-title":["Proceedings of the 5th Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3719330.3721230","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3719330.3721230","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,23]],"date-time":"2025-08-23T20:33:09Z","timestamp":1755981189000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3719330.3721230"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,30]]},"references-count":61,"alternative-id":["10.1145\/3719330.3721230","10.1145\/3719330"],"URL":"https:\/\/doi.org\/10.1145\/3719330.3721230","relation":{},"subject":[],"published":{"date-parts":[[2025,3,30]]},"assertion":[{"value":"2025-03-30","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}