{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,20]],"date-time":"2025-12-20T08:39:44Z","timestamp":1766219984933,"version":"3.48.0"},"publisher-location":"New York, NY, USA","reference-count":31,"publisher":"ACM","funder":[{"name":"Young Scientists Fund of the National Natural Science Foundation of China","award":["62402333"],"award-info":[{"award-number":["62402333"]}]},{"name":"The Chinese Academy of Sciences \uff08CAS\uff09\u201cLight of West China\u201dProgram","award":["xbzg-zdsys-202410"],"award-info":[{"award-number":["xbzg-zdsys-202410"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,9,8]]},"DOI":"10.1145\/3754598.3754605","type":"proceedings-article","created":{"date-parts":[[2025,12,20]],"date-time":"2025-12-20T08:34:32Z","timestamp":1766219672000},"page":"11-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["MixLoRA: An Efficient Multi-Tenant Framework for Concurrently Serving Diverse LoRA Models in Large Language Models"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-5089-2082","authenticated-orcid":false,"given":"Ronghuai","family":"Chen","sequence":"first","affiliation":[{"name":"Tianjin University, Tianjin, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2416-4547","authenticated-orcid":false,"given":"Ce","family":"Yu","sequence":"additional","affiliation":[{"name":"Tianjin University, Tianjin, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4349-8748","authenticated-orcid":false,"given":"Hao","family":"Fu","sequence":"additional","affiliation":[{"name":"National Supercomputing Center in Tianjin, Tianjin, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-6793-4225","authenticated-orcid":false,"given":"Xiaoteng","family":"Hu","sequence":"additional","affiliation":[{"name":"Tianjin University, Tianjin, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3783-2228","authenticated-orcid":false,"given":"Bin","family":"Yang","sequence":"additional","affiliation":[{"name":"Tianjin University, Tianjin, China"}]}],"member":"320","published-online":{"date-parts":[[2025,12,20]]},"reference":[{"key":"e_1_3_3_1_2_2","volume-title":"Lorax","year":"2024","unstructured":"2024. Lorax. https:\/\/github.com\/predibase\/lorax?tab=readme-ov-file#-features"},{"key":"e_1_3_3_1_3_2","unstructured":"Amey Agrawal Nitin Kedia Ashish Panwar Jayashree Mohan Nipun Kwatra Bhargav\u00a0S Gulavani Alexey Tumanov and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. Proceedings of 18th USENIX Symposium on Operating Systems Design and Implementation 2024 Santa Clara (2024)."},{"key":"e_1_3_3_1_4_2","unstructured":"Lequn Chen Zihao Ye Yongji Wu Danyang Zhuo Luis Ceze and Arvind Krishnamurthy. 2023. Punica: Multi-Tenant LoRA Serving. arxiv:https:\/\/arXiv.org\/abs\/2310.18547\u00a0[cs.DC]"},{"key":"e_1_3_3_1_5_2","unstructured":"NVIDIA Corporation. 2023. Nsight Systems. https:\/\/developer.nvidia.com\/nsight-systems. Accessed: 2025-01-23."},{"key":"e_1_3_3_1_6_2","volume-title":"Advances in Neural Information Processing Systems (NeurIPS)","author":"Dao Tri","year":"2022","unstructured":"Tri Dao, Daniel\u00a0Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_3_3_1_7_2","unstructured":"DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arxiv:https:\/\/arXiv.org\/abs\/2412.19437\u00a0[cs.CL] https:\/\/arxiv.org\/abs\/2412.19437"},{"key":"e_1_3_3_1_8_2","unstructured":"Tim Dettmers Artidoro Pagnoni Ari Holtzman and Luke Zettlemoyer. 2023. QLoRA: efficient finetuning of quantized LLMs (2023). arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2305.14314 52 (2023) 3982\u20133992."},{"key":"e_1_3_3_1_9_2","unstructured":"Jacob Devlin. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1810.04805 (2018)."},{"key":"e_1_3_3_1_10_2","unstructured":"In Gim Guojun Chen Seung-seob Lee Nikhil Sarda Anurag Khandelwal and Lin Zhong. 2024. Prompt cache: Modular attention reuse for low-latency inference. Proceedings of Machine Learning and Systems 6 (2024) 325\u2013338."},{"key":"e_1_3_3_1_11_2","unstructured":"Team GLM Aohan Zeng Bin Xu Bowen Wang Chenhui Zhang Da Yin Dan Zhang Diego Rojas Guanyu Feng Hanlin Zhao et\u00a0al. 2024. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2406.12793 (2024)."},{"key":"e_1_3_3_1_12_2","unstructured":"Ke Hong Guohao Dai Jiaming Xu Qiuli Mao Xiuhong Li Jun Liu Yuhan Dong Yu Wang et\u00a0al. 2024. FlashDecoding++: Faster Large Language Model Inference with Asynchronization Flat GEMM Optimization and Heuristics. Proceedings of Machine Learning and Systems 6 (2024) 148\u2013161."},{"key":"e_1_3_3_1_13_2","first-page":"2790","volume-title":"International conference on machine learning","author":"Houlsby Neil","year":"2019","unstructured":"Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De\u00a0Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In International conference on machine learning. PMLR, 2790\u20132799."},{"key":"e_1_3_3_1_14_2","doi-asserted-by":"publisher","unstructured":"Edward\u00a0J. Hu Yelong Shen Phillip Wallis Zeyuan Allen-Zhu Yuanzhi Li Shean Wang Lu Wang and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv e-prints Article arXiv:2106.09685 (June 2021) arXiv:2106.09685\u00a0pages. 10.48550\/arXiv.2106.09685 arxiv:https:\/\/arXiv.org\/abs\/2106.09685\u00a0[cs.CL]","DOI":"10.48550\/arXiv.2106.09685"},{"key":"e_1_3_3_1_15_2","unstructured":"Jiawei Hu Hong Jia Mahbub Hassan Lina Yao Brano Kusy and Wen Hu. 2024. LightLLM: A Versatile Large Language Model for Predictive Light Sensing. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2411.15211 (2024)."},{"key":"e_1_3_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_3_1_17_2","first-page":"155","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Lee Wonbeom","year":"2024","unstructured":"Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. { InfiniGen} : Efficient generative inference of large language models with dynamic { KV} cache management. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155\u2013172."},{"key":"e_1_3_3_1_18_2","unstructured":"Xiang\u00a0Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2101.00190 (2021)."},{"key":"e_1_3_3_1_19_2","doi-asserted-by":"publisher","unstructured":"Subhabrata Mukherjee Arindam Mitra Ganesh Jawahar Sahaj Agarwal Hamid Palangi and Ahmed Awadallah. 2023. Orca: Progressive Learning from Complex Explanation Traces of GPT-4. arXiv e-prints Article arXiv:2306.02707 (June 2023) arXiv:2306.02707\u00a0pages. 10.48550\/arXiv.2306.02707 arxiv:https:\/\/arXiv.org\/abs\/2306.02707\u00a0[cs.CL]","DOI":"10.48550\/arXiv.2306.02707"},{"key":"e_1_3_3_1_20_2","unstructured":"NVIDIA. 2024. NVIDIA Multi-Instance GPU. https:\/\/www.nvidia.com\/en-us\/technologies\/multi-instance-gpu\/"},{"key":"e_1_3_3_1_21_2","unstructured":"NVIDIA. 2025. cuBLAS Documentation. https:\/\/docs.nvidia.com\/cuda\/cublas\/index.html Accessed: 2025-04-21."},{"key":"e_1_3_3_1_22_2","unstructured":"R OpenAI. 2023. Gpt-4 technical report. arxiv 2303.08774. View in Article 2 5 (2023)."},{"key":"e_1_3_3_1_23_2","first-page":"27730","volume-title":"Advances in Neural Information Processing Systems","volume":"35","author":"Ouyang Long","year":"2022","unstructured":"Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul\u00a0F Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, S.\u00a0Koyejo, S.\u00a0Mohamed, A.\u00a0Agarwal, D.\u00a0Belgrave, K.\u00a0Cho, and A.\u00a0Oh (Eds.), Vol.\u00a035. Curran Associates, Inc., 27730\u201327744. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2022\/file\/b1efde53be364a73914f58805a001731-Paper-Conference.pdf"},{"key":"e_1_3_3_1_24_2","unstructured":"A Paszke. 2019. Pytorch: An imperative style high-performance deep learning library. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1912.01703 (2019)."},{"key":"e_1_3_3_1_25_2","unstructured":"Reiner Pope Sholto Douglas Aakanksha Chowdhery Jacob Devlin James Bradbury Jonathan Heek Kefan Xiao Shivani Agrawal and Jeff Dean. 2023. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems 5 (2023) 606\u2013624."},{"key":"e_1_3_3_1_26_2","unstructured":"A. Radford J. Wu R. Child D. Luan D. Amodei and I. Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog 1 8 (2019) 9."},{"key":"e_1_3_3_1_27_2","unstructured":"Ying Sheng Shiyi Cao Dacheng Li Coleman Hooper Nicholas Lee Shuo Yang Christopher Chou Banghua Zhu Lianmin Zheng Kurt Keutzer Joseph\u00a0E. Gonzalez and Ion Stoica. 2023. S-LoRA: Serving Thousands of Concurrent LoRA Adapters. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2311.03285 (2023)."},{"key":"e_1_3_3_1_28_2","volume-title":"18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)","author":"Sun Biao","year":"2024","unstructured":"Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)."},{"key":"e_1_3_3_1_29_2","doi-asserted-by":"publisher","unstructured":"Hugo Touvron Thibaut Lavril Gautier Izacard Xavier Martinet Marie-Anne Lachaux Timoth\u00e9e Lacroix Baptiste Rozi\u00e8re Naman Goyal Eric Hambro Faisal Azhar Aurelien Rodriguez Armand Joulin Edouard Grave and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv e-prints Article arXiv:2302.13971 (Feb. 2023) arXiv:2302.13971\u00a0pages. 10.48550\/arXiv.2302.13971 arxiv:https:\/\/arXiv.org\/abs\/2302.13971\u00a0[cs.CL]","DOI":"10.48550\/arXiv.2302.13971"},{"key":"e_1_3_3_1_30_2","doi-asserted-by":"publisher","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan\u00a0N. Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention Is All You Need. arXiv e-prints Article arXiv:1706.03762 (June 2017) arXiv:1706.03762\u00a0pages. 10.48550\/arXiv.1706.03762 arxiv:https:\/\/arXiv.org\/abs\/1706.03762\u00a0[cs.CL]","DOI":"10.48550\/arXiv.1706.03762"},{"key":"e_1_3_3_1_31_2","unstructured":"Susan Zhang Stephen Roller Naman Goyal Mikel Artetxe Moya Chen Shuohui Chen Christopher Dewan Mona Diab Xian Li Xi\u00a0Victoria Lin et\u00a0al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2205.01068 (2022)."},{"key":"e_1_3_3_1_32_2","doi-asserted-by":"crossref","unstructured":"Lianmin Zheng Liangsheng Yin Zhiqiang Xie Chuyue\u00a0Livia Sun Jeff Huang Cody\u00a0Hao Yu Shiyi Cao Christos Kozyrakis Ion Stoica Joseph\u00a0E Gonzalez et\u00a0al. 2024. Sglang: Efficient execution of structured language model programs. Advances in Neural Information Processing Systems 37 (2024) 62557\u201362583.","DOI":"10.52202\/079017-2000"}],"event":{"name":"ICPP '25: 54th International Conference on Parallel Processing","location":"San Diego CA USA","acronym":"ICPP '25"},"container-title":["Proceedings of the 54th International Conference on Parallel Processing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3754598.3754605","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,20]],"date-time":"2025-12-20T08:36:31Z","timestamp":1766219791000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3754598.3754605"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,8]]},"references-count":31,"alternative-id":["10.1145\/3754598.3754605","10.1145\/3754598"],"URL":"https:\/\/doi.org\/10.1145\/3754598.3754605","relation":{},"subject":[],"published":{"date-parts":[[2025,9,8]]},"assertion":[{"value":"2025-12-20","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}