{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,6]],"date-time":"2026-01-06T13:30:31Z","timestamp":1767706231409,"version":"3.44.0"},"publisher-location":"New York, NY, USA","reference-count":102,"publisher":"ACM","license":[{"start":{"date-parts":[[2025,3,30]],"date-time":"2025-03-30T00:00:00Z","timestamp":1743292800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"NSFC","award":["No.62402280, No.62272261, No.62272223"],"award-info":[{"award-number":["No.62402280, No.62272261, No.62272223"]}]},{"name":"Carbon Neutrality and Energy System Transformation"},{"name":"the National Key R&D Program of China","award":["2023YFB4502400"],"award-info":[{"award-number":["2023YFB4502400"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,3,30]]},"DOI":"10.1145\/3689031.3717472","type":"proceedings-article","created":{"date-parts":[[2025,3,26]],"date-time":"2025-03-26T06:25:20Z","timestamp":1742970320000},"page":"261-277","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Empower Vision Applications with LoRA LMM"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-6058-5950","authenticated-orcid":false,"given":"Liang","family":"Mi","sequence":"first","affiliation":[{"name":"State Key Laboratory for Novel Software Technology, Nanjing University and AI Industry Research (AIR), Tsinghua University"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9545-3322","authenticated-orcid":false,"given":"Weijun","family":"Wang","sequence":"additional","affiliation":[{"name":"Institute for AI Industry Research (AIR), Tsinghua University"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-9984-2858","authenticated-orcid":false,"given":"Wenming","family":"Tu","sequence":"additional","affiliation":[{"name":"Institute for AI Industry Research (AIR), Tsinghua University"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-2383-1096","authenticated-orcid":false,"given":"Qingfeng","family":"He","sequence":"additional","affiliation":[{"name":"Institute for AI Industry Research (AIR), Tsinghua University"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-2889-2266","authenticated-orcid":false,"given":"Rui","family":"Kong","sequence":"additional","affiliation":[{"name":"Institute for AI Industry Research (AIR), Tsinghua University"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-1208-5917","authenticated-orcid":false,"given":"Xinyu","family":"Fang","sequence":"additional","affiliation":[{"name":"Institute for AI Industry Research (AIR), Tsinghua University"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-1590-9353","authenticated-orcid":false,"given":"Yazhu","family":"Dong","sequence":"additional","affiliation":[{"name":"Institute for AI Industry Research (AIR), Tsinghua University"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-6908-9183","authenticated-orcid":false,"given":"Yikang","family":"Zhang","sequence":"additional","affiliation":[{"name":"State Key Laboratory for Novel Software Technology, Nanjing University"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1591-2526","authenticated-orcid":false,"given":"Yuanchun","family":"Li","sequence":"additional","affiliation":[{"name":"Institute for AI Industry Research (AIR), Tsinghua University, Shanghai AI Laboratory and Beijing Academy of Artificial Intelligence (BAAI)"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5764-960X","authenticated-orcid":false,"given":"Meng","family":"Li","sequence":"additional","affiliation":[{"name":"State Key Laboratory for Novel Software Technology, Nanjing University"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0545-8187","authenticated-orcid":false,"given":"Haipeng","family":"Dai","sequence":"additional","affiliation":[{"name":"State Key Laboratory for Novel Software Technology, Nanjing University"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6934-1685","authenticated-orcid":false,"given":"Guihai","family":"Chen","sequence":"additional","affiliation":[{"name":"State Key Laboratory for Novel Software Technology, Nanjing University"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7352-8955","authenticated-orcid":false,"given":"Yunxin","family":"Liu","sequence":"additional","affiliation":[{"name":"Institute for AI Industry Research (AIR), Tsinghua University and Shanghai AI Laboratory"}]}],"member":"320","published-online":{"date-parts":[[2025,3,30]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Lightllm. https:\/\/github.com\/ModelTC\/lightllm."},{"key":"e_1_3_2_1_2_1","unstructured":"NVIDIA A100. https:\/\/www.nvidia.com\/en-us\/data-center\/a100\/."},{"key":"e_1_3_2_1_3_1","unstructured":"Pytorch torch.einsum. https:\/\/pytorch.org\/docs\/stable\/generated\/torch.einsum.html."},{"key":"e_1_3_2_1_4_1","unstructured":"rpyc. https:\/\/github.com\/tomerfiliba-org\/rpyc."},{"key":"e_1_3_2_1_5_1","unstructured":"setuptools. https:\/\/github.com\/pypa\/setuptools."},{"key":"e_1_3_2_1_6_1","volume-title":"Official website of the python platform. https:\/\/pytorch.org\/","author":"Pytorch","year":"2021","unstructured":"Pytorch: Official website of the python platform. https:\/\/pytorch.org\/, 2021."},{"key":"e_1_3_2_1_7_1","unstructured":"12 Practical Large Language Model (LLM) Applications - Techopedia. https:\/\/www.techopedia.com\/12-practical-large-language-model-llm-applications (Accessed on 09\/18\/2024)."},{"key":"e_1_3_2_1_8_1","unstructured":"7 top large language model use cases and applications. https:\/\/www.projectpro.io\/article\/large-language-model-use-cases-and-applications\/887 (Accessed on 09\/18\/2024)."},{"key":"e_1_3_2_1_9_1","unstructured":"Airbus Aircraft dataset. https:\/\/www.kaggle.com\/datasets\/airbusgeo\/airbus-aircrafts-sample-dataset\/data (Accessed on 09\/18\/2024)."},{"key":"e_1_3_2_1_10_1","unstructured":"Applications of large language models - indata labs. https:\/\/indatalabs. com\/blog\/large-language-model-apps (Accessed on 09\/18\/2024)."},{"key":"e_1_3_2_1_11_1","unstructured":"Claude3.5-Sonnet. https:\/\/www.anthropic.com\/news\/claude-3-5-sonnet (Accessed on 09\/18\/2024)."},{"key":"e_1_3_2_1_12_1","unstructured":"OpenAI GPT-4o. https:\/\/openai.com\/index\/hello-gpt-4o\/ (Accessed on 09\/18\/2024)."},{"key":"e_1_3_2_1_13_1","unstructured":"Real-world use cases for large language models (llms). https:\/\/cellstrat.medium.com\/real-world-use-cases-for-large-language-models-llms-d71c3a577bf2 (Accessed on 09\/18\/2024)."},{"key":"e_1_3_2_1_14_1","unstructured":"Azure LLM inference trace 2023. https:\/\/github.com\/Azure\/AzurePublicDataset\/blob\/master\/AzureLLMInferenceDataset2023.md (Accessed on 10\/07\/2024)."},{"key":"e_1_3_2_1_15_1","unstructured":"NVIDIA CUTLASS 3.5.1. https:\/\/github.com\/NVIDIA\/cutlass (Accessed on 10\/07\/2024)."},{"key":"e_1_3_2_1_16_1","unstructured":"NVIDIA Tensor Core. https:\/\/www.nvidia.cn\/data-center\/tensor-cores\/ (Accessed on 10\/07\/2024)."},{"key":"e_1_3_2_1_17_1","volume-title":"Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774","author":"Achiam Josh","year":"2023","unstructured":"Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023."},{"key":"e_1_3_2_1_18_1","volume-title":"Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369","author":"Agrawal Amey","year":"2023","unstructured":"Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachandran Ramjee. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369, 2023."},{"key":"e_1_3_2_1_19_1","volume-title":"Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond","author":"Bai Jinze","year":"2023","unstructured":"Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. 2023."},{"key":"e_1_3_2_1_20_1","volume-title":"Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966","author":"Bai Jinze","year":"2023","unstructured":"Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023."},{"key":"e_1_3_2_1_21_1","first-page":"499","volume-title":"USENIX Symposium on Operating Systems Design and Implementation (OSDI)","author":"Bai Zhihao","year":"2020","unstructured":"Zhihao Bai, Zhen Zhang, Yibo Zhu, and Xin Jin. PipeSwitch: Fast pipelined context switching for deep learning applications. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 499--514, 2020."},{"key":"e_1_3_2_1_22_1","first-page":"2206","volume-title":"International conference on machine learning (ICML)","author":"Borgeaud Sebastian","year":"2022","unstructured":"Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning (ICML), pages 2206--2240, 2022."},{"key":"e_1_3_2_1_23_1","volume-title":"One-for-all: Generalized lora for parameter-efficient finetuning. arXiv preprint arXiv:2306.07967","author":"Chavan Arnav","year":"2023","unstructured":"Arnav Chavan, Zhuang Liu, Deepak Gupta, Eric Xing, and Zhiqiang Shen. One-for-all: Generalized lora for parameter-efficient finetuning. arXiv preprint arXiv:2306.07967, 2023."},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i16.29728"},{"key":"e_1_3_2_1_25_1","volume-title":"Proceedings of Machine Learning and Systems (MLSys)","author":"Chen Lequn","year":"2024","unstructured":"Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy. Punica: Multi-tenant lora serving. In Proceedings of Machine Learning and Systems (MLSys), 2024."},{"key":"e_1_3_2_1_26_1","volume-title":"Sharegpt4v: Improving large multi-modal models with better captions","author":"Chen Lin","year":"2023","unstructured":"Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions, 2023."},{"key":"e_1_3_2_1_27_1","volume-title":"Advances in Neural Information Processing Systems (NeurIPS)","author":"Chen Qi","year":"2021","unstructured":"Qi Chen, Bing Zhao, Haidong Wang, Mingqin Li, Chuanjie Liu, Zengzhong Li, Mao Yang, and Jingdong Wang. Spann: Highly-efficient billion-scale approximate nearest neighborhood search. Advances in Neural Information Processing Systems (NeurIPS), 2021."},{"issue":"240","key":"e_1_3_2_1_28_1","first-page":"1","article-title":"Palm: Scaling language modeling with pathways","volume":"24","author":"Chowdhery Aakanksha","year":"2023","unstructured":"Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1--113, 2023.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.350"},{"key":"e_1_3_2_1_30_1","volume-title":"Advances in Neural Information Processing Systems (NeurIPS)","author":"Dettmers Tim","year":"2024","unstructured":"Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. In Advances in Neural Information Processing Systems (NeurIPS), 2024."},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3387514.3405887"},{"key":"e_1_3_2_1_32_1","first-page":"54","volume-title":"The General Theory of Relativity","author":"Einstein Albert","year":"1922","unstructured":"Albert Einstein. The General Theory of Relativity, pages 54--75. Springer Netherlands, Dordrecht, 1922."},{"key":"e_1_3_2_1_33_1","volume-title":"Roboflow integration, TensorFlow export, OpenCV DNN support","author":"Glenn Jocher","year":"2021","unstructured":"Glenn Jocher et. al. ultralytics\/yolov5: v6.0 - YOLOv5n 'Nano' models, Roboflow integration, TensorFlow export, OpenCV DNN support, October 2021."},{"key":"e_1_3_2_1_34_1","volume-title":"Mixture-of-loras: An efficient multitask tuning for large language models. arXiv preprint arXiv:2403.03432","author":"Feng Wenfeng","year":"2024","unstructured":"Wenfeng Feng, Chuzhan Hao, Yuewei Zhang, Yu Han, and Hao Wang. Mixture-of-loras: An efficient multitask tuning for large language models. arXiv preprint arXiv:2403.03432, 2024."},{"key":"e_1_3_2_1_35_1","volume-title":"Mini-internvl: A flexible-transfer pocket multimodal model with 5 arXiv preprint arXiv:2410.16261","author":"Gao Zhangwei","year":"2024","unstructured":"Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Jifeng Dai, and Wenhai Wang. Mini-internvl: A flexible-transfer pocket multimodal model with 5 arXiv preprint arXiv:2410.16261, 2024."},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-7091-2748-3_8"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.670"},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.670"},{"key":"e_1_3_2_1_39_1","unstructured":"Connor Holmes Masahiro Tanaka Michael Wyatt Ammar Ahmad Awan Jeff Rasley Samyam Rajbhandari Reza Yazdani Aminabadi Heyang Qin Arash Bakhtiari Lev Kurilenko et al. Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference. arXiv preprint arXiv:2401.08671 2024."},{"key":"e_1_3_2_1_40_1","volume-title":"Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685","author":"Hu Edward J","year":"2021","unstructured":"Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021."},{"key":"e_1_3_2_1_41_1","volume-title":"pybind11 --- seamless operability between c++11 and python","author":"Jakob Wenzel","year":"2016","unstructured":"Wenzel Jakob, Jason Rhinelander, and Dean Moldovan. pybind11 --- seamless operability between c++11 and python, 2016. https:\/\/github.com\/pybind\/pybind11."},{"key":"e_1_3_2_1_42_1","volume-title":"Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825","author":"Jiang Albert Q","year":"2023","unstructured":"Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023."},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3230543.3230574"},{"key":"e_1_3_2_1_44_1","volume-title":"Ragcache: Efficient knowledge caching for retrieval-augmented generation. arXiv preprint arXiv:2404.12457","author":"Jin Chao","year":"2024","unstructured":"Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, and Xin Jin. Ragcache: Efficient knowledge caching for retrieval-augmented generation. arXiv preprint arXiv:2404.12457, 2024."},{"key":"e_1_3_2_1_45_1","volume-title":"Blazeit: Optimizing declarative aggregation and limit queries for neural network-based video analytics. VLDB Endowment, 13(4)","author":"Kang Daniel","year":"2020","unstructured":"Daniel Kang, Peter Bailis, and Matei Zaharia. Blazeit: Optimizing declarative aggregation and limit queries for neural network-based video analytics. VLDB Endowment, 13(4), 2020."},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137664"},{"key":"e_1_3_2_1_47_1","volume-title":"CIDR","author":"Kang Daniel","year":"2022","unstructured":"Daniel Kang, Francisco Romero, Peter D Bailis, Christos Kozyrakis, and Matei Zaharia. Viva: An end-to-end system for interactive video analytics. In CIDR, 2022."},{"key":"e_1_3_2_1_48_1","first-page":"787","volume-title":"ACL conference on empirical methods in natural language processing (EMNLP)","author":"Kazemzadeh Sahar","year":"2014","unstructured":"Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In ACL conference on empirical methods in natural language processing (EMNLP), pages 787--798, 2014."},{"key":"e_1_3_2_1_49_1","first-page":"917","volume-title":"USENIX Symposium on Networked Systems Design and Implementation (NSDI)","author":"Khani Mehrdad","year":"2023","unstructured":"Mehrdad Khani, Ganesh Ananthanarayanan, Kevin Hsieh, Junchen Jiang, Ravi Netravali, Yuanchao Shu, Mohammad Alizadeh, and Victor Bahl. RECL: Responsive resource-efficient continuous learning for video analytics. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), pages 917--932, 2023."},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_2_1_52_1","volume-title":"Advances in Neural Information Processing Systems (NeurIPS)","author":"Lewis Patrick","year":"2020","unstructured":"Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\u00fcttler, Mike Lewis, Wen-tau Yih, Tim Rockt\u00e4schel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems (NeurIPS), 2020."},{"key":"e_1_3_2_1_53_1","first-page":"19730","volume-title":"International conference on machine learning (ICML)","author":"Li Junnan","year":"2023","unstructured":"Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning (ICML), pages 19730--19742. PMLR, 2023."},{"key":"e_1_3_2_1_54_1","first-page":"4582","volume-title":"Proceedings of Annual Meeting of the Association for Computational Linguistics and the Joint Conference on Natural Language Processing","author":"Li Xiang Lisa","year":"2021","unstructured":"Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of Annual Meeting of the Association for Computational Linguistics and the Joint Conference on Natural Language Processing, pages 4582--4597, 2021."},{"key":"e_1_3_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58577-8_8"},{"key":"e_1_3_2_1_56_1","volume-title":"Personal llm agents: Insights and survey about the capability, efficiency and security. arXiv preprint arXiv:2401.05459","author":"Li Yuanchun","year":"2024","unstructured":"Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, et al. Personal llm agents: Insights and survey about the capability, efficiency and security. arXiv preprint arXiv:2401.05459, 2024."},{"key":"e_1_3_2_1_57_1","first-page":"359","volume-title":"Guoqing Harry Xu, and Ravi Netravali. Reducto: On-Camera Filtering for Resource-Efficient Real-Time Video Analytics","author":"Li Yuanqi","year":"2020","unstructured":"Yuanqi Li, Arthi Padmanabhan, Pengzhan Zhao, Yufei Wang, Guoqing Harry Xu, and Ravi Netravali. Reducto: On-Camera Filtering for Resource-Efficient Real-Time Video Analytics. In ACM Special Interest Group on Data Communication (SIGCOMM), pages 359--376, 2020."},{"key":"e_1_3_2_1_58_1","volume-title":"Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems (NeurIPS), 35:1950--1965","author":"Liu Haokun","year":"2022","unstructured":"Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems (NeurIPS), 35:1950--1965, 2022."},{"key":"e_1_3_2_1_59_1","volume-title":"Improved baselines with visual instruction tuning","author":"Liu Haotian","year":"2023","unstructured":"Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023."},{"key":"e_1_3_2_1_60_1","volume-title":"Conference and Workshop on Neural Information Processing Systems (NeurIPS)","author":"Liu Haotian","year":"2024","unstructured":"Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Conference and Workshop on Neural Information Processing Systems (NeurIPS), 2024."},{"key":"e_1_3_2_1_61_1","volume-title":"Peft: State-of-the-art parameter-efficient fine-tuning methods. https:\/\/github.com\/huggingface\/peft","author":"Mangrulkar Sourab","year":"2022","unstructured":"Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods. https:\/\/github.com\/huggingface\/peft, 2022."},{"key":"e_1_3_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/3620666.3651335"},{"key":"e_1_3_2_1_63_1","unstructured":"Thomas Politzer. Vision is our dominant sense. https:\/\/www.brainline.org\/article\/vision-our-dominant-sense (Accessed on 09\/18\/2024)."},{"key":"e_1_3_2_1_64_1","volume-title":"Mooncake: A kvcache-centric disaggregated architecture for llm serving. arXiv preprint arxiv:2407.00079","author":"Qin Ruoyu","year":"2024","unstructured":"Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: A kvcache-centric disaggregated architecture for llm serving. arXiv preprint arxiv:2407.00079, 2024."},{"key":"e_1_3_2_1_65_1","first-page":"8748","volume-title":"International conference on machine learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748--8763. PMLR, 2021."},{"key":"e_1_3_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00605"},{"key":"e_1_3_2_1_67_1","volume-title":"USENIX Symposium on Networked Systems Design and Implementation (NSDI)","author":"Romil Bhardwaj","year":"2022","unstructured":"Bhardwaj Romil, Xia Zhengxu, Ananthanarayanan Ganesh, Jiang Junchen, Shu Yuanchao, Karianakis Nikolaos, Hsieh Kevin, Bahl Paramvir, and Stoica Ion. Ekya: Continuous learning of video analytics models on edge compute servers. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2022."},{"key":"e_1_3_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359658"},{"key":"e_1_3_2_1_69_1","volume-title":"Proceedings of Machine Learning and Systems (MLSys)","author":"Sheng Ying","year":"2024","unstructured":"Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, et al. S-lora: Serving thousands of concurrent lora adapters. In Proceedings of Machine Learning and Systems (MLSys), 2024."},{"key":"e_1_3_2_1_70_1","first-page":"31094","volume-title":"International Conference on Machine Learning (ICML)","author":"Sheng Ying","year":"2023","unstructured":"Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher R\u00e9, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning (ICML), pages 31094--31116, 2023."},{"key":"e_1_3_2_1_71_1","volume-title":"17th USENIX Symposium on Operating Systems Design and Implementation (OSDI'23)","author":"Shi Yining","year":"2023","unstructured":"Yining Shi, Zhi Yang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Ziming Miao, Yuxiao Guo, Fan Yang, and Lidong Zhou. Welder: Scheduling deep learning memory access via tile-graph. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI'23), July 2023."},{"key":"e_1_3_2_1_72_1","volume-title":"Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402","author":"Soomro Khurram","year":"2012","unstructured":"Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012."},{"key":"e_1_3_2_1_73_1","volume-title":"Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild","author":"Soomro Khurram","year":"2012","unstructured":"Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012."},{"key":"e_1_3_2_1_74_1","volume-title":"Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805","author":"Team Gemini","year":"2023","unstructured":"Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023."},{"key":"e_1_3_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1145\/3315508.3329973"},{"key":"e_1_3_2_1_76_1","volume-title":"Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training","author":"Tong Zhan","year":"2022","unstructured":"Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training, 2022."},{"key":"e_1_3_2_1_77_1","volume-title":"Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023."},{"key":"e_1_3_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-long.557"},{"key":"e_1_3_2_1_79_1","volume-title":"Region-based content enhancement for efficient video analytics at the edge. arXiv preprint arXiv:2407.16990","author":"Wang Weijun","year":"2024","unstructured":"Weijun Wang, Liang Mi, Shaowei Cen, Haipeng Dai, Yuanchun Li, Xiaoming Fu, and Yunxin Liu. Region-based content enhancement for efficient video analytics at the edge. arXiv preprint arXiv:2407.16990, 2024."},{"key":"e_1_3_2_1_80_1","first-page":"543","volume-title":"Proceedings of the ACM International Conference on Mobile Computing and Networking (MobiCom)","author":"Wen Hao","year":"2024","unstructured":"Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. Autodroid: Llm-powered task automation in android. In Proceedings of the ACM International Conference on Mobile Computing and Networking (MobiCom), pages 543--557, 2024."},{"key":"e_1_3_2_1_81_1","volume-title":"Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Huggingface's transformers: State-of-the-art natural language processing","author":"Wolf Thomas","year":"2020","unstructured":"Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R\u00e9mi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Huggingface's transformers: State-of-the-art natural language processing, 2020."},{"key":"e_1_3_2_1_82_1","first-page":"911","volume-title":"USENIX Symposium on Operating Systems Design and Implementation (OSDI)","author":"Wu Bingyang","year":"2024","unstructured":"Bingyang Wu, Ruidong Zhu, Zili Zhang, Peng Sun, Xuanzhe Liu, and Xin Jin. dLoRA: Dynamically orchestrating requests and adapters for LoRA LLM serving. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 911--927, 2024."},{"key":"e_1_3_2_1_83_1","doi-asserted-by":"publisher","DOI":"10.1109\/TGRS.2017.2685945"},{"key":"e_1_3_2_1_84_1","first-page":"148","volume-title":"IEEE\/ACM Symposium on Edge Computing (SEC)","author":"Xiao Zhujun","year":"2021","unstructured":"Zhujun Xiao, Zhengxu Xia, Haitao Zheng, Ben Y Zhao, and Junchen Jiang. Towards performance clarity of edge video analytics. In IEEE\/ACM Symposium on Edge Computing (SEC), pages 148--164, 2021."},{"key":"e_1_3_2_1_85_1","first-page":"204","volume-title":"Bolt: Bridging the gap between auto-tuners and hardware-native performance","author":"Xing Jiarong","year":"2022","unstructured":"Jiarong Xing, Leyuan Wang, Shang Zhang, Jack Chen, Ang Chen, and Yibo Zhu. Bolt: Bridging the gap between auto-tuners and hardware-native performance. In D. Marculescu, Y. Chi, and C. Wu, editors, Proceedings of Machine Learning and Systems, volume 4, pages 204--216, 2022."},{"key":"e_1_3_2_1_86_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01471"},{"key":"e_1_3_2_1_87_1","volume-title":"Cacheblend: Fast large language model serving with cached knowledge fusion. arXiv preprint arXiv:2405.16444","author":"Yao Jiayi","year":"2024","unstructured":"Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. Cacheblend: Fast large language model serving with cached knowledge fusion. arXiv preprint arXiv:2405.16444, 2024."},{"key":"e_1_3_2_1_88_1","first-page":"1","volume-title":"ACM International Conference on Mobile Computing and Networking (MobiCom)","author":"Yi Juheon","year":"2020","unstructured":"Juheon Yi, Sunghyun Choi, and Youngki Lee. EagleEye: wearable camera-based person identification in crowded urban spaces. In ACM International Conference on Mobile Computing and Networking (MobiCom), pages 1--14, April 2020."},{"key":"e_1_3_2_1_89_1","first-page":"521","volume-title":"USENIX Symposium on Operating Systems Design and Implementation (OSDI)","author":"Yu Gyeong-In","year":"2022","unstructured":"Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 521--538, 2022."},{"key":"e_1_3_2_1_90_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46475-6_5"},{"key":"e_1_3_2_1_91_1","volume-title":"Machine Learning and Systems (MLSys)","author":"Yu Shan","year":"2024","unstructured":"Shan Yu, Zhenting Zhu, Yu Chen, Hanchen Xu, Pengzhan Zhao, Yang Wang, Arthi Padmanabhan, Hugo Latapie, and Harry Xu. Vqpy: An object-oriented approach to modern video analytics. In Machine Learning and Systems (MLSys), 2024."},{"key":"e_1_3_2_1_92_1","volume-title":"Packetgame: Multi-stream packet gating for concurrent video inference at scale","author":"Yuan Mu","year":"2023","unstructured":"Mu Yuan, Lan Zhang, Xuanke You, and Xiang-Yang Li. Packetgame: Multi-stream packet gating for concurrent video inference at scale. In ACM Special Interest Group on Data Communication (SIGCOMM), 2023."},{"key":"e_1_3_2_1_93_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM53939.2023.10228933"},{"key":"e_1_3_2_1_94_1","volume-title":"Byeongho Heo, Dongyoon Han, and Jinhyung Kim. Videomix: Rethinking data augmentation for video classification","author":"Yun Sangdoo","year":"2020","unstructured":"Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, and Jinhyung Kim. Videomix: Rethinking data augmentation for video classification, 2020."},{"key":"e_1_3_2_1_95_1","doi-asserted-by":"publisher","DOI":"10.1145\/3230543.3230554"},{"key":"e_1_3_2_1_96_1","volume-title":"International Conference on Learning Representations (ICLR)","author":"Zhang Qingru","year":"2023","unstructured":"Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. In International Conference on Learning Representations (ICLR), 2023."},{"key":"e_1_3_2_1_97_1","volume-title":"Siren's song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219","author":"Zhang Yue","year":"2023","unstructured":"Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren's song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023."},{"key":"e_1_3_2_1_98_1","volume-title":"Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Efficiently programming large language models using sglang. arXiv preprint arXiv:2312.07104","author":"Zheng Lianmin","year":"2023","unstructured":"Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Efficiently programming large language models using sglang. arXiv preprint arXiv:2312.07104, 2023."},{"key":"e_1_3_2_1_99_1","first-page":"489","volume-title":"USENIX Annual Technical Conference (ATC)","author":"Zhou Zhe","year":"2022","unstructured":"Zhe Zhou, Xuechao Wei, Jiejing Zhang, and Guangyu Sun. Pet: A unified framework for Parameter-Efficient transformers serving. In USENIX Annual Technical Conference (ATC), pages 489--504, 2022."},{"key":"e_1_3_2_1_100_1","volume-title":"Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592","author":"Zhu Deyao","year":"2023","unstructured":"Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023."},{"key":"e_1_3_2_1_101_1","first-page":"233","volume-title":"16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)","author":"Zhu Hongyu","year":"2022","unstructured":"Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, Fan Yang, Mao Yang, Lidong Zhou, Asaf Cidon, and Gennady Pekhimenko. ROLLER: Fast and efficient tensor compilation for deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 233--248, Carlsbad, CA, July 2022. USENIX Association."},{"key":"e_1_3_2_1_102_1","volume-title":"Vision mamba: Efficient visual representation learning with bidirectional state space model","author":"Zhu Lianghui","year":"2024","unstructured":"Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model, 2024."}],"event":{"name":"EuroSys '25: Twentieth European Conference on Computer Systems","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems"],"location":"Rotterdam Netherlands","acronym":"EuroSys '25"},"container-title":["Proceedings of the Twentieth European Conference on Computer Systems"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3689031.3717472","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3689031.3717472","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T11:22:52Z","timestamp":1755775372000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3689031.3717472"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,30]]},"references-count":102,"alternative-id":["10.1145\/3689031.3717472","10.1145\/3689031"],"URL":"https:\/\/doi.org\/10.1145\/3689031.3717472","relation":{},"subject":[],"published":{"date-parts":[[2025,3,30]]},"assertion":[{"value":"2025-03-30","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}