{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,15]],"date-time":"2026-03-15T15:32:11Z","timestamp":1773588731130,"version":"3.50.1"},"reference-count":48,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2026,11,11]],"date-time":"2026-11-11T00:00:00Z","timestamp":1794355200000},"content-version":"vor","delay-in-days":246,"URL":"http:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001691","name":"Japan Society for the Promotion of Science","doi-asserted-by":"crossref","award":["JP22K11989, JP23K11063, and JP24K14910"],"award-info":[{"award-number":["JP22K11989, JP23K11063, and JP24K14910"]}],"id":[{"id":"10.13039\/501100001691","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100002241","name":"Japan Science and Technology Agency","doi-asserted-by":"crossref","award":["JPMJPR21P3, JPMJAP2344, and JPMJSP2153"],"award-info":[{"award-number":["JPMJPR21P3, JPMJAP2344, and JPMJSP2153"]}],"id":[{"id":"10.13039\/501100002241","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Auton. Adapt. Syst."],"published-print":{"date-parts":[[2026,3,31]]},"abstract":"<jats:p>Multimodal Large Language Models (MLLMs) integrate multimodal encoders with Large Language Models (LLMs) to overcome the limitations of text-only models. Traditional LLMs are deployed on high-performance cloud servers, but MLLMs, which process multimodal data, face high transmission latency and privacy risks when tasks are offloaded to the cloud. Intelligent edge computing is a promising solution for supporting such latency-sensitive and privacy-sensitive tasks. However, the heterogeneity of edge environments makes efficient MLLM inference challenging. In this work, we enhance MLLM inference efficiency in heterogeneous edge environments by decoupling MLLM into LLM and multimodal encoders, deploying the LLM on high-performance devices and the multimodal encoders on lower-capability devices. Additionally, we observe that processing MLLM tasks in edge environments involves numerous configuration parameters that impact inference speed and energy consumption in an unknown and possibly time-varying fashion. To address this challenge, we present an adaptive scheduling algorithm that assigns parameters to tasks or minimizing energy consumption while meeting maximum latency constraints. The results of extensive experimental trials demonstrate that the proposed approach consistently outperforms existing state-of-the-art methods, achieving significant improvements in both latency reduction and energy efficiency.<\/jats:p>","DOI":"10.1145\/3774908","type":"journal-article","created":{"date-parts":[[2025,11,11]],"date-time":"2025-11-11T14:45:34Z","timestamp":1762872334000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Adaptive Scheduling of Multimodal Large Language Model in Intelligent Edge Computing"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9378-1269","authenticated-orcid":false,"given":"Xingyu","family":"Yuan","sequence":"first","affiliation":[{"name":"Department of Sciences and Informatics, Muroran Institute of Technology, Muroran, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4325-6631","authenticated-orcid":false,"given":"He","family":"Li","sequence":"additional","affiliation":[{"name":"Department of Sciences and Informatics, Muroran Institute of Technology, Muroran, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2788-3451","authenticated-orcid":false,"given":"Mianxiong","family":"Dong","sequence":"additional","affiliation":[{"name":"Department of Sciences and Informatics, Muroran Institute of Technology, Muroran, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3382-1652","authenticated-orcid":false,"given":"Kaoru","family":"Ota","sequence":"additional","affiliation":[{"name":"Department of Sciences and Informatics, Muroran Institute of Technology, Muroran, Japan and Tohoku University, Sendai, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2026,3,10]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.2196\/59505"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1023\/A:1013689704352"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/3634750"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.2991734"},{"key":"e_1_3_1_6_2","first-page":"844","volume-title":"Proceedings of the 34th International Conference on Machine Learning (ICML \u201917)","author":"Chowdhury Sayak Ray","year":"2017","unstructured":"Sayak Ray Chowdhury and Aditya Gopalan. 2017. On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning (ICML \u201917). PMLR, 844\u2013853."},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACVW60836.2024.00106"},{"key":"e_1_3_1_8_2","unstructured":"Xianzhe Dong Tongxuan Liu Yuting Zeng Liangyu Liu Yang Liu Siyu Wu Yu Wu Hailong Yang Ke Zhang and Jing Li. 2025. HydraInfer: Hybrid disaggregated scheduling for multimodal large language model serving. arXiv:2505.12658. Retrieved from https:\/\/arxiv.org\/abs\/2505.12658"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM42981.2021.9488704"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3664200"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","unstructured":"Muhammad Usman Hadi Rizwan Qureshi Abbas Shah Muhammad Irfan Anas Zafar Muhammad Bilal Shaikh Naveed Akhtar Jia Wu Seyedali Mirjalili et al. 2023. A survey on large language models: Applications challenges limitations and practical usage. TechRxiv. DOI: 10.36227\/techrxiv.23589741.v3","DOI":"10.36227\/techrxiv.23589741.v3"},{"key":"e_1_3_1_12_2","doi-asserted-by":"crossref","unstructured":"Yijun Hao Shusen Yang Fang Li Yifan Zhang Shibo Wang and Xuebin Ren. 2024. EdgeTimer: Adaptive multi-timescale scheduling in mobile edge computing with deep reinforcement learning. arXiv:2406.07342. Retrieved from https:\/\/arxiv.org\/abs\/2406.07342","DOI":"10.1109\/INFOCOM52122.2024.10621305"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_14_2","unstructured":"Wenxuan Huang Zijie Zhai Yunhang Shen Shaoshen Cao Fei Zhao Xiangfeng Xu Zheyu Ye and Shaohui Lin. 2024. Dynamic-LLaVA: Efficient multimodal large language models via dynamic vision-language context sparsification. arXiv:2412.00876. Retrieved from https:\/\/arxiv.org\/abs\/2412.00876"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.comcom.2020.01.004"},{"key":"e_1_3_1_16_2","article-title":"ImageNet classification with deep convolutional neural networks","volume":"25","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 25.","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/MNET.2018.1700202"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCC.2018.2871118"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3686803"},{"key":"e_1_3_1_20_2","unstructured":"Kevin Y. Li Sachin Goyal Joao D. Semedo and J. Zico Kolter. 2024. 2024. Inference optimal VLMs need only one visual token but larger models. arXiv:2411.03312. Retrieved from https:\/\/arxiv.org\/abs\/2411.03312"},{"key":"e_1_3_1_21_2","unstructured":"Zhihang Lin Mingbao Lin Luxi Lin and Rongrong Ji. 2024. Boosting multimodal large language models with visual tokens withdrawal for rapid inference. arXiv:2405.05803. Retrieved from https:\/\/arxiv.org\/abs\/2405.05803"},{"key":"e_1_3_1_22_2","first-page":"34892","article-title":"Visual instruction tuning","volume":"36","author":"Liu Haotian","year":"2023","unstructured":"Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 36, 34892\u201334916.","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-72643-9_4"},{"key":"e_1_3_1_25_2","unstructured":"Zhenyu Ning Jieru Zhao Qihao Jin Wenchao Ding and Minyi Guo. 2024. Inf-MLLM: Efficient streaming inference of multimodal large language models on a single GPU. arXiv:2409.09086. Retrieved from https:\/\/arxiv.org\/abs\/2409.09086"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3605552"},{"key":"e_1_3_1_27_2","unstructured":"Haoran Qiu Anish Biswas Zihan Zhao Jayashree Mohan Alind Khare Esha Choukse \u00cd\u00f1igo Goiri Zeyu Zhang Haiying Shen Chetan Bansal et al. 2025. ModServe: Scalable and resource-efficient large multimodal model serving. arXiv:2502.00937. Retrieved from https:\/\/arxiv.org\/abs\/2502.00937"},{"key":"e_1_3_1_28_2","first-page":"8748","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748\u20138763."},{"key":"e_1_3_1_29_2","first-page":"2980","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Ross T.-Y. L. P. G.","year":"2017","unstructured":"T.-Y. L. P. G. Ross and G. K. H. P. Doll\u00e1r. 2017. Focal loss for dense object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2980\u20132988."},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1142\/S0129065704001899"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3603269.3604830"},{"key":"e_1_3_1_32_2","first-page":"997","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Sui Yanan","year":"2015","unstructured":"Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. 2015. Safe exploration for optimization with Gaussian processes. In Proceedings of the International Conference on Machine Learning. PMLR, 997\u20131005."},{"key":"e_1_3_1_33_2","first-page":"4781","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Sui Yanan","year":"2018","unstructured":"Yanan Sui, Vincent Zhuang, Joel Burdick, and Yisong Yue. 2018. Stagewise safe Bayesian optimization with Gaussian processes. In Proceedings of the International Conference on Machine Learning. PMLR, 4781\u20134789."},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVT.2019.2895593"},{"key":"e_1_3_1_35_2","volume-title":"Reinforcement Learning: An Introduction","author":"Sutton Richard S.","year":"2018","unstructured":"Richard S. Sutton. 2018. Reinforcement Learning: An Introduction. A Bradford Book, MIT Press."},{"key":"e_1_3_1_36_2","unstructured":"Yunlong Tang Jing Bi Siting Xu Luchuan Song Susan Liang Teng Wang Daoan Zhang Jie An Jingyang Lin Rongyi Zhu et al. 2023. Video understanding with large language models: A survey. arXiv:2312.17432. Retrieved from https:\/\/arxiv.org\/abs\/2312.17432"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/25.3-4.285"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.3390\/smartcities7050094"},{"key":"e_1_3_1_39_2","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 30.","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_40_2","unstructured":"Jiaqi Wang Hanqi Jiang Yiheng Liu Chong Ma Xu Zhang Yi Pan Mengyuan Liu Peiran Gu Sichen Xia Wenjun Li et al. 2024. A comprehensive review of multimodal large language models: Performance and challenges across different tasks. arXiv:2408.01319. Retrieved from https:\/\/arxiv.org\/abs\/2408.01319"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/BigData59044.2023.10386743"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDCS60910.2024.00093"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/GLOBECOM54140.2023.10436771"},{"key":"e_1_3_1_44_2","doi-asserted-by":"crossref","unstructured":"Biwei Yan Kun Li Minghui Xu Yueyan Dong Yue Zhang Zhaochun Ren and Xiuzhen Cheng. 2024. On protecting the data privacy of large language models (LLMs): A survey. arXiv:2403.05156. Retrieved from https:\/\/arxiv.org\/abs\/2403.05156","DOI":"10.1109\/ICMC60390.2024.00008"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52734.2025.01325"},{"key":"e_1_3_1_46_2","unstructured":"Shukang Yin Chaoyou Fu Sirui Zhao Ke Li Xing Sun Tong Xu and Enhong Chen. 2023. A survey on multimodal large language models. arXiv:2306.13549. Retrieved from https:\/\/arxiv.org\/abs\/2306.13549"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2016.2597169"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/INFOCOM52122.2024.10621104"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASID.2018.8693202"}],"container-title":["ACM Transactions on Autonomous and Adaptive Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3774908","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3774908","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,15]],"date-time":"2026-03-15T14:10:52Z","timestamp":1773583852000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3774908"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,10]]},"references-count":48,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,3,31]]}},"alternative-id":["10.1145\/3774908"],"URL":"https:\/\/doi.org\/10.1145\/3774908","relation":{},"ISSN":["1556-4665","1556-4703"],"issn-type":[{"value":"1556-4665","type":"print"},{"value":"1556-4703","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,3,10]]},"assertion":[{"value":"2024-11-05","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-28","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-03-10","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}