{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,19]],"date-time":"2026-05-19T05:10:52Z","timestamp":1779167452534,"version":"3.51.4"},"reference-count":40,"publisher":"MDPI AG","issue":"5","license":[{"start":{"date-parts":[[2026,5,4]],"date-time":"2026-05-04T00:00:00Z","timestamp":1777852800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Funding project","award":["2025-JCJQ-JJ-0710"],"award-info":[{"award-number":["2025-JCJQ-JJ-0710"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Algorithms"],"abstract":"<jats:p>Vision\u2013language models (VLMs) are increasingly deployed in resource-constrained environments, yet efficient fine-tuning remains challenging because post-training quantization often degrades the effectiveness of low-rank adaptation. This paper revisits that mismatch in the context of MobileVLM1.7B and presents QuantFT-VL, a novel initialization strategy following the quantization phase to seamlessly align with the LoRA technique. The key idea is to initialize LoRA using a low-rank approximation of the quantization residual instead of the default zero-initialization used in QLoRA-style pipelines. After quantizing a pretrained weight matrix W into Q, we compute the residual W \u2212 Q and use truncated singular value decomposition to initialize the LoRA factors (A and B) so that the starting adapted weight Q + ABT better matches the full-precision model. This residual-aware initialization reduces the discrepancy introduced by quantization and leads to faster and more stable optimization. Experiments on six standard VLM benchmarks show that QuantFT-VL consistently improves over QLoRA and recovers performance close to or better than full-precision LoRA in the best setting. On two RTX 3090 GPUs, QuantFT-VL improves the average benchmark score by 3.27 percentage points over QLoRA while preserving the memory and speed advantages of quantized fine-tuning.<\/jats:p>","DOI":"10.3390\/a19050364","type":"journal-article","created":{"date-parts":[[2026,5,5]],"date-time":"2026-05-05T07:57:29Z","timestamp":1777967849000},"page":"364","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["QuantFT-VL: Harmonizing Quantization and LoRA for Efficient Mobile Vision\u2013Language Model Fine-Tuning"],"prefix":"10.3390","volume":"19","author":[{"given":"Fangyuan","family":"Jin","sequence":"first","affiliation":[{"name":"Science and Technology on Underwater Test and Control Laboratory, Dalian 116023, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hui","family":"Lin","sequence":"additional","affiliation":[{"name":"Science and Technology on Underwater Test and Control Laboratory, Dalian 116023, China"},{"name":"Marine Engineering College, Dalian Maritime University, Dalian 116026, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lu","family":"Zhang","sequence":"additional","affiliation":[{"name":"Science and Technology on Underwater Test and Control Laboratory, Dalian 116023, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3638-2309","authenticated-orcid":false,"given":"Yiwei","family":"Chen","sequence":"additional","affiliation":[{"name":"Suzhou Institute of Biomedical Engineering and Technology, Chinese Academy of Sciences, Suzhou 215163, China"},{"name":"School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China, Suzhou 215163, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2026,5,4]]},"reference":[{"key":"ref_1","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 18\u201324). Learning transferable visual models from natural language supervision. Proceedings of the ICML, Virtual."},{"key":"ref_2","unstructured":"Li, J., Li, D., Xiong, C., and Hoi, S. (2022, January 17\u201323). BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. Proceedings of the ICML, Baltimore, MD, USA."},{"key":"ref_3","unstructured":"Li, J., Li, D., Savarese, S., and Hoi, S. (2023, January 23\u201329). BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. Proceedings of the ICML, Honolulu, HI, USA."},{"key":"ref_4","unstructured":"Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseinyet, M. (2023). MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv."},{"key":"ref_5","unstructured":"Liu, H., Li, C., Wu, Q., and Lee, Y.J. (2023, January 10\u201316). Visual instruction tuning. Proceedings of the NeurIPS, New Orleans, LA, USA."},{"key":"ref_6","unstructured":"Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., and Zhao, R. (2023). Shikra: Unleashing multimodal LLM\u2019s referential dialogue magic. arXiv."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Chen, Z., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., and Lin, D. (2023). ShareGPT4V: Improving large multi-modal models with better captions. arXiv.","DOI":"10.1007\/978-3-031-72643-9_22"},{"key":"ref_8","unstructured":"Chu, X., Qiao, L., Lin, X., Xu, S., Yang, Y., Hu, Y., Wei, F., Zhang, X., Zhang, B., and Wei, X. (2023). MobileVLM: A fast, strong, and open vision language assistant for mobile devices. arXiv."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Hu, Z., Wang, L., Lan, Y., Xu, W., Lim, E.P., Bing, L., Xu, X., Poria, S., and Lee, R.K.W. (2023, January 6\u201310). LLM-Adapters: An adapter family for parameter-efficient fine-tuning of large language models. Proceedings of the EMNLP, Singapore.","DOI":"10.18653\/v1\/2023.emnlp-main.319"},{"key":"ref_10","unstructured":"Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2022, January 25\u201329). LoRA: Low-rank adaptation of large language models. Proceedings of the ICLR, Online."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. (2023, January 10\u201316). QLoRA: Efficient finetuning of quantized LLMs. Proceedings of the NeurIPS, New Orleans, LA, USA.","DOI":"10.52202\/075280-0441"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P., and Hoi, S. (2023, January 10\u201316). InstructBLIP: Towards general-purpose vision-language models with instruction tuning. Proceedings of the NeurIPS, New Orleans, LA, USA.","DOI":"10.52202\/075280-2142"},{"key":"ref_13","unstructured":"Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Hu, A., Shi, P., and Shi, Y. (2023). mPLUG-Owl: Modularization empowers large language models with multimodality. arXiv."},{"key":"ref_14","unstructured":"Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., and Wei, F. (2023). KOSMOS-2: Grounding multilingual large language models to the world. arXiv."},{"key":"ref_15","unstructured":"Chu, X., Qiao, L., Zhang, X., Xu, S., Wei, F., Yang, Y., Sun, X., Hu, Y., Lin, X., and Zhang, B. (2024). MobileVLM V2: Faster and stronger baseline for vision language model. arXiv."},{"key":"ref_16","unstructured":"Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., and Ge, W. (2024). Qwen2-VL: Enhancing vision-language model\u2019s perception of the world at any resolution. arXiv."},{"key":"ref_17","unstructured":"Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., and Liu, Z. (2024). LLaVA-OneVision: Easy visual task transfer. arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Rang, M., Bi, Z., Liu, C., Tang, Y., Han, K., and Wang, Y. (2025). Eve: Efficient multimodal vision language models with elastic visual experts. arXiv.","DOI":"10.1609\/aaai.v39i7.32718"},{"key":"ref_19","unstructured":"Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., and Tang, J. (2025). Qwen2.5-VL Technical Report. arXiv."},{"key":"ref_20","unstructured":"Kimi Team (2025). Kimi-VL Technical Report. arXiv."},{"key":"ref_21","unstructured":"Guo, D., Wu, F., Zhu, F., Leng, F., Shi, G., Chen, H., Fan, H., Wang, J., Jiang, J., and Wang, J. (2025). Seed1.5-VL Technical Report. arXiv."},{"key":"ref_22","unstructured":"GLM-V Team (2025). GLM-4.1V-Thinking and GLM-4.5V: Towards versatile multimodal reasoning with scalable reinforcement learning. arXiv."},{"key":"ref_23","unstructured":"Jin, Z., Song, X., Wang, N., Liu, Y., Li, C., Li, X., Wang, R., Li, Z., Qi, Q., and Cheng, L. (2025). AndesVL Technical Report: An efficient mobile-side multimodal large language model. arXiv."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Li, X.L., and Liang, P. (2021, January 1\u20136). Prefix-tuning: Optimizing continuous prompts for generation. Proceedings of the ACL-IJCNLP, Online.","DOI":"10.18653\/v1\/2021.acl-long.353"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Lester, B., Al-Rfou, R., and Constant, N. (2021, January 7\u201311). The power of scale for parameter-efficient prompt tuning. Proceedings of the EMNLP, Punta Cana, Dominican Republic.","DOI":"10.18653\/v1\/2021.emnlp-main.243"},{"key":"ref_26","unstructured":"Xu, Y., Xie, L., Gu, X., Chen, X., Chang, H., Zhang, H., Chen, Z., Zhang, X., and Tian, Q. (2024, January 7\u201311). QA-LoRA: Quantization-aware low-rank adaptation of large language models. Proceedings of the ICLR, Vienna, Austria."},{"key":"ref_27","unstructured":"Li, Y., Yu, Y., Liang, C., He, P., Karampatziakis, N., Chen, W., and Zhao, T. (2024, January 7\u201311). LoftQ: LoRA-fine-tuning-aware quantization for large language models. Proceedings of the ICLR, Vienna, Austria."},{"key":"ref_28","unstructured":"Qin, H., Yu, Y., Liang, C., He, P., Karampatziakis, P., Chen, W., and Zhao, T. (2024, January 21\u201327). Accurate LoRA-finetuning quantization of LLMs via information retention. Proceedings of the ICML, Vienna, Austria."},{"key":"ref_29","unstructured":"Jeon, H., Kim, Y., and Kim, J.-J. (August, January 27). L4Q: Parameter efficient quantization-aware fine-tuning on large language models. Proceedings of the ACL, Vienna, Austria."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Yu, J., Zhou, S., Yang, D., Wang, S., Li, S., Hu, X., Xu, C., Xu, Z., Shu, C., and Yuan, Z. (2025). MQuant: Unleashing the inference potential of multimodal large language models via full static quantization. arXiv.","DOI":"10.1145\/3746027.3755433"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Su, Z., Shen, W., Li, L., Chen, Z., Wei, H., Yu, H., and Yuan, K. (2025). AKVQ-VL: Attention-aware KV cache adaptive 2-bit quantization for vision-language models. arXiv.","DOI":"10.1109\/ICME59968.2025.11209367"},{"key":"ref_32","unstructured":"Xue, Y., Huang, Y., Shao, J., Zhu, L., Zhang, C., Li, X., and Zhang, J. (2025). VLMQ: Token saliency-driven post-training quantization for vision-language models. arXiv."},{"key":"ref_33","unstructured":"Das, G., La, V., Lau, E., Shrivastava, A., and Gwilliam, M. (2026). Towards understanding best practices for quantization of vision-language models. arXiv."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Meng, F., Wang, Z., and Zhang, M. (2024). PiSSA: Principal singular values and singular vectors adaptation of large language models. arXiv.","DOI":"10.52202\/079017-3846"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Hudson, D.A., and Manning, C.D. (2019, January 15\u201320). GQA: A new dataset for real-world visual reasoning and compositional question answering. Proceedings of the CVPR, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00686"},{"key":"ref_36","unstructured":"Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., and Kalyan, A. (December, January 28). Learn to explain: Multimodal reasoning via thought chains for science question answering. Proceedings of the NeurIPS, New Orleans, LA, USA."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., and Rohrbach, M. (2019, January 15\u201320). Towards VQA models that can read. Proceedings of the CVPR, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00851"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., and Wen, J.R. (2023, January 6\u201310). Evaluating object hallucination in large vision-language models. Proceedings of the EMNLP, Singapore.","DOI":"10.18653\/v1\/2023.emnlp-main.20"},{"key":"ref_39","unstructured":"Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., and Sun, X. (2023). MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., and Liu, Z. (2023). MMBench: Is your multi-modal model an all-around player?. arXiv.","DOI":"10.1007\/978-3-031-72658-3_13"}],"container-title":["Algorithms"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-4893\/19\/5\/364\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,5,19]],"date-time":"2026-05-19T04:29:51Z","timestamp":1779164991000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-4893\/19\/5\/364"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,5,4]]},"references-count":40,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2026,5]]}},"alternative-id":["a19050364"],"URL":"https:\/\/doi.org\/10.3390\/a19050364","relation":{},"ISSN":["1999-4893"],"issn-type":[{"value":"1999-4893","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,5,4]]}}}