{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T06:27:43Z","timestamp":1763447263730,"version":"3.45.0"},"reference-count":63,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T00:00:00Z","timestamp":1763424000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Artif. Intell."],"abstract":"<jats:p>Recent advances in large vision-language models (LVLMs) have transformed visual recognition research by enabling multimodal integration of images, text, and videos. This fusion supports a deeper and more context-aware understanding of visual environments. However, the application of LVLMs to multitask visual recognition in real-world construction scenarios remains underexplored. In this study, we present a resource-efficient framework for fine-tuning LVLMs tailored to autonomous excavator operations, with a focus on robust detection of humans and obstacles, as well as classification of weather conditions on consumer-grade hardware. By leveraging Quantized Low-Rank Adaptation (QLoRA) in conjunction with the Unsloth framework, our method substantially reduces memory consumption and accelerates fine-tuning compared with conventional approaches. We comprehensively evaluate a domain-specific excavator-vision dataset using five open-source LVLMs. These include Llama-3.2-Vision, Qwen2-VL, Qwen2.5-VL, LLaVA-1.6, and Gemma 3. Each model is fine-tuned on 1,000 annotated frames and tested on 2000 images. Experimental results demonstrate significant improvements in both object detection and weather classification, with Qwen2-VL-7B achieving an mAP@50 of 88.03%, mAP@[0.50:0.95] of 74.20%, accuracy of 84.54%, and F1 score of 78.83%. Our fine-tuned Qwen2-VL-7B model not only detects humans and obstacles robustly but also classifies weather accurately. These results illustrate the feasibility of deploying LVLM-based multimodal AI agents for safety monitoring, pose estimation, activity tracking, and strategic planning in autonomous excavator operations.<\/jats:p>","DOI":"10.3389\/frai.2025.1681277","type":"journal-article","created":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T06:23:29Z","timestamp":1763447009000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Resource-efficient fine-tuning of large vision-language models for multimodal perception in autonomous excavators"],"prefix":"10.3389","volume":"8","author":[{"given":"Hung Viet","family":"Nguyen","sequence":"first","affiliation":[]},{"given":"Hyojin","family":"Park","sequence":"additional","affiliation":[]},{"given":"Namhyun","family":"Yoo","sequence":"additional","affiliation":[]},{"given":"Jinhong","family":"Yang","sequence":"additional","affiliation":[]}],"member":"1965","published-online":{"date-parts":[[2025,11,18]]},"reference":[{"key":"ref1","year":"2024"},{"key":"ref2","first-page":"99","article-title":"Preliminary study: use of large generative artificial intelligence models in integrated Project Management","author":"Aramali","year":"2024"},{"key":"ref3","doi-asserted-by":"publisher","first-page":"104440","DOI":"10.1016\/j.autcon.2022.104440","article-title":"Artificial intelligence and smart vision for building and construction 4.0: machine and deep learning methods and applications","volume":"141","author":"Baduge","year":"2022","journal-title":"Autom. Constr."},{"key":"ref4","doi-asserted-by":"publisher","first-page":"13923","DOI":"10.48550\/arXiv.2502.13923","article-title":"Qwen2.5-VL technical report","volume":"2025","author":"Bai","year":"2025","journal-title":"arXiv"},{"key":"ref5","doi-asserted-by":"publisher","first-page":"103075","DOI":"10.1016\/j.aei.2024.103075","article-title":"Automatic identification of integrated construction elements using open-set object detection based on image and text modality fusion","volume":"64","author":"Cai","year":"2025","journal-title":"Adv. Eng. Inform."},{"key":"ref6","doi-asserted-by":"publisher","first-page":"105158","DOI":"10.1016\/j.autcon.2023.105158","article-title":"Augmented reality, deep learning and vision-language query system for construction worker safety","volume":"157","author":"Chen","year":"2024","journal-title":"Autom. Constr."},{"key":"ref7","doi-asserted-by":"publisher","first-page":"104702","DOI":"10.1016\/j.autcon.2022.104702","article-title":"Automatic vision-based calculation of excavator earthmoving productivity using zero-shot learning activity recognition","volume":"146","author":"Chen","year":"2023","journal-title":"Autom. Constr."},{"key":"ref8","first-page":"10088","article-title":"QLoRA: efficient Finetuning of quantized LLMs","volume-title":"Advances in neural information processing systems","author":"Dettmers","year":"2023"},{"key":"ref9","doi-asserted-by":"publisher","first-page":"3255","DOI":"10.3390\/buildings14103255","article-title":"Effectiveness of generative AI for post-earthquake damage assessment","volume":"14","author":"Est\u00eav\u00e3o","year":"2024","journal-title":"Buildings"},{"key":"ref10","doi-asserted-by":"publisher","first-page":"226","DOI":"10.1080\/01446193.2024.2415676","article-title":"Application of large language models to intelligently analyze long construction contract texts","volume":"43","author":"Gao","year":"2025","journal-title":"Constr. Manag. Econ."},{"key":"ref11","doi-asserted-by":"publisher","first-page":"105470","DOI":"10.1016\/j.autcon.2024.105470","article-title":"Zero-shot monitoring of construction workers\u2019 personal protective equipment based on image captioning","volume":"164","author":"Gil","year":"2024","journal-title":"Autom. Constr."},{"key":"ref12","year":"2025"},{"key":"ref13","doi-asserted-by":"publisher","first-page":"21783","DOI":"10.48550\/arXiv.2407.21783","article-title":"The llama 3 herd of models","volume":"2024","author":"Grattafiori","year":"2024","journal-title":"arXiv"},{"key":"ref14","doi-asserted-by":"publisher","first-page":"75","DOI":"10.1111\/mice.13310","article-title":"Integrated vision language and foundation model for automated estimation of building lowest floor elevation","volume":"40","author":"Ho","year":"2025","journal-title":"Comput. Aided Civ. Inf. Eng."},{"key":"ref15","author":"Hsu","year":"2024"},{"key":"ref16","doi-asserted-by":"publisher","first-page":"5068","DOI":"10.3390\/app14125068","article-title":"From large language models to large multimodal models: a literature review","volume":"14","author":"Huang","year":"2024","journal-title":"Appl. Sci."},{"key":"ref17","doi-asserted-by":"publisher","first-page":"103076","DOI":"10.1016\/j.aei.2024.103076","article-title":"Hybrid large language model approach for prompt and sensitive defect management: a comparative analysis of hybrid, non-hybrid, and GraphRAG approaches","volume":"64","author":"Jeon","year":"2025","journal-title":"Adv. Eng. Inform."},{"key":"ref18","doi-asserted-by":"publisher","first-page":"17981","DOI":"10.48550\/arXiv.2401.17981","article-title":"From training-free to adaptive: empirical insights into MLLMs\u2019 understanding of detection information","volume":"2024","author":"Jiao","year":"2024","journal-title":"arXiv"},{"key":"ref19","doi-asserted-by":"publisher","first-page":"105483","DOI":"10.1016\/j.autcon.2024.105483","article-title":"Visualsitediary: a detector-free vision-language transformer model for captioning photologs for daily construction reporting and image retrievals","volume":"165","author":"Jung","year":"2024","journal-title":"Autom. Constr."},{"key":"ref20","doi-asserted-by":"publisher","first-page":"345","DOI":"10.1111\/mice.13086","article-title":"Improving visual question answering for bridge inspection by pre-training with external data of image\u2013text pairs","volume":"39","author":"Kunlamai","year":"2024","journal-title":"Comput. Aided Civ. Inf. Eng."},{"key":"ref21","author":"Kwon","year":"2023"},{"key":"ref22","doi-asserted-by":"publisher","first-page":"557","DOI":"10.48550\/arXiv.2504.00557","article-title":"Efficient LLaMA-3.2-vision by trimming cross-attended visual features","volume":"2025","author":"Lee","year":"2025","journal-title":"arXiv"},{"key":"ref23","first-page":"740","article-title":"Microsoft COCO","volume-title":"Common objects in context., in computer vision \u2013 ECCV 2014","author":"Lin","year":"2014"},{"key":"ref24","year":"2025"},{"key":"ref25","author":"Liu","year":"2025"},{"key":"ref26","doi-asserted-by":"publisher","first-page":"105891","DOI":"10.1016\/j.autcon.2024.105891","article-title":"Automated legal consulting in construction procurement using metaheuristically optimized large language models","volume":"170","author":"Liu","year":"2025","journal-title":"Autom. Constr."},{"key":"ref27","year":"2025"},{"key":"ref28","author":"Luo","year":"2023"},{"key":"ref29","doi-asserted-by":"publisher","first-page":"103312","DOI":"10.1016\/j.autcon.2020.103312","article-title":"On-site autonomous construction robots: towards unsupervised building","volume":"119","author":"Melenbrink","year":"2020","journal-title":"Autom. Constr."},{"key":"ref30","year":""},{"key":"ref31","year":""},{"key":"ref32","doi-asserted-by":"publisher","first-page":"103940","DOI":"10.1016\/j.autcon.2021.103940","article-title":"Computer vision applications in construction: current state, opportunities & challenges","volume":"132","author":"Paneru","year":"2021","journal-title":"Autom. Constr."},{"key":"ref33","doi-asserted-by":"publisher","first-page":"142824","DOI":"10.1016\/j.jclepro.2024.142824","article-title":"Large language models for life cycle assessments: opportunities, challenges, and risks","volume":"466","author":"Preuss","year":"2024","journal-title":"J. Clean. Prod."},{"key":"ref34","doi-asserted-by":"publisher","first-page":"124601","DOI":"10.1016\/j.eswa.2024.124601","article-title":"Autorepo: a general framework for multimodal LLM-based automated construction reporting","volume":"255","author":"Pu","year":"","journal-title":"Expert Syst. Appl."},{"key":"ref35","author":"Pu","year":""},{"key":"ref36","doi-asserted-by":"publisher","first-page":"106103","DOI":"10.1016\/j.autcon.2025.106103","article-title":"Large language model-empowered paradigm for automated geotechnical site planning and geological characterization","volume":"173","author":"Qian","year":"2025","journal-title":"Autom. Constr."},{"key":"ref37","year":"2024"},{"key":"ref38","year":"2025"},{"key":"ref39","doi-asserted-by":"crossref","first-page":"299","DOI":"10.1007\/978-3-031-78447-7_20","article-title":"Enhancing object detection by leveraging large language models for contextual knowledge","volume-title":"Pattern recognition","author":"Rouhi","year":"2025"},{"key":"ref40","doi-asserted-by":"publisher","first-page":"11285","DOI":"10.48550\/arXiv.2411.11285","article-title":"Zero-shot automatic annotation and instance segmentation using LLM-generated datasets: eliminating field imaging and manual annotation for deep learning model development","volume":"2025","author":"Sapkota","year":"","journal-title":"arXiv"},{"key":"ref41","doi-asserted-by":"publisher","first-page":"18505","DOI":"10.48550\/arXiv.2502.18505","article-title":"Comprehensive analysis of transparency and accessibility of ChatGPT, DeepSeek, and other SoTA large language models","volume":"2025","author":"Sapkota","year":"","journal-title":"arXiv"},{"key":"ref42","doi-asserted-by":"publisher","first-page":"18648","DOI":"10.48550\/arXiv.2501.18648","article-title":"Multimodal large language models for image, text, and speech data augmentation: a survey","volume":"2025","author":"Sapkota","year":"","journal-title":"arXiv"},{"key":"ref43","author":"Tang","year":"2024"},{"key":"ref44","doi-asserted-by":"publisher","first-page":"19786","DOI":"10.48550\/arXiv.2503.19786","article-title":"Gemma 3 technical report","volume":"2025","author":"Team","year":"2025","journal-title":"arXiv"},{"key":"ref45","doi-asserted-by":"publisher","first-page":"225","DOI":"10.1016\/j.aei.2015.03.006","article-title":"Status quo and open challenges in vision-based sensing and tracking of temporary resources on infrastructure construction sites","volume":"29","author":"Teizer","year":"2015","journal-title":"Adv. Eng. Inform."},{"key":"ref46","doi-asserted-by":"publisher","first-page":"105863","DOI":"10.1016\/j.autcon.2024.105863","article-title":"Construction safety inspection with contrastive language-image pre-training (CLIP) image captioning and attention","volume":"169","author":"Tsai","year":"2025","journal-title":"Autom. Constr."},{"key":"ref47","year":"2024"},{"key":"ref48","year":""},{"key":"ref49","year":""},{"key":"ref50","doi-asserted-by":"publisher","first-page":"12191","DOI":"10.48550\/arXiv.2409.12191","article-title":"Qwen2-VL: enhancing vision-language model\u2019s perception of the world at any resolution","volume":"2024","author":"Wang","year":"2024","journal-title":"arXiv"},{"key":"ref51","doi-asserted-by":"publisher","first-page":"105995","DOI":"10.1016\/j.autcon.2025.105995","article-title":"Crack image classification and information extraction in steel bridges using multimodal large language models","volume":"171","author":"Wang","year":"2025","journal-title":"Autom. Constr."},{"key":"ref52","author":"Wen","year":"2024"},{"key":"ref53","doi-asserted-by":"publisher","first-page":"104082","DOI":"10.1016\/j.compind.2024.104082","article-title":"Construction contract risk identification based on knowledge-augmented language models","author":"Wong","year":"2024","journal-title":"Comput. Ind."},{"key":"ref54","doi-asserted-by":"publisher","first-page":"103158","DOI":"10.1016\/j.aei.2025.103158","article-title":"Retrieval augmented generation-driven information retrieval and question answering in construction management","volume":"65","author":"Wu","year":"2025","journal-title":"Adv. Eng. Inform."},{"key":"ref55","doi-asserted-by":"publisher","first-page":"39","DOI":"10.1016\/j.neucom.2020.01.085","article-title":"Recent advances in deep learning for object detection","volume":"396","author":"Wu","year":"2020","journal-title":"Neurocomputing"},{"key":"ref56","doi-asserted-by":"publisher","first-page":"105874","DOI":"10.1016\/j.autcon.2024.105874","article-title":"Automated daily report generation from construction videos using ChatGPT and computer vision","volume":"168","author":"Xiao","year":"2024","journal-title":"Autom. Constr."},{"key":"ref57","doi-asserted-by":"publisher","first-page":"105880","DOI":"10.1016\/j.autcon.2024.105880","article-title":"Automated physics-based modeling of construction equipment through data fusion","volume":"168","author":"Xu","year":"2024","journal-title":"Autom. Constr."},{"key":"ref58","doi-asserted-by":"publisher","first-page":"105565","DOI":"10.1016\/j.autcon.2024.105565","article-title":"Enhancing cyber risk identification in the construction industry using language models","volume":"165","author":"Yao","year":"2024","journal-title":"Autom. Constr."},{"key":"ref59","doi-asserted-by":"publisher","first-page":"1536","DOI":"10.1111\/mice.12954","article-title":"Prompt engineering for zero-shot and few-shot defect detection and classification using a visual-language pretrained model","volume":"38","author":"Yong","year":"2023","journal-title":"Comput. Aided Civ. Inf. Eng."},{"key":"ref60","doi-asserted-by":"publisher","first-page":"04024022","DOI":"10.1061\/JCCEE5.CPENG-5744","article-title":"Explainable image captioning to identify ergonomic problems and solutions for construction workers","volume":"38","author":"Yong","year":"2024","journal-title":"J. Comput. Civ. Eng."},{"key":"ref61","doi-asserted-by":"publisher","first-page":"825","DOI":"10.1007\/s11263-024-02214-4","article-title":"Contextual object detection with multimodal large language models","volume":"133","author":"Zang","year":"2025","journal-title":"Int. J. Comput. Vis."},{"key":"ref62","doi-asserted-by":"publisher","first-page":"105067","DOI":"10.1016\/j.autcon.2023.105067","article-title":"Dynamic prompt-based virtual assistant framework for BIM information search","volume":"155","author":"Zheng","year":"2023","journal-title":"Autom. Constr."},{"key":"ref63","doi-asserted-by":"publisher","first-page":"103142","DOI":"10.1016\/j.aei.2025.103142","article-title":"Augmenting general-purpose large-language models with domain-specific multimodal knowledge graph for question-answering in construction project management","volume":"65","author":"Zhou","year":"2025","journal-title":"Adv. Eng. Inform."}],"container-title":["Frontiers in Artificial Intelligence"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frai.2025.1681277\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T06:23:32Z","timestamp":1763447012000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frai.2025.1681277\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,18]]},"references-count":63,"alternative-id":["10.3389\/frai.2025.1681277"],"URL":"https:\/\/doi.org\/10.3389\/frai.2025.1681277","relation":{},"ISSN":["2624-8212"],"issn-type":[{"value":"2624-8212","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,18]]},"article-number":"1681277"}}