{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,29]],"date-time":"2025-12-29T11:46:38Z","timestamp":1767008798096,"version":"build-2065373602"},"reference-count":28,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2025,11,9]],"date-time":"2025-11-09T00:00:00Z","timestamp":1762646400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"U.S. Department of Transportation, Office of the Assistant Secretary for Research and Technology (OST-R), University Transportation Centers Program","award":["69A3552348304"],"award-info":[{"award-number":["69A3552348304"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computers"],"abstract":"<jats:p>Comprehensive highway scene understanding and robust traffic risk inference are vital for advancing Intelligent Transportation Systems (ITS) and autonomous driving. Traditional approaches often struggle with scalability and generalization, particularly under the complex and dynamic conditions of real-world environments. To address these challenges, we introduce a novel structured prompting and multi-agent collaborative knowledge distillation framework that enables automatic generation of high-quality traffic scene annotations and contextual risk assessments. Our framework orchestrates two large vision\u2013language models (VLMs): GPT-4o and o3-mini, using a structured Chain-of-Thought (CoT) strategy to produce rich, multiperspective outputs. These outputs serve as knowledge-enriched pseudo-annotations for supervised fine-tuning of a much smaller student VLM. The resulting compact 3B-scale model, named VISTA (Vision for Intelligent Scene and Traffic Analysis), is capable of understanding low-resolution traffic videos and generating semantically faithful, risk-aware captions. Despite its significantly reduced parameter count, VISTA achieves strong performance across established captioning metrics (BLEU-4, METEOR, ROUGE-L, and CIDEr) when benchmarked against its teacher models. This demonstrates that effective knowledge distillation and structured role-aware supervision can empower lightweight VLMs to capture complex reasoning capabilities. The compact architecture of VISTA facilitates efficient deployment on edge devices, enabling real-time risk monitoring without requiring extensive infrastructure upgrades.<\/jats:p>","DOI":"10.3390\/computers14110490","type":"journal-article","created":{"date-parts":[[2025,11,10]],"date-time":"2025-11-10T13:51:08Z","timestamp":1762782668000},"page":"490","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Structured Prompting and Collaborative Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-2000-2361","authenticated-orcid":false,"given":"Yunxiang","family":"Yang","sequence":"first","affiliation":[{"name":"Smart Mobility and Infrastructure Laboratory, College of Engineering, University of Georgia, Athens, GA 30602, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-5989-558X","authenticated-orcid":false,"given":"Ningning","family":"Xu","sequence":"additional","affiliation":[{"name":"Smart Mobility and Infrastructure Laboratory, College of Engineering, University of Georgia, Athens, GA 30602, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4823-6322","authenticated-orcid":false,"given":"Jidong J.","family":"Yang","sequence":"additional","affiliation":[{"name":"Smart Mobility and Infrastructure Laboratory, College of Engineering, University of Georgia, Athens, GA 30602, USA"}]}],"member":"1968","published-online":{"date-parts":[[2025,11,9]]},"reference":[{"key":"ref_1","unstructured":"Rivera, J., Lin, K., and Adeli, E. (March, January 26). Scenario Understanding of Traffic Scenes Through Large Visual Language Models. Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA."},{"key":"ref_2","unstructured":"Zhang, Y., Liu, L., Zhang, H., Wang, X., and Li, M. (2024). Semantic Understanding of Traffic Scenes with Large Vision-Language Models. arXiv."},{"key":"ref_3","unstructured":"Zheng, O., Abdel-Aty, M., Wang, D., Wang, Z., and Ding, S. (2023). ChatGPT Is on the Horizon: Could a Large Language Model Be All We Need for Intelligent Transportation?. arXiv."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"244","DOI":"10.1016\/j.aap.2014.06.017","article-title":"A review of the effect of traffic and weather characteristics on road safety","volume":"72","author":"Theofilatos","year":"2014","journal-title":"Accid. Anal. Prev."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"5625","DOI":"10.1109\/TPAMI.2024.3369699","article-title":"Vision-Language Models for Vision Tasks: A Survey","volume":"46","author":"Zhang","year":"2024","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Li, Z., Wu, X., Du, H., Nghiem, H., and Shi, G. (2025). Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey. arXiv.","DOI":"10.32388\/GXR68Q"},{"key":"ref_7","unstructured":"Xu, H., Jin, L., Wang, X., Wang, L., and Liu, C. (2024). A Survey on Multi-Agent Foundation Models: Progress and Challenges. arXiv."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Yang, C., Zhu, Y., Lu, W., Wang, Y., Chen, Q., Gao, C., Yan, B., and Chen, Y. (2024). Survey on Knowledge Distillation for Large Language Models: Methods, Evaluation, and Application. ACM Trans. Intell. Syst. Technol.","DOI":"10.1145\/3699518"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"19472","DOI":"10.52202\/079017-0614","article-title":"ShareGPT4Video: Improving Video Understanding and Generation with Better Captions","volume":"37","author":"Chen","year":"2024","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Cao, X., Zhou, T., Ma, Y., Ye, W., Cui, C., Tang, K., Cao, Z., Liang, K., Wang, Z., and Rehg, J.M. (2024, January 16\u201322). MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.02061"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Lohner, A., Compagno, F., Francis, J., and Oltramari, A. (2024, January 22\u201323). Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding. Proceedings of the 2024 IEEE International Automated Vehicle Validation Conference (IAVVC), Pittsburgh, PA, USA.","DOI":"10.1109\/IAVVC63304.2024.10786395"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"508","DOI":"10.3390\/automation5040029","article-title":"Leveraging Multimodal Large Language Models (MLLMs) for Enhanced Object Detection and Scene Understanding in Thermal Images for Autonomous Driving Systems","volume":"5","author":"Ashqar","year":"2024","journal-title":"Automation"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Shriram, S., Perisetla, S., Keskar, A., Krishnaswamy, H., Westerhof Bossen, T.E., M\u00f8gelmose, A., and Greer, R. (2025, January 17\u201321). Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety. Proceedings of the 2025 IEEE 21st International Conference on Automation Science and Engineering (CASE), Los Angeles, CA, USA.","DOI":"10.1109\/CASE58245.2025.11163861"},{"key":"ref_14","unstructured":"Kugo, N., Li, X., Li, Z., Gupta, A., Khatua, A., Jain, N., Patel, C., Kyuragi, Y., Ishii, Y., and Tanabiki, M. (2025). VideoMultiAgents: A Multi-Agent Framework for Video Question Answering. arXiv."},{"key":"ref_15","unstructured":"Jiang, B., Zhuang, Z., Shivakumar, S.S., Roth, D., and Taylor, C.J. (2024). Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"92166","DOI":"10.1109\/ACCESS.2022.3202526","article-title":"A Multimodal Framework for Video Caption Generation","volume":"10","author":"Bhooshan","year":"2022","journal-title":"IEEE Access"},{"key":"ref_17","unstructured":"Yang, Y. (2025, August 08). Vision-Informed Safety and Transportation Assessment (VISTA). Available online: https:\/\/github.com\/winstonyang117\/Vision-informed-Safety-and-Transportation-Assessment."},{"key":"ref_18","unstructured":"OpenAI (2023). GPT-4 Technical Report, OpenAI. Technical Report."},{"key":"ref_19","first-page":"24824","article-title":"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models","volume":"35","author":"Wei","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_20","unstructured":"OpenAI (2025, November 04). ChatGPT (o3-mini). 31 January 2025. Available online: https:\/\/openai.com\/index\/openai-o3-mini\/."},{"key":"ref_21","unstructured":"Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., and Tang, J. (2025). Qwen2.5-VL Technical Report. arXiv."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. (2002, January 6\u201312). BLEU: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL), Association for Computational Linguistics, Philadelphia, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_23","unstructured":"Banerjee, S., and Lavie, A. (2005, January 23). METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization, Association for Computational Linguistics, Ann Arbor, MI, USA."},{"key":"ref_24","unstructured":"Lin, C.Y. (2004, January 25\u201326). ROUGE: A Package for Automatic Evaluation of Summaries. Proceedings of the Workshop on Text Summarization Branches Out (WAS), Barcelona, Spain."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Vedantam, R., Lawrence Zitnick, C., and Parikh, D. (2015, January 7\u201312). CIDEr: Consensus-Based Image Description Evaluation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"ref_26","unstructured":"Deng, C., Li, Y., Jiang, H., Li, W., Zhang, Y., Zhao, H., and Zhou, P. (2024, January 17\u201318). CityLLaVA: Efficient Fine-Tuning for Vision-Language Models in City Scenario. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"227","DOI":"10.1007\/s10462-025-11236-4","article-title":"Parameter-efficient fine-tuning in large language models: A survey of methodologies","volume":"58","author":"Wang","year":"2025","journal-title":"Artif. Intell. Rev."},{"key":"ref_28","unstructured":"Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., and Chen, W. (2022, January 25\u201329). LoRA: Low-Rank Adaptation of Large Language Models. Proceedings of the Tenth International Conference on Learning Representations (ICLR), Virtual Conference."}],"container-title":["Computers"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/11\/490\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,10]],"date-time":"2025-11-10T14:09:55Z","timestamp":1762783795000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/11\/490"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,9]]},"references-count":28,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2025,11]]}},"alternative-id":["computers14110490"],"URL":"https:\/\/doi.org\/10.3390\/computers14110490","relation":{},"ISSN":["2073-431X"],"issn-type":[{"type":"electronic","value":"2073-431X"}],"subject":[],"published":{"date-parts":[[2025,11,9]]}}}