{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T01:37:58Z","timestamp":1760060278122,"version":"build-2065373602"},"reference-count":71,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2025,8,14]],"date-time":"2025-08-14T00:00:00Z","timestamp":1755129600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62306329","2023JJ40676","2024-JCJQ-QT-034"],"award-info":[{"award-number":["62306329","2023JJ40676","2024-JCJQ-QT-034"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Hunan Provincial Natural Science Foundation of China","award":["62306329","2023JJ40676","2024-JCJQ-QT-034"],"award-info":[{"award-number":["62306329","2023JJ40676","2024-JCJQ-QT-034"]}]},{"DOI":"10.13039\/100010097","name":"China Association for Science and Technology","doi-asserted-by":"publisher","award":["62306329","2023JJ40676","2024-JCJQ-QT-034"],"award-info":[{"award-number":["62306329","2023JJ40676","2024-JCJQ-QT-034"]}],"id":[{"id":"10.13039\/100010097","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["BDCC"],"abstract":"<jats:p>Scene Knowledge-guided Visual Grounding (SK-VG) is a multi-modal detection task built upon conventional visual grounding (VG) for human\u2013computer interaction scenarios. It utilizes an additional passage of scene knowledge apart from the image and context-dependent textual query for referred object localization. Due to the inherent difficulty in directly establishing correlations between the given query and the image without leveraging scene knowledge, this task imposes significant demands on a multi-step knowledge reasoning process to achieve accurate grounding. Off-the-shelf VG models underperform under such a setting due to the requirement of detailed description in the query and a lack of knowledge inference based on implicit narratives of the visual scene. Recent Vision\u2013Language Models (VLMs) exhibit improved cross-modal reasoning capabilities. However, their monolithic architectures, particularly in lightweight implementations, struggle to maintain coherent reasoning chains across sequential logical deductions, leading to error accumulation in knowledge integration and object localization. To address the above-mentioned challenges, we propose SplitGround\u2014a collaborative framework that strategically decomposes complex reasoning processes by fusing the input query and image with knowledge through two auxiliary modules. Specifically, it implements an Agentic Annotation Workflow (AAW) for explicit image annotation and a Synonymous Conversion Mechanism (SCM) for semantic query transformation. This hierarchical decomposition enables VLMs to focus on essential reasoning steps while offloading auxiliary cognitive tasks to specialized modules, effectively splitting long reasoning chains into manageable subtasks with reduced complexity. Comprehensive evaluations on the SK-VG benchmark demonstrate the significant advancements of our method. Remarkably, SplitGround attains an accuracy improvement of 15.71% on the hard split of the test set over the previous training-required SOTA, using only a compact VLM backbone without fine-tuning, which provides new insights for knowledge-intensive visual grounding tasks.<\/jats:p>","DOI":"10.3390\/bdcc9080209","type":"journal-article","created":{"date-parts":[[2025,8,14]],"date-time":"2025-08-14T14:51:46Z","timestamp":1755183106000},"page":"209","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["SplitGround: Long-Chain Reasoning Split via Modular Multi-Expert Collaboration for Training-Free Scene Knowledge-Guided Visual Grounding"],"prefix":"10.3390","volume":"9","author":[{"given":"Xilong","family":"Qin","sequence":"first","affiliation":[{"name":"College of Systems Engineering, National University of Defense Technology, Changsha 410073, China"}]},{"given":"Yue","family":"Hu","sequence":"additional","affiliation":[{"name":"College of Systems Engineering, National University of Defense Technology, Changsha 410073, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0467-3830","authenticated-orcid":false,"given":"Wansen","family":"Wu","sequence":"additional","affiliation":[{"name":"Navy Submarine Academy, Qingdao 266000, China"}]},{"given":"Xinmeng","family":"Li","sequence":"additional","affiliation":[{"name":"Hunan Institute of Advanced Technology, Changsha 410205, China"}]},{"given":"Quanjun","family":"Yin","sequence":"additional","affiliation":[{"name":"College of Systems Engineering, National University of Defense Technology, Changsha 410073, China"}]}],"member":"1968","published-online":{"date-parts":[[2025,8,14]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Chen, Z., Zhang, R., Song, Y., Wan, X., and Li, G. (2023, January 17\u201324). Advancing Visual Grounding With Scene Knowledge: Benchmark and Method. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01444"},{"key":"ref_2","first-page":"38","article-title":"Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection","volume":"Volume 15105","author":"Leonardis","year":"2024","journal-title":"Proceedings of the Computer Vision-ECCV 2024-18th European Conference"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Yan, B., Jiang, Y., Wu, J., Wang, D., Luo, P., Yuan, Z., and Lu, H. (2023, January 17\u201324). Universal Instance Perception as Object Discovery and Retrieval. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01471"},{"key":"ref_4","unstructured":"Wang, P., Yang, A., Men, R., Lin, J., Bai, S., and Li, Z. (2022). OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. arXiv."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Ma, Z., Gao, X., Shakiah, S., Gao, Q., and Chai, J. (2024, January 16\u201322). Groundhog Grounding Large Language Models to Holistic Segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.01349"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"101688","DOI":"10.1016\/j.aei.2022.101688","article-title":"Detection and location of unsafe behaviour in digital images: A visual grounding approach","volume":"53","author":"Liu","year":"2022","journal-title":"Adv. Eng. Inform."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"103075","DOI":"10.1016\/j.aei.2024.103075","article-title":"Automatic identification of integrated construction elements using open-set object detection based on image and text modality fusion","volume":"64","author":"Cai","year":"2025","journal-title":"Adv. Eng. Inform."},{"key":"ref_8","unstructured":"Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., and Kivlichan, I. (2024). GPT-4o System Card. arXiv."},{"key":"ref_9","unstructured":"Liu, H., Li, C., Li, Y., and Lee, Y.J. (2023, January 17\u201324). Improved Baselines with Visual Instruction Tuning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada."},{"key":"ref_10","unstructured":"Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. (2023). Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv."},{"key":"ref_11","unstructured":"Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., and Ge, W. (2024). Qwen2-VL: Enhancing Vision-Language Model\u2019s Perception of the World at Any Resolution. arXiv."},{"key":"ref_12","unstructured":"Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., and Tang, J. (2025). Qwen2.5-VL Technical Report. arXiv."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Dorkenwald, M., Barazani, N., Snoek, C.G.M., and Asano, Y.M. (2024). PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs. arXiv.","DOI":"10.1109\/CVPR52733.2024.01286"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"42","DOI":"10.1145\/3703155","article-title":"A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions","volume":"43","author":"Huang","year":"2025","journal-title":"ACM Trans. Inf. Syst."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"103246","DOI":"10.1016\/j.aei.2025.103246","article-title":"An integrated approach for automatic safety inspection in construction: Domain knowledge with multimodal large language model","volume":"65","author":"Wang","year":"2025","journal-title":"Adv. Eng. Inform."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"684","DOI":"10.1109\/TPAMI.2019.2911066","article-title":"Learning to Compose and Reason with Language Tree Structures for Visual Grounding","volume":"44","author":"Hong","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Liu, X., Wang, Z., Shao, J., Wang, X., and Li, H. (2019, January 15\u201319). Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00205"},{"key":"ref_18","unstructured":"Liu, D., Zhang, H., Zha, Z.J., and Feng, W. (November, January 27). Learning to Assemble Neural Module Tree Networks for Visual Grounding. Proceedings of the The IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Bajaj, M., Wang, L., and Sigal, L. (November, January 27). G3raphGround: Graph-Based Language Grounding. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea.","DOI":"10.1109\/ICCV.2019.00438"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Yang, S., Li, G., and Yu, Y. (November, January 27). Dynamic Graph Attention for Referring Expression Comprehension. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Republic of Korea.","DOI":"10.1109\/ICCV.2019.00474"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Yang, S., Li, G., and Yu, Y. (2020, January 13\u201319). Graph-Structured Referring Expression Reasoning in the Wild. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA. Computer Vision Foundation.","DOI":"10.1109\/CVPR42600.2020.00997"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Huang, B., Lian, D., Luo, W., and Gao, S. (2021, January 19\u201325). Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual.","DOI":"10.1109\/CVPR46437.2021.01661"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Luo, G., Zhou, Y., Sun, X., Cao, L., Wu, C., Deng, C., and Ji, R. (2020, January 13\u201319). Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA. Computer Vision Foundation.","DOI":"10.1109\/CVPR42600.2020.01005"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., and Luo, J. (2019, January 27\u201328). A Fast and Accurate One-Stage Approach to Visual Grounding. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea.","DOI":"10.1109\/ICCV.2019.00478"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Yang, Z., Chen, T., Wang, L., and Luo, J. (2020, January 23\u201328). Improving One-stage Visual Grounding by Recursive Sub-query Construction. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58568-6_23"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Deng, J., Yang, Z., Chen, T., Gang Zhou, W., and Li, H. (2021, January 11\u201317). TransVG: End-to-End Visual Grounding with Transformers. Proceedings of the2021 IEEE\/CVF International Conference on Computer Vision (ICCV), Virtual.","DOI":"10.1109\/ICCV48922.2021.00179"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"134","DOI":"10.1109\/TNNLS.2021.3090426","article-title":"A Real-Time Global Inference Network for One-Stage Referring Expression Comprehension","volume":"34","author":"Zhou","year":"2023","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_28","unstructured":"Simonyan, K., and Zisserman, A. (2015, January 7\u20139). Very Deep Convolutional Networks for Large-Scale Image Recognition. Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long Short-Term Memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Chen, L., Ma, W., Xiao, J., Zhang, H., and Chang, S. (2021, January 2\u20139). Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding. Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual.","DOI":"10.1609\/aaai.v35i2.16188"},{"key":"ref_31","unstructured":"Redmon, J., and Farhadi, A. (2018). YOLOv3: An Incremental Improvement. arXiv."},{"key":"ref_32","first-page":"121670","article-title":"SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion","volume":"37","author":"Dai","year":"2024","journal-title":"Proc. Adv. Neural Inf. Process. Syst."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Yang, Z., Gan, Z., Wang, J., Hu, X., Ahmed, F., Liu, Z., Lu, Y., and Wang, L. (2022, January 23\u201327). UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling. Proceedings of the ECCV, 2022: 17th European Conference, Tel Aviv, Israel.","DOI":"10.1007\/978-3-031-20059-5_30"},{"key":"ref_34","unstructured":"Zhang, H., Zhang, P., Hu, X., Chen, Y.C., Li, L.H., Dai, X., Wang, L., Yuan, L., Hwang, J.N., and Gao, J. (2022). GLIPv2: Unifying Localization and Vision-Language Understanding. arXiv."},{"key":"ref_35","unstructured":"Kang, W., Qu, M., Wei, Y., and Yan, Y. (2024). ACTRESS: Active Retraining for Semi-supervised Visual Grounding. arXiv."},{"key":"ref_36","unstructured":"Kang, W., Zhou, L., Wu, J., Sun, C., and Yan, Y. (2024). Visual Grounding with Attention-Driven Constraint Balancing. arXiv."},{"key":"ref_37","first-page":"546","article-title":"SiRi: A Simple Selective Retraining Mechanism for Transformer-Based Visual Grounding","volume":"Volume 13695","author":"Avidan","year":"2022","journal-title":"Proceedings of the Computer Vision\u2014ECCV 2022-17th European Conference"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Yang, L., Xu, Y., Yuan, C., Liu, W., Li, B., and Hu, W. (2022, January 18\u201324). Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00928"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Ye, J., Tian, J., Yan, M., Yang, X., Wang, X., Zhang, J., He, L., and Lin, X. (2022, January 18\u201324). Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01506"},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"13636","DOI":"10.1109\/TPAMI.2023.3296823","article-title":"TransVG++: End-to-End Visual Grounding With Language Conditioned Vision Transformer","volume":"45","author":"Deng","year":"2023","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Vedaldi, A., Bischof, H., Brox, T., and Frahm, J.M. (2020, January 23\u201328). End-to-End Object Detection with Transformers. Proceedings of the Computer Vision-ECCV, Glasgow, UK.","DOI":"10.1007\/978-3-030-58583-9"},{"key":"ref_42","unstructured":"Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., and Shum, H. (2023, January 1\u20135). DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. Proceedings of the The Eleventh International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda."},{"key":"ref_43","unstructured":"Ren, T., Jiang, Q., Liu, S., Zeng, Z., Liu, W., Gao, H., Huang, H., Ma, Z., Jiang, X., and Chen, Y. (arXiv, 2024). Grounding DINO 1.5: Advance the \"Edge\" of Open-Set Object Detection, arXiv."},{"key":"ref_44","unstructured":"Ren, T., Chen, Y., Jiang, Q., Zeng, Z., Xiong, Y., Liu, W., Ma, Z., Shen, J., Gao, Y., and Jiang, X. (2024). DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding. arXiv."},{"key":"ref_45","first-page":"8748","article-title":"Learning Transferable Visual Models From Natural Language Supervision","volume":"Volume 139","author":"Meila","year":"2021","journal-title":"Proceedings of the 38th International Conference on Machine Learning (ICML 2021)"},{"key":"ref_46","unstructured":"Cai, J., Kankanhalli, M.S., Prabhakaran, B., Boll, S., Subramanian, R., Zheng, L., Singh, V.K., C\u00e9sar, P., Xie, L., and Xu, D. (November, January 28). HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding. Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Kim, S., Kang, M., Kim, D., Park, J., and Kwak, S. (2024, January 4\u20137). Extending CLIP\u2019s Image-Text Alignment to Referring Image Segmentation. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico.","DOI":"10.18653\/v1\/2024.naacl-long.258"},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"3469","DOI":"10.1109\/TMM.2023.3311646","article-title":"SgVA-CLIP: Semantic-Guided Visual Adapting of Vision-Language Models for Few-Shot Image Classification","volume":"26","author":"Peng","year":"2024","journal-title":"IEEE Trans. Multimed."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M., and Liu, T. (2022, January 18\u201324). CRIS: CLIP-Driven Referring Image Segmentation. Proceedings of the 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01139"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., and Houlsby, N. (2022). Simple Open-Vocabulary Object Detection with Vision Transformers. arXiv.","DOI":"10.1007\/978-3-031-20080-9_42"},{"key":"ref_51","doi-asserted-by":"crossref","first-page":"4334","DOI":"10.1109\/TMM.2023.3321501","article-title":"CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding","volume":"26","author":"Xiao","year":"2023","journal-title":"IEEE Trans. Multimedia"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Jin, L., Luo, G., Zhou, Y., Sun, X., Jiang, G., Shu, A., and Ji, R. (2023, January 17\u201324). RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension. Proceedings of the 2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.00263"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Sun, J., Luo, G., Zhou, Y., Sun, X., Jiang, G., Wang, Z., and Ji, R. (2023, January 11\u201315). RefTeacher: A Strong Baseline for Semi-Supervised Referring Expression Comprehension. Proceedings of the 2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville TN, USA.","DOI":"10.1109\/CVPR52729.2023.01835"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Minderer, M., Gritsenko, A., and Houlsby, N. (arXiv, 2023). Scaling Open-Vocabulary Object Detection, arXiv.","DOI":"10.1007\/978-3-031-20080-9_42"},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., and Gao, J. (2022, January 18\u201324). Grounded Language-Image Pre-training. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01069"},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Xiao, B., Wu, H., Xu, W., Dai, X., Hu, H., Lu, Y., Zeng, M., Liu, C., and Yuan, L. (2024, January 16\u201322). Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks. Proceedings of the 2024 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.00461"},{"key":"ref_57","unstructured":"Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is All you Need. Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_58","doi-asserted-by":"crossref","first-page":"103478","DOI":"10.1016\/j.aei.2025.103478","article-title":"Tailored vision-language framework for automated hazard identification and report generation in construction sites","volume":"66","author":"Chen","year":"2025","journal-title":"Adv. Eng. Inform."},{"key":"ref_59","doi-asserted-by":"crossref","first-page":"103208","DOI":"10.1016\/j.aei.2025.103208","article-title":"FD-LLM: Large language model for fault diagnosis of complex equipment","volume":"65","author":"Lin","year":"2025","journal-title":"Adv. Eng. Inform."},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Zhang, H., Li, H., Li, F., Ren, T., Zou, X., Liu, S., Huang, S., Gao, J., Zhang, L., and Li, C. (October, January 29). LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models. Proceedings of the Computer Vision\u2014ECCV 2024, Milan, Italy.","DOI":"10.1007\/978-3-031-72775-7_2"},{"key":"ref_61","doi-asserted-by":"crossref","unstructured":"Rasheed, H., Maaz, M., Shaji, S., Shaker, A., Khan, S., Cholakkal, H., Anwer, R.M., Xing, E., Yang, M.H., and Khan, F.S. (2024, January 16\u201322). GLaMM: Pixel Grounding Large Multimodal Model. Proceedings of the 2024 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.01236"},{"key":"ref_62","unstructured":"Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., and Song, X. (2024, January 9\u201315). CogVLM: Visual Expert for Pretrained Language Models. Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada."},{"key":"ref_63","doi-asserted-by":"crossref","unstructured":"Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y., Wang, Z., Dong, Y., and Ding, M. (2024, January 16\u201322). CogAgent: A Visual Language Model for GUI Agents. Proceedings of the 2024 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.01354"},{"key":"ref_64","doi-asserted-by":"crossref","unstructured":"Yang, J., Chen, X., Qian, S., Madaan, N., Iyengar, M., Fouhey, D.F., and Chai, J. (2024, January 13\u201317). LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent. Proceedings of the IEEE International Conference on Robotics and Automation, ICRA 2024, Yokohama, Japan.","DOI":"10.1109\/ICRA57147.2024.10610443"},{"key":"ref_65","unstructured":"Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., and Liu, J. (2024, January 7\u20139). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations. Proceedings of the Conference on Language Models (COLM), Philadelphia, PA, USA."},{"key":"ref_66","first-page":"68539","article-title":"Toolformer: Language Models Can Teach Themselves to Use Tools","volume":"Volume 36","author":"Oh","year":"2023","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"ref_67","first-page":"38154","article-title":"HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face","volume":"Volume 36","author":"Oh","year":"2023","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"ref_68","unstructured":"Zhao, H., Ge, W., and Cong Chen, Y. (2024). LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding. arXiv."},{"key":"ref_69","doi-asserted-by":"crossref","unstructured":"Li, R., Li, S., Kong, L., Yang, X., and Liang, J. (2025, January 14\u201315). SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA.","DOI":"10.1109\/CVPR52734.2025.00351"},{"key":"ref_70","doi-asserted-by":"crossref","unstructured":"Shahriar, S., Lund, B.D., Mannuru, N.R., Arshad, M.A., Hayawi, K., Bevara, R.V.K., Mannuru, A., and Batool, L. (2024). Putting GPT-4o to the Sword: A Comprehensive Evaluation of Language, Vision, Speech, and Multimodal Proficiency. Appl. Sci., 14.","DOI":"10.20944\/preprints202406.1635.v1"},{"key":"ref_71","doi-asserted-by":"crossref","unstructured":"Kamath, A., Singh, M., LeCun, Y., Misra, I., Synnaeve, G., and Carion, N. (2021). MDETR\u2013Modulated Detection for End-to-End Multi-Modal Understanding. arXiv.","DOI":"10.1109\/ICCV48922.2021.00180"}],"container-title":["Big Data and Cognitive Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-2289\/9\/8\/209\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:27:36Z","timestamp":1760034456000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-2289\/9\/8\/209"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,14]]},"references-count":71,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2025,8]]}},"alternative-id":["bdcc9080209"],"URL":"https:\/\/doi.org\/10.3390\/bdcc9080209","relation":{},"ISSN":["2504-2289"],"issn-type":[{"type":"electronic","value":"2504-2289"}],"subject":[],"published":{"date-parts":[[2025,8,14]]}}}