{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T01:38:01Z","timestamp":1760146681358,"version":"build-2065373602"},"reference-count":21,"publisher":"MDPI AG","issue":"12","license":[{"start":{"date-parts":[[2024,11,29]],"date-time":"2024-11-29T00:00:00Z","timestamp":1732838400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["BDCC"],"abstract":"<jats:p>Currently, closed-set object detection models represented by YOLO are widely deployed in the industrial field. However, such closed-set models lack sufficient tuning ability for easily confused objects in complex detection scenarios. Open-set object detection models such as GroundingDINO expand the detection range to a certain extent, but they still have a gap in detection accuracy compared with closed-set detection models and cannot meet the requirements for high-precision detection in practical applications. In addition, existing detection technologies are also insufficient in interpretability, making it difficult to clearly show users the basis and process of judgment of detection results, causing users to have doubts about the trust and application of detection results. Based on the above deficiencies, we propose a new object detection algorithm based on multi-modal large language models that significantly improves the detection effect of closed-set object detection models for more difficult boundary tasks while ensuring detection accuracy, thereby achieving a semi-open set object detection algorithm. It has significant improvements in accuracy and interpretability under the verification of seven common traffic and safety production scenarios.<\/jats:p>","DOI":"10.3390\/bdcc8120175","type":"journal-article","created":{"date-parts":[[2024,12,2]],"date-time":"2024-12-02T03:50:57Z","timestamp":1733111457000},"page":"175","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Semi-Open Set Object Detection Algorithm Leveraged by Multi-Modal Large Language Models"],"prefix":"10.3390","volume":"8","author":[{"given":"Kewei","family":"Wu","sequence":"first","affiliation":[{"name":"School of Artificial Intelligence, Beijing University of Posts and Telecommunications, 10 Xitucheng Rd, Beijing 100876, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yiran","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence, Beijing University of Posts and Telecommunications, 10 Xitucheng Rd, Beijing 100876, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-3533-0148","authenticated-orcid":false,"given":"Xiaogang","family":"He","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence, Beijing University of Posts and Telecommunications, 10 Xitucheng Rd, Beijing 100876, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jinyu","family":"Yan","sequence":"additional","affiliation":[{"name":"Beijing Zhuoshizhitong Technology Co., Ltd., Beijing 100096, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yang","family":"Guo","sequence":"additional","affiliation":[{"name":"Beijing Zhuoshizhitong Technology Co., Ltd., Beijing 100096, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhuqing","family":"Jiang","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence, Beijing University of Posts and Telecommunications, 10 Xitucheng Rd, Beijing 100876, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xing","family":"Zhang","sequence":"additional","affiliation":[{"name":"China Resources Digital Co., Ltd., Beijing 518049, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wei","family":"Wang","sequence":"additional","affiliation":[{"name":"China Resources Digital Co., Ltd., Beijing 518049, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yongping","family":"Xiong","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence, Beijing University of Posts and Telecommunications, 10 Xitucheng Rd, Beijing 100876, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Aidong","family":"Men","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence, Beijing University of Posts and Telecommunications, 10 Xitucheng Rd, Beijing 100876, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Li","family":"Xiao","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence, Beijing University of Posts and Telecommunications, 10 Xitucheng Rd, Beijing 100876, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2024,11,29]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"257","DOI":"10.1109\/JPROC.2023.3238524","article-title":"Object detection in 20 years: A survey","volume":"111","author":"Zou","year":"2023","journal-title":"Proc. IEEE"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"261","DOI":"10.1007\/s11263-019-01247-4","article-title":"Deep learning for generic object detection: A survey","volume":"128","author":"Liu","year":"2020","journal-title":"Int. J. Comput. Vis."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"3212","DOI":"10.1109\/TNNLS.2018.2876865","article-title":"Object detection with deep learning: A review","volume":"30","author":"Zhao","year":"2019","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Dhamija, A., Gunther, M., Ventura, J., and Boult, T. (2020, January 2\u20135). The overlooked elephant of object detection: Open set. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, The Westin Snowmass Resort, CO, USA.","DOI":"10.1109\/WACV45572.2020.9093355"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"85","DOI":"10.1007\/s13748-019-00203-0","article-title":"Convolutional neural network: A review of models, methodologies and applications to object detection","volume":"9","author":"Dhillon","year":"2020","journal-title":"Prog. Artif. Intell."},{"key":"ref_6","first-page":"886","article-title":"Histograms of oriented gradients for human detection","volume":"2","author":"Navneet","year":"2005","journal-title":"Int. Conf. Comput. Vis. Pattern Recognit."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"1627","DOI":"10.1109\/TPAMI.2009.167","article-title":"Object detection with discriminatively trained part-based models","volume":"32","author":"Felzenszwalb","year":"2009","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014, January 23\u201328). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.81"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","article-title":"Faster R-CNN: Towards real-time object detection with region proposal networks","volume":"39","author":"Ren","year":"2016","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., and Berg, A.C. (2016, January 11\u201314). Ssd: Single shot multibox detector. Proceedings of the Computer Vision\u2013ECCV 2016: 14th European Conference, Amsterdam, The Netherlands. Proceedings, Part I 14.","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Goyal, P., Girshick, R., He, K., and Doll\u00e1r, P. (2017, January 22\u201329). Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.324"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016, January 27\u201330). You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.91"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"154","DOI":"10.1007\/s11263-013-0620-5","article-title":"Selective search for object recognition","volume":"104","author":"Uijlings","year":"2013","journal-title":"Int. J. Comput. Vis."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., and Zhu, J. (2023). Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv.","DOI":"10.1007\/978-3-031-72970-6_3"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Huang, X., Ma, J., Li, Z., Luo, Z., Xie, Y., Qin, Y., Luo, T., Li, Y., and Liu, S. (2024, January 17\u201318). Recognize anything: A strong image tagging model. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPRW63382.2024.00179"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., and Chen, E. (2023). A survey on multimodal large language models. arXiv.","DOI":"10.1093\/nsr\/nwae403"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"481","DOI":"10.1016\/j.future.2019.08.026","article-title":"CO-STAR: A collaborative prediction service for short-term trends on continuous spatio-temporal data","volume":"102","author":"Ding","year":"2020","journal-title":"Future Gener. Comput. Syst."},{"key":"ref_18","unstructured":"Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Pfeiffer, J., R\u00fcckl\u00e9, A., Poth, C., Kamath, A., Vuli\u0107, I., Ruder, S., Cho, K., and Gurevych, I. (2020). Adapterhub: A framework for adapting transformers. arXiv.","DOI":"10.18653\/v1\/2020.emnlp-demos.7"},{"key":"ref_20","unstructured":"Zhang, Q., Chen, M., Bukharin, A., Karampatziakis, N., He, P., Cheng, Y., Chen, W., and Zhao, T. (2023). AdaLoRA: Adaptive budget allocation for parameter-efficient fine-tuning. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Sohan, M., Sai Ram, T., Reddy, R., and Venkata, C. (2024). A review on yolov8 and its advancements. Proceedings of the International Conference on Data Intelligence and Cognitive Informatics, Springer.","DOI":"10.1007\/978-981-99-7962-2_39"}],"container-title":["Big Data and Cognitive Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-2289\/8\/12\/175\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T16:42:46Z","timestamp":1760114566000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-2289\/8\/12\/175"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,29]]},"references-count":21,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["bdcc8120175"],"URL":"https:\/\/doi.org\/10.3390\/bdcc8120175","relation":{},"ISSN":["2504-2289"],"issn-type":[{"type":"electronic","value":"2504-2289"}],"subject":[],"published":{"date-parts":[[2024,11,29]]}}}