{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,8]],"date-time":"2025-12-08T22:44:01Z","timestamp":1765233841461,"version":"build-2065373602"},"reference-count":45,"publisher":"MDPI AG","issue":"17","license":[{"start":{"date-parts":[[2024,9,6]],"date-time":"2024-09-06T00:00:00Z","timestamp":1725580800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Chinese Ministry of Transportation In Service Trunk Highway Infrastructure and Safety Emergency Digitization Project","award":["2023-26","23-04X"],"award-info":[{"award-number":["2023-26","23-04X"]}]},{"name":"Transportation Research Project of Department of Transport of Shaanxi Province","award":["2023-26","23-04X"],"award-info":[{"award-number":["2023-26","23-04X"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>As a fundamental element of the transportation system, traffic signs are widely used to guide traffic behaviors. In recent years, drones have emerged as an important tool for monitoring the conditions of traffic signs. However, the existing image processing technique is heavily reliant on image annotations. It is time consuming to build a high-quality dataset with diverse training images and human annotations. In this paper, we introduce the utilization of Vision\u2013language Models (VLMs) in the traffic sign detection task. Without the need for discrete image labels, the rapid deployment is fulfilled by the multi-modal learning and large-scale pretrained networks. First, we compile a keyword dictionary to explain traffic signs. The Chinese national standard is used to suggest the shape and color information. Our program conducts Bootstrapping Language-image Pretraining v2 (BLIPv2) to translate representative images into text descriptions. Second, a Contrastive Language-image Pretraining (CLIP) framework is applied to characterize not only drone images but also text descriptions. Our method utilizes the pretrained encoder network to create visual features and word embeddings. Third, the category of each traffic sign is predicted according to the similarity between drone images and keywords. Cosine distance and softmax function are performed to calculate the class probability distribution. To evaluate the performance, we apply the proposed method in a practical application. The drone images captured from Guyuan, China, are employed to record the conditions of traffic signs. Further experiments include two widely used public datasets. The calculation results indicate that our vision\u2013language model-based method has an acceptable prediction accuracy and low training cost.<\/jats:p>","DOI":"10.3390\/s24175800","type":"journal-article","created":{"date-parts":[[2024,9,6]],"date-time":"2024-09-06T06:18:35Z","timestamp":1725603515000},"page":"5800","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":6,"title":["A Vision\u2013Language Model-Based Traffic Sign Detection Method for High-Resolution Drone Images: A Case Study in Guyuan, China"],"prefix":"10.3390","volume":"24","author":[{"given":"Jianqun","family":"Yao","sequence":"first","affiliation":[{"name":"CCCC Infrastructure Maintenance Group Co., Ltd., Beijing 100011, China"}]},{"given":"Jinming","family":"Li","sequence":"additional","affiliation":[{"name":"CCCC Infrastructure Maintenance Group Co., Ltd., Beijing 100011, China"}]},{"given":"Yuxuan","family":"Li","sequence":"additional","affiliation":[{"name":"CCCC Infrastructure Maintenance Group Co., Ltd., Beijing 100011, China"}]},{"given":"Mingzhu","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Transportation Engineering, Chang\u2019an University, Xi\u2019an 710064, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0115-1674","authenticated-orcid":false,"given":"Chen","family":"Zuo","sequence":"additional","affiliation":[{"name":"School of Transportation Engineering, Chang\u2019an University, Xi\u2019an 710064, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3147-7341","authenticated-orcid":false,"given":"Shi","family":"Dong","sequence":"additional","affiliation":[{"name":"School of Transportation Engineering, Chang\u2019an University, Xi\u2019an 710064, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3718-0511","authenticated-orcid":false,"given":"Zhe","family":"Dai","sequence":"additional","affiliation":[{"name":"School of Transportation Engineering, Chang\u2019an University, Xi\u2019an 710064, China"}]}],"member":"1968","published-online":{"date-parts":[[2024,9,6]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Canese, L., Cardarilli, G.C., Di Nunzio, L., Fazzolari, R., Famil Ghadakchi, H., Re, M., and Span\u00f2, S. (2022). Sensing and Detection of Traffic Signs Using CNNs: An Assessment on Their Performance. Sensors, 22.","DOI":"10.3390\/s22228830"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Sanyal, B., Mohapatra, R.K., and Dash, R. (2020, January 10\u201312). Traffic Sign Recognition: A Survey. Proceedings of the 2020 International Conference on Artificial Intelligence and Signal Processing (AISP), Amaravati, India.","DOI":"10.1109\/AISP48273.2020.9072976"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Lim, X.R., Lee, C.P., Lim, K.M., Ong, T.S., Alqahtani, A., and Ali, M. (2023). Recent Advances in Traffic Sign Recognition: Approaches and Datasets. Sensors, 23.","DOI":"10.3390\/s23104674"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"4277","DOI":"10.1109\/TVT.2022.3144358","article-title":"DroneSegNet: Robust Aerial Semantic Segmentation for UAV-Based IoT Applications","volume":"71","author":"Chakravarthy","year":"2022","journal-title":"IEEE Trans. Veh. Technol."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"102388","DOI":"10.1016\/j.aei.2024.102388","article-title":"From Global Challenges to Local Solutions: A Review of Cross-country Collaborations and Winning Strategies in Road Damage Detection","volume":"60","author":"Arya","year":"2024","journal-title":"Adv. Eng. Inform."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Du, J., Zhang, R., Gao, R., Nan, L., and Bao, Y. (2024). RSDNet: A New Multiscale Rail Surface Defect Detection Model. Sensors, 24.","DOI":"10.3390\/s24113579"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"2807","DOI":"10.1007\/s00521-017-2887-x","article-title":"Traffic sign recognition based on color, shape, and pictogram classification using support vector machines","volume":"30","author":"Madani","year":"2018","journal-title":"Neural Comput. Appl."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Kerim, A., and Efe, M.\u00d6. (2021, January 13\u201316). Recognition of Traffic Signs with Artificial Neural Networks: A Novel Dataset and Algorithm. Proceedings of the 2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Jeju Island, Republic of Korea.","DOI":"10.1109\/ICAIIC51459.2021.9415238"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Soni, D., Chaurasiya, R.K., and Agrawal, S. (2019, January 20\u201322). Improving the Classification Accuracy of Accurate Traffic Sign Detection and Recognition System Using HOG and LBP Features and PCA-Based Dimension Reduction. Proceedings of the International Conference on Sustainable Computing in Science, Technology and Management (SUSCOM), Amity University Rajasthan, Jaipur, India.","DOI":"10.2139\/ssrn.3358756"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Namyang, N., and Phimoltares, S. (2020, January 21\u201322). Thai traffic sign classification and recognition system based on histogram of gradients, color layout descriptor, and normalized correlation coefficient. Proceedings of the 2020-5th International Conference on Information Technology (InCIT), Chonburi, Thailand.","DOI":"10.1109\/InCIT50588.2020.9310778"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"03014","DOI":"10.1051\/shsconf\/202214403014","article-title":"Research on the Optimal Machine Learning Classifier for Traffic Signs","volume":"Volume 144","author":"Wang","year":"2022","journal-title":"Proceedings of the SHS Web of Conferences"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"975","DOI":"10.1109\/TITS.2018.2843815","article-title":"Real-time traffic sign recognition based on efficient CNNs in the wild","volume":"20","author":"Li","year":"2018","journal-title":"IEEE Trans. Intell. Transp. Syst."},{"key":"ref_13","first-page":"165","article-title":"Traffic sign classification comparison between various convolution neural network models","volume":"12","author":"Sokipriala","year":"2021","journal-title":"Int. J. Sci. Eng. Res."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"17779","DOI":"10.1007\/s11042-022-12163-0","article-title":"Traffic sign recognition based on deep learning","volume":"81","author":"Zhu","year":"2022","journal-title":"Multimed. Tools Appl."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016, January 27\u201330). You Only Look Once: Unified, Real-Time Object Detection. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.91"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Li, X., and Geng, S. (2023, January 7\u20138). Improved traffic sign detection for YOLOv5s. Proceedings of the IEEE 4th International Conference on Computer Engineering and Application, Hangzhou, China.","DOI":"10.1109\/ICCEA58433.2023.10135461"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"16632","DOI":"10.1109\/TITS.2022.3170354","article-title":"Traffic Sign Detection and Recognition in Multiimages Using a Fusion Model With YOLO and VGG Network","volume":"23","author":"Yu","year":"2022","journal-title":"Trans. Intell. Transp. Syst."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"142","DOI":"10.1109\/TPAMI.2015.2437384","article-title":"Region-Based Convolutional Networks for Accurate Object Detection and Segmentation","volume":"38","author":"Girshick","year":"2015","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","article-title":"Faster R-CNN: Towards real-time object detection with region proposal networks","volume":"39","author":"Ren","year":"2017","journal-title":"Trans. Pattern Anal. Mach. Intell."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"29742","DOI":"10.1109\/ACCESS.2020.2972338","article-title":"A cascaded R-CNN with multiscale attention and imbalanced samples for traffic sign detection","volume":"8","author":"Zhang","year":"2020","journal-title":"IEEE Access"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"369","DOI":"10.1007\/s12243-019-00731-9","article-title":"Lightweight deep network for traffic sign classification","volume":"75","author":"Zhang","year":"2020","journal-title":"Ann. Telecommun."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Triki, N., Karray, M., and Ksantini, M. (2023). A Real-Time Traffic Sign Recognition Method Using a New Attention-Based Deep Convolutional Neural Network for Smart Vehicles. Appl. Sci., 13.","DOI":"10.3390\/app13084793"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"1155","DOI":"10.1007\/s11554-022-01252-w","article-title":"Real-time traffic sign detection based on multiscale attention and spatial information aggregator","volume":"19","author":"Zhang","year":"2022","journal-title":"J. Real-Time Image Process."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020, January 23\u201328). End-to-end object detection with transformers. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"ref_25","unstructured":"Zhang, J., Huang, J., Jin, S., and Lu, S. (2023). Vision-Language Models for Vision Tasks: A Survey. arXiv."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Jaiswal, A., Ramesh Babu, A., Zaki Zadeh, M., Banerjee, D., and Makedon, F. (2021). A Survey on Contrastive Self-Supervised Learning. Technologies, 9.","DOI":"10.3390\/technologies9010002"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Gui, J., Chen, T., Zhang, J., Cao, Q., Sun, Z., Luo, H., and Tao, D. (2023). A Survey on Self-supervised Learning: Algorithms, Applications, and Future Trends. arXiv.","DOI":"10.1109\/TPAMI.2024.3415112"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"200","DOI":"10.1145\/3505244","article-title":"Transformers in Vision: A Survey","volume":"54","author":"Khan","year":"2022","journal-title":"ACM Comput. Surv. (CSUR)"},{"key":"ref_29","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2021). An Image is Worth 16\u00d716 Words: Transformers for Image Recognition at Scale. arXiv."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Redmon, J., and Farhadi, A. (2017, January 21\u201326). YOLO9000: Better, faster, stronger. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.690"},{"key":"ref_32","unstructured":"Redmon, J., and Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv."},{"key":"ref_33","unstructured":"Bochkovskiy, A., Wang, C.-Y., and Liao, H.-Y.M. (2020). Yolov4: Optimal speed and accuracy of object detection. arXiv."},{"key":"ref_34","unstructured":"Ultralytics (2020, November 01). YOLOv5. Available online: https:\/\/github.com\/ultralytics\/yolov5."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Wang, C.-Y., Bochkovskiy, A., and Liao, H.-Y.M. (2023, January 17\u201324). YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. Proceedings of the 2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.00721"},{"key":"ref_36","unstructured":"Jocher, G., Chaurasia, A., and Qiu, J. (2024, June 20). Ultralytics YOLOv8. Available online: https:\/\/github.com\/ultralytics\/ultralytics."},{"key":"ref_37","unstructured":"(2022). Traffic Signs (Standard No. GB 5768-2022)."},{"key":"ref_38","unstructured":"Li, J., Li, D., Xiong, C., and Hoi, S. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. arXiv."},{"key":"ref_39","unstructured":"Li, J., Li, D., Savarese, S., and Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models. arXiv."},{"key":"ref_40","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 18\u201324). Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual Event."},{"key":"ref_41","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017). Attention Is All You Need. arXiv."},{"key":"ref_42","first-page":"23","article-title":"CCTSDB 2021: A more comprehensive traffic sign detection benchmark","volume":"12","author":"Zhang","year":"2022","journal-title":"Hum.-Centric Comput. Inf. Sci."},{"key":"ref_43","unstructured":"Zhu, Z., Liang, D., Zhang, S., Huang, X., Li, B., and Hu, S. (July, January 26). Traffic-sign detection and classification in the wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"1437","DOI":"10.1109\/TETCI.2024.3349464","article-title":"A robust real-time anchor-free traffic sign detector with one-level feature","volume":"8","author":"Zhang","year":"2024","journal-title":"IEEE Trans. Emerg. Top. Comput. Intell."},{"key":"ref_45","first-page":"1922","article-title":"FCOS: A simple and strong anchor free object detector","volume":"44","author":"Tian","year":"2022","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/17\/5800\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T15:49:46Z","timestamp":1760111386000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/17\/5800"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,9,6]]},"references-count":45,"journal-issue":{"issue":"17","published-online":{"date-parts":[[2024,9]]}},"alternative-id":["s24175800"],"URL":"https:\/\/doi.org\/10.3390\/s24175800","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2024,9,6]]}}}