{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,2]],"date-time":"2026-04-02T06:37:08Z","timestamp":1775111828353,"version":"3.50.1"},"reference-count":53,"publisher":"MDPI AG","issue":"15","license":[{"start":{"date-parts":[[2023,7,29]],"date-time":"2023-07-29T00:00:00Z","timestamp":1690588800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Natural Science Foundation of China","award":["41404022"],"award-info":[{"award-number":["41404022"]}]},{"name":"National Natural Science Foundation of China","award":["2021-JCJQ-JJ-0871"],"award-info":[{"award-number":["2021-JCJQ-JJ-0871"]}]},{"name":"National Key Basic Research Strengthen Foundation of China","award":["41404022"],"award-info":[{"award-number":["41404022"]}]},{"name":"National Key Basic Research Strengthen Foundation of China","award":["2021-JCJQ-JJ-0871"],"award-info":[{"award-number":["2021-JCJQ-JJ-0871"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>The detection of infrared vehicle targets by UAVs poses significant challenges in the presence of complex ground backgrounds, high target density, and a large proportion of small targets, which result in high false alarm rates. To alleviate these deficiencies, a novel YOLOv7-based, multi-scale target detection method for infrared vehicle targets is proposed, which is termed YOLO-ViT. Firstly, within the YOLOV7-based framework, the lightweight MobileViT network is incorporated as the feature extraction backbone network to fully extract the local and global features of the object and reduce the complexity of the model. Secondly, an innovative C3-PANet neural network structure is delicately designed, which adopts the CARAFE upsampling method to utilize the semantic information in the feature map and improve the model\u2019s recognition accuracy of the target region. In conjunction with the C3 structure, the receptive field will be increased to enhance the network\u2019s accuracy in recognizing small targets and model generalization ability. Finally, the K-means++ clustering method is utilized to optimize the anchor box size, leading to the design of anchor boxes better suited for detecting small infrared targets from UAVs, thereby improving detection efficiency. The present article showcases experimental findings attained through the use of the HIT-UAV public dataset. The results demonstrate that the enhanced YOLO-ViT approach, in comparison to the original method, achieves a reduction in the number of parameters by 49.9% and floating-point operations by 67.9%. Furthermore, the mean average precision (mAP) exhibits an improvement of 0.9% over the existing algorithm, reaching a value of 94.5%, which validates the effectiveness of the method for UAV infrared vehicle target detection.<\/jats:p>","DOI":"10.3390\/rs15153778","type":"journal-article","created":{"date-parts":[[2023,7,31]],"date-time":"2023-07-31T01:48:50Z","timestamp":1690768130000},"page":"3778","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":67,"title":["YOLO-ViT-Based Method for Unmanned Aerial Vehicle Infrared Vehicle Target Detection"],"prefix":"10.3390","volume":"15","author":[{"given":"Xiaofeng","family":"Zhao","sequence":"first","affiliation":[{"name":"Xi\u2019an Research Institute of High-Tech, Xi\u2019an 710025, China"}]},{"given":"Yuting","family":"Xia","sequence":"additional","affiliation":[{"name":"Xi\u2019an Research Institute of High-Tech, Xi\u2019an 710025, China"}]},{"given":"Wenwen","family":"Zhang","sequence":"additional","affiliation":[{"name":"Xi\u2019an Research Institute of High-Tech, Xi\u2019an 710025, China"}]},{"given":"Chao","family":"Zheng","sequence":"additional","affiliation":[{"name":"Xi\u2019an Research Institute of High-Tech, Xi\u2019an 710025, China"}]},{"given":"Zhili","family":"Zhang","sequence":"additional","affiliation":[{"name":"Xi\u2019an Research Institute of High-Tech, Xi\u2019an 710025, China"}]}],"member":"1968","published-online":{"date-parts":[[2023,7,29]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1109\/MGRS.2021.3115137","article-title":"Deep Learning for Unmanned Aerial Vehicle-Based Object Detection and Tracking: A survey","volume":"10","author":"Wu","year":"2022","journal-title":"Geosci. Remote Sens."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Qiu, Z., Bai, H., and Chen, T. (2023). Special Vehicle Detection from UAV Perspective via YOLO-GNS Based Deep Learning Network. Drones, 7.","DOI":"10.3390\/drones7020117"},{"key":"ref_3","first-page":"1828848","article-title":"YOLOv5-Based Vehicle Detection Method for High-Resolution UAV Images","volume":"2022","author":"Chen","year":"2022","journal-title":"Mob. Inf. Syst."},{"key":"ref_4","first-page":"e6726","article-title":"SI-EDTL: Swarm intelligence ensemble deep transfer learning for multiple vehicle detection in UAV images","volume":"34","author":"Shokouhifar","year":"2021","journal-title":"Concurr. Comput. Pract. Exp."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"8614117","DOI":"10.1155\/2023\/8614117","article-title":"Multi-UAV Search and Rescue with Enhanced A\u2217 Algorithm Path Planning in 3D Environment","volume":"2023","author":"Du","year":"2023","journal-title":"Int. J. Aerosp. Eng."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"553","DOI":"10.3233\/IDT-190138","article-title":"Design of search and rescue system using autonomous Multi-UAVs","volume":"14","author":"Choutri","year":"2021","journal-title":"Intell. Decis. Technol."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Patel, T., Guo, B.H., van der Walt, J.D., and Zou, Y. (2022). Effective Motion Sensors and Deep Learning Techniques for Unmanned Ground Vehicle (UGV)-Based Automated Pavement Layer Change Detection in Road Construction. Buildings, 13.","DOI":"10.3390\/buildings13010005"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"1464","DOI":"10.3390\/rs15051464","article-title":"Local Convergence Index-Based Infrared Small Target Detection against Complex Scenes","volume":"15","author":"Cao","year":"2023","journal-title":"Remote Sens."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"283","DOI":"10.1016\/j.isprsjprs.2021.08.002","article-title":"Multi-scale adversarial network for vehicle detection in UAV imagery","volume":"180","author":"Zhang","year":"2021","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"102152","DOI":"10.1016\/j.sysarc.2021.102152","article-title":"A Survey of Deep Learning Techniques for Vehicle Detection from UAV Images","volume":"117","author":"Srivastava","year":"2021","journal-title":"J. Syst. Archit."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"6047","DOI":"10.1109\/TNNLS.2021.3080276","article-title":"Vehicle Detection From UAV Imagery With Deep Learning: A Review","volume":"33","author":"Bouguettaya","year":"2021","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Gao, P., Tian, T., Zhao, T., and Li, L. (2022). GF-Detection: Fusion with GAN of Infrared and Visible Images for Vehicle Detection at Nighttime. Remote Sens., 14.","DOI":"10.3390\/rs14122771"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Fan, Y., Qiu, Q., Hou, S., Li, Y., Xie, J., Qin, M., and Chu, F. (2022). Application of Improved YOLOv5 in Aerial Photographing Infrared Vehicle Detection. Electronics, 11.","DOI":"10.3390\/electronics11152344"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"33","DOI":"10.34768\/amcs-2023-0003","article-title":"Infrared Small\u2013Target Detection under a Complex Background Based on a Local Gradient Contrast Method","volume":"33","author":"Yang","year":"2023","journal-title":"Int. J. Appl. Math. Comput. Sci."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Lin, T., Maire, M., and Belongie, S. (2014, January 6\u201312). Microsoft COCO: Common Objects in Context. Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014, January 23\u201328). Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA.","DOI":"10.1109\/CVPR.2014.81"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Girshick, R. (2015, January 7\u201313). Fast R-CNN. Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile.","DOI":"10.1109\/ICCV.2015.169"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","article-title":"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks","volume":"39","author":"Ren","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_19","unstructured":"Liu, S., Ma, Z., and Chen, B. (2021). Artificial Intelligence in China, Springer."},{"key":"ref_20","unstructured":"Wei, L., Dragomir, A., Dumitru, E., and Szegedy, C. (2016). SSD: Single Shot MultiBox Detector, Springer."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016, January 27\u201330). You Only Look Once: Unified, Real-Time Object Detection. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.91"},{"key":"ref_22","unstructured":"Redmon, J., and Farhadi, A. (2018). YOLOv3: An Incremental Improvement. arXiv."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Wang, C.Y., Bochkovskiy, A., and Liao, H.Y.M. (2022). YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv.","DOI":"10.1109\/UV56588.2022.10185474"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"7894","DOI":"10.1049\/joe.2019.0710","article-title":"Small vehicles detection based on UAV","volume":"2019","author":"Chen","year":"2019","journal-title":"J. Eng."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Benjdira, B., Khursheed, T., Koubaa, A., Ammar, A., and Ouni, K. (2019, January 5\u20137). Car Detection using Unmanned Aerial Vehicles: Comparison between Faster R-CNN and YOLOv3. Proceedings of the 2019 1st International Conference on Unmanned Vehicle Systems-Oman (UVS), Muscat, Oman.","DOI":"10.1109\/UVS.2019.8658300"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Qiu, M., Huang, L., and Tang, B.H. (2022). ASFF-YOLOv5: Multielement Detection Method for Road Traffic in UAV Images Based on Multiscale Feature Fusion. Remote Sens., 14.","DOI":"10.3390\/rs14143498"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"2152008","DOI":"10.1142\/S021800142152008X","article-title":"CAFFNet: Channel Attention and Feature Fusion Network for Multi-target Traffic Sign Detection","volume":"35","author":"Liu","year":"2021","journal-title":"Int. J. Pattern Recognit. Artif. Intell."},{"key":"ref_28","unstructured":"Liu, Y. (2020). Dense Multiscale Feature Fusion Pyramid Networks for Object Detection in UAV-Captured Images. arXiv."},{"key":"ref_29","unstructured":"Zhu, P.F., Wen, L., Bian, X., Ling, H., and Hu, Q. (2018). Vision Meets Drones: A Challenge. arXiv."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"92","DOI":"10.1049\/ipr2.12331","article-title":"Road infrared target detection with I-YOLO","volume":"16","author":"Sun","year":"2021","journal-title":"IET Image Process."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Tang, T., Zhou, S., Deng, Z., Zou, H., and Lei, L. (2017). Vehicle detection in aerial images based on region convolutional neural networks and hard negative example mining. Sensors, 17.","DOI":"10.3390\/s17020336"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Zhao, Q., Liu, B., Lyu, S., and Wang, C. (2023). TPH-YOLOv5++: Boosting Object Detection on Drone-Captured Scenarios with Cross-Layer Asymmetric Transformer. Remote Sens., 15.","DOI":"10.3390\/rs15061687"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Zuo, Z., Tong, X., Wei, J., Su, S., Wu, P., Guo, R., and Sun, B. (2022). AFFPN: Attention Fusion Feature Pyramid Network for Small Infrared Target Detection. Remote Sens., 14.","DOI":"10.3390\/rs14143412"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Yao, S., Zhu, Q., Zhang, T., Cui, W., and Yan, P. (2022). Infrared Image Small-Target Detection Based on Improved FCOS and Spatio-Temporal Features. Electronics, 11.","DOI":"10.3390\/electronics11060933"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Zhang, M., Li, B., Wang, T., and Bai, H. (2023). CHFNet: Curvature Half-Level Fusion Network for Single-Frame Infrared Small Target Detection. Remote Sens., 15.","DOI":"10.3390\/rs15061573"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"141861","DOI":"10.1109\/ACCESS.2021.3120870","article-title":"YOLO-FIRI: Improved YOLOv5 for Infrared Image Object Detection","volume":"9","author":"Li","year":"2021","journal-title":"IEEE Access"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Dai, Y., Wu, Y., Zhou, F., and Barnard, K. (2021, January 3\u20138). Asymmetric Contextual Modulation for Infrared Small Target Detection. Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA.","DOI":"10.1109\/WACV48630.2021.00099"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Zhang, M., Zhang, R., Yang, Y., Bai, H., Zhang, J., and Guo, J. (2022, January 19\u201324). ISNet: Shape Matters for Infrared Small Target Detection. Proceedings of the 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00095"},{"key":"ref_39","unstructured":"Sutskever, I., Vinyals, O., and Le, Q.V. (2014). Sequence to Sequence Learning with Neural Networks. Adv. Neural Inf. Process. Syst., 3104\u20133112."},{"key":"ref_40","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Wang, W., Xie, E., Li, X., Fan, D., Song, K., Liang, D., Lu, T., and Shao, L. (2021, January 10\u201317). Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. Proceedings of the 2021 IEEE\/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00061"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z., EH Tay, F., Feng, J., and Yan, S. (2021). Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. arXiv.","DOI":"10.1109\/ICCV48922.2021.00060"},{"key":"ref_43","unstructured":"Vedaldi, A., Bischof, H., Brox, T., and Frahm, J.M. Lecture Notes in Computer Science, Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23\u201328 August 2020, Springer."},{"key":"ref_44","unstructured":"Liu, F., Gao, C., Chen, F., Meng, D., Zuo, W., and Gao, X. (2021). Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds. arXiv."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Chen, G., Wang, W., and Tan, S. (2022). IRSTFormer: A Hierarchical Vision Transformer for Infrared Small Target Detection. Remote Sens., 14.","DOI":"10.3390\/rs14143258"},{"key":"ref_46","unstructured":"Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J\u00e9gou, H. (2012). Training data-efficient image transformers & distillation through attention. arXiv."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Rao, Y., Liu, Z., Zhao, W., Zhou, J., and Lu, J. (2022). Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks. arXiv.","DOI":"10.1109\/TPAMI.2023.3263826"},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"227","DOI":"10.1038\/s41597-023-02066-6","article-title":"HIT-UAV: A high-altitude infrared thermal dataset for Unmanned Aerial Vehicle-based object detection","volume":"10","author":"Suo","year":"2023","journal-title":"Sci. Data"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Suo, J., Wang, T., Zhang, X., Chen, H., Zhou, W., and Shi, W. (2022). HIT-UAV: A High-altitude Infrared Thermal Dataset for Unmanned Aerial Vehicles. arXiv.","DOI":"10.1038\/s41597-023-02066-6"},{"key":"ref_50","unstructured":"Mehta, S., and Rastegari, M. (2021). MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. arXiv."},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L. (2018, January 18\u201322). MobileNetV2: Inverted Residuals and Linear Bottlenecks. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00474"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Wang, J., Chen, K., Xu, R., Liu, Z., Loy, C., and Lin, D. (November, January 27). CARAFE: Content-Aware ReAssembly of FEatures. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea.","DOI":"10.1109\/ICCV.2019.00310"},{"key":"ref_53","unstructured":"Arthur, D., and Vassilvitskii, S. (2007, January 7\u20139). K-Means++: The Advantages of Careful Seeding. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, LA, USA."}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/15\/15\/3778\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T20:22:15Z","timestamp":1760127735000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/15\/15\/3778"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,7,29]]},"references-count":53,"journal-issue":{"issue":"15","published-online":{"date-parts":[[2023,8]]}},"alternative-id":["rs15153778"],"URL":"https:\/\/doi.org\/10.3390\/rs15153778","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,7,29]]}}}