{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T18:30:12Z","timestamp":1775068212635,"version":"3.50.1"},"reference-count":53,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2023,8,24]],"date-time":"2023-08-24T00:00:00Z","timestamp":1692835200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61905240"],"award-info":[{"award-number":["61905240"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Symmetry"],"abstract":"<jats:p>Siamese-based trackers have been widely used in object tracking. However, aerial remote tracking suffers from various challenges such as scale variation, viewpoint change, background clutter and occlusion, while most existing Siamese trackers are limited to single-scale and local features, making it difficult to achieve accurate aerial tracking. We propose the global multi-scale optimization and prediction head attentional Siamese network to solve this problem and improve aerial tracking performance. Firstly, a transformer-based multi-scale and global feature encoder (TMGFE) is proposed to obtain global multi-scale optimization of features. Then, the prediction head attentional module (PHAM) is proposed to add context information to the prediction head by adaptively adjusting the spatial position and channel contribution of the response map. Benefiting from these two components, the proposed tracker solves these challenges of aerial remote sensing tracking to some extent and improves tracking performance. Additionally, we conduct ablation experiments on aerial tracking benchmarks, including UAV123, UAV20L, UAV123@10fps and DTB70, to verify the effectiveness of the proposed network. The comparisons of our tracker with several state-of-the-art (SOTA) trackers are also conducted on four benchmarks to verify its superior performance. It runs at 40.8 fps on the GPU RTX3060ti.<\/jats:p>","DOI":"10.3390\/sym15091629","type":"journal-article","created":{"date-parts":[[2023,8,24]],"date-time":"2023-08-24T10:09:53Z","timestamp":1692871793000},"page":"1629","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Global Multi-Scale Optimization and Prediction Head Attentional Siamese Network for Aerial Tracking"],"prefix":"10.3390","volume":"15","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3353-6878","authenticated-orcid":false,"given":"Qiqi","family":"Chen","sequence":"first","affiliation":[{"name":"Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China"},{"name":"University of Chinese Academy of Sciences, Beijing 100049, China"}]},{"given":"Jinghong","family":"Liu","sequence":"additional","affiliation":[{"name":"Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China"}]},{"given":"Xuan","family":"Wang","sequence":"additional","affiliation":[{"name":"Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China"}]},{"given":"Yujia","family":"Zuo","sequence":"additional","affiliation":[{"name":"Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China"}]},{"given":"Chenglong","family":"Liu","sequence":"additional","affiliation":[{"name":"Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China"}]}],"member":"1968","published-online":{"date-parts":[[2023,8,24]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Bai, Y., Song, Y., Zhao, Y., Zhou, Y., Wu, X., He, Y., Zhang, Z., Yang, X., and Hao, Q. (2022). Occlusion and Deformation Handling Visual Tracking for UAV via Attention-Based Mask Generative Network. Remote Sens., 14.","DOI":"10.3390\/rs14194756"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Cao, J., Song, C., Song, S., Xiao, F., Zhang, X., Liu, Z., and Ang, M.H. (2021). Robust Object Tracking Algorithm for Autonomous Vehicles in Complex Scenes. Remote Sens., 13.","DOI":"10.3390\/rs13163234"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Sun, L., Yang, Z., Zhang, J., Fu, Z., and He, Z. (2022). Visual Object Tracking for Unmanned Aerial Vehicles Based on the Template-Driven Siamese Network. Remote Sens., 14.","DOI":"10.3390\/rs14071584"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"3","DOI":"10.37188\/lam.2023.001","article-title":"Automated optical inspection of FAST\u2019s reflector surface using drones and computer vision","volume":"4","author":"Li","year":"2023","journal-title":"Light Adv. Manuf."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Su, Y., Liu, J., Xu, F., Zhang, X., and Zuo, Y. (2021). A Novel Anti-Drift Visual Object Tracking Algorithm Based on Sparse Response and Adaptive Spatial-Temporal Context-Aware. Remote Sens., 13.","DOI":"10.3390\/rs13224672"},{"key":"ref_6","unstructured":"Bhat, G., Danelljan, M., Gool, L.V., and Timofte, R. (November, January 27). Learning discriminative model prediction for tracking. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., and Torr, P.H. Fully-convolutional Siamese networks for object tracking. Proceedings of the Computer Vision\u2013ECCV 2016 Workshops, Amsterdam, The Netherlands. Proceedings, Part II 14.","DOI":"10.1007\/978-3-319-48881-3_56"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., and Hu, W. (2018, January 8\u201314). Distractor-aware Siamese networks for visual object tracking. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01240-3_7"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Liu, Y., Wang, X., Li, B., and Hu, W. (2021, January 11\u201317). Learn to match: Automatic matching network design for visual tracking. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.01309"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Fan, H., and Ling, H. (2019, January 15\u201320). Siamese cascaded region proposal networks for real-time visual tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00814"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Han, W., Dong, X., Khan, F.S., Shao, L., and Shen, J. (2021, January 20\u201325). Learning to fuse asymmetric feature maps in Siamese trackers. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01630"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Wang, N., Zhou, W., Wang, J., and Li, H. (2021, January 20\u201325). Transformer meets tracker: Exploiting temporal context for robust visual tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00162"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"2068","DOI":"10.1007\/s10489-022-03502-7","article-title":"Graph attention information fusion for Siamese adaptive attention tracking","volume":"53","author":"Wei","year":"2023","journal-title":"Appl. Intell."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"39","DOI":"10.1038\/s41377-022-00714-x","article-title":"Deep learning in optical metrology: A review","volume":"11","author":"Zuo","year":"2022","journal-title":"Light Sci. Appl."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Tang, C., Wang, X., Bai, Y., Wu, Z., Zhang, J., and Huang, Y. (2023). Learning Spatial-Frequency Transformer for Visual Object Tracking. IEEE Trans. Circuits Syst. Video Technol.","DOI":"10.1109\/TCSVT.2023.3249468"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"61","DOI":"10.1038\/s41377-022-00743-6","article-title":"Spectral imaging with deep learning","volume":"11","author":"Huang","year":"2022","journal-title":"Light Sci. Appl."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Li, B., Yan, J., Wu, W., Zhu, Z., and Hu, X. (2018, January 18\u201323). High performance visual tracking with Siamese region proposal network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00935"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Fu, C., Cao, Z., Li, Y., Ye, J., and Feng, C. (June, January 30). Siamese anchor proposal network for high-speed aerial tracking. Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi\u2019an, China.","DOI":"10.1109\/ICRA48506.2021.9560756"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., and Yan, J. (2019, January 15\u201320). Siamrpn++: Evolution of Siamese visual tracking with very deep networks. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00441"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Guo, D., Wang, J., Cui, Y., Wang, Z., and Chen, S. (2020, January 13\u201319). SiamCAR: Siamese fully convolutional classification and regression for visual tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00630"},{"key":"ref_21","unstructured":"Yao, L., Zuo, H., Zheng, G., Fu, C., and Pan, J. (2023). SAM-DA: UAV Tracks Anything at Night with SAM-Powered Domain Adaptation. arXiv."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Fu, C., Lu, K., Zheng, G., Ye, J., Cao, Z., Li, B., and Lu, G. (2023). Siamese object tracking for unmanned aerial vehicle: A review and comprehensive analysis. Artif. Intell. Rev., 1\u201361.","DOI":"10.1007\/s10462-023-10558-5"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"3812","DOI":"10.1093\/nar\/gkg509","article-title":"SIFT: Predicting amino acid changes that affect protein function","volume":"31","author":"Ng","year":"2003","journal-title":"Nucleic Acids Res."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016, January 27\u201330). Rethinking the inception architecture for computer vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.308"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Lou, A., and Loew, M. (2021, January 19\u201322). Cfpnet: Channel-wise feature pyramid for real-time semantic segmentation. Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA.","DOI":"10.1109\/ICIP42928.2021.9506485"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"3349","DOI":"10.1109\/TPAMI.2020.2983686","article-title":"Deep high-resolution representation learning for visual recognition","volume":"43","author":"Wang","year":"2020","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_27","first-page":"5158","article-title":"SiamBAN: Target-aware tracking with Siamese box adaptive network","volume":"45","author":"Chen","year":"2022","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_28","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_29","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16 \u00d7 16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_30","unstructured":"Mehta, S., and Rastegari, M. (2021). Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Fu, Z., Fu, Z., Liu, Q., Cai, W., and Wang, Y. (2022). SparseTT: Visual tracking with sparse transformers. arXiv.","DOI":"10.24963\/ijcai.2022\/127"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Xing, D., Evangeliou, N., Tsoukalas, A., and Tzes, A. (2022, January 3\u20138). Siamese transformer pyramid networks for real-time UAV tracking. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACV51458.2022.00196"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Cao, Z., Fu, C., Ye, J., Li, B., and Li, Y. (2021, January 10\u201317). Hift: Hierarchical feature transformer for aerial tracking. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.01517"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Press, O., and Wolf, L. (2016). Using the output embedding to improve language models. arXiv.","DOI":"10.18653\/v1\/E17-2025"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Hu, J., Shen, L., and Sun, G. (2018, January 18\u201322). Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00745"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Woo, S., Park, J., Lee, J.-Y., and Kweon, I.S. (2018, January 8\u201314). Cbam: Convolutional block attention module. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01234-2_1"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Yu, Y., Xiong, Y., Huang, W., and Scott, M.R. (2020, January 13\u201319). Deformable Siamese attention networks for visual object tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00676"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Cao, Z., Fu, C., Ye, J., Li, B., and Li, Y. (October, January 27). SiamAPN++: Siamese attentional aggregation network for real-time UAV tracking. Proceedings of the 2021 IEEE\/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic.","DOI":"10.1109\/IROS51168.2021.9636309"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Mueller, M., Smith, N., and Ghanem, B. (2016, January 11\u201314). A benchmark and simulator for uav tracking. Proceedings of the Computer Vision\u2013ECCV 2016: 14th European Conference, Amsterdam, The Netherlands. Proceedings, Part I 14.","DOI":"10.1007\/978-3-319-46448-0_27"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Li, S., and Yeung, D.-Y. (2017, January 4\u20139). Visual object tracking for unmanned aerial vehicles: A benchmark and new motion models. Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA.","DOI":"10.1609\/aaai.v31i1.11205"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"1562","DOI":"10.1109\/TPAMI.2019.2957464","article-title":"Got-10k: A large high-diversity benchmark for generic object tracking in the wild","volume":"43","author":"Huang","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., and Zitnick, C.L. (2014, January 6\u201312). Microsoft coco: Common objects in context. Proceedings of the Computer Vision\u2013ECCV 2014: 13th European Conference, Zurich, Switzerland. Proceedings, Part V 13.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., and Ling, H. (2019, January 15\u201320). Lasot: A high-quality benchmark for large-scale single object tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00552"},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","article-title":"Imagenet large scale visual recognition challenge","volume":"115","author":"Russakovsky","year":"2015","journal-title":"Int. J. Comput. Vis."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Peng, H., Fu, J., Li, B., and Hu, W. (2020, January 23\u201328). Ocean: Object-aware anchor-free tracking. Proceedings of the Computer Vision\u2013ECCV 2020: 16th European Conference, Glasgow, UK. Proceedings, Part XXI 16.","DOI":"10.1007\/978-3-030-58589-1_46"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Xu, Y., Wang, Z., Li, Z., Yuan, Y., and Yu, G. (2020, January 7\u201312). Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA.","DOI":"10.1609\/aaai.v34i07.6944"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Zhang, Z., and Peng, H. (2019, January 15\u201320). Deeper and wider Siamese networks for real-time visual tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00472"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Zheng, G., Fu, C., Ye, J., Li, B., Lu, G., and Pan, J. (2022, January 23\u201327). Siamese Object Tracking for Vision-Based UAM Approaching with Pairwise Scale-Channel Attention. Proceedings of the 2022 IEEE\/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan.","DOI":"10.1109\/IROS47612.2022.9982189"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Danelljan, M., Bhat, G., Shahbaz Khan, F., and Felsberg, M. (2017, January 21\u201326). Eco: Efficient convolution operators for tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.733"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Cao, Z., Huang, Z., Pan, L., Zhang, S., Liu, Z., and Fu, C. (2022, January 18\u201324). TCTrack: Temporal contexts for aerial tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01438"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Yao, L., Fu, C., Li, S., Zheng, G., and Ye, J. (2023). SGDViT: Saliency-Guided Dynamic Vision Transformer for UAV Tracking. arXiv.","DOI":"10.1109\/ICRA48891.2023.10161487"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Li, Y., Fu, C., Ding, F., Huang, Z., and Lu, G. (2020, January 13\u201319). AutoTrack: Towards high-performance visual tracking for UAV with automatic spatio-temporal regularization. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01194"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Sosnovik, I., Moskalev, A., and Smeulders, A.W. (2021, January 3\u20138). Scale equivariance improves Siamese tracking. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACV48630.2021.00281"}],"container-title":["Symmetry"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-8994\/15\/9\/1629\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T20:37:53Z","timestamp":1760128673000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-8994\/15\/9\/1629"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,8,24]]},"references-count":53,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2023,9]]}},"alternative-id":["sym15091629"],"URL":"https:\/\/doi.org\/10.3390\/sym15091629","relation":{},"ISSN":["2073-8994"],"issn-type":[{"value":"2073-8994","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,8,24]]}}}