{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T01:06:57Z","timestamp":1760231217919,"version":"build-2065373602"},"reference-count":48,"publisher":"MDPI AG","issue":"17","license":[{"start":{"date-parts":[[2022,8,31]],"date-time":"2022-08-31T00:00:00Z","timestamp":1661904000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Zhejiang University-Shandong (Linyi) Modern Agricultural Research Institute Service\u2019s Local Economic Development Project (open project)","award":["ZDNY\u20142021\u2014FWLY02016","ZR2019MA030","61402212"],"award-info":[{"award-number":["ZDNY\u20142021\u2014FWLY02016","ZR2019MA030","61402212"]}]},{"name":"Natural Science Foundation of Shandong Province","award":["ZDNY\u20142021\u2014FWLY02016","ZR2019MA030","61402212"],"award-info":[{"award-number":["ZDNY\u20142021\u2014FWLY02016","ZR2019MA030","61402212"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China (NSFC)","doi-asserted-by":"publisher","award":["ZDNY\u20142021\u2014FWLY02016","ZR2019MA030","61402212"],"award-info":[{"award-number":["ZDNY\u20142021\u2014FWLY02016","ZR2019MA030","61402212"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Recently, the transformer model has progressed from the field of visual classification to target tracking. Its primary method replaces the cross-correlation operation in the Siamese tracker. The backbone of the network is still a convolutional neural network (CNN). However, the existing transformer-based tracker simply deforms the features extracted by the CNN into patches and feeds them into the transformer encoder. Each patch contains a single element of the spatial dimension of the extracted features and inputs into the transformer structure to use cross-attention instead of cross-correlation operations. This paper proposes a reconstruction patch strategy which combines the extracted features with multiple elements of the spatial dimension into a new patch. The reconstruction operation has the following advantages: (1) the correlation between adjacent elements combines well, and the features extracted by the CNN are usable for classification and regression; (2) using the performer operation reduces the amount of network computation and the dimension of the patch sent to the transformer, thereby sharply reducing the network parameters and improving the model-tracking speed.<\/jats:p>","DOI":"10.3390\/s22176558","type":"journal-article","created":{"date-parts":[[2022,9,1]],"date-time":"2022-09-01T03:55:38Z","timestamp":1662004538000},"page":"6558","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["A Robust Visual Tracking Method Based on Reconstruction Patch Transformer Tracking"],"prefix":"10.3390","volume":"22","author":[{"given":"Hui","family":"Chen","sequence":"first","affiliation":[{"name":"College of Information Science and Engineering, Linyi University, Linyi 276000, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhenhai","family":"Wang","sequence":"additional","affiliation":[{"name":"College of Information Science and Engineering, Linyi University, Linyi 276000, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hongyu","family":"Tian","sequence":"additional","affiliation":[{"name":"School of Physics and Electronic Engineering, Linyi University, Linyi 276005, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lutao","family":"Yuan","sequence":"additional","affiliation":[{"name":"College of Information Science and Engineering, Linyi University, Linyi 276000, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xing","family":"Wang","sequence":"additional","affiliation":[{"name":"College of Information Science and Engineering, Linyi University, Linyi 276000, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Peng","family":"Leng","sequence":"additional","affiliation":[{"name":"Shandong (Linyi) Modern Agricultural Research Institute, Zhejiang University, Linyi 276000, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2022,8,31]]},"reference":[{"key":"ref_1","unstructured":"Zhang, T., Ghanem, B., and Liu, S. (2012, January 16\u201321). Robust Visual Tracking Via Multi-Task Sparse Learning. Proceedings of the 2012 IEEE Conf. Computer Vision and Pattern Recognition, Providence, RI, USA."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Danelljan, M., Robinson, A., Shahbaz Khan, F., and Felsberg, M. (2016, January 8\u201316). Beyond correlation filters: Learning continuous convolution operators for visual tracking. Proceedings of the European Conference on Computer Vision, Amsterdam, The Neitherlands.","DOI":"10.1007\/978-3-319-46454-1_29"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"583","DOI":"10.1109\/TPAMI.2014.2345390","article-title":"High-speed tracking with kernelized correlation filters","volume":"37","author":"Henriques","year":"2014","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., and Torr, P.H. (2016, January 27\u201330). Staple: Complementary learners for real-time tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.156"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Danelljan, M., Hager, G., Shahbaz Khan, F., and Felsberg, M. (2016, January 27\u201330). Adaptive decontamination of the training set: A unified formulation for discriminative visual tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.159"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"2777","DOI":"10.1109\/TIP.2018.2813161","article-title":"Robust visual tracking revisited: From correlation filter to template matching","volume":"27","author":"Liu","year":"2018","journal-title":"IEEE Trans. Image Processing"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Lukezic, A., Vojir, T., \u010cehovin Zajc, L., Matas, J., and Kristan, M. (2017, January 21\u201326). Discriminative correlation filter with channel and spatial reliability. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.515"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Kiani Galoogahi, H., Fagg, A., and Lucey, S. (2017, January 21\u201326). Learning background-aware correlation filters for visual tracking. Proceedings of the IEEE International Conference on Computer Vision, Honolulu, HI, USA.","DOI":"10.1109\/ICCV.2017.129"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., and Torr, P.H. (2017, January 21\u201326). Fully-convolutional siamese networks for object tracking. Proceedings of the European Conference on Computer Vision, Honolulu, HI, USA.","DOI":"10.1007\/978-3-319-48881-3_56"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Danelljan, M., Bhat, G., Shahbaz Khan, F., and Felsberg, M. (2016, January 8\u201316). Eco: Efficient convolution operators for tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Amsterdam, The Netherlands.","DOI":"10.1109\/CVPR.2017.733"},{"key":"ref_11","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems 30, MIT Press."},{"key":"ref_12","unstructured":"Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., and Sutskever, I. (2020, January 12\u201318). Generative pretraining from pixels. Proceedings of the International Conference on Machine Learning, Vienna, Austria."},{"key":"ref_13","unstructured":"Chen, Y., Kalantidis, Y., Li, J., Yan, S., and Feng, J. (2018). A^ 2-nets: Double attention networks. Advances in Neural Information Processing Systems 31, MIT Press."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020, January 23\u201328). End-to-end object detection with transformers. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., and Lu, H. (2021, January 20\u201325). Transformer tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00803"},{"key":"ref_16","unstructured":"Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., and Kaiser, L. (2020). Rethinking attention with performers. arXiv."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Li, B., Yan, J., Wu, W., Zhu, Z., and Hu, X. (2018, January 18\u201323). High performance visual tracking with siamese region proposal network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00935"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","article-title":"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks","volume":"39","author":"Ren","year":"2015","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_19","first-page":"84","article-title":"Imagenet classification with deep convolutional neural networks","volume":"60","author":"Krizhevsky","year":"2012","journal-title":"NIPS"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., and Yan, J. (2019, January 16\u201317). Siamrpn++: Evolution of siamese visual tracking with very deep networks. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00441"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Xu, Y., Wang, Z., Li, Z., Yuan, Y., and Yu, G. (2020, January 7\u201312). Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA.","DOI":"10.1609\/aaai.v34i07.6944"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Chen, Z., Zhong, B., Li, G., Zhang, S., and Ji, R. (2020, January 13\u201319). Siamese box adaptive network for visual tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00670"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Guo, D., Wang, J., Cui, Y., Wang, Z., and Chen, S. (2019, January 16\u201317). SiamCAR: Siamese fully convolutional classification and regression for visual tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR42600.2020.00630"},{"key":"ref_24","unstructured":"Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_25","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16 \u00d7 16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z.-H., Tay, F.E., Feng, J., and Yan, S. (2021, January 10\u201317). Tokens-to-token vit: Training vision transformers from scratch on imagenet. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00060"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021, January 10\u201317). Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Wang, N., Zhou, W., Wang, J., and Li, H. (2021, January 20\u201325). Transformer meets tracker: Exploiting temporal context for robust visual tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00162"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"1834","DOI":"10.1109\/TPAMI.2014.2388226","article-title":"Object tracking benchmark","volume":"37","author":"Wu","year":"2015","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_31","unstructured":"Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., \u02c7Cehovin Zajc, L., Vojir, T., Bhat, G., Lukezic, A., and Eldesokey, A. (2018, January 8\u201314). The sixth visual object tracking vot2018 challenge results. Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany."},{"key":"ref_32","unstructured":"Kristan, M., Matas, J., Leonardis, A., Felsberg, M., Pflugfelder, R., K\u00e4m\u00e4r\u00e4inen, J.K., and Fern\u00e1ndez, G. (2021, January 10\u201317). The ninth visual object tracking vot2021 challenge results. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada."},{"key":"ref_33","unstructured":"Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., and Antiga, L. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32, MIT Press."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","article-title":"Imagenet large scale visual recognition challenge","volume":"115","author":"Russakovsky","year":"2015","journal-title":"Int. J. Comput. Vis."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., and Zitnick, C.L. (2014, January 6\u201312). Microsoft coco: Common objects in context. Proceedings of the European Conference on Computer Vision, Zurich, Switzerland.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., and Ling, H. (2019, January 16\u201317). Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00552"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"1562","DOI":"10.1109\/TPAMI.2019.2957464","article-title":"Got-10k: A large high-diversity benchmark for generic object tracking in the wild","volume":"43","author":"Huang","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Muller, M., Bibi, A., Giancola, S., Alsubaihi, S., and Ghanem, B. (2018, January 8\u201314). Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01246-5_19"},{"key":"ref_39","unstructured":"Loshchilov, I., and Hutter, F. (2017). Decoupled weight decay regularization. arXiv."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., and Hu, W. (2018, January 8\u201314). Distractor-aware siamese networks for visual object tracking. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01240-3_7"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Peng, H., Fu, J., Li, B., and Hu, W. (2020, January 23\u201328). Ocean: Object-aware anchor-free tracking. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58589-1_46"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Zhang, Z., and Peng, H. (2019, January 16\u201317). Deeper and wider siamese networks for real-time visual tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00472"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Xie, F., Yang, W., Zhang, K., Liu, B., Wang, G., and Zuo, W. (2021, January 10\u201317). Learning spatio-appearance memory network for high-performance visual tracking. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCVW54120.2021.00302"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Bhat, G., Danelljan, M., Gool, L.V., and Timofte, R. (2020, January 23\u201328). Know your surroundings: Exploiting scene information for object tracking. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58592-1_13"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Lukezic, A., Matas, J., and Kristan, M. (2020, January 13\u201319). D3s-a discriminative single shot segmentation tracker. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00716"},{"key":"ref_46","unstructured":"Bhat, G., Danelljan, M., Gool, L.V., and Timofte, R. (November, January 27). Learning discriminative model prediction for tracking. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Korea."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Danelljan, M., Gool, L.V., and Timofte, R. (2020, January 13\u201319). Probabilistic regression for visual tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00721"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Danelljan, M., Bhat, G., Khan, F.S., and Felsberg, M. (2019, January 16\u201317). ATOM: Accurate tracking by overlap maximization. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00479"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/17\/6558\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T00:20:45Z","timestamp":1760142045000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/17\/6558"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,8,31]]},"references-count":48,"journal-issue":{"issue":"17","published-online":{"date-parts":[[2022,9]]}},"alternative-id":["s22176558"],"URL":"https:\/\/doi.org\/10.3390\/s22176558","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2022,8,31]]}}}