{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T02:54:20Z","timestamp":1760151260633,"version":"build-2065373602"},"reference-count":45,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2022,3,11]],"date-time":"2022-03-11T00:00:00Z","timestamp":1646956800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Natural Science Foundation of Xinjiang Uygur Autonomous Region","award":["2019D01C033"],"award-info":[{"award-number":["2019D01C033"]}]},{"name":"Tianshan Innovation Team of Xinjiang Uygur Autonomous Region","award":["2020D14044"],"award-info":[{"award-number":["2020D14044"]}]},{"name":"autonomous region graduate scientific research innovation project","award":["XJ2020G062"],"award-info":[{"award-number":["XJ2020G062"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Recently, in the field of visual object tracking, visual object tracking algorithms combined with visual object segmentation have achieved impressive results while using mask to label targets in the VOT2020 dataset. Most of the trackers get the object mask by increasing the resolution through multiple upsampling modules and gradually get the mask by summing with the features in the backbone network. However, this refinement pathway does not fully consider the spatial information of the backbone features, and therefore, the segmentation results are not perfect. In this paper, the cross-stage and cross-resolution (CSCR) module is proposed for optimizing the segmentation effect. This module makes full use of the semantic information of high-level features and the spatial information of low-level features, and fuses them by skip connections to achieve a very accurate segmentation effect. Experiments were conducted on the VOT dataset, and the experimental results outperformed other excellent trackers and verified the effectiveness of the algorithm in this paper.<\/jats:p>","DOI":"10.3390\/info13030147","type":"journal-article","created":{"date-parts":[[2022,3,11]],"date-time":"2022-03-11T01:40:23Z","timestamp":1646962823000},"page":"147","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["An Accurate Refinement Pathway for Visual Tracking"],"prefix":"10.3390","volume":"13","author":[{"given":"Liang","family":"Xu","sequence":"first","affiliation":[{"name":"School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8363-8832","authenticated-orcid":false,"given":"Shuli","family":"Cheng","sequence":"additional","affiliation":[{"name":"School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China"},{"name":"College of Mathematics and System Science, Xinjiang University, Urumqi 830046, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0210-2273","authenticated-orcid":false,"given":"Liejun","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2022,3,11]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., K\u00e4m\u00e4r\u00e4inen, J.K., Danelljan, M., Zajc, L.C., Luke\u017eic, A., and Drbohlav, O. (2020). The eighth visual object tracking VOT2020 challenge results. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-030-68238-5_39"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., and Sorkine-Hornung, A. (2017, January 21\u201326). Learning video object segmentation from staticim-ages. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.372"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"3995","DOI":"10.1109\/TIP.2021.3068644","article-title":"Exploring Rich and Efficient Spatial Temporal Interactions for Real-Time Video Salient Object Detection","volume":"30","author":"Chen","year":"2021","journal-title":"IEEE Trans. Image Process."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1949","DOI":"10.1109\/TIP.2021.3049959","article-title":"Bilateral Attention Network for RGB-D Salient Object Detection","volume":"30","author":"Zhang","year":"2021","journal-title":"IEEE Trans. Image Process."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"2315","DOI":"10.1109\/TCSVT.2020.3023080","article-title":"A Plug-and-Play Scheme to Adapt Image Saliency Deep Model for Video Data","volume":"31","author":"Li","year":"2021","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3391743","article-title":"Video object segmentation and tracking: A survey","volume":"11","author":"Yao","year":"2020","journal-title":"ACM Trans. Intell. Syst. Technol. (TIST)"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"1812","DOI":"10.1109\/TIP.2020.3045630","article-title":"Exploring the Effects of Blur and Deblurring to Visual Object Tracking","volume":"30","author":"Guo","year":"2021","journal-title":"IEEE Trans. Image Process."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"2885","DOI":"10.1016\/j.patcog.2015.01.025","article-title":"Real-time and robust object tracking in video via low-rank coherency analysis in feature space","volume":"48","author":"Chen","year":"2015","journal-title":"Pattern Recognit. J. Pattern Recognit. Soc."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Wang, N., Zhou, W., Wang, J., and Li, H. (2021, January 20\u201325). Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. Proceedings of the 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00162"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Li, B., Yan, J., Wu, W., Zhu, Z., and Hu, X. (2018, January 18\u201323). High performance visual tracking with siamese region proposal network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00935"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., and Hu, W. (2018, January 8\u201314). Distractor-aware siamese networks for visual object tracking. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01240-3_7"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., and Yan, J. (2019, January 15\u201319). Siamrpn++: Evolution of siamese visual tracking with very deep networks. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00441"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Son, J., Jung, I., Park, K., and Han, B. (2015, January 7\u201313). Tracking-by-segmentation with online gradient boosting decision tree. Proceedings of the IEEE International Conference on Computer Vision, Washington, DC, USA.","DOI":"10.1109\/ICCV.2015.350"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Yeo, D., Son, J., Han, B., and Hee Han, J. (2017, January 21\u201326). Superpixel-based tracking-by-segmentation using markov chains. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.62"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"1687","DOI":"10.1109\/TCSVT.2018.2848358","article-title":"Semantics-aware visual object tracking","volume":"29","author":"Yao","year":"2018","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. International CONFERENCE on Medical Image Computing and Computer-Assisted Intervention, Springer.","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., and Liang, J. (2018). Unet++: A nested u-net architecture for medical image segmentation. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Springer.","DOI":"10.1007\/978-3-030-00889-5_1"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Wang, Q., Zhang, L., Bertinetto, L., Hu, W., and Torr, P.H. (2019, January 15\u201320). Fast online object tracking and segmentation: A unifying approach. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00142"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Chen, X., Li, Z., Yuan, Y., Yu, G., Shen, J., and Qi, D. (2020, January 13\u201319). State-Aware Tracker for Real-Time Video Object Segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00940"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Yu, Y., Xiong, Y., Huang, W., and Scott, M.R. (2020, January 13\u201319). Deformable Siamese attention networks for visual object tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00676"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Hua, Y., Song, T., Xue, Z., Ma, R., Robertson, N., and Guan, H. (2018, January 22\u201326). Tracking-assisted Weakly Supervised Online Visual Object Segmentation in Unconstrained Videos. Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Korea.","DOI":"10.1145\/3240508.3240638"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 21\u201326). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_23","unstructured":"Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. (2017). Visualizing the loss landscape of neural nets. arXiv."},{"key":"ref_24","unstructured":"Orhan, A.E., and Pitkow, X. (2017). Skip connections eliminate singularities. arXiv."},{"key":"ref_25","unstructured":"Chen, R.T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. (2018). Neural ordinary differential equations. arXiv."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1007\/s40304-017-0103-z","article-title":"A proposal on machine learning via dynamical systems","volume":"5","author":"Weinan","year":"2017","journal-title":"Commun. Math. Stat."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Drozdzal, M., Vorontsov, E., Chartrand, G., Kadoury, S., and Pal, C. (2016). The importance of skip connections in biomedical image segmentation. Deep Learning and Data Labeling for Medical Applications, Springer.","DOI":"10.1007\/978-3-319-46976-8_19"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"J\u00e9gou, S., Drozdzal, M., Vazquez, D., Romero, A., and Bengio, Y. (2017, January 21\u201326). The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA.","DOI":"10.1109\/CVPRW.2017.156"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K.Q. (2017, January 21\u201326). Densely connected convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.243"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Lukezic, A., Matas, J., and Kristan, M. (2020, January 13\u201319). D3S-A discriminative single shot segmentation tracker. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00716"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Danelljan, M., Bhat, G., Khan, F.S., and Felsberg, M. (2019, January 13\u201319). Atom: Accurate tracking by overlap maximization. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR.2019.00479"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Hu, Y.T., Huang, J.B., and Schwing, A.G. (2018, January 8\u201314). Videomatch: Matching based video object segmentation. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01237-3_4"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., and Torr, P.H. (2016). Fully-convolutional siamese networks for object tracking. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-48881-3_56"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Zhang, Z., and Peng, H. (2019, January 13\u201319). Deeper and wider siamese networks for real-time visual tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR.2019.00472"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Bhat, G., Danelljan, M., Gool, L.V., and Timofte, R. (2019, January 27\u201328). Learning discriminative model prediction for tracking. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00628"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Zhang, Z., and Peng, H. (2020). Ocean: Object-aware anchor-free tracking. arXiv.","DOI":"10.1007\/978-3-030-58589-1_46"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Ma, Z., Wang, L., Zhang, H., Lu, W., and Yin, J. (2020, January 23\u201328). RPT: Learning Point Set Representation for Siamese Visual Tracking. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-68238-5_43"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"1562","DOI":"10.1109\/TPAMI.2019.2957464","article-title":"GOT-10k: A large high-diversity benchmark for generic object tracking in the wild","volume":"43","author":"Huang","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., and Ling, H. (2019, January 15\u201320). LaSOT: A high-quality benchmark for large-scale single object tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00552"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Wang, L., Qi, J., Wang, D., Feng, M., and Lu, H. (2018, January 8\u201314). Structured siamese network for real-time visual tracking. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01240-3_22"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Guo, Q., Feng, W., Zhou, C., Huang, R., Wan, L., and Wang, S. (2017, January 22\u201329). Learning dynamic siamese network for visual object tracking. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.196"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Perazzi, F., Pont-Tuset, J., McWilliams, B., van Gool, L., Gross, M., and Sorkine-Hornung, A. (2016, January 27\u201330). A benchmark dataset and evaluation methodology for video object segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.85"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taix\u00e9, L., Cremers, D., and Van Gool, L. (2017, January 21\u201326). One-shot video object segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.565"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Cheng, J., Tsai, Y.H., Hung, W.C., Wang, S., and Yang, M.H. (2018, January 18\u201323). Fast and accurate online video object segmentation via tracking parts. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00774"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Oh, S.W., Lee, J.Y., Sunkavalli, K., and Kim, S.J. (2018, January 18\u201323). Fast video object segmentation by reference-guided mask propagation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00770"}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/13\/3\/147\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T22:34:43Z","timestamp":1760135683000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/13\/3\/147"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,3,11]]},"references-count":45,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2022,3]]}},"alternative-id":["info13030147"],"URL":"https:\/\/doi.org\/10.3390\/info13030147","relation":{},"ISSN":["2078-2489"],"issn-type":[{"type":"electronic","value":"2078-2489"}],"subject":[],"published":{"date-parts":[[2022,3,11]]}}}