{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T01:16:17Z","timestamp":1760145377346,"version":"build-2065373602"},"reference-count":45,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2024,7,19]],"date-time":"2024-07-19T00:00:00Z","timestamp":1721347200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Research on Cloud Edge Collaborative Intelligent Image Analysis Technology","award":["J2023003"],"award-info":[{"award-number":["J2023003"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>The existing Siamese trackers have achieved increasingly successful results in visual object tracking. However, the interactive fusion among multi-layer similarity maps after cross-correlation has not been fully studied in previous Siamese network-based methods. To address this issue, we propose a novel Siamese network for visual object tracking, named SiamSMN, which consists of a feature extraction network, a multi-scale fusion module, and a prediction head. First, the feature extraction network is used to extract the features of the template image and the search image, which is calculated by a depth-wise cross-correlation operation to produce multiple similarity feature maps. Second, we propose an effective multi-scale fusion module that can extract global context information for object search and learn the interdependencies between multi-level similarity maps. In addition, to further improve tracking accuracy, we design a learnable prediction head module to generate a boundary point for each side based on the coarse bounding box, which can solve the problem of inconsistent classification and regression during the tracking. Extensive experiments on four public benchmarks demonstrate that the proposed tracker has a competitive performance among other state-of-the-art trackers.<\/jats:p>","DOI":"10.3390\/info15070418","type":"journal-article","created":{"date-parts":[[2024,7,19]],"date-time":"2024-07-19T09:14:53Z","timestamp":1721380493000},"page":"418","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["SiamSMN: Siamese Cross-Modality Fusion Network for Object Tracking"],"prefix":"10.3390","volume":"15","author":[{"given":"Shuo","family":"Han","sequence":"first","affiliation":[{"name":"Nanjing Power Supply Branch, State Grid Jiangsu Electric Power Co., Ltd., Nanjing 210024, China"}]},{"given":"Lisha","family":"Gao","sequence":"additional","affiliation":[{"name":"Nanjing Power Supply Branch, State Grid Jiangsu Electric Power Co., Ltd., Nanjing 210024, China"}]},{"given":"Yue","family":"Wu","sequence":"additional","affiliation":[{"name":"Nanjing Power Supply Branch, State Grid Jiangsu Electric Power Co., Ltd., Nanjing 210024, China"}]},{"given":"Tian","family":"Wei","sequence":"additional","affiliation":[{"name":"Nanjing Power Supply Branch, State Grid Jiangsu Electric Power Co., Ltd., Nanjing 210024, China"}]},{"given":"Manyu","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Computer and Cyberspace Security, Nanjing University of Information Science and Technology, Nanjing 210044, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7615-9160","authenticated-orcid":false,"given":"Xu","family":"Cheng","sequence":"additional","affiliation":[{"name":"School of Computer and Cyberspace Security, Nanjing University of Information Science and Technology, Nanjing 210044, China"}]}],"member":"1968","published-online":{"date-parts":[[2024,7,19]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Reddy, K.R., Priya, K.H., and Neelima, N. (2015, January 12\u201314). Object Detection and Tracking\u2014A Survey. Proceedings of the 2015 International Conference on Computational Intelligence and Communication Networks (CICN), Jabalpur, India.","DOI":"10.1109\/CICN.2015.317"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Xing, J., Ai, H., and Lao, S. (2010, January 23\u201326). Multiple human tracking based on multi-view upper-body detection and discriminative learning. Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey.","DOI":"10.1109\/ICPR.2010.420"},{"key":"ref_3","unstructured":"Liu, L., Xing, J., Ai, H., and Ruan, X. (2012, January 11\u201315). Hand posture recognition using finger geometric feature. Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), Tsukuba, Japan."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Zhang, G., and Vela, P.A. (2015, January 7\u201312). Good features to track for visual slam. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298743"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Bolme, D.S., Beveridge, J.R., Draper, B.A., and Lui, Y.M. (2010, January 13\u201318). Visual object tracking using adaptive correlation filters. Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA.","DOI":"10.1109\/CVPR.2010.5539960"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Henriques Jo\u00e3o, F., Rui, C., Pedro, M., and Jorge, B. (2012, January 7\u201313). Exploiting the circulant structure of tracking-by-detection with kernels. Proceedings of the European Conference on Computer Vision, Florence, Italy.","DOI":"10.1007\/978-3-642-33765-9_50"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., and Torr, P.H. (2017, January 21\u201326). End-to-end representation learning for correlation filter based tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.531"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Danelljan, M., H\u00e4ger, G., Khan, F., and Felsberg, M. (2014, January 1\u20135). Accurate scale estimation for robust visual tracking. Proceedings of the British Machine Vision Conference, Nottingham, UK.","DOI":"10.5244\/C.28.65"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"671","DOI":"10.1007\/s11263-017-1061-3","article-title":"Discriminative correlation filter tracner with channel and spatial reliability","volume":"126","author":"Zajc","year":"2018","journal-title":"Int. J. Comput. Vis."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Danelljan, M., Hager, G., Shahbaz Khan, F., and Felsberg, M. (2015, January 7\u201312). Learning spatially regularized correlation filters for visual tracking. Proceedings of the IEEE International Conference on Computer Vision, Boston, MA, USA.","DOI":"10.1109\/ICCV.2015.490"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Danelljan, M., Robinson, A., Shahbaz Khan, F., and Felsberg, M. (2016, January 11\u201314). Beyond correlation filters: Learning continuous convolution operators for visual tracking. Proceedings of the Computer Vision\u2013ECCV 2016: 14th European Conference, Amsterdam, The Netherlands. Proceedings, Part V 14.","DOI":"10.1007\/978-3-319-46454-1_29"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Danelljan, M., Bhat, G., Shahbaz Khan, F., and Felsberg, M. (2017, January 21\u201326). Eco: Efficient convolution operators for tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.733"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Wang, L., Ouyang, W., Wang, X., and Lu, H. (2015, January 7\u201313). Visual tracking with fully convolutional networks. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.357"},{"key":"ref_14","unstructured":"Nam, H., and Han, B. (July, January 26). Learning multi-domain convolutional neural networks for visual tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA."},{"key":"ref_15","unstructured":"Wang, L., Ouyang, W., Wang, X., and Lu, H. (July, January 26). Stct: Sequentially training convolutional networks for visual tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"117","DOI":"10.1016\/j.neunet.2019.12.024","article-title":"Attention-guided CNN for image denoising","volume":"124","author":"Tian","year":"2020","journal-title":"Neural Netw."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., and Torr, P.H. (15\u201316, January 8\u201310). Fully-convolutional siamese networks for object tracking. Proceedings of the Computer Vision\u2013ECCV 2016 Workshops, Amsterdam, The Netherlands. Proceedings, Part II 14.","DOI":"10.1007\/978-3-319-48881-3_56"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1109\/MSPEC.1967.5217220","article-title":"The fast Fourier transform","volume":"4","author":"Brigham","year":"1967","journal-title":"IEEE Spectr."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Li, B., Yan, J., Wu, W., Zhu, Z., and Hu, X. (2018, January 18\u201322). High performance visual tracking with siamese region proposal network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00935"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., and Yan, J.S. (2019, January 16\u201320). Evolution of siamese visual tracking with very deep networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00441"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Wang, Q., Zhang, L., Bertinetto, L., Hu, W., and Torr, P.H. (2019, January 16\u201320). Fast online object tracking and segmentation: A unifying approach. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00142"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Chen, Z., Zhong, B., Li, G., Zhang, S., and Ji, R. (2020, January 14\u201319). Siamese box adaptive network for visual tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00670"},{"key":"ref_23","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 30), Long Beach, CA, USA."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., and Hu, W. (2018, January 8\u201314). Distractor-aware siamese networks for visual object tracking. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01240-3_7"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Wang, Q., Teng, Z., Xing, J., Gao, J., Hu, W., and Maybank, S. (2018, January 18\u201322). Learning attentions: Residual attentional siamese network for high performance online visual tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00510"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Zhang, Z., and Peng, H. (2019, January 16\u201320). Deeper and wider siamese networks for real-time visual tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00472"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Xu, Y., Wang, Z., Li, Z., Yuan, Y., and Yu, G. (2020, January 7\u201312). Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA.","DOI":"10.1609\/aaai.v34i07.6944"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","article-title":"Imagenet large scale visual recognition challenge","volume":"115","author":"Russakovsky","year":"2015","journal-title":"Int. J. Comput. Vis."},{"key":"ref_29","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Yu, Y., Xiong, Y., Huang, W., and Scott, M.R. (2020, January 14\u201319). Deformable siamese attention networks for visual object tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00676"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Du, F., Liu, P., Zhao, W., and Tang, X. (2020, January 14\u201319). Correlation-guided attention for corner detection based visual tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00687"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., and Lu, H. (2021, January 19\u201325). Transformer tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Online.","DOI":"10.1109\/CVPR46437.2021.00803"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., and Zitnick, C.L. (2014, January 6\u201312). Microsoft coco: Common objects in context. Proceedings of the Computer Vision\u2013ECCV 2014: 13th European Conference, Zurich, Switzerland. Proceedings, Part V 13.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., and Ling, H. (2019, January 16\u201320). Lasot: A high-quality benchmark for large-scale single object tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00552"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Wu, Y., Lim, J., and Yang, M.H. (2013, January 23\u201327). Online object tracking: A benchmark. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA.","DOI":"10.1109\/CVPR.2013.312"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Peng, H., Fu, J., Li, B., and Hu, W. (2020, January 23\u201328). Ocean: Object-aware anchor-free tracking. Proceedings of the Computer Vision\u2014ECCV 2020: 16th European Conference, Glasgow, UK. Proceedings, Part XXI 16.","DOI":"10.1007\/978-3-030-58589-1_46"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Mueller, M., Smith, N., and Ghanem, B. (2016, January 11\u201314). A benchmark and simulator for uav tracking. Proceedings of the Computer Vision\u2014ECCV 2016: 14th European Conference, Amsterdam, The Netherlands. Proceedings, Part I 14.","DOI":"10.1007\/978-3-319-46448-0_27"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Guo, D., Shao, Y., Cui, Y., Wang, Z., Zhang, L., and Shen, C. (2021, January 19\u201325). Graph attention tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Online.","DOI":"10.1109\/CVPR46437.2021.00942"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Guo, D., Wang, J., Cui, Y., Wang, Z., and Chen, S. (2020, January 14\u201319). SiamCAR: Siamese fully convolutional classification and regression for visual tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00630"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Danelljan, M., Bhat, G., Khan, F.S., and Felsberg, M. (2019, January 16\u201320). Atom: Accurate tracking by overlap maximization. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00479"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"1562","DOI":"10.1109\/TPAMI.2019.2957464","article-title":"Got-10k: A large high-diversity benchmark for generic object tracking in the wild","volume":"43","author":"Huang","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Danelljan, M., Gool, L.V., and Timofte, R. (2020, January 14\u201319). Probabilistic regression for visual tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00721"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Lukezic, A., Matas, J., and Kristan, M. (2020, January 14\u201319). D3s-a discriminative single shot segmentation tracker. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00716"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Zheng, L., Tang, M., Chen, Y., Wang, J., and Lu, H. (2020, January 23\u201328). Learning feature embeddings for discriminant model based tracking. Proceedings of the Computer Vision\u2014ECCV 2020: 16th European Conference, Glasgow, UK. Proceedings, Part XV 16.","DOI":"10.1007\/978-3-030-58555-6_45"},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"7527","DOI":"10.1109\/TCYB.2020.3043520","article-title":"SiamATL: Online update of siamese tracking network via attentional transfer learning","volume":"52","author":"Huang","year":"2021","journal-title":"IEEE Trans. Cybern."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/15\/7\/418\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T15:19:34Z","timestamp":1760109574000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/15\/7\/418"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,19]]},"references-count":45,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2024,7]]}},"alternative-id":["info15070418"],"URL":"https:\/\/doi.org\/10.3390\/info15070418","relation":{},"ISSN":["2078-2489"],"issn-type":[{"type":"electronic","value":"2078-2489"}],"subject":[],"published":{"date-parts":[[2024,7,19]]}}}