{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,11]],"date-time":"2026-02-11T13:35:41Z","timestamp":1770816941660,"version":"3.50.1"},"reference-count":84,"publisher":"Association for Computing Machinery (ACM)","issue":"12","license":[{"start":{"date-parts":[[2024,11,25]],"date-time":"2024-11-25T00:00:00Z","timestamp":1732492800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62172417, 62272461, 62101555, 62106268, and 62472424"],"award-info":[{"award-number":["62172417, 62272461, 62101555, 62106268, and 62472424"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Xuzhou Key Research and Development Program","award":["KC22287"],"award-info":[{"award-number":["KC22287"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,12,31]]},"abstract":"<jats:p>\n            Supervised RGBT (SRGBT) tracking tasks need both expensive and time-consuming annotations. Therefore, the implementation of Self-Supervised RGBT (SSRGBT) tracking methods has become increasingly important. Straightforward SSRGBT tracking methods use pseudo-labels for tracking, but inaccurate pseudo-labels can lead to object drift, which severely affects tracking performance. This article proposes a self-supervised RGBT object tracking method (S2OTFormer) to bridge the gap between tracking methods supervised under pseudo-labels and ground truth labels. Firstly, to provide more robust appearance features for motion cues, we introduce a multi-modality hierarchical transformer (MHT) module for feature fusion. This module allocates weights to both modalities and strengthens the expressive capability of the MHT module through multiple nonlinear layers to fully utilize the complementary information of the two modalities. Secondly, in order to solve the problems of motion blur caused by camera motion and inaccurate appearance information caused by pseudo-labels, we introduce a motion-aware mechanism (MAM). The MAM extracts the average motion vectors from the previous multi-frame search frame features and constructs the consistency loss with the motion vectors of the current search frame features. The motion vectors of inter-frame objects are obtained by reusing the inter-frame attention map to predict coordinate positions. Finally, to further reduce the effect of inaccurate pseudo-labels, we propose an Attention-Based Multi-Scale Enhancement Module. By introducing cross-attention to achieve more precise and accurate object tracking, this module overcomes the receptive field limitations of traditional CNN tracking heads. We demonstrate the effectiveness of S2OTFormer on four large-scale public datasets through extensive comparisons as well as numerous ablation experiments. The source code is available at\n            <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"url\" xlink:href=\"https:\/\/github.com\/LiShenglana\/S2OTFormer\">https:\/\/github.com\/LiShenglana\/S2OTFormer<\/jats:ext-link>\n            .\n          <\/jats:p>","DOI":"10.1145\/3698399","type":"journal-article","created":{"date-parts":[[2024,10,3]],"date-time":"2024-10-03T16:04:18Z","timestamp":1727971458000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["Motion-Aware Self-Supervised RGBT Tracking with Multi-Modality Hierarchical Transformers"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9245-5915","authenticated-orcid":false,"given":"Shenglan","family":"Li","sequence":"first","affiliation":[{"name":"School of Computer Sciences and Technology, China University of Mining and Technology, Xuzhou, China and Mine Digitization Engineering Research Center of the Ministry of Education, Xuzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2734-915X","authenticated-orcid":false,"given":"Rui","family":"Yao","sequence":"additional","affiliation":[{"name":"School of Computer Sciences and Technology, China University of Mining and Technology, Xuzhou, China and Mine Digitization Engineering Research Center of the Ministry of Education, Xuzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6207-0299","authenticated-orcid":false,"given":"Yong","family":"Zhou","sequence":"additional","affiliation":[{"name":"School of Computer Sciences and Technology, China University of Mining and Technology, Xuzhou, China and Mine Digitization Engineering Research Center of the Ministry of Education, Xuzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5418-9879","authenticated-orcid":false,"given":"Hancheng","family":"Zhu","sequence":"additional","affiliation":[{"name":"School of Computer Sciences and Technology, China University of Mining and Technology, Xuzhou, China and Mine Digitization Engineering Research Center of the Ministry of Education, Xuzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3564-5090","authenticated-orcid":false,"given":"Jiaqi","family":"Zhao","sequence":"additional","affiliation":[{"name":"School of Computer Sciences and Technology, China University of Mining and Technology, Xuzhou, China and Mine Digitization Engineering Research Center of the Ministry of Education, Xuzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9383-8384","authenticated-orcid":false,"given":"Zhiwen","family":"Shao","sequence":"additional","affiliation":[{"name":"School of Computer Sciences and Technology, China University of Mining and Technology, Xuzhou, China and Mine Digitization Engineering Research Center of the Ministry of Education, Xuzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7690-8547","authenticated-orcid":false,"given":"Abdulmotaleb","family":"El Saddik","sequence":"additional","affiliation":[{"name":"School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON, Canada"}]}],"member":"320","published-online":{"date-parts":[[2024,11,25]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.5555\/3540261.3541310"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i2.27852"},{"key":"e_1_3_1_4_2","first-page":"3718","volume-title":"Proceedings of the 25th IEEE International Conference on Image Processing (ICIP)","author":"Cen Miaobin","year":"2018","unstructured":"Miaobin Cen and Cheolkon Jung. 2018. Fully convolutional siamese fusion networks for object tracking. In Proceedings of the 25th IEEE International Conference on Image Processing (ICIP). IEEE, 3718\u20133722."},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01400"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00803"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2003.1195991"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00479"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.733"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2024.3393298"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2022\/127"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2019.00017"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/78.978396"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2021.12.104"},{"key":"e_1_3_1_15_2","first-page":"213","volume-title":"Proceedings of the 13th Asian Conference on Computer Vision","author":"Hazirbas Caner","year":"2017","unstructured":"Caner Hazirbas, Lingni Ma, Csaba Domokos, and Daniel Cremers. 2017. Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In Proceedings of the 13th Asian Conference on Computer Vision. Springer, 213\u2013228."},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2014.2345390"},{"key":"e_1_3_1_18_2","unstructured":"Xiaojun Hou Jiazheng Xing Yijie Qian Yaowei Guo Shuo Xin Junhao Chen Kai Tang Mengmeng Wang Zhengkai Jiang Liang Liu et al. 2024. Sdstrack: Self-distillation symmetric adapter learning for multi-modal visual object tracking. arXiv:2403.16002."},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00745"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01310"},{"issue":"4","key":"e_1_3_1_21_2","doi-asserted-by":"crossref","first-page":"948","DOI":"10.1109\/TSP.2015.2493985","article-title":"The accurate continuous-discrete extended Kalman filter for radar tracking","volume":"64","author":"Kulikov Gennady Yu","year":"2015","unstructured":"Gennady Yu Kulikov and Maria V. Kulikova. 2015. The accurate continuous-discrete extended Kalman filter for radar tracking. IEEE Transactions on Signal Processing 64, 4 (2015), 948\u2013958.","journal-title":"IEEE Transactions on Signal Processing"},{"issue":"4","key":"e_1_3_1_22_2","doi-asserted-by":"crossref","first-page":"625","DOI":"10.1109\/TPAMI.2013.170","article-title":"A geometric particle filter for template-based visual tracking","volume":"36","author":"Kwon Junghyun","year":"2013","unstructured":"Junghyun Kwon, Hee Seok Lee, Frank C. Park, and Kyoung Mu Lee. 2013. A geometric particle filter for template-based visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 4 (2013), 625\u2013643.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00651"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neunet.2023.12.018"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3652583.3658001"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00935"},{"key":"e_1_3_1_27_2","first-page":"12","article-title":"Learning collaborative sparse representation for grayscale-thermal tracking","volume":"25","author":"Li Chenglong","year":"2016","unstructured":"Chenglong Li, Hui Cheng, Shiyi Hu, Xiaobai Liu, Jin Tang, and Liang Lin. 2016. Learning collaborative sparse representation for grayscale-thermal tracking. IEEE Transactions on Image Processing 25, 12 (2016), 5743\u20135756.","journal-title":"IEEE Transactions on Image Processing"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2019.106977"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58542-6_14"},{"key":"e_1_3_1_30_2","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision Workshops","author":"Li Chenglong","year":"2019","unstructured":"Chenglong Li, Andong Lu, AiHua Zheng, Zhengzheng Tu, and Jin Tang. 2019. Multi-adapter RGBT tracking. In Proceedings of the IEEE\/CVF International Conference on Computer Vision Workshops."},{"key":"e_1_3_1_31_2","doi-asserted-by":"crossref","unstructured":"Chenglong Li Zhiqiang Xiang Jin Tang Bin Luo and Futian Wang. 2021. RGBT tracking via noise-robust cross-modal ranking. IEEE Transactions on Neural Networks and Learning Systems 33 9 (2021) 5019\u20135031.","DOI":"10.1109\/TNNLS.2021.3067107"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3130533"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123289"},{"key":"e_1_3_1_34_2","first-page":"808","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV)","author":"Li Chenglong","year":"2018","unstructured":"Chenglong Li, Chengli Zhu, Yan Huang, Jin Tang, and Liang Wang. 2018. Cross-modal ranking with soft consistency and noisy labels for robust RGB-T tracking. In Proceedings of the European Conference on Computer Vision (ECCV), 808\u2013823."},{"issue":"10","key":"e_1_3_1_35_2","first-page":"2913","article-title":"Learning local-global multi-graph descriptors for RGB-T object tracking","volume":"29","author":"Li Chenglong","year":"2018","unstructured":"Chenglong Li, Chengli Zhu, Jian Zhang, Bin Luo, Xiaohao Wu, and Jin Tang. 2018. Learning local-global multi-graph descriptors for RGB-T object tracking. IEEE Transactions on Circuits and Systems for Video Technology 29, 10 (2018), 2913\u20132926.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_1_36_2","first-page":"1","article-title":"Unsupervised RGB-T object tracking with attentional multi-modal feature fusion","author":"Li Shenglan","year":"2023","unstructured":"Shenglan Li, Rui Yao, Yong Zhou, Hancheng Zhu, Bing Liu, Jiaqi Zhao, and Zhiwen Shao. 2023. Unsupervised RGB-T object tracking with attentional multi-modal feature fusion. Multimedia Tools and Applications (2023), 1\u201319.","journal-title":"Multimedia Tools and Applications"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2018.12.011"},{"key":"e_1_3_1_38_2","unstructured":"Xin Li Wenjie Pei Zikun Zhou Zhenyu He Huchuan Lu and Ming-Hsuan Yang. 2021. Crop-transform-paste: Self-supervised learning for visual tracking. arXiv:2106.10900. Retrieved from https:\/\/arxiv.org\/abs\/2106.10900v1"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2008.73"},{"key":"e_1_3_1_40_2","unstructured":"Yunfeng Li Bo Wang Ye Li Zhiwen Yu and Liang Wang. 2024. Transformer-based RGB-T tracking with channel and spatial feature fusion. arXiv:2405.03177."},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-011-4536-9"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2021.3056725"},{"key":"e_1_3_1_43_2","first-page":"6489","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Liu Liang","year":"2020","unstructured":"Liang Liu, Jiangning Zhang, Ruifei He, Yong Liu, Yabiao Wang, Ying Tai, Donghao Luo, Chengjie Wang, Jilin Li, and Feiyue Huang. 2020. Learning by analogy: Reliable supervision from transformations for unsupervised optical flow estimation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 6489\u20136498."},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3140929"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2022.3157594"},{"key":"e_1_3_1_46_2","unstructured":"Andong Lu Wanyu Wang Chenglong Li Jin Tang and Bin Luo. 2024. AFter: Attention-based fusion router for RGBT tracking. arXiv:2405.02717. Retrieved from https:\/\/arxiv.org\/abs\/2405.02717"},{"key":"e_1_3_1_47_2","unstructured":"Yang Luo Xiqing Guo Hui Feng and Lei Ao. 2023. RGB-T tracking via multi-modal mutual prompt learning. arXiv:2308.16386. Retrieved from https:\/\/arxiv.org\/abs\/2308.16386"},{"key":"e_1_3_1_48_2","unstructured":"Yang Luo Xiqing Guo and Hao Li. 2024. From two stream to one stream: Efficient RGB-T tracking via mutual prompt learning and knowledge distillation. arXiv:2403.16834. Retrieved from https:\/\/arxiv.org\/abs\/2403.16834"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.465"},{"issue":"4","key":"e_1_3_1_50_2","doi-asserted-by":"crossref","first-page":"3822","DOI":"10.1109\/TITS.2022.3229830","article-title":"Dynamic fusion network for RGBT tracking","volume":"24","author":"Peng Jingchao","year":"2022","unstructured":"Jingchao Peng, Haitao Zhao, and Zhengwei Hu. 2022. Dynamic fusion network for RGBT tracking. IEEE Transactions on Intelligent Transportation Systems 24, 4 (2022), 3822\u20133832.","journal-title":"IEEE Transactions on Intelligent Transportation Systems"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_3_1_52_2","unstructured":"Niels Ole Salscheider. 2021. Object tracking by detection with visual and motion cues. arXiv:2101.07549. Retrieved from https:\/\/arxiv.org\/abs\/2101.07549"},{"key":"e_1_3_1_53_2","unstructured":"Dengdi Sun Yajie Pan Andong Lu Chenglong Li and Bin Luo. 2024. Transformer RGBT tracking with spatio-temporal multimodal tokens. arXiv:2401.01674. Retrieved from https:\/\/arxiv.org\/abs\/2401.01674"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2023.3234340"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i6.28325"},{"key":"e_1_3_1_56_2","volume-title":"Proceedings of the 31st International Conference on Neural Information Processing Systems","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems 30 (2017)."},{"key":"e_1_3_1_57_2","first-page":"7064","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Wang Chaoqun","year":"2020","unstructured":"Chaoqun Wang, Chunyan Xu, Zhen Cui, Ling Zhou, Tong Zhang, Xiaoya Zhang, and Jian Yang. 2020. Cross-modal pattern-propagation for RGB-T tracking. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 7064\u20137073."},{"key":"e_1_3_1_58_2","unstructured":"Hongyu Wang Xiaotao Liu Yifan Li Meng Sun Dian Yuan and Jing Liu. 2024. Temporal adaptive RGBT tracking with modality prompt. arXiv:2401.01244. Retrieved from https:\/\/arxiv.org\/abs\/2401.01244"},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00140"},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3174341"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jvcir.2006.03.004"},{"key":"e_1_3_1_62_2","first-page":"1","volume-title":"Proceedings of the 14th International Conference on Information Fusion","author":"Wu Yi","year":"2011","unstructured":"Yi Wu, Erik Blasch, Genshe Chen, Li Bai, and Haibin Ling. 2011. Multiple source data fusion via sparse representation for robust visual tracking. In Proceedings of the 14th International Conference on Information Fusion. IEEE, 1\u20138."},{"key":"e_1_3_1_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01020"},{"key":"e_1_3_1_64_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i3.20187"},{"key":"e_1_3_1_65_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2021.3055362"},{"key":"e_1_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2023.121577"},{"key":"e_1_3_1_67_2","first-page":"341","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Ye Botao","year":"2022","unstructured":"Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. 2022. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European Conference on Computer Vision. Springer, 341\u2013357."},{"key":"e_1_3_1_68_2","first-page":"1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Ye Junjie","year":"2022","unstructured":"Junjie Ye, Changhong Fu, Guangze Zheng, Danda Pani Paudel, and Guang Chen. 2022. Unsupervised domain adaptation for nighttime aerial tracking. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1\u201310."},{"key":"e_1_3_1_69_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2020.3037518"},{"issue":"3","key":"e_1_3_1_70_2","first-page":"1224","article-title":"Aligned spatial-temporal memory network for thermal infrared target tracking","volume":"70","author":"Yuan Di","year":"2022","unstructured":"Di Yuan, Xiu Shu, Qiao Liu, and Zhenyu He. 2022. Aligned spatial-temporal memory network for thermal infrared target tracking. IEEE Transactions on Circuits and Systems II: Express Briefs 70, 3 (2022), 1224\u20131228.","journal-title":"IEEE Transactions on Circuits and Systems II: Express Briefs"},{"key":"e_1_3_1_71_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00550"},{"key":"e_1_3_1_72_2","doi-asserted-by":"publisher","DOI":"10.3390\/s20020393"},{"key":"e_1_3_1_73_2","doi-asserted-by":"crossref","first-page":"188","DOI":"10.1007\/978-3-319-10599-4_13","volume-title":"Computer Vision\u2013ECCV 2014: Proceedings of the 13th European Conference on Computer Vision (ECCV \u201914)","author":"Zhang Jianming","year":"2014","unstructured":"Jianming Zhang, Shugao Ma, and Stan Sclaroff. 2014. MEEM: Robust tracking via multiple experts using entropy minimization. In Computer Vision\u2013ECCV 2014: Proceedings of the 13th European Conference on Computer Vision (ECCV \u201914). Springer, 188\u2013203."},{"key":"e_1_3_1_74_2","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision Workshops","author":"Zhang Lichao","year":"2019","unstructured":"Lichao Zhang, Martin Danelljan, Abel Gonzalez-Garcia, Joost Van De Weijer, and Fahad Shahbaz Khan. 2019. Multi-modal fusion for end-to-end RGB-T tracking. In Proceedings of the IEEE\/CVF International Conference on Computer Vision Workshops."},{"key":"e_1_3_1_75_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3060862"},{"key":"e_1_3_1_76_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00868"},{"key":"e_1_3_1_77_2","first-page":"5404","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zhang Tianlu","year":"2023","unstructured":"Tianlu Zhang, Hongyuan Guo, Qiang Jiao, Qiang Zhang, and Jungong Han. 2023. Efficient RGB-T tracking via cross-modality distillation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 5404\u20135413."},{"key":"e_1_3_1_78_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2021.3072207"},{"key":"e_1_3_1_79_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00472"},{"key":"e_1_3_1_80_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00572"},{"key":"e_1_3_1_81_2","first-page":"13546","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Zheng Jilai","year":"2021","unstructured":"Jilai Zheng, Chao Ma, Houwen Peng, and Xiaokang Yang. 2021. Learning to track objects from unlabeled videos. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, 13546\u201313555."},{"key":"e_1_3_1_82_2","unstructured":"Wenzhang Zhou Longyin Wen Libo Zhang Dawei Du Tiejian Luo and Yanjun Wu. 2019. SiamMan: Siamese motion-aware network for visual tracking. arXiv:1912.05515. Retrieved from https:\/\/arxiv.org\/abs\/1912.05515"},{"key":"e_1_3_1_83_2","first-page":"9516","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zhu Jiawen","year":"2023","unstructured":"Jiawen Zhu, Simiao Lai, Xin Chen, Dong Wang, and Huchuan Lu. 2023. Visual prompt multi-modal tracking. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 9516\u20139526."},{"key":"e_1_3_1_84_2","unstructured":"Y. Zhu C. Li Y. Lu L. Lin B. Luo and J. Tang. 2018. FANet: Quality-aware feature aggregation network for Robust RGB-T tracking. arXiv:1811.09855. Retrieved from https:\/\/arxiv.org\/abs\/1811.09855"},{"key":"e_1_3_1_85_2","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3350928"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3698399","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3698399","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:58:34Z","timestamp":1750294714000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3698399"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,25]]},"references-count":84,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2024,12,31]]}},"alternative-id":["10.1145\/3698399"],"URL":"https:\/\/doi.org\/10.1145\/3698399","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,11,25]]},"assertion":[{"value":"2024-02-18","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-09-24","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-11-25","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}