{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,4]],"date-time":"2026-04-04T18:07:39Z","timestamp":1775326059569,"version":"3.50.1"},"reference-count":70,"publisher":"Association for Computing Machinery (ACM)","issue":"5","license":[{"start":{"date-parts":[[2025,5,22]],"date-time":"2025-05-22T00:00:00Z","timestamp":1747872000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62472092, 62172089, 62106045"],"award-info":[{"award-number":["62472092, 62172089, 62106045"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100004608","name":"Natural Science Foundation of Jiangsu Province","doi-asserted-by":"crossref","award":["BK20241751"],"award-info":[{"award-number":["BK20241751"]}],"id":[{"id":"10.13039\/501100004608","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Jiangsu Provincial Key Laboratory of Computer Networking Technology"},{"name":"Jiangsu Provincial Key Laboratory of Network and Information Security","award":["BM2003201"],"award-info":[{"award-number":["BM2003201"]}]},{"name":"Key Laboratory of Computer Network and Information Integration of Ministry of Education of China","award":["93K-9"],"award-info":[{"award-number":["93K-9"]}]},{"name":"Nanjing Purple Mountain Laboratories"},{"name":"Big Data Computing Center of Southeast University"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,5,31]]},"abstract":"<jats:p>Single object tracking aims to locate one specific target in video sequences, given its initial state. Classical trackers rely solely on visual cues, restricting their ability to handle challenges such as appearance variations, ambiguity, and distractions. Hence, Vision-Language Tracking (VLT) has emerged as a promising approach, incorporating language descriptions to directly provide high-level semantics and enhance tracking performance. However, current Vision-Language (VL) trackers have not fully exploited the power of multi-modal learning, as they suffer from limitations such as heavily relying on off-the-shelf backbones for feature extraction, ineffective asynchronous fusion designs, and the absence of VL-related loss functions for optimizing multi-modal representation. Consequently, we present a novel tracker that progressively explores target-centric semantics for VLT. Specifically, we propose the first Synchronous Learning Backbone (SLB) for VLT, which consists of two novel modules: the Target Enhance Module (TEM) and the Semantic-Aware Module (SAM). These modules together ensure the multi-modal feature extraction and interaction at the same pace, facilitating the tracker to synchronously perceive target-related semantics from both visual and textual modalities. Moreover, we devise the dense matching loss to further strengthen multi-modal representation learning. Extensive experiments on VLT datasets demonstrate the superiority and effectiveness of our methods.<\/jats:p>","DOI":"10.1145\/3726529","type":"journal-article","created":{"date-parts":[[2025,3,28]],"date-time":"2025-03-28T20:19:31Z","timestamp":1743193171000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":8,"title":["Beyond Visual Cues: Synchronously Exploring Target-Centric Semantics for Vision-Language Tracking"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7268-7815","authenticated-orcid":false,"given":"Jiawei","family":"Ge","sequence":"first","affiliation":[{"name":"School of Cyber Science and Engineering, Southeast University, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2448-6717","authenticated-orcid":false,"given":"Jiuxin","family":"Cao","sequence":"additional","affiliation":[{"name":"School of Cyber Science and Engineering, Southeast University, Nanjing, China and Purple Mountain Laboratories, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-5395-8007","authenticated-orcid":false,"given":"Xiangmei","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Cyber Science and Engineering, Southeast University, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7676-2843","authenticated-orcid":false,"given":"Xuelin","family":"Zhu","sequence":"additional","affiliation":[{"name":"School of Cyber Science and Engineering, Southeast University, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2634-7283","authenticated-orcid":false,"given":"Weijia","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Cyber Science and Engineering, Southeast University, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-3865-2434","authenticated-orcid":false,"given":"Chang","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Cyber Science and Engineering, Southeast University, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6735-7667","authenticated-orcid":false,"given":"Kun","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Cyber Science and Engineering, Southeast University, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5209-9063","authenticated-orcid":false,"given":"Bo","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Southeast University, Nanjing, China and Purple Mountain Laboratories, Nanjing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,5,22]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"crossref","unstructured":"Yifan Bai Zeyang Zhao Yihong Gong and Xing Wei. 2023. ARTrackV2: Prompting autoregressive tracker where to look and how to describe. arXiv:2312.17133. Retrieved from https:\/\/arxiv.org\/abs\/2312.17133","DOI":"10.1109\/CVPR52733.2024.01802"},{"key":"e_1_3_2_3_2","first-page":"1401","volume-title":"Proceedings of the CVPR","author":"Bertinetto Luca","year":"2016","unstructured":"Luca Bertinetto, Jack Valmadre, Stuart Golodetz, Ondrej Miksik, and Philip H. S. Torr. 2016. Staple: Complementary learners for real-time tracking. In Proceedings of the CVPR, 1401\u20131409."},{"key":"e_1_3_2_4_2","first-page":"483","volume-title":"Proceedings of the ECCV","author":"Bhat Goutam","year":"2018","unstructured":"Goutam Bhat, Joakim Johnander, Martin Danelljan, Fahad Shahbaz Khan, and Michael Felsberg. 2018. Unveiling the power of deep tracking. In Proceedings of the ECCV, 483\u2013498."},{"key":"e_1_3_2_5_2","first-page":"8126","volume-title":"Proceedings of the CVPR","author":"Chen Xin","year":"2021","unstructured":"Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. 2021. Transformer tracking. In Proceedings of the CVPR, 8126\u20138135."},{"key":"e_1_3_2_6_2","first-page":"13608","volume-title":"Proceedings of the CVPR","author":"Cui Yutao","year":"2022","unstructured":"Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. 2022. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the CVPR, 13608\u201313618."},{"key":"e_1_3_2_7_2","first-page":"58736","article-title":"Mixformerv2: Efficient fully transformer tracking","volume":"36","author":"Cui Yutao","year":"2023","unstructured":"Yutao Cui, Tianhui Song, Gangshan Wu, and Limin Wang. 2023. Mixformerv2: Efficient fully transformer tracking. In Advances in Neural Information Processing Systems, Vol. 36, 58736\u201358751.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_8_2","unstructured":"Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https:\/\/arxiv.org\/abs\/1810.04805"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.5555\/3322706.3361996"},{"key":"e_1_3_2_10_2","first-page":"5374","volume-title":"Proceedings of the CVPR","author":"Fan Heng","year":"2019","unstructured":"Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. 2019. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the CVPR, 5374\u20135383."},{"key":"e_1_3_2_11_2","first-page":"700","volume-title":"Proceedings of the WACV","author":"Feng Qi","year":"2020","unstructured":"Qi Feng, Vitaly Ablavsky, Qinxun Bai, Guorong Li, and Stan Sclaroff. 2020. Real-time visual object tracking with natural language description. In Proceedings of the WACV, 700\u2013709."},{"key":"e_1_3_2_12_2","first-page":"5851","volume-title":"Proceedings of the CVPR","author":"Feng Qi","year":"2021","unstructured":"Qi Feng, Vitaly Ablavsky, Qinxun Bai, and Stan Sclaroff. 2021. Siamese natural language tracker: Tracking by natural language descriptions with Siamese trackers. In Proceedings of the CVPR, 5851\u20135860."},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3309665"},{"key":"e_1_3_2_14_2","first-page":"146","volume-title":"Proceedings of the ECCV","author":"Gao Shenyuan","year":"2022","unstructured":"Shenyuan Gao, Chunluan Zhou, Chao Ma, Xinggang Wang, and Junsong Yuan. 2022. Aiatrack: Attention in attention for transformer visual tracking. In Proceedings of the ECCV. Springer, 146\u2013164."},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/3664647.3680657"},{"key":"e_1_3_2_16_2","first-page":"4446","article-title":"Divert more attention to vision-language tracking","volume":"35","author":"Guo Mingzhe","year":"2022","unstructured":"Mingzhe Guo, Zhipeng Zhang, Heng Fan, and Liping Jing. 2022. Divert more attention to vision-language tracking. In Advances in Neural Information Processing Systems, Vol. 35, 4446\u20134460.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_17_2","unstructured":"Mingzhe Guo Zhipeng Zhang Heng Fan Liping Jing Yilin Lyu Bing Li and Weiming Hu. 2022. Learning target-aware representation for visual tracking via informative interactions. arXiv:2201.02526. Retrieved from https:\/\/arxiv.org\/abs\/2201.02526"},{"key":"e_1_3_2_18_2","first-page":"19079","volume-title":"Proceedings of the CVPR","author":"Hong Lingyi","year":"2024","unstructured":"Lingyi Hong, Shilin Yan, Renrui Zhang, Wanyun Li, Xinyu Zhou, Pinxue Guo, Kaixun Jiang, Yiting Chen, Jinglun Li, Zhaoyu Chen, et al. 2024. Onetracker: Unifying visual object tracking with foundation models and efficient tuning. In Proceedings of the CVPR, 19079\u201319091."},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2024.3433415"},{"key":"e_1_3_2_20_2","first-page":"7132","volume-title":"Proceedings of the CVPR","author":"Hu Jie","year":"2018","unstructured":"Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the CVPR, 7132\u20137141."},{"key":"e_1_3_2_21_2","first-page":"12519","volume-title":"Proceedings of the AAAI","author":"Hu Kun","year":"2024","unstructured":"Kun Hu, Wenjing Yang, Wanrong Huang, Xianchen Zhou, Mingyu Cao, Jing Ren, and Huibin Tan. 2024. Sequential fusion based multi-granularity consistency for space-time transformer tracking. In Proceedings of the AAAI, 12519\u201312527."},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2019.2957464"},{"issue":"5","key":"e_1_3_2_23_2","first-page":"6552","article-title":"Visual object tracking with discriminative filters and Siamese networks: A survey and outlook","volume":"45","author":"Javed Sajid","year":"2022","unstructured":"Sajid Javed, Martin Danelljan, Fahad Shahbaz Khan, Muhammad Haris Khan, Michael Felsberg, and Jiri Matas. 2022. Visual object tracking with discriminative filters and Siamese networks: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 45, 5 (2022), 6552\u20136574.","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"issue":"12","key":"e_1_3_2_24_2","doi-asserted-by":"crossref","first-page":"9900","DOI":"10.1109\/TNNLS.2022.3161969","article-title":"Robust rgb-t tracking via graph attention-based bilinear pooling","volume":"34","author":"Kang Bin","year":"2022","unstructured":"Bin Kang, Dong Liang, Junxi Mei, Xiaoyang Tan, Quan Zhou, and Dengyin Zhang. 2022. Robust rgb-t tracking via graph attention-based bilinear pooling. IEEE Trans. Neural Netw. Learn. Syst. 34, 12 (2022), 9900\u20139911.","journal-title":"IEEE Trans. Neural Netw. Learn. Syst"},{"key":"e_1_3_2_25_2","first-page":"19113","volume-title":"Proceedings of the CVPR","author":"Uzair Khattak Muhammad","year":"2023","unstructured":"Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. 2023. Maple: Multi-modal prompt learning. In Proceedings of the CVPR, 19113\u201319122."},{"key":"e_1_3_2_26_2","unstructured":"Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https:\/\/arxiv.org\/abs\/1412.6980"},{"key":"e_1_3_2_27_2","first-page":"50959","article-title":"ZoomTrack: Target-aware non-uniform resizing for efficient visual tracking","volume":"36","author":"Kou Yutong","year":"2023","unstructured":"Yutong Kou, Jin Gao, Bing Li, Gang Wang, Weiming Hu, Yizheng Wang, and Liang Li. 2023. ZoomTrack: Target-aware non-uniform resizing for efficient visual tracking. In Advances in Neural Information Processing Systems Vol. 36, 50959\u201350977.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_28_2","first-page":"12888","volume-title":"Proceedings of the ICME","author":"Li Junnan","year":"2022","unstructured":"Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the ICME. PMLR, 12888\u201312900."},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2024.3402436"},{"key":"e_1_3_2_30_2","first-page":"498","volume-title":"Proceedings of the ECCV","author":"Li Siyuan","year":"2022","unstructured":"Siyuan Li, Martin Danelljan, Henghui Ding, Thomas E. Huang, and Fisher Yu. 2022. Tracking every thing in the wild. In Proceedings of the ECCV. Springer, 498\u2013515."},{"key":"e_1_3_2_31_2","first-page":"11632","volume-title":"Proceedings of the CVPR","author":"Li Xiang","year":"2021","unstructured":"Xiang Li, Wenhai Wang, Xiaolin Hu, Jun Li, Jinhui Tang, and Jian Yang. 2021. Generalized focal loss v2: Learning reliable localization quality estimation for dense object detection. In Proceedings of the CVPR, 11632\u201311641."},{"key":"e_1_3_2_32_2","first-page":"4931","volume-title":"Proceedings of the CVPR","author":"Li Yihao","year":"2022","unstructured":"Yihao Li, Jun Yu, Zhongpeng Cai, and Yuwen Pan. 2022. Cross-modal target retrieval for tracking by natural language. In Proceedings of the CVPR, 4931\u20134940."},{"key":"e_1_3_2_33_2","first-page":"6495","volume-title":"Proceedings of the CVPR","author":"Li Zhenyang","year":"2017","unstructured":"Zhenyang Li, Ran Tao, Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders. 2017. Tracking by natural language specification. In Proceedings of the CVPR, 6495\u20136503."},{"key":"e_1_3_2_34_2","first-page":"300","volume-title":"Proceedings of the ECCV","author":"Lin Liting","year":"2025","unstructured":"Liting Lin, Heng Fan, Zhipeng Zhang, Yaowei Wang, Yong Xu, and Haibin Ling. 2025. Tracking meets lora: Faster training, larger model, stronger performance. In Proceedings of the ECCV. Springer, 300\u2013318."},{"key":"e_1_3_2_35_2","first-page":"16743","volume-title":"Proceedings of NeurIPS","volume":"35","author":"Lin Liting","year":"2022","unstructured":"Liting Lin, Heng Fan, Zhipeng Zhang, Yong Xu, and Haibin Ling. 2022. Swintrack: A simple and strong baseline for transformer tracking. In Proceedings of NeurIPS, Vol. 35, 16743\u201316754."},{"key":"e_1_3_2_36_2","first-page":"740","volume-title":"Proceedings of the ECCV","author":"Lin Tsung-Yi","year":"2014","unstructured":"Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the ECCV, Part V. Springer, 740\u2013755."},{"key":"e_1_3_2_37_2","first-page":"10012","volume-title":"Proceedings of the IEEE\/CVF","author":"Liu Ze","year":"2021","unstructured":"Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE\/CVF, 10012\u201310022."},{"key":"e_1_3_2_38_2","first-page":"1948","volume-title":"Proceedings of the ACM MM","author":"Ma Ding","year":"2021","unstructured":"Ding Ma and Xiangqian Wu. 2021. Capsule-based object tracking with natural language specification. In Proceedings of the ACM MM, 1948\u20131956."},{"key":"e_1_3_2_39_2","first-page":"14012","volume-title":"Proceedings of the ICCV","author":"Ma Ding","year":"2023","unstructured":"Ding Ma and Xiangqian Wu. 2023. Tracking by natural language specification with long short-term context decoupling. In Proceedings of the ICCV, 14012\u201314021."},{"key":"e_1_3_2_40_2","first-page":"8781","volume-title":"Proceedings of the CVPR","author":"Ma Fan","year":"2022","unstructured":"Fan Ma, Mike Zheng Shou, Linchao Zhu, Haoqi Fan, Yilei Xu, Yi Yang, and Zhicheng Yan. 2022. Unified transformer tracker for object tracking. In Proceedings of the CVPR, 8781\u20138790."},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/TITS.2020.3046478"},{"key":"e_1_3_2_42_2","first-page":"8731","volume-title":"Proceedings of the CVPR","author":"Mayer Christoph","year":"2022","unstructured":"Christoph Mayer, Martin Danelljan, Goutam Bhat, Matthieu Paul, Danda Pani Paudel, Fisher Yu, and Luc Van Gool. 2022. Transforming model prediction for tracking. In Proceedings of the CVPR, 8731\u20138740."},{"key":"e_1_3_2_43_2","first-page":"300","volume-title":"Proceedings of the ECCV","author":"Muller Matthias","year":"2018","unstructured":"Matthias Muller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and Bernard Ghanem. 2018. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the ECCV, 300\u2013317."},{"key":"e_1_3_2_44_2","first-page":"8748","volume-title":"Proceedings of the ICME","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the ICME. PMLR, 8748\u20138763."},{"key":"e_1_3_2_45_2","first-page":"19208","volume-title":"Proceedings of the CVPR","author":"Shao Yanyan","year":"2024","unstructured":"Yanyan Shao, Shuting He, Qi Ye, Yuchao Feng, Wenhan Luo, and Jiming Chen. 2024. Context-aware integration of language and visual references for natural language tracking. In Proceedings of the CVPR, 19208\u201319217."},{"issue":"3","key":"e_1_3_2_46_2","doi-asserted-by":"crossref","first-page":"244","DOI":"10.1111\/1467-9280.03439","article-title":"Learning to recognize objects","volume":"14","author":"Smith Linda B.","year":"2003","unstructured":"Linda B. Smith. 2003. Learning to recognize objects. Psychol. Sci. 14, 3 (2003), 244\u2013250.","journal-title":"Psychol. Sci"},{"key":"e_1_3_2_47_2","first-page":"5998","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems Vol. 30, 5998\u20136008.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_48_2","first-page":"568","volume-title":"Proceedings of the IEEE\/CVF","author":"Wang Wenhai","year":"2021","unstructured":"Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE\/CVF, 568\u2013578."},{"key":"e_1_3_2_49_2","first-page":"13763","volume-title":"Proceedings of the CVPR","author":"Wang Xiao","year":"2021","unstructured":"Xiao Wang, Xiujun Shu, Zhipeng Zhang, Bo Jiang, Yaowei Wang, Yonghong Tian, and Feng Wu. 2021. Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In Proceedings of the CVPR, 13763\u201313773."},{"key":"e_1_3_2_50_2","first-page":"9697","volume-title":"Proceedings of the CVPR","author":"Wei Xing","year":"2023","unstructured":"Xing Wei, Yifan Bai, Yongchao Zheng, Dahu Shi, and Yihong Gong. 2023. Autoregressive Visual Tracking. In Proceedings of the CVPR, 9697\u20139706."},{"key":"e_1_3_2_51_2","first-page":"22","volume-title":"Proceedings of the ICCV","author":"Wu Haiping","year":"2021","unstructured":"Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. 2021. Cvt: Introducing convolutions to vision transformers. In Proceedings of the ICCV, 22\u201331."},{"key":"e_1_3_2_52_2","first-page":"19113","volume-title":"Proceedings of the CVPR","author":"Xie Fei","year":"2024","unstructured":"Fei Xie, Zhongdao Wang, and Chao Ma. 2024. DiffusionTrack: Point set diffusion model for visual object tracking. In Proceedings of the CVPR, 19113\u201319124."},{"key":"e_1_3_2_53_2","first-page":"19300","volume-title":"Proceedings of the CVPR","author":"Xie Jinxia","year":"2024","unstructured":"Jinxia Xie, Bineng Zhong, Zhiyi Mo, Shengping Zhang, Liangtao Shi, Shuxiang Song, and Rongrong Ji. 2024. Autoregressive queries for adaptive tracking with spatio-temporal transformers. In Proceedings of the CVPR, 19300\u201319309."},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2023.3275156"},{"key":"e_1_3_2_55_2","first-page":"733","volume-title":"Proceedings of the ECCV","author":"Yan Bin","year":"2022","unstructured":"Bin Yan, Yi Jiang, Peize Sun, Dong Wang, Zehuan Yuan, Ping Luo, and Huchuan Lu. 2022. Towards grand unification of object tracking. In Proceedings of the ECCV. Springer, 733\u2013751."},{"key":"e_1_3_2_56_2","first-page":"15325","volume-title":"Proceedings of the CVPR","author":"Yan Bin","year":"2023","unstructured":"Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Ping Luo, Zehuan Yuan, and Huchuan Lu. 2023. Universal instance perception as object discovery and retrieval. In Proceedings of the CVPR, 15325\u201315336."},{"key":"e_1_3_2_57_2","first-page":"10448","volume-title":"Proceedings of the ICCV","author":"Yan Bin","year":"2021","unstructured":"Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. 2021. Learning spatio-temporal transformer for visual tracking. In Proceedings of the ICCV, 10448\u201310457."},{"key":"e_1_3_2_58_2","first-page":"5289","volume-title":"Proceedings of the CVPR","author":"Yan Bin","year":"2021","unstructured":"Bin Yan, Xinyu Zhang, Dong Wang, Huchuan Lu, and Xiaoyun Yang. 2021. Alpha-refine: Boosting tracking performance by precise bounding box estimation. In Proceedings of the CVPR, 5289\u20135298."},{"key":"e_1_3_2_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2021.3074239"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2020.3038720"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIM.2024.3462973"},{"key":"e_1_3_2_62_2","first-page":"5552","volume-title":"Proceedings of the ACM MM","author":"Zhang Chunhui","year":"2023","unstructured":"Chunhui Zhang, Xin Sun, Yiqian Yang, Li Liu, Qiong Liu, Xi Zhou, and Yanfeng Wang. 2023. All in one: Exploring unified vision-language tracking with multi-modal alignment. In Proceedings of the ACM MM, 5552\u20135561."},{"key":"e_1_3_2_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2024.3395352"},{"key":"e_1_3_2_64_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIM.2024.3436098"},{"key":"e_1_3_2_65_2","first-page":"5404","volume-title":"Proceedings of the CVPR","author":"Zhang Tianlu","year":"2023","unstructured":"Tianlu Zhang, Hongyuan Guo, Qiang Jiao, Qiang Zhang, and Jungong Han. 2023. Efficient RGB-T tracking via cross-modality distillation. In Proceedings of the CVPR, 5404\u20135413."},{"key":"e_1_3_2_66_2","first-page":"13339","volume-title":"Proceedings of the ICCV","author":"Zhang Zhipeng","year":"2021","unstructured":"Zhipeng Zhang, Yihao Liu, Xiao Wang, Bing Li, and Weiming Hu. 2021. Learn to match: Automatic matching network design for visual tracking. In Proceedings of the ICCV, 13339\u201313348."},{"key":"e_1_3_2_67_2","first-page":"76","volume-title":"In Proceedings of the ECCV","author":"Zhao Zelin","year":"2022","unstructured":"Zelin Zhao, Ze Wu, Yueqing Zhuang, Boxun Li, and Jiaya Jia. 2022. Tracking objects as pixel-wise distributions. In Proceedings of the ECCV. Springer, 76\u201394."},{"key":"e_1_3_2_68_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2023.3301933"},{"key":"e_1_3_2_69_2","doi-asserted-by":"crossref","unstructured":"Li Zhou Zikun Zhou Kaige Mao and Zhenyu He. 2023. Joint visual grounding and tracking with natural language specification. arXiv:2303.12027. Retrieved from https:\/\/arxiv.org\/abs\/arXiv:2303.12027","DOI":"10.1109\/CVPR52729.2023.02217"},{"key":"e_1_3_2_70_2","first-page":"1473","volume-title":"Proceedings of the IEEE\/CVF","author":"Zhu Xuelin","year":"2023","unstructured":"Xuelin Zhu, Jian Liu, Weijia Liu, Jiawei Ge, Bo Liu, and Jiuxin Cao. 2023. Scene-aware label graph learning for multi-label image classification. In Proceedings of the IEEE\/CVF, 1473\u20131482."},{"key":"e_1_3_2_71_2","first-page":"22045","volume-title":"Proceedings of the ICCV","author":"Zhu Zhiyu","year":"2023","unstructured":"Zhiyu Zhu, Junhui Hou, and Dapeng Oliver Wu. 2023. Cross-modal orthogonal high-rank augmentation for rgb-event transformer-trackers. In Proceedings of the ICCV, 22045\u201322055."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3726529","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3726529","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:56:42Z","timestamp":1750298202000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3726529"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,22]]},"references-count":70,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2025,5,31]]}},"alternative-id":["10.1145\/3726529"],"URL":"https:\/\/doi.org\/10.1145\/3726529","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,5,22]]},"assertion":[{"value":"2024-08-27","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-03-21","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-05-22","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}