{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T00:25:32Z","timestamp":1760142332052,"version":"build-2065373602"},"reference-count":80,"publisher":"Springer Science and Business Media LLC","issue":"10","license":[{"start":{"date-parts":[[2025,6,23]],"date-time":"2025-06-23T00:00:00Z","timestamp":1750636800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,6,23]],"date-time":"2025-06-23T00:00:00Z","timestamp":1750636800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["U23A20384","62176041","62402084"],"award-info":[{"award-number":["U23A20384","62176041","62402084"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100012131","name":"Department of Science and Technology of Liaoning Province","doi-asserted-by":"publisher","award":["2024JH2\/102600040"],"award-info":[{"award-number":["2024JH2\/102600040"]}],"id":[{"id":"10.13039\/501100012131","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100002858","name":"China Postdoctoral Science Foundation","doi-asserted-by":"publisher","award":["2024M750319"],"award-info":[{"award-number":["2024M750319"]}],"id":[{"id":"10.13039\/501100002858","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2025,10]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Transformer-based visual trackers have demonstrated significant advancements due to their powerful modeling capabilities. However, their practicality is limited on resource-constrained devices because of their slow processing speeds. To address this challenge, we present HiT, a novel family of efficient tracking models that achieve high performance while maintaining fast operation across various devices. The core innovation of HiT lies in its Bridge Module, which connects lightweight transformers to the tracking framework, enhancing feature representation quality. Additionally, we introduce a dual-image position encoding approach to effectively encode spatial information. HiT achieves an impressive speed of 61 frames per second (fps) on the NVIDIA Jetson AGX platform, alongside a competitive AUC of 64.6% on the LaSOT benchmark, outperforming all previous efficient trackers. Building on HiT, we propose DyHiT, an efficient dynamic tracker that flexibly adapts to scene complexity by selecting routes with varying computational requirements. DyHiT uses search area features extracted by the backbone network and inputs them into an efficient dynamic router to classify tracking scenarios. Based on the classification, DyHiT applies a divide-and-conquer strategy, selecting appropriate routes to achieve a superior trade-off between accuracy and speed. The fastest version of DyHiT achieves 111 fps on NVIDIA Jetson AGX while maintaining an AUC of 62.4% on LaSOT. Furthermore, we introduce a training-free acceleration method based on the dynamic routing architecture of DyHiT. This method significantly improves the execution speed of various high-performance trackers without sacrificing accuracy. For instance, our acceleration method enables the state-of-the-art tracker SeqTrack-B256 to achieve a <jats:inline-formula>\n              <jats:alternatives>\n                <jats:tex-math>$$2.68\\times $$<\/jats:tex-math>\n                <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mrow>\n                    <mml:mn>2.68<\/mml:mn>\n                    <mml:mo>\u00d7<\/mml:mo>\n                  <\/mml:mrow>\n                <\/mml:math>\n              <\/jats:alternatives>\n            <\/jats:inline-formula> speedup on an NVIDIA GeForce RTX 2080 Ti GPU while maintaining the same AUC of 69.9% on the LaSOT. Codes, models, and results are available at <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/github.com\/kangben258\/HiT\" ext-link-type=\"uri\">https:\/\/github.com\/kangben258\/HiT<\/jats:ext-link>.<\/jats:p>","DOI":"10.1007\/s11263-025-02500-9","type":"journal-article","created":{"date-parts":[[2025,6,23]],"date-time":"2025-06-23T01:15:28Z","timestamp":1750641328000},"page":"6689-6711","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking"],"prefix":"10.1007","volume":"133","author":[{"given":"Ben","family":"Kang","sequence":"first","affiliation":[]},{"given":"Xin","family":"Chen","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3924-5537","authenticated-orcid":false,"given":"Jie","family":"Zhao","sequence":"additional","affiliation":[]},{"given":"Chunjuan","family":"Bo","sequence":"additional","affiliation":[]},{"given":"Dong","family":"Wang","sequence":"additional","affiliation":[]},{"given":"Huchuan","family":"Lu","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,6,23]]},"reference":[{"key":"2500_CR1","doi-asserted-by":"crossref","unstructured":"Bertinetto, Luca, Valmadre, Jack, Henriques, Jo\u00e3o\u00a0F, Vedaldi, Andrea, & Torr, Philip H\u00a0S. (2016). Fully-Convolutional Siamese Networks for Object Tracking. In ECCV, pages 850\u2013865.","DOI":"10.1007\/978-3-319-48881-3_56"},{"key":"2500_CR2","doi-asserted-by":"crossref","unstructured":"Bhat, Goutam, Danelljan, Martin, Gool, Luc\u00a0Van, & Timofte, Radu. (2019). Learning Discriminative Model Prediction for Tracking. In ICCV, pages 6181\u20136190.","DOI":"10.1109\/ICCV.2019.00628"},{"key":"2500_CR3","doi-asserted-by":"crossref","unstructured":"Blatter, Philippe, Kanakis, Menelaos, Danelljan, Martin, & Van\u00a0Gool, Luc. (2023). Efficient Visual Tracking with Exemplar Transformers. In WACV, pages 1571\u20131581.","DOI":"10.1109\/WACV56688.2023.00162"},{"key":"2500_CR4","doi-asserted-by":"crossref","unstructured":"Borsuk, Vasyl, Vei, Roman, Kupyn, Orest, Martyniuk, Tetiana, Krashenyi, Igor, & Matas, Ji\u0159i. (2022). FEAR: Fast, Efficient, Accurate and Robust Visual Tracker. In ECCV, pages 644\u2013663.","DOI":"10.1007\/978-3-031-20047-2_37"},{"key":"2500_CR5","doi-asserted-by":"crossref","unstructured":"Chen, Boyu, Li, Peixia, Bai, Lei, Qiao, Lei, Shen, Qiuhong,\u00a0Li, Bo, Gan, Weihao, Wu, Wei, & Ouyang, Wanli. (2022). Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking. In ECCV, pages 375\u2013392.","DOI":"10.1007\/978-3-031-20047-2_22"},{"key":"2500_CR6","doi-asserted-by":"crossref","unstructured":"Chen, Xin, Kang, Ben, Geng, Wanting, Zhu, Jiawen,\u00a0Liu, Yi, Wang, Dong, & Lu, Huchuan. (2025). Sutrack: Towards simple and unified single object tracking. In AAAI, pages 2239\u20132247.","DOI":"10.1609\/aaai.v39i2.32223"},{"key":"2500_CR7","doi-asserted-by":"crossref","unstructured":"Chen, Xin, Kang, Ben, Wang, Dong, Li, Dongdong, & Lu, Huchuan. (2022). Efficient Visual Tracking via Hierarchical Cross-Attention Transformer. In ECCVW, pages 461\u2013477.","DOI":"10.1007\/978-3-031-25085-9_26"},{"key":"2500_CR8","doi-asserted-by":"crossref","unstructured":"Chen, Xin, Peng, Houwen, Wang, Dong, Lu, Huchuan, & Hu, Han. (2023). Seqtrack: Sequence to sequence learning for visual object tracking. In CVPR, pages 14572\u201314581.","DOI":"10.1109\/CVPR52729.2023.01400"},{"key":"2500_CR9","unstructured":"Chen, Xin, Yan, Bin, Zhu, Jiawen, Wang, Dong, Yang, Xiaoyun, & Lu, Huchuan. (2021). Transformer Tracking. In CVPR, pages 8126\u20138135,"},{"key":"2500_CR10","doi-asserted-by":"crossref","unstructured":"Chen, Zedu, Zhong, Bineng, Li, Guorong, Zhang, Shengping, & Ji, Rongrong. (2020). Siamese Box Adaptive Network for Visual Tracking. In CVPR, pages 6667\u20136676,","DOI":"10.1109\/CVPR42600.2020.00670"},{"issue":"7","key":"2500_CR11","first-page":"8507","volume":"45","author":"Xin Chen","year":"2022","unstructured":"Chen, Xin, Yan, Bin, Zhu, Jiawen, Huchuan, Lu., Ruan, Xiang, & Wang, Dong. (2022). High-performance transformer tracking. IEEE TPAMI, 45(7), 8507\u20138523.","journal-title":"IEEE TPAMI"},{"key":"2500_CR12","doi-asserted-by":"crossref","unstructured":"Cui, Yutao, Jiang, Cheng, Wang, Limin, & Wu, Gangshan. (2022). Mixformer: End-to-End Tracking with Iterative Mixed Attention. In CVPR, pages 13598\u201313608.","DOI":"10.1109\/CVPR52688.2022.01324"},{"key":"2500_CR13","unstructured":"Cui, Yutao, Song, Tianhui, Wu, Gangshan, & Wang, Limin. (2024). Mixformerv2: Efficient fully transformer tracking. NIPS, 36,"},{"key":"2500_CR14","doi-asserted-by":"crossref","unstructured":"Dai, Kenan, Wang, Dong, Lu, Huchuan, Sun, Chong, & Li, Jianhua. (2019). Visual tracking via adaptive spatially-regularized correlation filters. In CVPR, pages 4670\u20134679.","DOI":"10.1109\/CVPR.2019.00480"},{"key":"2500_CR15","doi-asserted-by":"crossref","unstructured":"Danelljan, Martin, Bhat, Goutam, Khan, Fahad\u00a0Shahbaz, & Felsberg, Michael. (2019). ATOM: Accurate Tracking by Overlap Maximization. In CVPR, pages 4660\u20134669.","DOI":"10.1109\/CVPR.2019.00479"},{"key":"2500_CR16","doi-asserted-by":"crossref","unstructured":"Danelljan, Martin, Bhat,Goutam, Shahbaz\u00a0Khan, Fahad, & Felsberg, Michael. (2017). ECO: Efficient Convolution Operators for Tracking. In CVPR, pages 6931\u20136939,","DOI":"10.1109\/CVPR.2017.733"},{"key":"2500_CR17","doi-asserted-by":"crossref","unstructured":"Danelljan, Martin, Gool, Luc\u00a0Van, & Timofte, Radu. (2020). Probabilistic Regression for Visual Tracking. In CVPR, pages 7181\u20137190,","DOI":"10.1109\/CVPR42600.2020.00721"},{"key":"2500_CR18","unstructured":"Dosovitskiy, Alexey, Beyer, Lucas, Kolesnikov, Alexander, Weissenborn, Dirk, Zhai, Xiaohua, Unterthiner, Thomas, Dehghani, Mostafa, Minderer, Matthias, Heigold, Georg, Gelly, Sylvain, et\u00a0al. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR,"},{"key":"2500_CR19","doi-asserted-by":"crossref","unstructured":"Fan, Heng, Lin, Liting, Yang, Fan, Chu, Peng, Deng,Ge, Yu, Sijia, Bai, Hexin, Xu, Yong, Liao, Chunyuan, & Ling, Haibin. (2019). LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking. In CVPR, pages 5374\u20135383,","DOI":"10.1109\/CVPR.2019.00552"},{"issue":"2","key":"2500_CR20","doi-asserted-by":"publisher","first-page":"439","DOI":"10.1007\/s11263-020-01387-y","volume":"129","author":"Heng Fan","year":"2021","unstructured":"Fan, Heng, Bai, Hexin, Lin, Liting, Yang, Fan, Chu, Peng, Deng, Ge., Sijia, Yu., Huang, Mingzhen, Liu, Juehuan, Yong, Xu., et al. (2021). LaSOT: A High-quality Large-scale Single Object Tracking Benchmark. IJCV, 129(2), 439\u2013461.","journal-title":"IJCV"},{"key":"2500_CR21","unstructured":"Fei, X.,\u00a0Wankou, Y.,\u00a0Bo, L.,\u00a0Kaihua, Z.,\u00a0Wanli, X., &\u00a0Wangmeng, Z. (2020). Discriminative segmentation tracking using dual memory banks. arXiv: 2009.09669 v1,"},{"key":"2500_CR22","doi-asserted-by":"crossref","unstructured":"Figurnov, Michael, Collins, Maxwell\u00a0D, Zhu, Yukun,\u00a0Zhang, Li, Huang, Jonathan, Vetrov, Dmitry, & Salakhutdinov, Ruslan. (2017). Spatially adaptive computation time for residual networks. In CVPR, pages 1039\u20131048,","DOI":"10.1109\/CVPR.2017.194"},{"key":"2500_CR23","doi-asserted-by":"crossref","unstructured":"Gao, Shenyuan, Zhou, Chunluan, Ma, Chao, Wang, Xinggang, & Yuan, Junsong. (2022). AiATrack: Attention in attention for transformer visual tracking. In ECCV, pages 146\u2013164,","DOI":"10.1007\/978-3-031-20047-2_9"},{"key":"2500_CR24","doi-asserted-by":"crossref","unstructured":"Graham, Benjamin, El-Nouby, Alaaeldin, Touvron, Hugo, Stock, Pierre, Joulin, Armand, J\u00e9gou, Herv\u00e9, & Douze, Matthijs. (2021). LeViT: a Vision Transformer in ConvNet\u2019s Clothing for Faster Inference. In ICCV, pages 12239\u201312249,","DOI":"10.1109\/ICCV48922.2021.01204"},{"key":"2500_CR25","doi-asserted-by":"crossref","unstructured":"Guo, Dongyan, Wang, Jun, Cui, Ying, Wang, Zhenhua, & Chen, Shengyong. (2020). SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking. In CVPR, pages 6268\u20136276,","DOI":"10.1109\/CVPR42600.2020.00630"},{"key":"2500_CR26","doi-asserted-by":"crossref","unstructured":"Haiping, Wu., Xiao, Bin, Codella, Noel, Liu, Mengchen, Xiyang Dai, Lu., & Yuan, Lei. (2021). Zhang (pp. 22\u201331). CvT: Introducing Convolutions to Vision Transformers. In ICCV.","DOI":"10.1109\/ICCV48922.2021.00009"},{"key":"2500_CR27","first-page":"36845","volume":"35","author":"Yizeng Han","year":"2022","unstructured":"Han, Yizeng, Yuan, Zhihang, Yifan, Pu., Xue, Chenhao, Song, Shiji, Sun, Guangyu, & Huang, Gao. (2022). Latency-aware spatial-wise dynamic networks. NIPS, 35, 36845\u201336857.","journal-title":"NIPS"},{"key":"2500_CR28","doi-asserted-by":"crossref","unstructured":"He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, & Sun, Jian. (2016). Deep Residual Learning for Image Recognition. In CVPR, pages 770\u2013778,","DOI":"10.1109\/CVPR.2016.90"},{"key":"2500_CR29","doi-asserted-by":"crossref","unstructured":"Huang, Chen, Lucey, Simon, & Ramanan, Deva. (2017). Learning policies for adaptive tracking with deep feature cascades. In ICCV, pages 105\u2013114,","DOI":"10.1109\/ICCV.2017.21"},{"key":"2500_CR30","unstructured":"Huang, Gao, Chen, Danlu, Li, Tianhong, Wu, Felix, van\u00a0der Maaten, Laurens, & Weinberger, Kilian. (2018). Multi-scale dense networks for resource efficient image classification. In ICLR,"},{"issue":"5","key":"2500_CR31","doi-asserted-by":"publisher","first-page":"1562","DOI":"10.1109\/TPAMI.2019.2957464","volume":"43","author":"Lianghua Huang","year":"2021","unstructured":"Huang, Lianghua, Zhao, Xin, & Huang, Kaiqi. (2021). Got-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE TPAMI, 43(5), 1562\u20131577.","journal-title":"IEEE TPAMI"},{"key":"2500_CR32","doi-asserted-by":"crossref","unstructured":"Kang, Ben, Chen, Xin, Lai, Simiao, Liu, Yang,\u00a0Liu, Yi, & Wang, Dong. (2025). Exploring enhanced contextual information for video-level object tracking. In AAAI, pages 4194\u20134202,","DOI":"10.1609\/aaai.v39i4.32440"},{"key":"2500_CR33","doi-asserted-by":"crossref","unstructured":"Kang, Ben, Chen, Xin, Wang, Dong, Peng, Houwen, & Lu, Huchuan. (2023). Exploring lightweight hierarchical vision transformers for efficient visual tracking. In ICCV, pages 9612\u20139621,","DOI":"10.1109\/ICCV51070.2023.00881"},{"key":"2500_CR34","doi-asserted-by":"crossref","unstructured":"Kiani\u00a0Galoogahi, Hamed, Fagg, Ashton, Huang, Chen, Ramanan, Deva, & Lucey, Simon. (2017). Need for Speed: A Benchmark for Higher Frame Rate Object Tracking. In ICCV, pages 1134\u20131143,","DOI":"10.1109\/ICCV.2017.128"},{"key":"2500_CR35","unstructured":"Kristan, Matej, Leonardis, Ale\u0161, Matas, Ji\u0159\u00ed, Felsberg, Michael, Pflugfelder, Roman, K\u00e4m\u00e4r\u00e4inen, Joni-Kristian, Danelljan, Martin, Zajc, Luka\u00a0\u010cehovin, Luke\u017ei\u010d, Alan, Drbohlav, Ondrej, et\u00a0al. (2020). The eighth visual object tracking VOT2020 challenge results. In ECCV, pages 547\u2013601,"},{"key":"2500_CR36","unstructured":"Kristan, Matej, Matas, Ji\u0159\u00ed, Leonardis, Ale\u0161, Felsberg, Michael, Pflugfelder, Roman, K\u00e4m\u00e4r\u00e4inen, Joni-Kristian, Chang, Hyung\u00a0Jin, Danelljan, Martin, Cehovin, Luka, Luke\u017ei\u010d, Alan, et\u00a0al. (2021). The ninth visual object tracking vot2021 challenge results. In ICCV, pages 2711\u20132738,"},{"key":"2500_CR37","unstructured":"Krizhevsky, Alex, Sutskever, Ilya, & Hinton, Geoffrey\u00a0E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106\u20131114,"},{"key":"2500_CR38","doi-asserted-by":"crossref","unstructured":"Li, Bo, Wu, Wei, Wang, Qiang, Zhang, Fangyi, Xing, Junliang, & Yan, Junjie. (2019). SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks. In CVPR, pages 4282\u20134291,","DOI":"10.1109\/CVPR.2019.00441"},{"key":"2500_CR39","doi-asserted-by":"crossref","unstructured":"Li, Bo, Yan, Junjie, Wu, Wei, Zhu, Zheng, & Hu, Xiaolin. (2018). High Performance Visual Tracking With Siamese Region Proposal Network. In CVPR, pages 8971\u20138980,","DOI":"10.1109\/CVPR.2018.00935"},{"key":"2500_CR40","doi-asserted-by":"crossref","unstructured":"Li, Changlin, Wang, Guangrun, Wang, Bing, Liang, Xiaodan, Li, Zhihui, & Chang, Xiaojun. (2021). Dynamic slimmable network. In CVPR, pages 8607\u20138617,","DOI":"10.1109\/CVPR46437.2021.00850"},{"key":"2500_CR41","doi-asserted-by":"crossref","unstructured":"Li, Peixia, Wang, Dong, Wang, Lijun, & Lu, Huchuan. (2018). Deep visual tracking: Review and experimental comparison. PR, 76:323\u2013338,","DOI":"10.1016\/j.patcog.2017.11.007"},{"key":"2500_CR42","doi-asserted-by":"crossref","unstructured":"Lin, Tsung-Yi, Doll\u00e1r, Piotr, Girshick, Ross, He, Kaiming, Hariharan, Bharath, & Belongie, Serge. (2017). Feature Pyramid Networks for Object Detection. In CVPR, pages 936\u2013944,","DOI":"10.1109\/CVPR.2017.106"},{"key":"2500_CR43","doi-asserted-by":"crossref","unstructured":"Lin, Tsung-Yi, Maire, Michael, Belongie, Serge\u00a0J., Bourdev, Lubomir\u00a0D., Girshick, Ross\u00a0B., Hays, James, Perona, Pietro, Ramanan, Deva, Doll\u00e1r, Piotr, & Lawrence Zitnick, C. (2014). Microsoft COCO: Common Objects in Context. In ECCV, pages 740\u2013755,","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"2500_CR44","doi-asserted-by":"crossref","unstructured":"Liu, Ze, Lin, Yutong, Cao, Yue, Hu, Han, Wei, Yixuan, Zhang, Zheng, Lin, Stephen, & Guo, Baining. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In ICCV, pages 9992\u201310002,","DOI":"10.1109\/ICCV48922.2021.00986"},{"issue":"1","key":"2500_CR45","doi-asserted-by":"publisher","first-page":"35","DOI":"10.1007\/s44267-024-00068-5","volume":"2","author":"Chang Liu","year":"2024","unstructured":"Liu, Chang, Yuan, Yongsheng, Chen, Xin, Huchuan, Lu., & Wang, Dong. (2024). Spatial-temporal initialization dilemma: towards realistic visual tracking. Visual Intelligence, 2(1), 35.","journal-title":"Visual Intelligence"},{"key":"2500_CR46","unstructured":"Loshchilov, Ilya, & Hutter, Frank. (2019). Decoupled Weight Decay Regularization. In ICLR,"},{"key":"2500_CR47","doi-asserted-by":"crossref","unstructured":"Mayer, Christoph, Danelljan, Martin, Bhat, Goutam, Paul, Matthieu, Paudel, Danda\u00a0Pani, Yu, Fisher, & Van\u00a0Gool, Luc. (2022). Transforming model prediction for tracking. In CVPR, pages 8731\u20138740,","DOI":"10.1109\/CVPR52688.2022.00853"},{"key":"2500_CR48","unstructured":"Mehta, Sachin, & Rastegari, Mohammad. (2022). MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. In ICLR,"},{"key":"2500_CR49","doi-asserted-by":"crossref","unstructured":"Mueller, Matthias, Smith, Neil, & Ghanem, Bernard. (2016). A Benchmark and Simulator for UAV Tracking. In ECCV, pages 445\u2013461,","DOI":"10.1007\/978-3-319-46448-0_27"},{"key":"2500_CR50","doi-asserted-by":"crossref","unstructured":"Muller, Matthias, Bibi, Adel, Giancola, Silvio, Alsubaihi, Salman, & Ghanem, Bernard. (2018). TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild. In ECCV, pages 310\u2013327,","DOI":"10.1007\/978-3-030-01246-5_19"},{"key":"2500_CR51","first-page":"13937","volume":"34","author":"Yongming Rao","year":"2021","unstructured":"Rao, Yongming, Zhao, Wenliang, Liu, Benlin, Jiwen, Lu., Zhou, Jie, & Hsieh, Cho-Jui. (2021). Dynamicvit: Efficient vision transformers with dynamic token sparsification. NIPS, 34, 13937\u201313949.","journal-title":"NIPS"},{"key":"2500_CR52","doi-asserted-by":"crossref","unstructured":"Rezatofighi, Hamid, Tsoi, Nathan, Gwak, JunYoung, Sadeghian, Amir, Reid, Ian\u00a0D., & Savarese, Silvio. (2019). Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In CVPR, pages 658\u2013666,","DOI":"10.1109\/CVPR.2019.00075"},{"issue":"3","key":"2500_CR53","doi-asserted-by":"publisher","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","volume":"115","author":"Olga Russakovsky","year":"2015","unstructured":"Russakovsky, Olga, Deng, Jia, Hao, Su., Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, & Bernstein, Michael. (2015). ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3), 211\u2013252.","journal-title":"IJCV"},{"key":"2500_CR54","doi-asserted-by":"crossref","unstructured":"Song, Zikai, Yu, Junqing,\u00a0Phoebe, Yi-Ping, Chen, & Yang, Wei. (2022). Transformer tracking with cyclic shifting window attention. In CVPR, pages 8791\u20138800,","DOI":"10.1109\/CVPR52688.2022.00859"},{"key":"2500_CR55","doi-asserted-by":"crossref","unstructured":"Tao, Ran, Gavves, Efstratios, & Smeulders, Arnold W.\u00a0M. (2016). Siamese Instance Search for Tracking. In CVPR, pages 1420\u20131429,","DOI":"10.1109\/CVPR.2016.158"},{"key":"2500_CR56","doi-asserted-by":"publisher","first-page":"6021","DOI":"10.1007\/s11263-024-02182-9","volume":"132","author":"Xu Tianyang","year":"2024","unstructured":"Tianyang, Xu., Pan, Yifan, Feng, Zhenhua, Zhu, Xuefeng, Cheng, Chunyang, Xiao-Jun, Wu., & Kittler, Josef. (2024). Learning feature restoration transformer for robust dehazing visual object tracking. IJCV, 132, 6021\u20136038.","journal-title":"IJCV"},{"key":"2500_CR57","unstructured":"Touvron, Hugo, Cord, Matthieu, Douze, Matthijs, Massa, Francisco, Sablayrolles, Alexandre, & J\u00e9gou, Herv\u00e9. (2021). Training data-efficient image transformers & distillation through attention. In ICML, pages 10347\u201310357,"},{"key":"2500_CR58","unstructured":"Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan\u00a0N, Kaiser, Lukasz, & Polosukhin, Illia. (2017). Attention is all you need. In NIPS, pages 5998\u20136008,"},{"key":"2500_CR59","doi-asserted-by":"crossref","unstructured":"Wang, Ning, Zhou, Wengang, Wang, Jie, & Li, Houqiang. (2021). Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. In CVPR, pages 1571\u20131580,","DOI":"10.1109\/CVPR46437.2021.00162"},{"key":"2500_CR60","doi-asserted-by":"crossref","unstructured":"Wang, Qiang,\u00a0Zhang, Li, Bertinetto, Luca, Hu, Weiming, & Torr, Philip H.\u00a0S. (2019). Fast Online Object Tracking and Segmentation: A Unifying Approach. In CVPR, pages 1328\u20131338,","DOI":"10.1109\/CVPR.2019.00142"},{"key":"2500_CR61","doi-asserted-by":"crossref","unstructured":"Wang, Wenhai, Xie, Enze, Li, Xiang, Fan, Deng-Ping, Song, Kaitao, Liang, Ding, Lu, Tong, Luo, Ping, & Shao, Ling. (2021). Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In ICCV, pages 548\u2013558,","DOI":"10.1109\/ICCV48922.2021.00061"},{"key":"2500_CR62","doi-asserted-by":"crossref","unstructured":"Wang, Xiao, Shu, Xiujun, Zhang, Zhipeng,\u00a0Jiang, Bo, Wang, Yaowei, Tian, Yonghong, & Wu, Feng. (2021). Towards More Flexible and Accurate Object Tracking With Natural Language: Algorithms and Benchmark. In CVPR, pages 13763\u201313773,","DOI":"10.1109\/CVPR46437.2021.01355"},{"key":"2500_CR63","doi-asserted-by":"crossref","unstructured":"Wang, Xin, Yu, Fisher, Dou, Zi-Yi, Darrell, Trevor, & Gonzalez, Joseph\u00a0E. (2018). Skipnet: Learning dynamic routing in convolutional networks. In ECCV, pages 409\u2013424,","DOI":"10.1007\/978-3-030-01261-8_25"},{"issue":"2","key":"2500_CR64","doi-asserted-by":"publisher","first-page":"400","DOI":"10.1007\/s11263-020-01357-4","volume":"129","author":"Ning Wang","year":"2021","unstructured":"Wang, Ning, Zhou, Wengang, Song, Yibing, Ma, Chao, Liu, Wei, & Li, Houqiang. (2021). Unsupervised deep representation learning for real-time tracking. IJCV, 129(2), 400\u2013418.","journal-title":"IJCV"},{"key":"2500_CR65","first-page":"11960","volume":"34","author":"Yulin Wang","year":"2021","unstructured":"Wang, Yulin, Huang, Rui, Song, Shiji, Huang, Zeyi, & Huang, Gao. (2021). Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition. NIPS, 34, 11960\u201311973.","journal-title":"NIPS"},{"key":"2500_CR66","doi-asserted-by":"crossref","unstructured":"Wei, Xing, Bai, Yifan, Zheng, Yongchao, Shi, Dahu, & Gong, Yihong. (2023). Autoregressive visual tracking. In CVPR, pages 9697\u20139706,","DOI":"10.1109\/CVPR52729.2023.00935"},{"key":"2500_CR67","doi-asserted-by":"crossref","unstructured":"Wu, Kan, Zhang, Jinnian, Peng, Houwen, Liu, Mengchen, Xiao, Bin, Fu, Jianlong, &\u00a0Yuan, Lu. (2022). TinyViT: Fast Pretraiobject ning Distillation for Small Vision Transformers. In ECCV, pages 68\u201385,","DOI":"10.1007\/978-3-031-19803-8_5"},{"key":"2500_CR68","doi-asserted-by":"crossref","unstructured":"Xie, Fei, Wang, Chunyu, Wang, Guangting, Cao, Yue, Yang, Wankou, & Zeng, Wenjun. (2022). Correlation-aware deep tracking. In CVPR, pages 8751\u20138760,","DOI":"10.1109\/CVPR52688.2022.00855"},{"key":"2500_CR69","doi-asserted-by":"crossref","unstructured":"Xu, Yinda, Wang, Zeyu, Li, Zuoxin,\u00a0Yuan, Ye, & Yu, Gang. (2020). SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines. In AAAI, pages 12549\u201312556,","DOI":"10.1609\/aaai.v34i07.6944"},{"key":"2500_CR70","doi-asserted-by":"crossref","unstructured":"Yan, Bin, Peng, Houwen, Fu, Jianlong, Wang, Dong, & Lu, Huchuan. (2021). Learning Spatio-Temporal Transformer for Visual Tracking. In ICCV, pages 10428\u201310437,","DOI":"10.1109\/ICCV48922.2021.01028"},{"key":"2500_CR71","doi-asserted-by":"crossref","unstructured":"Yan, Bin, Peng, Houwen, Wu, Kan, Wang, Dong, Fu, Jianlong, & Lu, Huchuan. (2021). LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search. In CVPR, pages 15180\u201315189,","DOI":"10.1109\/CVPR46437.2021.01493"},{"key":"2500_CR72","doi-asserted-by":"crossref","unstructured":"Yang, Le, Han, Yizeng,\u00a0Chen, Xi, Song, Shiji, Dai, Jifeng, & Huang, Gao. (2020). Resolution adaptive networks for efficient inference. In CVPR, pages 2369\u20132378,","DOI":"10.1109\/CVPR42600.2020.00244"},{"key":"2500_CR73","doi-asserted-by":"crossref","unstructured":"Ye, Botao, Chang, Hong, Ma, Bingpeng, Shan, Shiguang, & Chen, Xilin. (2022). Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework. In ECCV, pages 341\u2013357,","DOI":"10.1007\/978-3-031-20047-2_20"},{"key":"2500_CR74","doi-asserted-by":"crossref","unstructured":"Yuan, Li, Chen, Yunpeng, Wang, Tao, Yu, Weihao, Shi, Yujun, Jiang, Zi-Hang, Tay, Francis\u00a0EH, Feng, Jiashi, & Yan, Shuicheng. (2021). Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. In ICCV, pages 538\u2013547,","DOI":"10.1109\/ICCV48922.2021.00060"},{"key":"2500_CR75","doi-asserted-by":"crossref","unstructured":"Zhang,Zhipeng, & Peng, Houwen. (2019). Deeper and Wider Siamese Networks for Real-Time Visual Tracking. In CVPR, pages 4591\u20134600,","DOI":"10.1109\/CVPR.2019.00472"},{"key":"2500_CR76","doi-asserted-by":"publisher","first-page":"3509","DOI":"10.1007\/s11263-024-02034-6","volume":"132","author":"Jiangning Zhang","year":"2024","unstructured":"Zhang, Jiangning, Li, Xiangtai, Wang, Yabiao, Wang, Chengjie, Yang, Yibo, Liu, Yong, & Tao, Dacheng. (2024). Eatformer: Improving vision transformer inspired by evolutionary algorithm. IJCV, 132, 3509\u20133536.","journal-title":"IJCV"},{"issue":"5","key":"2500_CR77","first-page":"6168","volume":"45","author":"Jie Zhao","year":"2022","unstructured":"Zhao, Jie, Dai, Kenan, Zhang, Pengyu, Wang, Dong, & Huchuan, Lu. (2022). Robust online tracking with meta-updater. IEEE TPAMI, 45(5), 6168\u20136182.","journal-title":"IEEE TPAMI"},{"issue":"12","key":"2500_CR78","first-page":"25323","volume":"23","author":"Jie Zhao","year":"2022","unstructured":"Zhao, Jie, Zhang, Jingshu, Li, Dongdong, & Wang, Dong. (2022). Vision-based anti-uav detection and tracking. IEEE T-ITS, 23(12), 25323\u201325334.","journal-title":"IEEE T-ITS"},{"issue":"5","key":"2500_CR79","doi-asserted-by":"publisher","first-page":"1234","DOI":"10.1007\/s11263-023-01753-6","volume":"131","author":"Xiawu Zheng","year":"2023","unstructured":"Zheng, Xiawu, Yang, Chenyi, Zhang, Shaokun, Wang, Yan, Zhang, Baochang, Yongjian, Wu., Yunsheng, Wu., Shao, Ling, & Ji, Rongrong. (2023). Ddpnas: Efficient neural architecture search via dynamic distribution pruning. IJCV, 131(5), 1234\u20131249.","journal-title":"IJCV"},{"key":"2500_CR80","unstructured":"Zhu, Jiawen, Chen, Xin, Diao, Haiwen, Li, Shuai, He, Jun-Yan, Li, Chenyang, Luo, Bin, Wang, Dong, & Lu, Huchuan. (2024). Exploring dynamic transformer for efficient object tracking. arXiv preprint arXiv:2403.17651,"}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-025-02500-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-025-02500-9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-025-02500-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T08:48:19Z","timestamp":1760086099000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-025-02500-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,23]]},"references-count":80,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2025,10]]}},"alternative-id":["2500"],"URL":"https:\/\/doi.org\/10.1007\/s11263-025-02500-9","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"type":"print","value":"0920-5691"},{"type":"electronic","value":"1573-1405"}],"subject":[],"published":{"date-parts":[[2025,6,23]]},"assertion":[{"value":"27 December 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"5 June 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 June 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}