{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,3]],"date-time":"2026-02-03T13:57:43Z","timestamp":1770127063793,"version":"3.49.0"},"reference-count":66,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2026,1,27]],"date-time":"2026-01-27T00:00:00Z","timestamp":1769472000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,1,27]],"date-time":"2026-01-27T00:00:00Z","timestamp":1769472000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100007241","name":"Universit\u00e9 Paris-Saclay","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100007241","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["SN COMPUT. SCI."],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    Vision Transformers (ViTs) achieve state-of-the-art performance in semantic segmentation but are hindered by high computational and memory costs. To address this, we propose STEP (SuperToken and Early-Pruning), a hybrid token-reduction framework that combines dynamic patch merging and token pruning to enhance efficiency without significantly compromising accuracy. At the core of STEP is dCTS, a lightweight CNN-based policy network that enables flexible merging into superpatches. Encoder blocks integrate also early-exits to remove high-confident supertokens, lowering computational load. We evaluate our method on high-resolution semantic segmentation benchmarks, including images up to\n                    <jats:inline-formula>\n                      <jats:tex-math>$$1024\\times1024$$<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    , and show that when dCTS is applied alone, the token count can be reduced by a factor of 2.5 compared to the standard\n                    <jats:inline-formula>\n                      <jats:tex-math>$$16\\times$$<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    pixel patching scheme. This yields a\n                    <jats:inline-formula>\n                      <jats:tex-math>$$2.6\\times $$<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    reduction in computational cost and a\n                    <jats:inline-formula>\n                      <jats:tex-math>$$3.4\\times $$<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    increase in throughput when using ViT-Large as the backbone. Applying the full STEP framework further improves efficiency, reaching up to a\n                    <jats:inline-formula>\n                      <jats:tex-math>$$4\\times $$<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    reduction in computational complexity and a\n                    <jats:inline-formula>\n                      <jats:tex-math>$$1.7\\times $$<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    gain in inference speed, with a maximum accuracy drop of no more than 2.0%. With the proposed STEP configurations, up to 40% of tokens can be confidently predicted and halted before reaching the final encoder layer.\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\n                  <\/jats:p>","DOI":"10.1007\/s42979-025-04707-6","type":"journal-article","created":{"date-parts":[[2026,1,27]],"date-time":"2026-01-27T12:07:42Z","timestamp":1769515662000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions"],"prefix":"10.1007","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-9061-4396","authenticated-orcid":false,"given":"Michal","family":"Szczepanski","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5102-7735","authenticated-orcid":false,"given":"Martyna","family":"Poreba","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-6972-6019","authenticated-orcid":false,"given":"Karim","family":"Haroun","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2026,1,27]]},"reference":[{"key":"4707_CR1","doi-asserted-by":"crossref","unstructured":"Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, Lu T, Luo P, Shao L. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE\/CVF international conference on computer vision. 2021. pp. 568\u201378.","DOI":"10.1109\/ICCV48922.2021.00061"},{"key":"4707_CR2","doi-asserted-by":"crossref","unstructured":"Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, Fu Y, Feng J, Xiang T, Torr PH, et\u00a0al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. 2021. pp. 6881\u201390.","DOI":"10.1109\/CVPR46437.2021.00681"},{"key":"4707_CR3","doi-asserted-by":"crossref","unstructured":"Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE\/CVF international conference on computer vision (ICCV). 2021. pp. 10012\u201322.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"4707_CR4","doi-asserted-by":"crossref","unstructured":"Strudel R, Garcia R, Laptev I, Schmid C. Segmenter: transformer for semantic segmentation. In: Proceedings of the IEEE\/CVF international conference on computer vision. 2021. pp. 7262\u201372.","DOI":"10.1109\/ICCV48922.2021.00717"},{"key":"4707_CR5","first-page":"10326","volume":"34","author":"W Zhang","year":"2021","unstructured":"Zhang W, Pang J, Chen K, Loy CC. K-net: towards unified image segmentation. Adv Neural Inf Process Syst. 2021;34:10326\u201338.","journal-title":"Adv Neural Inf Process Syst"},{"key":"4707_CR6","first-page":"12077","volume":"34","author":"E Xie","year":"2021","unstructured":"Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, Luo P. Segformer: simple and efficient design for semantic segmentation with transformers. Adv Neural Inf Process Syst. 2021;34:12077\u201390.","journal-title":"Adv Neural Inf Process Syst"},{"key":"4707_CR7","unstructured":"Zhang B, Tian Z, Tang Q, Chu X, Wei X, Shen C, Liu Y. Segvit: semantic segmentation with plain vision transformers. NeurIPS. 2022."},{"key":"4707_CR8","doi-asserted-by":"crossref","unstructured":"Kerssies T, Cavagnero N, Hermans A, Norouzi, N, Averta, G, Leibe, B, Dubbelman, G, Geus, D. Your vit is secretly an image segmentation model. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (CVPR). 2025.","DOI":"10.1109\/CVPR52734.2025.02356"},{"key":"4707_CR9","doi-asserted-by":"crossref","unstructured":"Yoo J, Ko D, Kim G. Ccaseg: decoding multi-scale context with convolutional cross-attention for semantic segmentation. In: Proceedings of the winter conference on applications of computer vision (WACV). 2025. pp. 9461\u201370.","DOI":"10.1109\/WACV61041.2025.00918"},{"key":"4707_CR10","doi-asserted-by":"crossref","unstructured":"Yeom S, Klitzing J. U-mixformer: unet-like transformer with mix-attention for efficient semantic segmentation. In: Proceedings of the winter conference on applications of computer vision (WACV). 2025. pp. 7710\u20139.","DOI":"10.1109\/WACV61041.2025.00750"},{"key":"4707_CR11","doi-asserted-by":"crossref","unstructured":"Hu X, Jiang L, Schiele B. Training vision transformers for semi-supervised semantic segmentation. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (CVPR). 2024. pp. 4007\u201317.","DOI":"10.1109\/CVPR52733.2024.00384"},{"key":"4707_CR12","doi-asserted-by":"crossref","unstructured":"Lin Y, Zhang T, Sun P, Li Z, Zhou S. Fq-vit: post-training quantization for fully quantized vision transformer. In: Proceedings of the thirty-first international joint conference on artificial intelligence, IJCAI-22. 2022. pp. 1173\u20139.","DOI":"10.24963\/ijcai.2022\/164"},{"key":"4707_CR13","doi-asserted-by":"publisher","unstructured":"Yuan Z, Xue C, Chen Y, Wu Q, Sun G. Ptq4vit: post-training quantization for vision transformers with twin uniform quantization. In: Computer vision\u2014ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23\u201327, 2022, Proceedings, Part XII. Berlin: Springer; 2022. pp. 191\u2013207. https:\/\/doi.org\/10.1007\/978-3-031-19775-8_12.","DOI":"10.1007\/978-3-031-19775-8_12"},{"key":"4707_CR14","doi-asserted-by":"crossref","unstructured":"Li Z, Gu Q. I-vit: integer-only quantization for efficient vision transformer inference. In: Proceedings of the IEEE\/CVF international conference on computer vision. 2023. pp. 17065\u201375.","DOI":"10.1109\/ICCV51070.2023.01565"},{"key":"4707_CR15","unstructured":"Huang X, Shen Z, Dong P, Cheng K-T. Quantization variation: a new perspective on training transformers with low-bit precision. Trans Mach Learn Res. 2024."},{"key":"4707_CR16","doi-asserted-by":"publisher","unstructured":"Shang Y, Liu G, Kompella R, Yan Y. Quantized-vit efficient training via fisher matrix regularization. In: MultiMedia modeling: 31st international conference on multimedia modeling, MMM 2025, Nara, Japan, January 8\u201310, 2025, Proceedings, Part III. Berlin: Springer; 2025. pp. 270\u201384. https:\/\/doi.org\/10.1007\/978-981-96-2064-7_20.","DOI":"10.1007\/978-981-96-2064-7_20"},{"key":"4707_CR17","doi-asserted-by":"crossref","unstructured":"Wu, K, Zhang, J, Peng, H, Liu, M, Xiao, B, Fu, J, Yuan, L. Tinyvit: fast pretraining distillation for small vision transformers. In: European conference on computer vision (ECCV). 2022.","DOI":"10.1007\/978-3-031-19803-8_5"},{"key":"4707_CR18","doi-asserted-by":"publisher","unstructured":"Yang, Z, Li, Z, Zeng, A, Li, Z, Yuan, C, Li, Y. ViTKD: feature-based knowledge distillation for vision transformers. In: 2024 IEEE\/CVF conference on computer vision and pattern recognition workshops (CVPRW). IEEE Computer Society, Los Alamitos, CA, USA. 2024. pp. 1379\u201388. https:\/\/doi.org\/10.1109\/CVPRW63382.2024.00145.","DOI":"10.1109\/CVPRW63382.2024.00145"},{"key":"4707_CR19","doi-asserted-by":"publisher","unstructured":"Proust M, Poreba M, Szczepanski M, Haroun K. Step: supertoken and early-pruning for efficient semantic segmentation. In: VISIGRAPP 2025-20th international joint conference on computer vision, imaging and computer graphics theory and applications. 2025. pp. 56\u201361. https:\/\/doi.org\/10.5220\/0013132800003912. https:\/\/www.scitepress.org\/Papers\/2025\/131328\/131328.pdf.","DOI":"10.5220\/0013132800003912"},{"key":"4707_CR20","doi-asserted-by":"crossref","unstructured":"Lu C, de Geus D, Dubbelman G. Content-aware token sharing for efficient semantic segmentation with vision transformers. In: IEEE\/CVF conference on computer vision and pattern recognition (CVPR). 2023.","DOI":"10.1109\/CVPR52729.2023.02263"},{"key":"4707_CR21","doi-asserted-by":"crossref","unstructured":"Havtorn JD, Royer A, Blankevoort T, Bejnordi BE. MSViT: dynamic mixed-scale tokenization for vision transformers. In: Proceedings of the IEEE\/CVF international conference on computer vision. 2023. pp. 838\u201348.","DOI":"10.1109\/ICCVW60793.2023.00091"},{"key":"4707_CR22","doi-asserted-by":"crossref","unstructured":"Chen M, Lin M, Li K, Shen Y, Wu Y, Chao F, Ji R. Cf-vit: a general coarse-to-fine method for vision transformer. In: Proceedings of the AAAI conference on artificial intelligence, vol. 37.","DOI":"10.1609\/aaai.v37i6.25860"},{"key":"4707_CR23","doi-asserted-by":"crossref","unstructured":"Ronen T, Levy O, Golbert A. Vision transformers with mixed-resolution tokenization. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. pp. 4613\u2013622.","DOI":"10.1109\/CVPRW59228.2023.00486"},{"key":"4707_CR24","doi-asserted-by":"crossref","unstructured":"Mahmud T, Yaman B, Liu C-H, Marculescu D. PaPr: training-free one-step patch pruning with lightweight ConvNets for faster inference. 2024. https:\/\/arxiv.org\/abs\/2403.16020.","DOI":"10.1007\/978-3-031-73337-6_7"},{"key":"4707_CR25","unstructured":"Rao Y, Zhao W, Liu B, Lu J, Zhou J, Hsieh C-J. Dynamicvit: efficient vision transformers with dynamic token sparsification. In: Advances in neural information processing systems (NeurIPS). 2021."},{"key":"4707_CR26","doi-asserted-by":"crossref","unstructured":"Fayyaz M, Abbasi\u00a0Kouhpayegani S, Rezaei\u00a0Jafari F, Sommerlade E, Vaezi\u00a0Joze HR, Pirsiavash H, Gall J. Adaptive token sampling for efficient vision transformers. In: European conference on computer vision (ECCV). 2022.","DOI":"10.1007\/978-3-031-20083-0_24"},{"key":"4707_CR27","doi-asserted-by":"publisher","unstructured":"Kim S, Shen S, Thorsley D, Gholami A, Kwon W, Hassoun J, Keutzer K. Learned token pruning for transformers. In: Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining. KDD \u201922. Association for Computing Machinery, New York, NY, USA. 2022. pp. 784\u201394. https:\/\/doi.org\/10.1145\/3534678.3539260.","DOI":"10.1145\/3534678.3539260"},{"key":"4707_CR28","doi-asserted-by":"crossref","unstructured":"Kong Z, Dong P, Ma X, Meng X, Niu W, Sun M, Shen X, Yuan G, Ren B, Tang H, et\u00a0al. Spvit: enabling faster vision transformers via latency-aware soft token pruning. In: Computer Vision\u2013ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23\u201327, 2022, Proceedings, Part XI. Springer; 2022. pp. 620\u201340.","DOI":"10.1007\/978-3-031-20083-0_37"},{"key":"4707_CR29","unstructured":"Liang Y, Ge C, Tong Z, Song Y, Wang J, Xie P. Not all patches are what you need: expediting vision transformers via token reorganizations. In: International conference on learning representations. 2022. https:\/\/openreview.net\/forum?id=BjyvwnXXVn_."},{"key":"4707_CR30","doi-asserted-by":"publisher","unstructured":"Meng L, Li H, Chen B-C, Lan S, Wu Z, Jiang Y-G, Lim S-N. AdaViT: adaptive vision transformers for efficient image recognition. In: 2022 IEEE\/CVF conference on computer vision and pattern recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA. 2022. pp. 12299\u2013308. https:\/\/doi.org\/10.1109\/CVPR52688.2022.01199.","DOI":"10.1109\/CVPR52688.2022.01199"},{"key":"4707_CR31","doi-asserted-by":"publisher","unstructured":"Song Z, Xu Y, He Z, Jiang L, Jing N, Liang X. CP-ViT: cascade vision transformer pruning via progressive sparsity prediction. https:\/\/doi.org\/10.48550\/arXiv.2203.04570.","DOI":"10.48550\/arXiv.2203.04570"},{"key":"4707_CR32","doi-asserted-by":"publisher","unstructured":"Marin D, Chang J-HR, Ranjan A, Prabhu A, Rastegari M, Tuzel O. Token pooling in vision transformers for image classification. In: 2023 IEEE\/CVF winter conference on applications of computer vision (WACV). 2023. pp. 12\u201321. https:\/\/doi.org\/10.1109\/WACV56688.2023.00010.","DOI":"10.1109\/WACV56688.2023.00010"},{"key":"4707_CR33","unstructured":"Bolya D, Fu C-Y, Dai X, Zhang P, Feichtenhofer C, Hoffman J. Token merging: your ViT but faster. In: International conference on learning representations. 2023."},{"key":"4707_CR34","doi-asserted-by":"publisher","unstructured":"Tang Q, Zhang B, Liu J, Liu F, Liu Y. Dynamic token pruning in plain vision transformers for semantic segmentation. In: 2023 IEEE\/CVF international conference on computer vision (ICCV). IEEE Computer Society, Los Alamitos, CA, USA. 2023. pp. 777\u201386. https:\/\/doi.org\/10.1109\/ICCV51070.2023.00078.","DOI":"10.1109\/ICCV51070.2023.00078"},{"key":"4707_CR35","unstructured":"Wu X, Zeng F, Wang X, Chen X. Ppt: token pruning and pooling for efficient vision transformers. 2023. arXiv:2310.01812."},{"key":"4707_CR36","doi-asserted-by":"publisher","unstructured":"Liu X , Wu T, Guo G. Adaptive sparse vit: towards learnable adaptive token pruning by fully exploiting self-attention. 2023. pp. 1222\u201330. https:\/\/doi.org\/10.24963\/ijcai.2023\/136.","DOI":"10.24963\/ijcai.2023\/136"},{"key":"4707_CR37","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2025.127449","volume":"279","author":"M Marchetti","year":"2025","unstructured":"Marchetti M, Traini D, Ursino D, Virgili L. Efficient token pruning in vision transformers using an attention-based multilayer network. Expert Syst Appl. 2025;279:127449. https:\/\/doi.org\/10.1016\/j.eswa.2025.127449.","journal-title":"Expert Syst Appl"},{"key":"4707_CR38","doi-asserted-by":"publisher","unstructured":"Wang H, Dedhia B, Jha NK. Zero-tprune: zero-shot token pruning through leveraging of the attention graph in pre-trained transformers. 2024. pp. 16070\u201379. https:\/\/doi.org\/10.1109\/CVPR52733.2024.01521.","DOI":"10.1109\/CVPR52733.2024.01521"},{"key":"4707_CR39","doi-asserted-by":"crossref","unstructured":"Xu Y, Zhang Z, Zhang M, Sheng K, Li K, Dong W, Zhang L, Xu C, Sun X. Evo-vit: slow-fast token evolution for dynamic vision transformer. In: Proceedings of the AAAI conference on artificial intelligence, vol. 36. 2022. pp. 2964\u201372.","DOI":"10.1609\/aaai.v36i3.20202"},{"key":"4707_CR40","unstructured":"Courdier E, Sivaprasad PT, Fleuret F. PAUMER: patch pausing transformer for semantic segmentation. 2023. https:\/\/arxiv.org\/abs\/2311.00586."},{"key":"4707_CR41","doi-asserted-by":"publisher","unstructured":"Liu Y, Zhou Q, Wang J, Wang Z, Wang F, Wang J, Zhang W. Dynamic token-pass transformers for semantic segmentation. In: Proceedings of the IEEE\/CVF winter conference on applications of computer vision (WACV). 2024. pp. 1816\u201325. https:\/\/doi.org\/10.1109\/WACV57701.2024.00184.","DOI":"10.1109\/WACV57701.2024.00184"},{"key":"4707_CR42","doi-asserted-by":"publisher","unstructured":"Liu Y, Gehrig M, Messikommer N, Cannici M, Scaramuzza D. Revisiting token pruning for object detection and instance segmentation. In: 2024 IEEE\/CVF winter conference on applications of computer vision (WACV). IEEE Computer Society, Los Alamitos, CA, USA. 2024. pp. 2646\u201356. https:\/\/doi.org\/10.1109\/WACV57701.2024.00264.","DOI":"10.1109\/WACV57701.2024.00264"},{"key":"4707_CR43","doi-asserted-by":"crossref","unstructured":"Yin H, Vahdat A, Alvarez JM, Mallya A, Kautz J, Molchanov P. A-ViT: adaptive tokens for efficient vision transformer. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (CVPR). 2022. pp. 10809\u201318.","DOI":"10.1109\/CVPR52688.2022.01054"},{"key":"4707_CR44","doi-asserted-by":"crossref","unstructured":"Zeng W, Jin S, Liu W, Qian C, Luo P, Ouyang W, Wang X. Not all tokens are equal: human-centric visual analysis via token clustering transformer. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. 2022. pp. 11101\u201311.","DOI":"10.1109\/CVPR52688.2022.01082"},{"key":"4707_CR45","doi-asserted-by":"crossref","unstructured":"Zeng W, Jin S, Xu L, Liu W, Qian C, Ouyang W, Luo P, Wang X. Tcformer: visual recognition via token clustering transformer. IEEE Trans Pattern Anal Mach Intell. 2024.","DOI":"10.1109\/TPAMI.2024.3425768"},{"key":"4707_CR46","unstructured":"Li J, Wang Y, Zhang X, Shi B, Jiang D, Li C, Dai W, Xiong H, Tian Q. Ailurus: a scalable vit framework for dense prediction. In: Oh A, Naumann T , Globerson A, Saenko K, Hardt M, Levine S, editors. Advances in neural information processing systems, vol. 36. Curran Associates, Inc. 2023. pp. 30979\u201396. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2023\/file\/62c9aa4d48329a85d1e36d5b6d0a6a32-Paper-Conference.pdf."},{"key":"4707_CR47","doi-asserted-by":"crossref","unstructured":"Marin D, Chang J-HR, Ranjan A, Prabhu A, Rastegari M, Tuzel O. Token pooling in vision transformers for image classification. In: Proceedings of the IEEE\/CVF winter conference on applications of computer vision. 2023. pp. 12\u201321.","DOI":"10.1109\/WACV56688.2023.00010"},{"key":"4707_CR48","doi-asserted-by":"crossref","unstructured":"Norouzi N, Orlova S, De\u00a0Geus D, Dubbelman G. Algm: adaptive local-then-global token merging for efficient semantic segmentation with plain vision transformers. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. 2024. pp. 15773\u201382.","DOI":"10.1109\/CVPR52733.2024.01493"},{"key":"4707_CR49","doi-asserted-by":"crossref","unstructured":"Haroun K, Martinet J, Chehida KB, Allenet T. Leveraging local similarity for token merging in vision transformers. In: ICONIP 2024-31th international conference on neural information processing. 2024.","DOI":"10.1007\/978-981-96-6688-1_20"},{"key":"4707_CR50","doi-asserted-by":"crossref","unstructured":"Haroun K, Allenet T, Chehida KB, Martinet J. Dynamic hierarchical token merging for vision transformers. In: VISAPP-2025-20th international joint conference on computer vision, imaging and computer graphics theory and applications. 2025.","DOI":"10.5220\/0013284100003912"},{"key":"4707_CR51","doi-asserted-by":"crossref","unstructured":"Lee DH, Hong S. Learning to merge tokens via decoupled embedding for efficient vision transformers. In: Conference on neural information processing systems. 2024.","DOI":"10.52202\/079017-1713"},{"key":"4707_CR52","unstructured":"Bonnaerens M, Dambre J. Learned thresholds token merging and pruning for vision transformers. Trans Mach Learn Res. 2023."},{"key":"4707_CR53","doi-asserted-by":"crossref","unstructured":"Kim M, Gao S, Hsu Y-C, Shen Y, Jin H. Token fusion: bridging the gap between token pruning and token merging. In: Proceedings of the IEEE\/CVF winter conference on applications of computer vision. 2024. pp. 1383\u201392.","DOI":"10.1109\/WACV57701.2024.00141"},{"key":"4707_CR54","unstructured":"Wu X, Zeng F, Wang X, Chen X. PPT: token pruning and pooling for efficient vision transformers. 2024. https:\/\/arxiv.org\/abs\/2310.01812."},{"key":"4707_CR55","doi-asserted-by":"crossref","unstructured":"Chen M, Shao W, Xu P, Lin M, Zhang K, Chao F, Ji R, Qiao Y, Luo P. Diffrate: differentiable compression rate for efficient vision transformers. In: Proceedings of the IEEE\/CVF international conference on computer vision. 2023. pp. 17164\u201374.","DOI":"10.1109\/ICCV51070.2023.01574"},{"key":"4707_CR56","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2024.128747","volume":"612","author":"D Chen","year":"2025","unstructured":"Chen D, Lin K, Deng Q. Ucc: a unified cascade compression framework for vision transformer models. Neurocomputing. 2025;612:128747. https:\/\/doi.org\/10.1016\/j.neucom.2024.128747.","journal-title":"Neurocomputing"},{"key":"4707_CR57","doi-asserted-by":"publisher","unstructured":"Mao J, Shen Y, Guo J, Yao Y, Hua X, Shen H. Prune and merge: efficient token compression for vision transformer with spatial information preserved. IEEE Trans Multim. 2025. https:\/\/doi.org\/10.1109\/TMM.2025.3535405.","DOI":"10.1109\/TMM.2025.3535405"},{"key":"4707_CR58","unstructured":"Huang H, Zhou X, Cao J, He R, Tan T. Vision transformer with super token sampling. 2022. arXiv:2211.11167."},{"key":"4707_CR59","doi-asserted-by":"crossref","unstructured":"Zeng W, Jin S, Liu W, Qian C, Luo P, Ouyang W, Wang X. Not all tokens are equal: human-centric visual analysis via token clustering transformer. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. 2022. pp. 11101\u201311.","DOI":"10.1109\/CVPR52688.2022.01082"},{"issue":"9","key":"4707_CR60","doi-asserted-by":"publisher","first-page":"10883","DOI":"10.1109\/TPAMI.2023.3263826","volume":"45","author":"Y Rao","year":"2023","unstructured":"Rao Y, Liu Z, Zhao W, Zhou J, Lu J. Dynamic spatial sparsification for efficient vision transformers and convolutional neural networks. IEEE Trans Pattern Anal Mach Intell. 2023;45(9):10883\u201397. https:\/\/doi.org\/10.1109\/TPAMI.2023.3263826.","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"4707_CR61","unstructured":"Tan M, Le QV. Efficientnet: rethinking model scaling for convolutional neural networks. 2019. arXiv: abs\/1905.11946."},{"issue":"3","key":"4707_CR62","doi-asserted-by":"publisher","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","volume":"115","author":"O Russakovsky","year":"2015","unstructured":"Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet large scale visual recognition challenge. Int J Comput Vis. 2015;115(3):211\u201352. https:\/\/doi.org\/10.1007\/s11263-015-0816-y.","journal-title":"Int J Comput Vis"},{"key":"4707_CR63","unstructured":"MMSegmentation Contributors: MMSegmentation: OpenMMLab Semantic Segmentation Toolbox and Benchmark. 2020. https:\/\/github.com\/open-mmlab\/mmsegmentation."},{"key":"4707_CR64","doi-asserted-by":"publisher","unstructured":"Caesar H, Uijlings J, Ferrari V. Coco-stuff: thing and stuff classes in context. In: 2018 IEEE\/CVF conference on computer vision and pattern recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA. 2018. pp. 1209\u201318. https:\/\/doi.org\/10.1109\/CVPR.2018.00132.","DOI":"10.1109\/CVPR.2018.00132"},{"key":"4707_CR65","doi-asserted-by":"crossref","unstructured":"Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A. Scene parsing through ade20k dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.","DOI":"10.1109\/CVPR.2017.544"},{"key":"4707_CR66","doi-asserted-by":"crossref","unstructured":"Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 2016.","DOI":"10.1109\/CVPR.2016.350"}],"container-title":["SN Computer Science"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s42979-025-04707-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s42979-025-04707-6","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s42979-025-04707-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,27]],"date-time":"2026-01-27T12:07:51Z","timestamp":1769515671000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s42979-025-04707-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,27]]},"references-count":66,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2026,2]]}},"alternative-id":["4707"],"URL":"https:\/\/doi.org\/10.1007\/s42979-025-04707-6","relation":{},"ISSN":["2661-8907"],"issn-type":[{"value":"2661-8907","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,27]]},"assertion":[{"value":"20 September 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 December 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"27 January 2026","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"On behalf of all authors, the corresponding author states that there is no Conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}],"article-number":"137"}}