{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,2]],"date-time":"2026-06-02T09:44:08Z","timestamp":1780393448928,"version":"3.54.1"},"reference-count":69,"publisher":"Springer Science and Business Media LLC","issue":"9","license":[{"start":{"date-parts":[[2025,5,26]],"date-time":"2025-05-26T00:00:00Z","timestamp":1748217600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,5,26]],"date-time":"2025-05-26T00:00:00Z","timestamp":1748217600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Northeastern University USA"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2025,9]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Despite their impressive performance on various tasks, vision transformers (ViTs) are heavy for mobile vision applications. Recent works have proposed combining the strengths of ViTs and convolutional neural networks (CNNs) to build lightweight networks. Still, these approaches rely on hand-designed architectures with a pre-determined number of parameters. In this work, we address the challenge of finding optimal light-weight ViTs given constraints on model size and computational cost using neural architecture search. We use a search algorithm that considers both model parameters and on-device deployment latency. This method analyzes network properties, hardware memory access pattern, and degree of parallelism to directly and accurately estimate the network latency. To prevent the need for extensive testing during the search process, we use a lookup table based on a detailed breakdown of the speed of each component and operation, which can be reused to evaluate the whole latency of each search structure. Our approach leads to improved efficiency compared to testing the speed of the whole model during the search process. Extensive experiments demonstrate that, under similar parameters and FLOPs, our searched lightweight ViTs achieve higher accuracy and lower latency than state-of-the-art models. For instance, on ImageNet-1K, AutoViT_XXS (71.3% Top-1 accuracy, 10.2ms latency) outperforms MobileViTv3_XXS (71.0% Top-1 accuracy, 12.5ms latency) with 0.3% higher accuracy and 2.3ms lower latency.<\/jats:p>","DOI":"10.1007\/s11263-025-02480-w","type":"journal-article","created":{"date-parts":[[2025,5,26]],"date-time":"2025-05-26T08:56:43Z","timestamp":1748249803000},"page":"6170-6186","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["AutoViT: Achieving Real-Time Vision Transformers on Mobile via Latency-aware Coarse-to-Fine Search"],"prefix":"10.1007","volume":"133","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8120-4456","authenticated-orcid":false,"given":"Zhenglun","family":"Kong","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Dongkuan","family":"Xu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zhengang","family":"Li","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Peiyan","family":"Dong","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Hao","family":"Tang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yanzhi","family":"Wang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Subhabrata","family":"Mukherjee","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2025,5,26]]},"reference":[{"key":"2480_CR1","unstructured":"Anonymous. (2022). NASVit: Neural architecture search for efficient vision transformers with gradient conflict aware supernet training. In  Submitted to The Tenth International Conference on Learning Representations, under review."},{"key":"2480_CR2","doi-asserted-by":"crossref","unstructured":"Bender, Gabriel, Liu, Hanxiao, Chen, Bo, Chu, Grace, Cheng, Shuyang, Kindermans, Pieter-Jan, Le, Quoc\u00a0V. (2020). Can weight sharing outperform random architecture search? an investigation with tunas. In  Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pages 14323\u201314332","DOI":"10.1109\/CVPR42600.2020.01433"},{"key":"2480_CR3","unstructured":"Cai, Han, Gan, Chuang, Wang, Tianzhe, Zhang, Zhekai, & Han, Song. (2019). Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791"},{"key":"2480_CR4","unstructured":"Cai, Han, Zhu, Ligeng, & Han, Song. (2018). Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332"},{"key":"2480_CR5","doi-asserted-by":"crossref","unstructured":"Chen, Boyu, Li, Peixia, Li, Chuming, Li, Baopu, Bai, Lei, Lin, Chen, Sun, Ming, Yan, Junjie, & Ouyang, Wanli. (2021). Glit: Neural architecture search for global and local image transformer. In  Proceedings of the IEEE\/CVF International Conference on Computer Vision, pages 12\u201321","DOI":"10.1109\/ICCV48922.2021.00008"},{"key":"2480_CR6","doi-asserted-by":"crossref","unstructured":"Chen, Chun-Fu, Fan, Quanfu, & Panda, Rameswar. (2021). Crossvit: Cross-attention multi-scale vision transformer for image classification. arXiv preprint arXiv:2103.14899","DOI":"10.1109\/ICCV48922.2021.00041"},{"key":"2480_CR7","doi-asserted-by":"crossref","unstructured":"Chen, Hanting, Wang, Yunhe, Guo, Tianyu, Xu, Chang, Deng, Yiping, Liu, Zhenhua, Ma, Siwei, Xu, Chunjing, Xu, Chao, & Gao, Wen. (2021). Pre-trained image processing transformer. In  Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12299\u201312310","DOI":"10.1109\/CVPR46437.2021.01212"},{"key":"2480_CR8","doi-asserted-by":"crossref","unstructured":"Chen, Minghao, Peng, Houwen, Fu, Jianlong, & Ling, Haibin. (2021). Autoformer: Searching transformers for visual recognition. In  Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), pages 12270\u201312280","DOI":"10.1109\/ICCV48922.2021.01205"},{"key":"2480_CR9","unstructured":"Chen, Minghao, Wu, Kan, Ni, Bolin, Peng, Houwen, Liu, Bei, Fu, Jianlong, Chao, Hongyang, & Ling, Haibin. (2021). Searching the search space of vision transformer.  Advances in Neural Information Processing Systems, 34"},{"key":"2480_CR10","unstructured":"Chen, Tianlong, Cheng, Yu, Gan, Zhe, Yuan, Lu, Zhang, Lei, & Wang, Zhangyang. (2021). Chasing sparsity in vision transformers: An end-to-end exploration. In  Advances in Neural Information Processing Systems"},{"key":"2480_CR11","unstructured":"Cheng, Bowen, Schwing, Alexander\u00a0G., & Kirillov, Alexander. (2021). Per-pixel classification is not all you need for semantic segmentation. arXiv preprint arXiv:2107.06278"},{"key":"2480_CR12","unstructured":"Chu, Xiangxiang, Tian, Zhi, Zhang, Bo, Wang, Xinlong, Wei, Xiaolin, Xia, Huaxia, & Shen, Chunhua. (2021). Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882,"},{"key":"2480_CR13","doi-asserted-by":"crossref","unstructured":"Cubuk, Ekin\u00a0D., Zoph, Barret, Shlens, Jonathon, & Le, Quoc\u00a0V. (2020). Randaugment: Practical automated data augmentation with a reduced search space. In  Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702\u2013703","DOI":"10.1109\/CVPRW50498.2020.00359"},{"key":"2480_CR14","doi-asserted-by":"crossref","unstructured":"Dai, Zhigang, Cai, Bolun, Lin, Yugeng, & Chen, Junying. (2021). Up-detr: Unsupervised pre-training for object detection with transformers. In  Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1601\u20131610","DOI":"10.1109\/CVPR46437.2021.00165"},{"key":"2480_CR15","unstructured":"Dai, Zihang, Liu, Hanxiao, Le, Quoc\u00a0V., & Tan, Mingxing. (2021). Coatnet: Marrying convolution and attention for all data sizes. arXiv preprint arXiv:2106.04803"},{"key":"2480_CR16","doi-asserted-by":"crossref","unstructured":"d\u2019Ascoli, St\u00e9phane, Touvron, Hugo, Leavitt, Matthew, Morcos, Ari, Biroli, Giulio, & Sagun, Levent. (2021). Convit: Improving vision transformers with soft convolutional inductive biases. arXiv preprint arXiv:2103.10697","DOI":"10.1088\/1742-5468\/ac9830"},{"key":"2480_CR17","doi-asserted-by":"crossref","unstructured":"Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, & Fei-Fei, Li. (2009). Imagenet: A large-scale hierarchical image database. In  Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 248\u2013255. Ieee","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"2480_CR18","unstructured":"Dong, P., Kong, Z., Meng, X., Yu, P., Gong, Y., Yuan, G., Tang, H., & Wang, Y. (2023a). HotBEV: Hardware-oriented transformer-based multi-view 3D detector for BEV perception. Advances in Neural Information Processing Systems, 36, 2824\u20132836."},{"key":"2480_CR19","unstructured":"Dong, P., Kong, Z., Meng, X., Zhang, P., Tang, H., Wang, Y., & Chou, C. H. (2023b). SpeedDETR: Speed-aware transformers for end-to-end object detection. In Proceedings of the 40th International Conference on Machine Learning. JMLR.org."},{"key":"2480_CR20","unstructured":"Dosovitskiy, Alexey, Beyer, Lucas, Kolesnikov, Alexander, Weissenborn, Dirk, Zhai, Xiaohua, Unterthiner, Thomas, Dehghani, Mostafa, Minderer, Matthias, Heigold, Georg, Gelly, Sylvain, Uszkoreit, Jakob, & Houlsby, Neil. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In  International Conference on Learning Representations (ICLR)"},{"key":"2480_CR21","doi-asserted-by":"crossref","unstructured":"Graham, Benjamin, El-Nouby, Alaaeldin, Touvron, Hugo, Stock, Pierre, Joulin, Armand, Jegou, Herve, & Douze, Matthijs. (2021). Levit: A vision transformer in convnet\u2019s clothing for faster inference. In  Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), pages 12259\u201312269, October","DOI":"10.1109\/ICCV48922.2021.01204"},{"key":"2480_CR22","doi-asserted-by":"crossref","unstructured":"Guo, Zichao, Zhang, Xiangyu, Mu, Haoyuan, Heng, Wen, Liu, Zechun, Wei, Yichen, & Sun, Jian. (2020). Single path one-shot neural architecture search with uniform sampling. In  European Conference on Computer Vision, pages 544\u2013560. Springer","DOI":"10.1007\/978-3-030-58517-4_32"},{"key":"2480_CR23","doi-asserted-by":"crossref","unstructured":"He, Kaiming, Gkioxari, Georgia, Doll\u00e1r, Piotr, & Girshick, Ross. (2017). Mask r-cnn. In  Proceedings of the IEEE international conference on computer vision, pages 2961\u20132969","DOI":"10.1109\/ICCV.2017.322"},{"key":"2480_CR24","doi-asserted-by":"crossref","unstructured":"Heo, Byeongho, Yun, Sangdoo, Han, Dongyoon, Chun, Sanghyuk, Choe, Junsuk, & Oh, Seong\u00a0Joon. (2021). Rethinking spatial dimensions of vision transformers. In  International Conference on Computer Vision (ICCV)","DOI":"10.1109\/ICCV48922.2021.01172"},{"key":"2480_CR25","unstructured":"Hinton, Geoffrey, Vinyals, Oriol, & Dean, Jeff. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531"},{"key":"2480_CR26","unstructured":"Howard, Andrew\u00a0G., Zhu, Menglong, Chen, Bo, Kalenichenko, Dmitry, Wang, Weijun, Weyand, Tobias, Andreetto, Marco, & Adam, Hartwig. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861"},{"key":"2480_CR27","unstructured":"Hu, Shoukang, Wang, Ruochen, Hong, Lanqing, Li, Zhenguo, Hsieh, Cho-Jui, & Feng, Jiashi. (2022). Generalizing few-shot nas with gradient matching. arXiv preprint arXiv:2203.15207"},{"key":"2480_CR28","doi-asserted-by":"crossref","unstructured":"Huang, Gao, Liu, Zhuang, Der\u00a0Maaten, Laurens Van, & Weinberger, Kilian\u00a0Q. (2017). Densely connected convolutional networks. In  Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700\u20134708","DOI":"10.1109\/CVPR.2017.243"},{"key":"2480_CR29","doi-asserted-by":"crossref","unstructured":"Krause, Jonathan, Stark, Michael, Deng, Jia, & Fei-Fei, Li. (2013). 3d object representations for fine-grained categorization. In  Proceedings of the IEEE international conference on computer vision workshops, pages 554\u2013561","DOI":"10.1109\/ICCVW.2013.77"},{"key":"2480_CR30","unstructured":"Krizhevsky, Alex, Hinton, Geoffrey, et\u00a0al. (2009). Learning multiple layers of features from tiny images."},{"key":"2480_CR31","first-page":"12934","volume":"35","author":"Yanyu Li","year":"2022","unstructured":"Li, Yanyu, Yuan, Geng, Yang Wen, JuHu., Evangelidis, Georgios, Tulyakov, Sergey, Wang, Yanzhi, & Ren, Jian. (2022). Efficientformer: Vision transformers at mobilenet speed. Advances in Neural Information Processing Systems, 35, 12934\u201312949.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2480_CR32","volume-title":"and Luc Van Gool","author":"Yawei Li","year":"2021","unstructured":"Li, Yawei, Zhang, Kai, Cao, Jiezhang, & Timofte, Radu. (2021). and Luc Van Gool. Bringing locality to vision transformers: Localvit."},{"key":"2480_CR33","doi-asserted-by":"crossref","unstructured":"Li, Z., Yuan, G., Niu, W., Zhao, P., Li, Y., Cai, Y., Shen, X., Zhan, Z., Kong, Z., Jin, Q. and Chen, Z., Liu, S., Yang, K., Ren, B., Wang, Y., & Lin, X. (2021). Npas: A compiler-aware framework of unified network pruning and architecture search for beyond real-time mobile acceleration. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 14255-14266).","DOI":"10.1109\/CVPR46437.2021.01403"},{"key":"2480_CR34","doi-asserted-by":"crossref","unstructured":"Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Doll\u00e1r, Piotr, & Zitnick, C\u00a0Lawrence. (2014). Microsoft coco: Common objects in context. In  European conference on computer vision, pages 740\u2013755. Springer","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"2480_CR35","doi-asserted-by":"crossref","unstructured":"Liu, Ze, Lin, Yutong, Cao, Yue, Hu, Han, Wei, Yixuan, Zhang, Zheng, Lin, Stephen, & Guo, Baining. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In  Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV)","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"2480_CR36","unstructured":"Lu, Zhisheng, Liu, Hong, Li, Juncheng, & Zhang, Linlin. (2021). Efficient transformer for single image super-resolution. arXiv preprint arXiv:2108.11084"},{"key":"2480_CR37","doi-asserted-by":"crossref","unstructured":"Ma, Ningning, Zhang, Xiangyu, Zheng, Hai-Tao, & Sun, Jian. (2018). Shufflenet v2: Practical guidelines for efficient cnn architecture design. In  Proceedings of the European Conference on Computer Vision (ECCV), September","DOI":"10.1007\/978-3-030-01264-9_8"},{"key":"2480_CR38","unstructured":"Mehta, Sachin, & Rastegari, Mohammad. (2021). Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178"},{"key":"2480_CR39","doi-asserted-by":"crossref","unstructured":"Mehta, Sachin, Rastegari, Mohammad, Shapiro, Linda, & Hajishirzi, Hannaneh. (2019). Espnetv2: A light-weight, power efficient, and general purpose convolutional neural network. In  Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pages 9190\u20139200","DOI":"10.1109\/CVPR.2019.00941"},{"key":"2480_CR40","doi-asserted-by":"crossref","unstructured":"Misra, Ishan, Girdhar, Rohit, & Joulin, Armand. (2021). An End-to-End Transformer Model for 3D Object Detection. In  Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV)","DOI":"10.1109\/ICCV48922.2021.00290"},{"key":"2480_CR41","doi-asserted-by":"crossref","unstructured":"Nilsback, Maria-Elena, & Zisserman, Andrew. (2008). Automated flower classification over a large number of classes. In  2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722\u2013729. IEEE","DOI":"10.1109\/ICVGIP.2008.47"},{"key":"2480_CR42","unstructured":"Pham, Hieu, Guan, Melody, Zoph, Barret, Le, Quoc, & Dean, Jeff. (2018). Efficient neural architecture search via parameters sharing. In  International conference on machine learning, pages 4095\u20134104. PMLR"},{"key":"2480_CR43","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2023.110052","volume":"147","author":"Matt Poyser","year":"2024","unstructured":"Poyser, Matt, & Breckon, Toby P. (2024). Neural architecture search: A contemporary literature review for computer vision applications. Pattern Recognition, 147, Article 110052.","journal-title":"Pattern Recognition"},{"key":"2480_CR44","doi-asserted-by":"crossref","unstructured":"Sandler, Mark, Howard, Andrew, Zhu, Menglong, Zhmoginov, Andrey, & Chen, Liang-Chieh. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In  Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510\u20134520","DOI":"10.1109\/CVPR.2018.00474"},{"key":"2480_CR45","doi-asserted-by":"crossref","unstructured":"Tan, Mingxing, Chen, Bo, Pang, Ruoming, Vasudevan, Vijay, Sandler, Mark, Howard, Andrew, & Le, Quoc\u00a0V. (2019). Mnasnet: Platform-aware neural architecture search for mobile. In  Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pages 2820\u20132828","DOI":"10.1109\/CVPR.2019.00293"},{"key":"2480_CR46","unstructured":"Tan, Mingxing, & Le, Quoc. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In  International Conference on Machine Learning, pages 6105\u20136114. PMLR"},{"key":"2480_CR47","unstructured":"Touvron, Hugo, Cord, Matthieu, Douze, Matthijs, Massa, Francisco, Sablayrolles, Alexandre, & J\u2019egou,Herv\u2019e. (2021). Training data-efficient image transformers & distillation through attention. In  ICML"},{"key":"2480_CR48","doi-asserted-by":"crossref","unstructured":"Wan, Xingchen, Ru, Binxin, Esparan\u00e7a, Pedro\u00a0M., & Carlucci, Fabio\u00a0Maria. (2022). Approximate neural architecture search via operation distribution learning. In  Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, pages 2377\u20132386","DOI":"10.1109\/WACV51458.2022.00360"},{"key":"2480_CR49","unstructured":"Wang, Dilin, Gong, Chengyue, Li, Meng, Liu, Qiang, & Chandra, Vikas. (2021). Alphanet: Improved training of supernet with alpha-divergence. arXiv preprint arXiv:2102.07954"},{"key":"2480_CR50","doi-asserted-by":"crossref","unstructured":"Wang, Dilin, Li, Meng, Gong, Chengyue, & Chandra, Vikas. (2021). Attentivenas: Improving neural architecture search via attentive sampling. In  Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pages 6418\u20136427","DOI":"10.1109\/CVPR46437.2021.00635"},{"key":"2480_CR51","doi-asserted-by":"crossref","unstructured":"Wang, Hanrui, Wu, Zhanghao, Liu, Zhijian, Cai, Han, Zhu, Ligeng, Gan, Chuang, & Han, Song. (2020). Hat: Hardware-aware transformers for efficient natural language processing. arXiv preprint arXiv:2005.14187","DOI":"10.18653\/v1\/2020.acl-main.686"},{"key":"2480_CR52","doi-asserted-by":"crossref","unstructured":"Wang, Wenhai, Xie, Enze, Li, Xiang, Fan, Deng-Ping, Song, Kaitao, Liang, Ding, Lu, Tong, Luo, Ping, & Shao, Ling. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV)","DOI":"10.1109\/ICCV48922.2021.00061"},{"key":"2480_CR53","doi-asserted-by":"crossref","unstructured":"Xia, Xin, Xiao, Xuefeng, Wang, Xing, & Zheng, Min. (2022). Progressive automatic design of search space for one-shot neural architecture search. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, pages 2455\u20132464","DOI":"10.1109\/WACV51458.2022.00358"},{"key":"2480_CR54","unstructured":"Yang, Chenglin, Qiao, Siyuan, Yu, Qihang, Yuan, Xiaoding, Zhu, Yukun, Yuille, Alan, Adam, Hartwig, & Chen. Liang-Chieh. (2022). Moat: Alternating mobile convolution and attention brings strong vision models. arXiv preprint arXiv:2210.01820"},{"key":"2480_CR55","doi-asserted-by":"crossref","unstructured":"Yu, Jiahui, Jin, Pengchong, Liu, Hanxiao, Bender, Gabriel, Kindermans, Pieter-Jan, Tan, Mingxing, Huang, Thomas, Song, Xiaodan, Pang, Ruoming, & Le, Quoc. (2020). Bignas: Scaling up neural architecture search with big single-stage models. In  European Conference on Computer Vision, pages 702\u2013717. Springer","DOI":"10.1007\/978-3-030-58571-6_41"},{"key":"2480_CR56","doi-asserted-by":"crossref","unstructured":"Yu, Weihao, Luo, Mi, Zhou, Pan, Si, Chenyang, Zhou, Yichen, Wang, Xinchao, Feng, Jiashi, & Yan, Shuicheng. (2022). Metaformer is actually what you need for vision. In  Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pages 10819\u201310829","DOI":"10.1109\/CVPR52688.2022.01055"},{"key":"2480_CR57","doi-asserted-by":"crossref","unstructured":"Yuan, Kun, Guo, Shaopeng, Liu, Ziwei, Zhou, Aojun, Yu, Fengwei, & Wu, Wei.(2021). Incorporating convolution designs into visual transformers. In  Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), pages 579\u2013588, October","DOI":"10.1109\/ICCV48922.2021.00062"},{"key":"2480_CR58","doi-asserted-by":"crossref","unstructured":"Yuan, Li, Chen, Yunpeng, Wang, Tao, Yu, Weihao, Shi, Yujun, Jiang, Zi-Hang, Tay, Francis\u00a0E.H., Feng, Jiashi, & Yan, Shuicheng. (2021). Tokens-to-token vit: Training vision transformers from scratch on imagenet. In  Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), pages 558\u2013567, October","DOI":"10.1109\/ICCV48922.2021.00060"},{"key":"2480_CR59","doi-asserted-by":"crossref","unstructured":"Yun, Sangdoo, Han, Dongyoon, Oh, Seong\u00a0Joon, Chun, Sanghyuk, Choe, Junsuk, & Yoo, Youngjoon. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. In  Proceedings of the IEEE\/CVF International Conference on Computer Vision, pages 6023\u20136032","DOI":"10.1109\/ICCV.2019.00612"},{"key":"2480_CR60","doi-asserted-by":"crossref","unstructured":"Zhang, Hongyi, Cisse, Moustapha, Dauphin, Yann\u00a0N., & Lopez-Paz, David. (2017). mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412","DOI":"10.1007\/978-1-4899-7687-1_79"},{"key":"2480_CR61","unstructured":"Zhao, Yiyang, Wang, Linnan, Tian, Yuandong, Fonseca, Rodrigo, & Guo, Tian. (2021). Few-shot neural architecture search. In  International Conference on Machine Learning, pages 12707\u201312718. PMLR"},{"key":"2480_CR62","doi-asserted-by":"crossref","unstructured":"Zheng, Sixiao, Lu, Jiachen, Zhao, Hengshuang, Zhu, Xiatian, Luo, Zekun, Wang, Yabiao, Fu, Yanwei, Feng, Jianfeng, Xiang, Tao, Torr, Philip\u00a0H.S. et\u00a0al. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In  Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6881\u20136890","DOI":"10.1109\/CVPR46437.2021.00681"},{"key":"2480_CR63","doi-asserted-by":"crossref","unstructured":"Xin, Y., Luo, S., Liu, X., Du, Y., Zhou, H., Cheng, X., Lee, C. E., Du, J., Wang, H., Chen, M., Liu, T., Hu, G., Wan, Z., Zhang, R., Li, A., Yi, M., Liu, X. (2024a). V-petl bench: A unified visual parameter-efficient transfer learning benchmark. Advances in Neural Information Processing Systems, 37, 80522\u201380535.","DOI":"10.52202\/079017-2560"},{"key":"2480_CR64","doi-asserted-by":"crossref","unstructured":"Xin, Y., Du, J., Wang, Q., Lin, Z., Yan, K. (2024b). Vmt-adapter: Parameter-efficient transfer learning for multi-task dense scene understanding. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, No. 14, pp. 16085\u201316093).","DOI":"10.1609\/aaai.v38i14.29541"},{"key":"2480_CR65","unstructured":"Xin, Y., Luo, S., Zhou, H., Du, J., Liu, X., Fan, Y., Li, Q., & Du, Y. (2024c). Parameter-efficient fine-tuning for pre-trained vision models: A survey. arXiv:2402.02242"},{"key":"2480_CR66","doi-asserted-by":"crossref","unstructured":"Xin, Y., Du, J., Wang, Q., Yan, K., & Ding, S. (2024d). Mmap: Multi-modal alignment prompt for cross-domain multi-task learning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, No. 14, pp. 16076\u201316084).","DOI":"10.1609\/aaai.v38i14.29540"},{"key":"2480_CR67","doi-asserted-by":"crossref","unstructured":"Xin, Y., Luo, S., Jin, P., Du, Y., & Wang, C. (2023). Self-training with label-feature-consistency for domain adaptation. In International Conference on Database Systems for Advanced Applications (pp. 84\u201399). Springer Nature Switzerland.","DOI":"10.1007\/978-3-031-30678-5_7"},{"key":"2480_CR68","doi-asserted-by":"publisher","first-page":"13001","DOI":"10.1609\/aaai.v34i07.7000","volume":"34","author":"Zhun Zhong","year":"2020","unstructured":"Zhong, Zhun, Zheng, Liang, Kang, Guoliang, Li, Shaozi, & Yang, Yi. (2020). Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, 34, 13001-13008.","journal-title":"In Proceedings of the AAAI Conference on Artificial Intelligence"},{"key":"2480_CR69","doi-asserted-by":"crossref","unstructured":"Yu, P., Kong, Z., Zhao, P., Dong, P., Tang, H., Sun, F., Lin, X., & Wang, Y. (2025, February). Q-TempFusion: Quantization-Aware Temporal Multi-Sensor Fusion on Bird\u2019s-Eye View Representation. In 2025 IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV) (pp. 5489\u20135499). IEEE.","DOI":"10.1109\/WACV61041.2025.00536"}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-025-02480-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-025-02480-w\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-025-02480-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,13]],"date-time":"2025-11-13T06:59:58Z","timestamp":1763017198000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-025-02480-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,26]]},"references-count":69,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2025,9]]}},"alternative-id":["2480"],"URL":"https:\/\/doi.org\/10.1007\/s11263-025-02480-w","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,5,26]]},"assertion":[{"value":"2 September 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"10 May 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 May 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}