{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,15]],"date-time":"2026-04-15T18:33:26Z","timestamp":1776278006813,"version":"3.50.1"},"reference-count":74,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2024,4,19]],"date-time":"2024-04-19T00:00:00Z","timestamp":1713484800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,4,19]],"date-time":"2024-04-19T00:00:00Z","timestamp":1713484800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Mach. Intell. Res."],"published-print":{"date-parts":[[2024,8]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>This paper tackles the high computational\/space complexity associated with multi-head self-attention (MHSA) in vanilla vision transformers. To this end, we propose hierarchical MHSA (H-MHSA), a novel approach that computes sell-attention in a hierarchical fashion. Specifically, we first divide the input image into patches as commonly done, and each patch is viewed as a token. Then, the proposed H-MHSA learns token relationships within local patches, serving as local relationship modeling. Then, the small patches are merged into larger ones, and H-MHSA models the global dependencies for the small number of the merged tokens. At last, the local and global attentive features are aggregated to obtain features with powerful representation capacity. Since we only calculate attention for a limited number of tokens at each step, the computational load is reduced dramatically. Hence, H-MHSA can efficiently model global relationships among tokens without sacrificing fine-grained information. With the H-MHSA module incorporated, we build a family of hierarchical-attention-based transformer networks, namely HAT-Net. To demonstrate the superiority of HAT-Net in scene understanding, we conduct extensive experiments on fundamental vision tasks, including image classification, semantic segmentation, object detection and instance segmentation. Therefore, HAT-Net provides a new perspective for vision transformers. Code and pretrained models are available at <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/yun-liu\/HAT-Net\">https:\/\/github.com\/yun-liu\/HAT-Net<\/jats:ext-link>.<\/jats:p>","DOI":"10.1007\/s11633-024-1393-8","type":"journal-article","created":{"date-parts":[[2024,4,19]],"date-time":"2024-04-19T12:01:30Z","timestamp":1713528090000},"page":"670-683","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":45,"title":["Vision Transformers with Hierarchical Attention"],"prefix":"10.1007","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6143-0264","authenticated-orcid":false,"given":"Yun","family":"Liu","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8666-3435","authenticated-orcid":false,"given":"Yu-Huan","family":"Wu","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8667-9656","authenticated-orcid":false,"given":"Guolei","family":"Sun","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6930-8674","authenticated-orcid":false,"given":"Le","family":"Zhang","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2051-2209","authenticated-orcid":false,"given":"Ajad","family":"Chhatkuli","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3445-5711","authenticated-orcid":false,"given":"Luc","family":"Van Gool","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,4,19]]},"reference":[{"key":"1393_CR1","doi-asserted-by":"publisher","unstructured":"A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, USA, pp. 1097\u20131105, 2012. DOI: https:\/\/doi.org\/10.5555\/2999134.2999257.","DOI":"10.5555\/2999134.2999257"},{"key":"1393_CR2","unstructured":"K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA, 2015."},{"key":"1393_CR3","doi-asserted-by":"publisher","unstructured":"K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 770\u2013778, 2016. DOI: https:\/\/doi.org\/10.1109\/CVPR.2016.90.","DOI":"10.1109\/CVPR.2016.90"},{"issue":"6","key":"1393_CR4","doi-asserted-by":"publisher","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","volume":"39","author":"S Q Ren","year":"2017","unstructured":"S. Q. Ren, K. M. He, R. Girshick, J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137\u20131149, 2017. DOI: https:\/\/doi.org\/10.1109\/TPAMI.2016.2577031.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"1393_CR5","doi-asserted-by":"publisher","unstructured":"K. M. He, G. Gkioxari, P. Doll\u00e1r, R. Girshick. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, pp. 2980\u20132988, 2017. DOI: https:\/\/doi.org\/10.1109\/ICCV.2017.322.","DOI":"10.1109\/ICCV.2017.322"},{"key":"1393_CR6","doi-asserted-by":"publisher","unstructured":"H. S. Zhao, J. P. Shi, X. J. Qi, X. G. Wang, J. Y. Jia. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 6230\u20136239, 2017. DOI: https:\/\/doi.org\/10.1109\/CV-PR.2017.660.","DOI":"10.1109\/CV-PR.2017.660"},{"issue":"8","key":"1393_CR7","doi-asserted-by":"publisher","first-page":"1939","DOI":"10.1109\/TPAMI.2018.2878849","volume":"41","author":"Y Liu","year":"2019","unstructured":"Y. Liu, M. M. Cheng, X. W. Hu, J. W. Bian, L. Zhang, X. Bai, J. H. Tang. Richer convolutional features for edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 8, pp. 1939\u20131946, 2019. DOI: https:\/\/doi.org\/10.1109\/TPAMI.2018.2878849.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"issue":"1","key":"1393_CR8","doi-asserted-by":"publisher","first-page":"179","DOI":"10.1007\/s11263-021-01539-8","volume":"130","author":"Y Liu","year":"2022","unstructured":"Y. Liu, M. M. Cheng, D. P. Fan, L. Zhang, J. W. Bian, D. C. Tao. Semantic edge detection with diverse deep supervision. International Journal of Computer Vision, vol. 130, no. 1, pp. 179\u2013198, 2022. DOI: https:\/\/doi.org\/10.1007\/s11263-021-01539-8.","journal-title":"International Journal of Computer Vision"},{"issue":"7","key":"1393_CR9","doi-asserted-by":"publisher","first-page":"6131","DOI":"10.1109\/TCYB.2021.3051350","volume":"52","author":"Y Liu","year":"2022","unstructured":"Y. Liu, M. M. Cheng, X. Y. Zhang, G. Y. Nie, M. Wang. DNA: Deeply supervised nonlinear aggregation for salient object detection. IEEE Transactions on Cybernetics, vol. 52, no. 7, pp. 6131\u20136142, 2022. DOI: https:\/\/doi.org\/10.1109\/TCYB.2021.3051350.","journal-title":"IEEE Transactions on Cybernetics"},{"issue":"9","key":"1393_CR10","doi-asserted-by":"publisher","first-page":"4439","DOI":"10.1109\/TCYB.2020.3035613","volume":"51","author":"Y Liu","year":"2021","unstructured":"Y. Liu, Y. C. Gu, X. Y. Zhang, W. W. Wang, M. M. Cheng. Lightweight salient object detection via hierarchical visual perception learning. IEEE Transactions on Cybernetics, vol. 51, no. 9, pp. 4439\u20134449, 2021. DOI: https:\/\/doi.org\/10.1109\/TCYB.2020.3035613.","journal-title":"IEEE Transactions on Cybernetics"},{"key":"1393_CR11","doi-asserted-by":"publisher","unstructured":"Y. Liu, Y. H. Wu, Y. F. Ban, H. F. Wang, M. M. Cheng. Rethinking computer-aided tuberculosis diagnosis. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 2643\u20132652, 2020. DOI: https:\/\/doi.org\/10.1109\/CVPR42600.2020.00272.","DOI":"10.1109\/CVPR42600.2020.00272"},{"issue":"3","key":"1393_CR12","doi-asserted-by":"publisher","first-page":"1415","DOI":"10.1109\/TPAMI.2020.3023152","volume":"44","author":"Y Liu","year":"2022","unstructured":"Y. Liu, Y. H. Wu, P. S. Wen, Y. J. Shi, Y. Qiu, M. M. Cheng. Leveraging instance-, image- and dataset-level information for weakly supervised instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 3, pp. 1415\u20131428, 2022. DOI: https:\/\/doi.org\/10.1109\/TPAMI.2020.3023152.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"1393_CR13","doi-asserted-by":"publisher","unstructured":"A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 6000\u20136010, 2017. DOI: https:\/\/doi.org\/10.5555\/3295222.3295349.","DOI":"10.5555\/3295222.3295349"},{"key":"1393_CR14","doi-asserted-by":"publisher","unstructured":"J. Devlin, M. W. Chang, K. Lee, K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, pp. 4171\u20134186, 2019. DOI: https:\/\/doi.org\/10.18653\/v1\/N19-1423.","DOI":"10.18653\/v1\/N19-1423"},{"key":"1393_CR15","doi-asserted-by":"publisher","unstructured":"Z. H. Dai, Z. L. Yang, Y. M. Yang, J. Carbonell, Q. Le, R. Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2978\u20132988, 2019. DOI: https:\/\/doi.org\/10.18653\/v1\/P19-1285.","DOI":"10.18653\/v1\/P19-1285"},{"key":"1393_CR16","unstructured":"A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is worth 16\u00d716 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, 2021."},{"key":"1393_CR17","doi-asserted-by":"publisher","unstructured":"B. Heo, S. Yun, D. Han, S. Chun, J. Choe, S. J. Oh. Rethinking spatial dimensions of vision transformers. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, Canada, pp. 11916\u201311925, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.01172.","DOI":"10.1109\/ICCV48922.2021.01172"},{"key":"1393_CR18","doi-asserted-by":"publisher","unstructured":"Z. Liu, Y. T. Lin, Y. Cao, H. Hu, Y. X. Wei, Z. Zhang, S. Lin, B. N. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, Canada, pp. 9992\u201310002, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.00986.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"1393_CR19","doi-asserted-by":"publisher","unstructured":"W. H. Wang, E. Z. Xie, X. Li, D. P. Fan, K. T. Song, D. Liang, T. Lu, P. Luo, L. Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, Canada, pp. 548\u2013558, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.00061.","DOI":"10.1109\/ICCV48922.2021.00061"},{"key":"1393_CR20","doi-asserted-by":"publisher","unstructured":"W. J. Xu, Y. F. Xu, T. Chang, Z. W. Tu. Co-scale conv-attentional image transformers. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, Canada, pp. 9961\u20139970, 2021. DOI: https:\/\/doi.org\/10.1109\/IC-CV48922.2021.00983.","DOI":"10.1109\/IC-CV48922.2021.00983"},{"key":"1393_CR21","doi-asserted-by":"publisher","unstructured":"H. Q. Fan, B. Xiong, K. Mangalam, Y. H. Li, Z. C. Yan, J. Malik, C. Feichtenhofer. Multiscale vision transformers. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, Canada, pp. 6804\u20136815, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.00675.","DOI":"10.1109\/ICCV48922.2021.00675"},{"key":"1393_CR22","unstructured":"D. Bolya, C. Y. Fu, X. L. Dai, P. Z. Zhang, C. Feichtenhofer, J. Hoffman. Token merging: Your ViT but faster. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 2023."},{"issue":"11","key":"1393_CR23","doi-asserted-by":"publisher","first-page":"2278","DOI":"10.1109\/5.726791","volume":"86","author":"Y LeCun","year":"1998","unstructured":"Y. LeCun, L. Bottou, Y. Bengio, P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, vol. 86, no. 11, pp. 2278\u20132324, 1998. DOI: https:\/\/doi.org\/10.1109\/5.726791.","journal-title":"Proceedings of the IEEE"},{"issue":"3","key":"1393_CR24","doi-asserted-by":"publisher","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","volume":"115","author":"O Russakovsky","year":"2015","unstructured":"O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. A. Ma, Z. H. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, vol. 115, no. 3, pp. 211\u2013252, 2015. DOI: https:\/\/doi.org\/10.1007\/s11263-015-0816-y.","journal-title":"International Journal of Computer Vision"},{"key":"1393_CR25","unstructured":"R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks, [Online], Available: https:\/\/arxiv.org\/abs\/1505.00387, 2015."},{"key":"1393_CR26","doi-asserted-by":"publisher","unstructured":"C. Szegedy, W. Liu, Y. Q. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, pp. 1\u20139, 2015. DOI: https:\/\/doi.org\/10.1109\/CVPR.2015.7298594.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"1393_CR27","doi-asserted-by":"publisher","unstructured":"C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 2818\u20132826, 2016. DOI: https:\/\/doi.org\/10.1109\/CVPR.2016.308.","DOI":"10.1109\/CVPR.2016.308"},{"key":"1393_CR28","doi-asserted-by":"publisher","unstructured":"C. Szegedy, S. Ioffe, V. Vanhoucke, A. A. Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, USA, pp. 4278\u20134284, 2017. DOI: https:\/\/doi.org\/10.5555\/3298023.3298188.","DOI":"10.5555\/3298023.3298188"},{"key":"1393_CR29","doi-asserted-by":"publisher","unstructured":"S. N. Xie, R. Girshick, P. Dollar, Z. W. Tu, K. M. He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 5987\u20135995, 2017. DOI: https:\/\/doi.org\/10.1109\/CVPR.2017.634.","DOI":"10.1109\/CVPR.2017.634"},{"key":"1393_CR30","doi-asserted-by":"publisher","unstructured":"G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 2261\u20132269, 2017. DOI: https:\/\/doi.org\/10.1109\/CVPR.2017.243.","DOI":"10.1109\/CVPR.2017.243"},{"key":"1393_CR31","unstructured":"A. G. Howard, M. L. Zhu, B. Chen, D. Kalenichenko, W. J. Wang, T. Weyand, M. Andreetto, H. Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications, [Online], Available: https:\/\/arxiv.org\/abs\/1704.04861, 2017."},{"key":"1393_CR32","doi-asserted-by":"publisher","unstructured":"M. Sandler, A. Howard, M. L. Zhu, A. Zhmoginov, L. C. Chen. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 4510\u20134520, 2018. DOI: https:\/\/doi.org\/10.1109\/CVPR.2018.00474.","DOI":"10.1109\/CVPR.2018.00474"},{"key":"1393_CR33","doi-asserted-by":"publisher","unstructured":"X. Y. Zhang, X. Y. Zhou, M. X. Lin, J. Sun. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 6848\u20136856, 2018. DOI: https:\/\/doi.org\/10.1109\/CV-PR.2018.00716.","DOI":"10.1109\/CV-PR.2018.00716"},{"key":"1393_CR34","doi-asserted-by":"publisher","unstructured":"N. N. Ma, X. Y. Zhang, H. T. Zheng, J. Sun. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, pp. 122\u2013138, 2018. DOI: https:\/\/doi.org\/10.1007\/978-3-030-01264-9_8.","DOI":"10.1007\/978-3-030-01264-9_8"},{"key":"1393_CR35","doi-asserted-by":"publisher","unstructured":"M. X. Tan, B. Chen, R. M. Pang, V. Vasudevan, M. Sandler, A. Howard, Q. V. Le. MnasNet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 2815\u20132823, 2019. DOI: https:\/\/doi.org\/10.1109\/CVPR.2019.00293.","DOI":"10.1109\/CVPR.2019.00293"},{"key":"1393_CR36","unstructured":"M. X. Tan, Q. V. Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, USA, pp. 6105\u20136114, 2019."},{"key":"1393_CR37","doi-asserted-by":"publisher","unstructured":"M. Jaderberg, K. Simonyan, A. Zisserman, K. Kavukcuoglu. Spatial transformer networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 2017\u20132025, 2015. DOI: https:\/\/doi.org\/10.5555\/2969442.2969465.","DOI":"10.5555\/2969442.2969465"},{"key":"1393_CR38","doi-asserted-by":"publisher","unstructured":"L. Chen, H. W. Zhang, J. Xiao, L. Q. Nie, J. Shao, W. Liu, T. S. Chua. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 6298\u20136306, 2017. DOI: https:\/\/doi.org\/10.1109\/CVPR.2017.667.","DOI":"10.1109\/CVPR.2017.667"},{"key":"1393_CR39","doi-asserted-by":"publisher","unstructured":"F. Wang, M. Q. Jiang, C. Qian, S. Yang, C. Li, H. G. Zhang, X. G. Wang, X. O. Tang. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 6450\u20136458, 2017. DOI: https:\/\/doi.org\/10.1109\/CVPR.2017.683.","DOI":"10.1109\/CVPR.2017.683"},{"issue":"8","key":"1393_CR40","doi-asserted-by":"publisher","first-page":"2011","DOI":"10.1109\/TPAMI.2019.2913372","volume":"42","author":"J Hu","year":"2020","unstructured":"J. Hu, L. Shen, S. Albanie, G. Sun, E. H. Wu. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 8, pp. 2011\u20132023, 2020. DOI: https:\/\/doi.org\/10.1109\/TPAMI.2019.2913372.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"1393_CR41","doi-asserted-by":"publisher","unstructured":"S. Woo, J. Park, J. Y. Lee, I. S. Kweon. CBAM: Convolutional block attention module. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, pp. 3\u201319, 2018. DOI: https:\/\/doi.org\/10.1007\/978-3-030-01234-2_1.","DOI":"10.1007\/978-3-030-01234-2_1"},{"key":"1393_CR42","unstructured":"J. Park, S. Woo, J. Y. Lee, I. S. Kweon. BAM: Bottleneck attention module. In Proceedings of the British Machine Vision Conference, Newcastle, UK, Article number 147, 2018."},{"key":"1393_CR43","doi-asserted-by":"publisher","unstructured":"X. Li, W. H. Wang, X. L. Hu, J. Yang. Selective kernel networks. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 510\u2013519, 2019. DOI: https:\/\/doi.org\/10.1109\/CVPR.2019.00060.","DOI":"10.1109\/CVPR.2019.00060"},{"key":"1393_CR44","doi-asserted-by":"publisher","unstructured":"X. L. Wang, R. Girshick, A. Gupta, K. M. He. Non-local neural networks. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 7794\u20137803, 2018. DOI: https:\/\/doi.org\/10.1109\/CVPR.2018.00813.","DOI":"10.1109\/CVPR.2018.00813"},{"key":"1393_CR45","doi-asserted-by":"publisher","unstructured":"H. Zhang, C. R. Wu, Z. Y. Zhang, Y. Zhu, H. B. Lin, Z. Zhang, Y. Sun, T. He, J. Mueller, R. Manmatha, M. Li, A. Smola. ResNeSt: Split-attention networks. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops, New Orleans, USA, pp. 2735\u20132745, 2022. DOI: https:\/\/doi.org\/10.1109\/CVPRW56347.2022.00309.","DOI":"10.1109\/CVPRW56347.2022.00309"},{"key":"1393_CR46","doi-asserted-by":"publisher","unstructured":"L. Yuan, Y. P. Chen, T. Wang, W. H. Yu, Y. J. Shi, Z. H. Jiang, F. E. H. Tay, J. S. Feng, S. C. Yan. Tokens-to-token ViT: Training vision transformers from scratch on ImageNet. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, Canada, pp. 538\u2013547, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.00060.","DOI":"10.1109\/ICCV48922.2021.00060"},{"key":"1393_CR47","doi-asserted-by":"publisher","unstructured":"H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, H. J\u00e9gou. Going deeper with image transformers. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, Canada, pp. 32\u201342, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.00010.","DOI":"10.1109\/ICCV48922.2021.00010"},{"key":"1393_CR48","unstructured":"D. Q. Zhou, B. Y. Kang, X. J. Jin, L. J. Yang, X. C. Lian, Z. H. Jiang, Q. B. Hou, J. S. Feng. DeepViT: Towards deeper vision transformer, [Online], Available: https:\/\/arxiv.org\/abs\/2103.11886, 2021."},{"key":"1393_CR49","unstructured":"H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, H. J\u00e9gou. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, pp. 10347\u201310357, 2021."},{"key":"1393_CR50","doi-asserted-by":"publisher","unstructured":"A. Srinivas, T. Y. Lin, N. Parmar, J. Shlens, P. Abbeel, A. Vaswani. Bottleneck transformers for visual recognition. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, pp. 16514\u201316524, 2021. DOI: https:\/\/doi.org\/10.1109\/CVPR46437.2021.01625.","DOI":"10.1109\/CVPR46437.2021.01625"},{"key":"1393_CR51","unstructured":"I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. H. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, M. Lucic, A. Dosovitskiy. MLP-Mixer: An all-MLP architecture for vision. In Proceedings of the 34th Advances in Neural Information Processing Systems, pp. 24261\u201324272, 2021."},{"key":"1393_CR52","unstructured":"H. X. Liu, Z. H. Dai, D. R. So, Q. V. Le. Pay attention to MLPs. In Proceedings of the 34th Advances in Neural Information Processing Systems, pp. 9204\u20139215, 2021."},{"issue":"1","key":"1393_CR53","doi-asserted-by":"publisher","first-page":"1328","DOI":"10.1109\/TPAMI.2022.3145427","volume":"45","author":"Q B Hou","year":"2023","unstructured":"Q. B. Hou, Z. H. Jiang, L. Yuan, M. M. Cheng, S. C. Yan, J. S. Feng. Vision Permutator: A permutable MLP-like architecture for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 1328\u20131334, 2023. DOI: https:\/\/doi.org\/10.1109\/TPAMI.2022.3145427.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"1393_CR54","doi-asserted-by":"publisher","unstructured":"Z. C. Wang, Y. B. Hao, X. Y. Gao, H. Zhang, S. Wang, T. T. Mu, X. N. He. Parameterization of cross-token relations with relative positional encoding for vision MLP. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, pp. 6288\u20136299, 2022. DOI: https:\/\/doi.org\/10.1145\/3503161.3547953.","DOI":"10.1145\/3503161.3547953"},{"key":"1393_CR55","unstructured":"K. Han, A. Xiao, E. H. Wu, J. Y. Guo, C. J. Xu, Y. H. Wang. Transformer in transformer. In Proceedings of the 35th Conference on Neural Information Processing Systems, pp. 15908\u201315919, 2021."},{"key":"1393_CR56","unstructured":"Y. W. Li, K. Zhang, J. Z. Cao, R. Timofte, L. Van Gool. LocalViT: Bringing locality to vision transformers, [Online], Available: https:\/\/arxiv.org\/abs\/2104.05707, 2021."},{"key":"1393_CR57","doi-asserted-by":"publisher","unstructured":"K. Yuan, S. P. Guo, Z. W. Liu, A. J. Zhou, F. W. Yu, W. Wu. Incorporating convolution designs into visual transformers. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, Canada, pp. 559\u2013568, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.00062.","DOI":"10.1109\/ICCV48922.2021.00062"},{"key":"1393_CR58","unstructured":"D. Hendrycks, K. Gimpel. Gaussian error linear units (GELUs), [Online], Available: https:\/\/arxiv.org\/abs\/1606.08415, 2016."},{"key":"1393_CR59","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1016\/j.neunet.2017.12.012","volume":"107","author":"S Elfwing","year":"2018","unstructured":"S. Elfwing, E. Uchibe, K. Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, vol. 107, pp. 3\u201311, 2018. DOI: https:\/\/doi.org\/10.1016\/j.neunet.2017.12.012.","journal-title":"Neural Networks"},{"key":"1393_CR60","doi-asserted-by":"publisher","unstructured":"A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. M. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K\u00f6pf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. J. Bai, S. Chintala. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, USA, Article number 721, 2019. DOI: https:\/\/doi.org\/10.5555\/3454287.3455008.","DOI":"10.5555\/3454287.3455008"},{"key":"1393_CR61","unstructured":"H. Y. Zhang, M. Ciss\u00e9, Y. N. Dauphin, D. Lopez-Paz. mixup: Beyond empirical risk minimization. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, 2018."},{"key":"1393_CR62","unstructured":"I. Loshchilov, F. Hutter. Decoupled weight decay regularization. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, USA, 2019."},{"key":"1393_CR63","unstructured":"I. Loshchilov, F. Hutter. SGDR: Stochastic gradient descent with warm restarts. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 2017."},{"key":"1393_CR64","doi-asserted-by":"publisher","unstructured":"I. Radosavovic, R. P. Kosaraju, R. Girshick, K. M. He, P. Doll\u00e1r. Designing network design spaces. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 10425\u201310433, 2020. DOI: https:\/\/doi.org\/10.1109\/CVPR42600.2020.01044.","DOI":"10.1109\/CVPR42600.2020.01044"},{"key":"1393_CR65","doi-asserted-by":"publisher","unstructured":"H. P. Wu, B. Xiao, N. Codella, M. C. Liu, X. Y. Dai, L. Yuan, L. Zhang. CvT: Introducing convolutions to vision transformers. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, Canada, pp. 22\u201331, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.00009.","DOI":"10.1109\/ICCV48922.2021.00009"},{"key":"1393_CR66","unstructured":"X. X. Chu, Z. Tian, Y. Q. Wang, B. Zhang, H. B. Ren, X. L. Wei, H. X. Xia, C. H. Shen. Twins: Revisiting the design of spatial attention in vision transformers. In Proceedings of the 34th Advances in Neural Information Processing Systems, pp. 9355\u20139366, 2021."},{"issue":"3","key":"1393_CR67","doi-asserted-by":"publisher","first-page":"415","DOI":"10.1007\/s41095-022-0274-8","volume":"8","author":"W H Wang","year":"2022","unstructured":"W. H. Wang, E. Z. Xie, X. Li, D. P. Fan, K. T. Song, D. Liang, T. Lu, P. Luo, L. Shao. PVT v2: Improved baselines with pyramid vision transformer. Computational Visual Media, vol. 8, no. 3, pp. 415\u2013424, 2022. DOI: https:\/\/doi.org\/10.1007\/s41095-022-0274-8.","journal-title":"Computational Visual Media"},{"key":"1393_CR68","doi-asserted-by":"crossref","unstructured":"A. Kirillov, R. Girshick, K. M. He, P. Doll\u00e1r. Panoptic feature pyramid networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 6392\u20136401, 2019.DOL10.1109\/CVPR.2019.00656.","DOI":"10.1109\/CVPR.2019.00656"},{"key":"1393_CR69","doi-asserted-by":"publisher","unstructured":"B. L. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, A. Torralba. Scene parsing through ADE20K dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 5122\u20135130, 2017. DOI: https:\/\/doi.org\/10.1109\/CVPR.2017.544.","DOI":"10.1109\/CVPR.2017.544"},{"key":"1393_CR70","unstructured":"MMSegmentation Contributors. MMSegmentation: Open-MMLab semantic segmentation toolbox and benchmark, [Online], Available: https:\/\/github.com\/open-mmlab\/mm-segmentation, 2020."},{"key":"1393_CR71","doi-asserted-by":"publisher","unstructured":"T. Y. Lin, P. Goyal, R. Girshick, K. M. He, P. Doll\u00e1r. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, pp. 2999\u20133007, 2017. DOI: https:\/\/doi.org\/10.1109\/ICCV.2017.324.","DOI":"10.1109\/ICCV.2017.324"},{"key":"1393_CR72","doi-asserted-by":"publisher","unstructured":"T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, C. L. Zitnick. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, Z\u00fcrich, Switzerland, pp. 740\u2013755, 2014. DOI: https:\/\/doi.org\/10.1007\/978-3-319-10602-1.","DOI":"10.1007\/978-3-319-10602-1"},{"key":"1393_CR73","unstructured":"K. Chen, J. Q. Wang, J. M. Pang, Y. H. Cao, Y. Xiong, X. X. Li, S. Y. Sun, W. S. Feng, Z. W. Liu, J. R. Xu, Z. Zhang, D. Z. Cheng, C. C. Zhu, T. H. Cheng, Q. J. Zhao, B. Y. Li, X. Lu, R. Zhu, Y. Wu, J. F. Dai, J. D. Wang, J. P. Shi, W. L. Ouyang, C. C. Loy, D. H. Lin. MMDetection: Open MMLab detection toolbox and benchmark, [Online], Available: https:\/\/arxiv.org\/abs\/1906.07155, 2019."},{"key":"1393_CR74","doi-asserted-by":"publisher","unstructured":"P. C. Zhang, X. Y. Dai, J. W. Yang, B. Xiao, L. Yuan, L. Zhang, J. F. Gao. Multi-scale vision Longformer: A new vision transformer for high-resolution image encoding. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, Canada, pp. 2978\u20132988, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.00299.","DOI":"10.1109\/ICCV48922.2021.00299"}],"container-title":["Machine Intelligence Research"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11633-024-1393-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11633-024-1393-8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11633-024-1393-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,16]],"date-time":"2024-07-16T08:14:18Z","timestamp":1721117658000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11633-024-1393-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,4,19]]},"references-count":74,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,8]]}},"alternative-id":["1393"],"URL":"https:\/\/doi.org\/10.1007\/s11633-024-1393-8","relation":{},"ISSN":["2731-538X","2731-5398"],"issn-type":[{"value":"2731-538X","type":"print"},{"value":"2731-5398","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,4,19]]},"assertion":[{"value":"3 September 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 January 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 April 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declared that they have no conflicts of interest to this work.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations of conflict of interest"}}]}}