{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,2]],"date-time":"2026-05-02T04:35:40Z","timestamp":1777696540091,"version":"3.51.4"},"reference-count":46,"publisher":"SAGE Publications","issue":"6","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["IDA"],"published-print":{"date-parts":[[2023,11,20]]},"abstract":"<jats:p>Transformer-based networks have demonstrated their powerful performance in various vision tasks. However, these transformer-based networks are heavyweight and cannot be applied to edge computing (mobile) devices. Despite that the lightweight transformer network has emerged, several problems remain, i.e., weak feature extraction ability, feature redundancy, and lack of convolutional inductive bias. To address these three problems, we propose a lightweight visual transformer (Symmetric Former, SFormer), which contains two novel modules (Symmetric Block and Symmetric FFN). Specifically, we design Symmetric Block to expand feature capacity inside the module and enhance the long-range modeling capability of attention mechanism. To increase the compactness of the model and introduce inductive bias, we introduce convolutional cheap operations to design Symmetric FFN. We compared the SFormer with existing lightweight transformers on several vision tasks. Remarkably, on the image recognition task of ImageNet [13], SFormer gains 1.2% and 1.6% accuracy improvements compared to PVTv2-b0 and Swin Transformer, respectively. On the semantic segmentation task of ADE20K [64], SFormer delivers performance improvements of 0.2% and 0.7% compared to PVTv2-b0 and Swin Transformer, respectively. On the cityscapes dataset [11], SFormer delivers performance improvements of 2.5% and 4.2% compared to PVTv2-b0 and Swin Transformer, respectively. The code is open-source and available at: https:\/\/github.com\/ISCLab-Bistu\/Symmetric_Former.git.<\/jats:p>","DOI":"10.3233\/ida-227205","type":"journal-article","created":{"date-parts":[[2023,10,27]],"date-time":"2023-10-27T11:20:49Z","timestamp":1698405649000},"page":"1741-1757","source":"Crossref","is-referenced-by-count":5,"title":["A lightweight vision transformer with symmetric modules for vision tasks"],"prefix":"10.1177","volume":"27","author":[{"given":"Shengjun","family":"Liang","sequence":"first","affiliation":[{"name":"Key Laboratory of the Ministry of Education for Optoelectronic Measurement Technology and Instrument, Beijing Information Science and Technology University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mingxin","family":"Yu","sequence":"additional","affiliation":[{"name":"Key Laboratory of the Ministry of Education for Optoelectronic Measurement Technology and Instrument, Beijing Information Science and Technology University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wenshuai","family":"Lu","sequence":"additional","affiliation":[{"name":"Department of Precision Instrument, Tsinghua University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xinglong","family":"Ji","sequence":"additional","affiliation":[{"name":"Department of Precision Instrument, Tsinghua University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiongxin","family":"Tang","sequence":"additional","affiliation":[{"name":"Science&Technology on Integrated Information System Laboratory, Institute of Software, The Chinese Academy of Sciences, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiaolin","family":"Liu","sequence":"additional","affiliation":[{"name":"Beijing Institute of Space Mechanic & Electricity, Beijing Key Laboratory of Advanced Optical Remote Sensing, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rui","family":"You","sequence":"additional","affiliation":[{"name":"Key Laboratory of the Ministry of Education for Optoelectronic Measurement Technology and Instrument, Beijing Information Science and Technology University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"179","reference":[{"key":"10.3233\/IDA-227205_ref3","first-page":"1","article-title":"Tukey\u2019s honestly significant difference (HSD) test","volume":"3","author":"Abdi","year":"2010","journal-title":"Encyclopedia of Research Design"},{"key":"10.3233\/IDA-227205_ref4","doi-asserted-by":"crossref","unstructured":"A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lu\u010di\u0107 and C. Schmid, Vivit: A video vision transformer, in: Proceedings of the IEEE\/CVF International Conference on Computer Vision, 2021, pp. 6836\u20136846.","DOI":"10.1109\/ICCV48922.2021.00676"},{"key":"10.3233\/IDA-227205_ref5","doi-asserted-by":"crossref","unstructured":"L. Bai, Y. Zhao and X. Huang, A Near Sensor Edge Computing System for Point Cloud Semantic Segmentation, in: 2022 IEEE International Symposium on Circuits and Systems (ISCAS), 2022, pp. 1818\u20131822.","DOI":"10.1109\/ISCAS48785.2022.9937678"},{"key":"10.3233\/IDA-227205_ref7","doi-asserted-by":"crossref","unstructured":"Y. Chen, X. Dai, D. Chen, M. Liu, X. Dong, L. Yuan and Z. Liu, Mobile-former: Bridging mobilenet and transformer, in: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5270\u20135279.","DOI":"10.1109\/CVPR52688.2022.00520"},{"key":"10.3233\/IDA-227205_ref8","first-page":"1","article-title":"RetinaNet With Difference Channel Attention and Adaptively Spatial Feature Fusion for Steel Surface Defect Detection","volume":"70","author":"Cheng","year":"2021","journal-title":"IEEE Transactions on Instrumentation and Measurement"},{"key":"10.3233\/IDA-227205_ref11","doi-asserted-by":"crossref","unstructured":"M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth and B. Schiele, The cityscapes dataset for semantic urban scene understanding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3213\u20133223.","DOI":"10.1109\/CVPR.2016.350"},{"key":"10.3233\/IDA-227205_ref12","unstructured":"Z. Dai, H. Liu, Q.V. Le and M. Tan, Coatnet: Marrying convolution and attention for all data sizes, Advances in Neural Information Processing Systems 34 (2021), 3965\u20133977."},{"key":"10.3233\/IDA-227205_ref13","doi-asserted-by":"crossref","unstructured":"J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248\u2013255.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"10.3233\/IDA-227205_ref14","doi-asserted-by":"crossref","unstructured":"X. Ding, X. Zhang, N. Ma, J. Han, G. Ding and J. Sun, Repvgg: Making vgg-style convnets great again, in: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13733\u201313742.","DOI":"10.1109\/CVPR46437.2021.01352"},{"key":"10.3233\/IDA-227205_ref15","doi-asserted-by":"crossref","unstructured":"X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen and B. Guo, Cswin transformer: A general vision transformer backbone with cross-shaped windows, in: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12124\u201312134.","DOI":"10.1109\/CVPR52688.2022.01181"},{"key":"10.3233\/IDA-227205_ref17","doi-asserted-by":"crossref","unstructured":"H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik and C. Feichtenhofer, Multiscale vision transformers, in: Proceedings of the IEEE\/CVF International Conference on Computer Vision, 2021, pp. 6824\u20136835.","DOI":"10.1109\/ICCV48922.2021.00675"},{"key":"10.3233\/IDA-227205_ref18","unstructured":"X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010, pp. 249\u2013256."},{"key":"10.3233\/IDA-227205_ref19","doi-asserted-by":"crossref","unstructured":"B. Graham, A. El-Nouby, H. Touvron, P. Stock, A. Joulin, H. J\u00e9gou and M. Douze, Levit: a vision transformer in convnet\u2019s clothing for faster inference, in: Proceedings of the IEEE\/CVF International Conference on Computer Vision, 2021, pp.\u00a012259\u201312269.","DOI":"10.1109\/ICCV48922.2021.01204"},{"key":"10.3233\/IDA-227205_ref20","doi-asserted-by":"crossref","unstructured":"K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu and C. Xu, Ghostnet: More features from cheap operations, in: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 1580\u20131589.","DOI":"10.1109\/CVPR42600.2020.00165"},{"key":"10.3233\/IDA-227205_ref21","doi-asserted-by":"crossref","unstructured":"K. He, X. Zhang, S. Ren and J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770\u2013778.","DOI":"10.1109\/CVPR.2016.90"},{"key":"10.3233\/IDA-227205_ref22","doi-asserted-by":"crossref","unstructured":"K. He, G. Gkioxari, P. Doll\u00e1r and R. Girshick, Mask r-cnn, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2961\u20132969.","DOI":"10.1109\/ICCV.2017.322"},{"key":"10.3233\/IDA-227205_ref24","doi-asserted-by":"crossref","unstructured":"B. Heo, S. Yun, D. Han, S. Chun, J. Choe and S.J. Oh, Rethinking spatial dimensions of vision transformers, in: Proceedings of the IEEE\/CVF International Conference on Computer Vision, 2021, pp. 11936\u201311945.","DOI":"10.1109\/ICCV48922.2021.01172"},{"key":"10.3233\/IDA-227205_ref27","unstructured":"A. Katharopoulos, A. Vyas, N. Pappas and F. Fleuret, Transformers are rnns: Fast autoregressive transformers with linear attention, in: International Conference on Machine Learning, 2020, pp. 5156\u20135165."},{"key":"10.3233\/IDA-227205_ref28","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3505244","article-title":"Transformers in vision: A survey","volume":"54","author":"Khan","year":"2022","journal-title":"ACM Computing Surveys"},{"key":"10.3233\/IDA-227205_ref29","doi-asserted-by":"crossref","unstructured":"A. Kirillov, R. Girshick, K. He and P. Doll\u00e1r, Panoptic feature pyramid networks, in: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6399\u20136408.","DOI":"10.1109\/CVPR.2019.00656"},{"key":"10.3233\/IDA-227205_ref30","doi-asserted-by":"crossref","first-page":"84","DOI":"10.1145\/3065386","article-title":"Imagenet classification with deep convolutional neural networks","volume":"60","author":"Krizhevsky","year":"2017","journal-title":"Communications of the ACM"},{"key":"10.3233\/IDA-227205_ref31","unstructured":"J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi and Y.W. Teh, Set transformer: A framework for attention-based permutation-invariant neural networks, in: International Conference on Machine Learning, 2019, pp. 3744\u20133753."},{"key":"10.3233\/IDA-227205_ref34","doi-asserted-by":"crossref","unstructured":"T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r and C.L. Zitnick, Microsoft coco: Common objects in context, in: Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6\u201312, 2014, Proceedings, Part V 13, 2014, pp. 740\u2013755.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"10.3233\/IDA-227205_ref35","doi-asserted-by":"crossref","unstructured":"Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin and B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE\/CVF International Conference on Computer Vision, 2021, pp.\u00a010012\u201310022.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"10.3233\/IDA-227205_ref36","doi-asserted-by":"crossref","unstructured":"Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang and L. Dong, Swin transformer v2: Scaling up capacity and resolution, in: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12009\u201312019.","DOI":"10.1109\/CVPR52688.2022.01170"},{"key":"10.3233\/IDA-227205_ref39","doi-asserted-by":"crossref","first-page":"4481","DOI":"10.3390\/rs13214481","article-title":"Drone-based autonomous motion planning system for outdoor environments under object detection uncertainty","volume":"13","author":"Sandino","year":"2021","journal-title":"Remote Sensing"},{"key":"10.3233\/IDA-227205_ref40","doi-asserted-by":"crossref","unstructured":"M. Sandler, A. Howard, M. Zhu, A. Zhmoginov and L.-C. Chen, Mobilenetv2: Inverted residuals and linear bottlenecks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510\u20134520.","DOI":"10.1109\/CVPR.2018.00474"},{"key":"10.3233\/IDA-227205_ref42","doi-asserted-by":"crossref","first-page":"259","DOI":"10.1016\/0169-7439(89)80095-4","article-title":"Analysis of variance (ANOVA)","volume":"6","author":"St","year":"1989","journal-title":"Chemometrics and Intelligent Laboratory Systems"},{"key":"10.3233\/IDA-227205_ref43","doi-asserted-by":"crossref","unstructured":"R. Strudel, R. Garcia, I. Laptev and C. Schmid, Segmenter: Transformer for semantic segmentation, in: Proceedings of the IEEE\/CVF International Conference on Computer Vision, 2021, pp. 7262\u20137272.","DOI":"10.1109\/ICCV48922.2021.00717"},{"key":"10.3233\/IDA-227205_ref44","doi-asserted-by":"crossref","unstructured":"Z. Sun, S. Cao, Y. Yang and K.M. Kitani, Rethinking transformer-based set prediction for object detection, in: Proceedings of the IEEE\/CVF International Conference on Computer Vision, 2021, pp. 3611\u20133620.","DOI":"10.1109\/ICCV48922.2021.00359"},{"key":"10.3233\/IDA-227205_ref45","doi-asserted-by":"crossref","unstructured":"C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich, Going deeper with convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1\u20139.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"10.3233\/IDA-227205_ref46","doi-asserted-by":"crossref","unstructured":"C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna, Rethinking the inception architecture for computer vision, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818\u20132826.","DOI":"10.1109\/CVPR.2016.308"},{"key":"10.3233\/IDA-227205_ref48","unstructured":"H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles and H. J\u00e9gou, Training data-efficient image transformers & distillation through attention, in: International Conference on Machine Learning, 2021, pp. 10347\u201310357."},{"key":"10.3233\/IDA-227205_ref51","doi-asserted-by":"crossref","unstructured":"W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo and L. Shao, Pyramid vision transformer: A versatile backbone for dense prediction without convolutions, in: Proceedings of the IEEE\/CVF International Conference on Computer Vision, 2021, pp. 568\u2013578.","DOI":"10.1109\/ICCV48922.2021.00061"},{"key":"10.3233\/IDA-227205_ref52","doi-asserted-by":"crossref","first-page":"415","DOI":"10.1007\/s41095-022-0274-8","article-title":"PVT v2: Improved baselines with Pyramid Vision Transformer","volume":"8","author":"Wang","year":"2022","journal-title":"Computational Visual Media"},{"key":"10.3233\/IDA-227205_ref53","doi-asserted-by":"crossref","unstructured":"X. Wang, R. Girshick, A. Gupta and K. He, Non-local neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794\u20137803.","DOI":"10.1109\/CVPR.2018.00813"},{"key":"10.3233\/IDA-227205_ref54","doi-asserted-by":"crossref","unstructured":"H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan and L. Zhang, Cvt: Introducing convolutions to vision transformers, in: Proceedings of the IEEE\/CVF International Conference on Computer Vision, 2021, pp. 22\u201331.","DOI":"10.1109\/ICCV48922.2021.00009"},{"key":"10.3233\/IDA-227205_ref55","unstructured":"T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Doll\u00e1r and R. Girshick, Early convolutions help transformers see better, Advances in Neural Information Processing Systems 34 (2021), 30392\u201330400."},{"key":"10.3233\/IDA-227205_ref56","doi-asserted-by":"crossref","unstructured":"S. Xie, R. Girshick, P. Doll\u00e1r, Z. Tu and K. He, Aggregated residual transformations for deep neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1492\u20131500.","DOI":"10.1109\/CVPR.2017.634"},{"key":"10.3233\/IDA-227205_ref58","doi-asserted-by":"crossref","unstructured":"C. Yang, Y. Wang, J. Zhang, H. Zhang, Z. Wei, Z. Lin and A. Yuille, Lite vision transformer with enhanced self-attention, in: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11998\u201312008.","DOI":"10.1109\/CVPR52688.2022.01169"},{"key":"10.3233\/IDA-227205_ref59","doi-asserted-by":"crossref","unstructured":"H. Yang, Z. Shen and Y. Zhao, AsymmNet: Towards ultralight convolution neural networks using asymmetrical bottlenecks, in: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2339\u20132348.","DOI":"10.1109\/CVPRW53098.2021.00266"},{"key":"10.3233\/IDA-227205_ref60","doi-asserted-by":"crossref","unstructured":"L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z.-H. Jiang, F.E. Tay, J. Feng and S. Yan, Tokens-to-token vit: Training vision transformers from scratch on imagenet, in: Proceedings of the IEEE\/CVF International Conference on Computer Vision, 2021, pp. 558\u2013567.","DOI":"10.1109\/ICCV48922.2021.00060"},{"key":"10.3233\/IDA-227205_ref61","unstructured":"M. Zaheer, G. Guruganesh, K.A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang and L. Yang, Big bird: Transformers for longer sequences, in: Advances in Neural Information Processing Systems, 2020, pp. 17283\u201317297."},{"key":"10.3233\/IDA-227205_ref62","doi-asserted-by":"crossref","unstructured":"X. Zhang, X. Zhou, M. Lin and J. Sun, Shufflenet: An extremely efficient convolutional neural network for mobile devices, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6848\u20136856.","DOI":"10.1109\/CVPR.2018.00716"},{"key":"10.3233\/IDA-227205_ref63","doi-asserted-by":"crossref","unstructured":"Z. Zhong, L. Zheng, G. Kang, S. Li and Y. Yang, Random erasing data augmentation, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 13001\u201313008.","DOI":"10.1609\/aaai.v34i07.7000"},{"key":"10.3233\/IDA-227205_ref64","doi-asserted-by":"crossref","unstructured":"B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba, Scene parsing through ade20k dataset, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 633\u2013641.","DOI":"10.1109\/CVPR.2017.544"}],"container-title":["Intelligent Data Analysis"],"original-title":[],"link":[{"URL":"https:\/\/content.iospress.com\/download?id=10.3233\/IDA-227205","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T09:20:18Z","timestamp":1777454418000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/full\/10.3233\/IDA-227205"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,11,20]]},"references-count":46,"journal-issue":{"issue":"6"},"URL":"https:\/\/doi.org\/10.3233\/ida-227205","relation":{},"ISSN":["1088-467X","1571-4128"],"issn-type":[{"value":"1088-467X","type":"print"},{"value":"1571-4128","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,11,20]]}}}