{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T01:44:18Z","timestamp":1760147058478,"version":"build-2065373602"},"reference-count":50,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2023,1,4]],"date-time":"2023-01-04T00:00:00Z","timestamp":1672790400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100003052","name":"Technology Innovation Program of the Ministry of Trade, Industry &amp; Energy (MOTIE, Republic of Korea)","doi-asserted-by":"publisher","award":["1415181272"],"award-info":[{"award-number":["1415181272"]}],"id":[{"id":"10.13039\/501100003052","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Transformer-based semantic segmentation methods have achieved excellent performance in recent years. Mask2Former is one of the well-known transformer-based methods which unifies common image segmentation into a universal model. However, it performs relatively poorly in obtaining local features and segmenting small objects due to relying heavily on transformers. To this end, we propose a simple yet effective architecture that introduces auxiliary branches to Mask2Former during training to capture dense local features on the encoder side. The obtained features help improve the performance of learning local information and segmenting small objects. Since the proposed auxiliary convolution layers are required only for training and can be removed during inference, the performance gain can be obtained without additional computation at inference. Experimental results show that our model can achieve state-of-the-art performance (57.6% mIoU) on the ADE20K and (84.8% mIoU) on the Cityscapes datasets.<\/jats:p>","DOI":"10.3390\/s23020581","type":"journal-article","created":{"date-parts":[[2023,1,4]],"date-time":"2023-01-04T05:31:44Z","timestamp":1672810304000},"page":"581","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Enhancing Mask Transformer with Auxiliary Convolution Layers for Semantic Segmentation"],"prefix":"10.3390","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5225-5580","authenticated-orcid":false,"given":"Zhengyu","family":"Xia","sequence":"first","affiliation":[{"name":"Department of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL 60616, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8833-0319","authenticated-orcid":false,"given":"Joohee","family":"Kim","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL 60616, USA"}]}],"member":"1968","published-online":{"date-parts":[[2023,1,4]]},"reference":[{"unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017, January 4\u20139). Attention Is All You Need. Proceedings of the Conference Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA.","key":"ref_1"},{"unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020, January 26\u201330). An Image is Worth 16 \u00d7 16 Words: Transformers for Image Recognition at Scale. Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia.","key":"ref_2"},{"unstructured":"Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J\u00e9gou, H. (2021, January 18\u201324). Training Data-efficient Image Transformers & Distillation through Attention. Proceedings of the International Conference on Machine Learning (ICML), Virtual.","key":"ref_3"},{"unstructured":"Bao, H., Dong, L., Piao, S., and Wei, F. (2021). BEiT: BERT Pre-Training of Image Transformers. arXiv.","key":"ref_4"},{"unstructured":"Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012, January 3\u20136). ImageNet Classification with Deep Convolutional Neural Networks. Proceedings of the Conference Neural Information Processing Systems (NeurIPS), Lake Tahoe, Nevada, USA.","key":"ref_5"},{"doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","key":"ref_6","DOI":"10.1109\/CVPR.2016.90"},{"doi-asserted-by":"crossref","unstructured":"Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, X.Z., Wang, Y., Fu, Y., Feng, J., Xing, T., and Torr, P.H.S. (2021, January 20\u201325). Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA.","key":"ref_7","DOI":"10.1109\/CVPR46437.2021.00681"},{"unstructured":"Xie, E., Wang, W., Yu, Z., Anadkumar, A., Alvarez, J.M., and Luo, P. (2021, January 6\u201314). SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Proceedings of the Conference Neural Information Processing Systems (NeurIPS), Virtual.","key":"ref_8"},{"doi-asserted-by":"crossref","unstructured":"Strudel, R., Garcia, R., Laptev, I., and Schmid, C. (2021, January 11\u201317). Segmenter: Transformer for Semantic Segmentation. Proceedings of the IEEE International Conference on Computer Vision (ICCV), Virtual.","key":"ref_9","DOI":"10.1109\/ICCV48922.2021.00717"},{"unstructured":"Cheng, B., Schwing, A.G., and Kirillov, A. (2021, January 6\u201314). Per-Pixel Classification is Not All You Need for Semantic Segmentation. Proceedings of the Conference Neural Information Processing Systems (NeurIPS), Virtual.","key":"ref_10"},{"doi-asserted-by":"crossref","unstructured":"Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., and Girdhar, R. (2022, January 19\u201324). Masked-attention Mask Transformer for Universal Image Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.","key":"ref_11","DOI":"10.1109\/CVPR52688.2022.00135"},{"doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Doll\u00e1r, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. (2017, January 21\u201326). Feature Pyramid Networks for Object Detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","key":"ref_12","DOI":"10.1109\/CVPR.2017.106"},{"unstructured":"Xiao, T., Singh, M., Mintun, E., Darrell, T., Doll\u00e1r, P., and Girshick, R. (2021, January 6\u201314). Early Convolutions Help Transformers See Better. Proceedings of the Conference Neural Information Processing Systems (NeurIPS), Virtual.","key":"ref_13"},{"doi-asserted-by":"crossref","unstructured":"Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., and Vaswani, A. (2021, January 20\u201325). Bottleneck Transformers for Visual Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA.","key":"ref_14","DOI":"10.1109\/CVPR46437.2021.01625"},{"doi-asserted-by":"crossref","unstructured":"Chen, Z., Xie, L., Niu, J., Liu, X., Wei, L., and Tian, Q. (2021, January 11\u201317). Visformer: The Vision-friendly Transformer. Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, BC, Canada.","key":"ref_15","DOI":"10.1109\/ICCV48922.2021.00063"},{"doi-asserted-by":"crossref","unstructured":"Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., and Torralba, A. (2017, January 21\u201326). Scene Parsing through ADE20K Dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","key":"ref_16","DOI":"10.1109\/CVPR.2017.544"},{"doi-asserted-by":"crossref","unstructured":"Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., and Schiele, B. (2016, January 27\u201330). The Cityscapes Dataset for Semantic Urban Scene Understanding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","key":"ref_17","DOI":"10.1109\/CVPR.2016.350"},{"doi-asserted-by":"crossref","unstructured":"Long, J., Shelhamer, E., and Darrell, T. (2015, January 7\u201312). Fully Convolutional Networks for Semantic Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","key":"ref_18","DOI":"10.1109\/CVPR.2015.7298965"},{"unstructured":"Badrinarayanan, V., Kendall, A., and Cipolla, R. (2015, January 7\u201312). SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","key":"ref_19"},{"doi-asserted-by":"crossref","unstructured":"Ronneberger, O., Fischer, P., and Brox, T. (2015, January 5\u20139). U-Net: Convolutional Networks for Biomedical Image Segmentation. Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), Munich, Germany.","key":"ref_20","DOI":"10.1007\/978-3-319-24574-4_28"},{"unstructured":"Liu, W., Rabinovich, A., and Berg, A.C. (2016, January 2\u20134). ParseNet: Looking Wider to See Better. Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico.","key":"ref_21"},{"doi-asserted-by":"crossref","unstructured":"Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017, January 21\u201326). Pyramid Scene Parsing Network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","key":"ref_22","DOI":"10.1109\/CVPR.2017.660"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"834","DOI":"10.1109\/TPAMI.2017.2699184","article-title":"DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs","volume":"40","author":"Chen","year":"2017","journal-title":"TPAMI"},{"unstructured":"Chen, L.C., Papandreou, G., Schroff, F., and Adam, H. (2017). Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv.","key":"ref_24"},{"doi-asserted-by":"crossref","unstructured":"Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. (2018, January 8\u201314). Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","key":"ref_25","DOI":"10.1007\/978-3-030-01234-2_49"},{"doi-asserted-by":"crossref","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020, January 23\u201328). End-to-End Object Detection with Transformers. Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK.","key":"ref_26","DOI":"10.1007\/978-3-030-58452-8_13"},{"doi-asserted-by":"crossref","unstructured":"Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., and Lu, H. (2019, January 15\u201320). Dual Attention Network for Scene Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","key":"ref_27","DOI":"10.1109\/CVPR.2019.00326"},{"unstructured":"Huang, Z., Wang, X., Wei, Y., Huang, L., Shi, H., Liu, W., and Huang, T.S. (November, January October). CCNet: Criss-Cross Attention for Semantic Segmentation. Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea.","key":"ref_28"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"2375","DOI":"10.1007\/s11263-021-01465-9","article-title":"OCNet: Object Context for Semantic Segmentation","volume":"129","author":"Yuan","year":"2021","journal-title":"IJCV"},{"doi-asserted-by":"crossref","unstructured":"Kirillov, A., He, K., Girshick, R., Rother, C., and Doll\u00e1r, P. (2019, January 15\u201320). Panoptic Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","key":"ref_30","DOI":"10.1109\/CVPR.2019.00963"},{"doi-asserted-by":"crossref","unstructured":"Cheng, B., Collins, M.D., Zhu, Y., Liu, T., Huang, T.S., Adam, H., and Chen, L.C. (2019, January 15\u201320). Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","key":"ref_31","DOI":"10.1109\/CVPR42600.2020.01249"},{"unstructured":"Li, J., Raventos, A., Bhargava, A., Tagawa, T., and Gaidon, A. (2018). Learning to Fuse Things and Stuff. arXiv.","key":"ref_32"},{"doi-asserted-by":"crossref","unstructured":"Xiong, Y., Liao, R., Zhao, H., Hu, R., Bai, M., Yumer, E., and Urtasun, R. (2019, January 15\u201320). UPSNet: A Unified Panoptic Segmentation Network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","key":"ref_33","DOI":"10.1109\/CVPR.2019.00902"},{"doi-asserted-by":"crossref","unstructured":"Wu, Y., Zhang, G., Gao, Y., Deng, X., Gong, K., Liang, X., and Lin, L. (2020, January 13\u201319). Bidirectional Graph Reasoning Network for Panoptic Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","key":"ref_34","DOI":"10.1109\/CVPR42600.2020.00910"},{"unstructured":"Wu, Y., Zhang, G., Xu, H., Liang, X., and Lin, L. (2020, January 6\u201312). Auto-Panoptic: Cooperative Multi-Component Architecture Search for Panoptic Segmentation. Proceedings of the Conference Neural Information Processing Systems (NeurIPS), Vancouver, Canada.","key":"ref_35"},{"doi-asserted-by":"crossref","unstructured":"Li, Y., Zhao, H., Qi, X., Chen, Y., Qi, L., Wang, L., Li, Z., Sun, J., and Jia, J. (2021). Fully Convolutional Networks for Panoptic Segmentation with Point-based Supervision. arXiv.","key":"ref_36","DOI":"10.1109\/CVPR46437.2021.00028"},{"unstructured":"Lu, J., Batra, D., Parikh, D., and Lee, S. (2019, January 8\u201314). ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Proceedings of the Conference Neural Information Processing Systems (NeurIPS), Vancouver, Canada.","key":"ref_37"},{"unstructured":"Ren, S., He, K., Girshick, R., and Sun, J. (2015, January 11\u201312). Faster. R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Proceedings of the Conference Neural Information Processing Systems (NeurIPS), Montr\u00e9al, Canada.","key":"ref_38"},{"doi-asserted-by":"crossref","unstructured":"Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., and Shao, L. (2021, January 11\u201317). Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, BC, Canada.","key":"ref_39","DOI":"10.1109\/ICCV48922.2021.00061"},{"key":"ref_40","first-page":"1","article-title":"PVT v2: Improved Baselines with Pyramid Vision Transformer","volume":"8","author":"Wang","year":"2022","journal-title":"CVMJ"},{"key":"ref_41","first-page":"1","article-title":"P2T: Pyramid Pooling Transformer for Scene Understanding","volume":"99","author":"Wu","year":"2022","journal-title":"TPAMI"},{"doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021, January 11\u201317). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, BC, Canada.","key":"ref_42","DOI":"10.1109\/ICCV48922.2021.00986"},{"unstructured":"Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., and Antiga, L. (2019, January 8\u201314). PyTorch: An Imperative Style, High-Performance Deep Learning Library. Proceedings of the Conference Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada.","key":"ref_43"},{"unstructured":"Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., and Girshick, R. (2020, February 06). Detectron2. Available online: https:\/\/github.com\/facebookresearch\/detectron2.","key":"ref_44"},{"unstructured":"Loshchilov, I., and Hutter, F. (2019, January 6\u20139). Decoupled Weight Decay Regularization. Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA.","key":"ref_45"},{"doi-asserted-by":"crossref","unstructured":"Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T., Cubuk, E.D., Le, Q.V., and Zoph, B. (2021, January 20\u201325). Simple Copy-paste is A Strong Data Augmentation Method for Instance Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA.","key":"ref_46","DOI":"10.1109\/CVPR46437.2021.00294"},{"doi-asserted-by":"crossref","unstructured":"Xiao, T., Liu, Y., Zhou, B., Jiang, Y., and Sun, J. (2018, January 8\u201314). Unified Perceptual Parsing for Scene Understanding. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","key":"ref_47","DOI":"10.1007\/978-3-030-01228-1_26"},{"doi-asserted-by":"crossref","unstructured":"Huang, S., Lu, Z., Cheng, R., and He, C. (2021, January 11\u201317). Fapn: Feature-aligned Pyramid Network for Dense Image Prediction. Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, BC, Canada.","key":"ref_48","DOI":"10.1109\/ICCV48922.2021.00090"},{"doi-asserted-by":"crossref","unstructured":"Ma, N., Zhang, X., Zheng, H.T., and Sun, J. (2018, January 8\u201314). ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","key":"ref_49","DOI":"10.1007\/978-3-030-01264-9_8"},{"unstructured":"Chen, L., Wang, H., and Qiao, S. (2020). Scaling Wide Residual Networks for Panoptic Segmentation. arXiv.","key":"ref_50"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/2\/581\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T17:58:40Z","timestamp":1760119120000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/2\/581"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,1,4]]},"references-count":50,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2023,1]]}},"alternative-id":["s23020581"],"URL":"https:\/\/doi.org\/10.3390\/s23020581","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2023,1,4]]}}}