{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T16:57:11Z","timestamp":1767891431393,"version":"3.49.0"},"reference-count":31,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2024,2,24]],"date-time":"2024-02-24T00:00:00Z","timestamp":1708732800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,2,24]],"date-time":"2024-02-24T00:00:00Z","timestamp":1708732800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Complex Intell. Syst."],"published-print":{"date-parts":[[2024,6]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The existing image semantic segmentation models have low accuracy in detecting tiny targets or multi-targets at overlapping regions. This work proposes a hybrid vision transformer with unified-perceptual-parsing network (ViT-UperNet) for medical image segmentation. A self-attention mechanism is embedded in a vision transformer to extract multi-level features. The image features are extracted hierarchically from low to high dimensions using 4 groups of Transformer blocks with different numbers. Then, it uses a unified-perceptual-parsing network based on a feature pyramid network (FPN) and a pyramid pooling module (PPM) for the fusion of multi-scale contextual features and semantic segmentation. FPN can naturally use hierarchical features, and generate strong semantic information on all scales. PPM can better use the global prior knowledge to understand complex scenes, and extract features with global context information to improve segmentation results. In the training process, a scalable self-supervised learner named masked autoencoder is used for pre-training, which strengthens the visual representation ability and improves the efficiency of the feature learning. Experiments are conducted on cardiac magnetic resonance image segmentation where the left and right atrium and ventricle are selected for segmentation. The pixels accuracy is 93.85%, the Dice coefficient is 92.61% and Hausdorff distance is 11.16, which are improved compared with the other methods. The results show the superiority of Vit-UperNet in medical images segmentation, especially for the low-recognition and serious-occlusion targets.<\/jats:p>","DOI":"10.1007\/s40747-024-01359-6","type":"journal-article","created":{"date-parts":[[2024,2,24]],"date-time":"2024-02-24T19:01:58Z","timestamp":1708801318000},"page":"3819-3831","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":11,"title":["ViT-UperNet: a hybrid vision transformer with unified-perceptual-parsing network for medical image segmentation"],"prefix":"10.1007","volume":"10","author":[{"given":"Yang","family":"Ruiping","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7349-8730","authenticated-orcid":false,"given":"Liu","family":"Kun","sequence":"additional","affiliation":[]},{"given":"Xu","family":"Shaohua","sequence":"additional","affiliation":[]},{"given":"Yin","family":"Jian","sequence":"additional","affiliation":[]},{"given":"Zhang","family":"Zhen","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,2,24]]},"reference":[{"issue":"1","key":"1359_CR1","doi-asserted-by":"publisher","first-page":"19","DOI":"10.1007\/s13735-021-00218-1","volume":"11","author":"S Suganyadevi","year":"2022","unstructured":"Suganyadevi S, Seethalakshmi V, Balasamy K (2022) A review on deep learning in medical image analysis. Int J Multimed Inf Retr 11(1):19\u201338","journal-title":"Int J Multimed Inf Retr"},{"issue":"5","key":"1359_CR2","doi-asserted-by":"publisher","first-page":"1243","DOI":"10.1049\/ipr2.12419","volume":"16","author":"R Wang","year":"2022","unstructured":"Wang R, Lei T, Cui R, Zhang B, Meng H, Nandi AK (2022) Medical image segmentation using deep learning: a survey. IET Image Proc 16(5):1243\u20131267","journal-title":"IET Image Proc"},{"issue":"11","key":"1359_CR3","doi-asserted-by":"publisher","first-page":"11150","DOI":"10.1109\/TII.2023.3244344","volume":"19","author":"S Alagarsamy","year":"2023","unstructured":"Alagarsamy S, Govindaraj V et al (2023) Automated brain tumor segmentation for MR brain images using artificial bee colony combined with interval type-II fuzzy technique. IEEE Trans Ind Inf 19(11):11150\u201311159","journal-title":"IEEE Trans Ind Inf"},{"key":"1359_CR4","doi-asserted-by":"publisher","DOI":"10.1016\/j.compbiomed.2021.105063","volume":"140","author":"S Xun","year":"2022","unstructured":"Xun S, Li D, Zhu H, Chen M, Wang J, Li J, Chen M, Wu B, Zhang H, Chai X et al (2022) Generative adversarial networks in medical image segmentation: a review. Comput Biol Med 140:105063","journal-title":"Comput Biol Med"},{"key":"1359_CR5","first-page":"1","volume":"71","author":"A Lin","year":"2022","unstructured":"Lin A, Chen B, Xu J, Zhang Z, Lu G, Zhang D (2022) Ds-transunet: dual swin transformer u-net for medical image segmentation. IEEE Trans Instrum Meas 71:1\u201315","journal-title":"IEEE Trans Instrum Meas"},{"key":"1359_CR6","doi-asserted-by":"crossref","unstructured":"Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M (2023) Swin-unet: Unet-like pure transformer for medical image segmentation. In: Computer Vision\u2014ECCV 2022 Workshops: Tel Aviv, Israel, October 23\u201327, 2022, Proceedings, Part III. Springer, pp 205\u2013218","DOI":"10.1007\/978-3-031-25066-8_9"},{"key":"1359_CR7","doi-asserted-by":"crossref","unstructured":"Zhou Z, Rahman\u00a0Siddiquee MM, Tajbakhsh N, Liang J (2018) Unet++: a nested u-net architecture for medical image segmentation. In: Deep learning in medical image analysis and multimodal learning for clinical decision support: 4th international workshop, DLMIA 2018, and 8th international workshop, ML-CDS 2018, held in conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4. Springer, pp 3\u201311","DOI":"10.1007\/978-3-030-00889-5_1"},{"key":"1359_CR8","doi-asserted-by":"crossref","unstructured":"Xiao X, Lian S, Luo Z, Li S (2018) Weighted res-unet for high-quality retina vessel segmentation. In: 2018 9th international conference on information technology in medicine and education (ITME). IEEE, pp 327\u2013331","DOI":"10.1109\/ITME.2018.00080"},{"issue":"12","key":"1359_CR9","doi-asserted-by":"publisher","first-page":"2663","DOI":"10.1109\/TMI.2018.2845918","volume":"37","author":"X Li","year":"2018","unstructured":"Li X, Chen H, Qi X, Dou Q, Fu C-W, Heng P-A (2018) H-DenseUNet: hybrid densely connected UNet for liver and tumor segmentation from CT volumes. IEEE Trans Med Imaging 37(12):2663\u20132674","journal-title":"IEEE Trans Med Imaging"},{"key":"1359_CR10","doi-asserted-by":"crossref","unstructured":"Alom MZ, Hasan M, Yakopcic C, Taha TM, Asari VK (2018) Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmentation. arXiv preprint arXiv:1802.06955","DOI":"10.1109\/NAECON.2018.8556686"},{"key":"1359_CR11","doi-asserted-by":"crossref","unstructured":"Valanarasu JMJ, Sindagi VA, Hacihaliloglu I, Patel VM (2020) Kiu-net: towards accurate segmentation of biomedical images using over-complete representations. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 363\u2013373","DOI":"10.1007\/978-3-030-59719-1_36"},{"key":"1359_CR12","doi-asserted-by":"crossref","unstructured":"Huang H, Lin L, Tong R, Hu H, Zhang Q, Iwamoto Y, Han X, Chen Y-W, Wu J (2020) UNet 3+: a full-scale connected UNet for medical image segmentation. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 1055\u20131059","DOI":"10.1109\/ICASSP40776.2020.9053405"},{"key":"1359_CR13","doi-asserted-by":"crossref","unstructured":"Gillioz A, Casas J, Mugellini E, Abou\u00a0Khaled O (2020) Overview of the transformer-based models for NLP tasks. In: 2020 15th conference on computer science and information systems (FedCSIS). IEEE, pp 179\u2013183","DOI":"10.15439\/2020F20"},{"key":"1359_CR14","doi-asserted-by":"crossref","unstructured":"Meng L, Li H, Chen B-C, Lan S, Wu Z, Jiang Y-G, Lim S-N (2022) Adavit: adaptive vision transformers for efficient image recognition. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 12309\u201312318","DOI":"10.1109\/CVPR52688.2022.01199"},{"key":"1359_CR15","doi-asserted-by":"publisher","first-page":"1141","DOI":"10.1007\/s11263-022-01739-w","volume":"131","author":"Q Zhang","year":"2023","unstructured":"Zhang Q, Xu Y, Zhang J, Tao D (2023) Vitaev2: vision transformer advanced by exploring inductive bias for image recognition and beyond. Int J Comput Vis 131:1141\u20131162","journal-title":"Int J Comput Vis"},{"key":"1359_CR16","doi-asserted-by":"crossref","unstructured":"Wang Y, Xu Z, Wang X, Shen C, Cheng B, Shen H, Xia H (2021) End-to-end video instance segmentation with transformers. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 8741\u20138750","DOI":"10.1109\/CVPR46437.2021.00863"},{"key":"1359_CR17","doi-asserted-by":"crossref","unstructured":"Cheng B, Misra I, Schwing AG, Kirillov A, Girdhar R (2022) Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 1290\u20131299","DOI":"10.1109\/CVPR52688.2022.00135"},{"key":"1359_CR18","doi-asserted-by":"crossref","unstructured":"Han G, Ma J, Huang S, Chen L, Chang S-F (2022) Few-shot object detection with fully cross-transformer. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 5321\u20135330","DOI":"10.1109\/CVPR52688.2022.00525"},{"key":"1359_CR19","doi-asserted-by":"crossref","unstructured":"Fan L, Pang Z, Zhang T, Wang Y-X, Zhao H, Wang F, Wang N, Zhang Z (2022) Embracing single stride 3d object detector with sparse transformer. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 8458\u20138468","DOI":"10.1109\/CVPR52688.2022.00827"},{"key":"1359_CR20","doi-asserted-by":"crossref","unstructured":"Zhang B, Gu S, Zhang B, Bao J, Chen D, Wen F, Wang Y, Guo B (2022) Styleswin: transformer-based gan for high-resolution image generation. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 11304\u201311314","DOI":"10.1109\/CVPR52688.2022.01102"},{"key":"1359_CR21","unstructured":"Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y (2021) Transunet: transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306"},{"key":"1359_CR22","doi-asserted-by":"crossref","unstructured":"Zhang Y, Liu H, Hu Q (2021) Transfuse: Fusing transformers and cnns for medical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 14\u201324","DOI":"10.1007\/978-3-030-87193-2_2"},{"key":"1359_CR23","doi-asserted-by":"crossref","unstructured":"Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE\/CVF international conference on computer vision, pp 10012\u201310022","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"1359_CR24","doi-asserted-by":"crossref","unstructured":"He K, Chen X, Xie S, Li Y, Doll\u00e1r P, Girshick R (2022) Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 16000\u201316009","DOI":"10.1109\/CVPR52688.2022.01553"},{"issue":"11","key":"1359_CR25","doi-asserted-by":"publisher","first-page":"2514","DOI":"10.1109\/TMI.2018.2837502","volume":"37","author":"O Bernard","year":"2018","unstructured":"Bernard O, Lalande A, Zotti C, Cervenansky F, Yang X, Heng P-A, Cetin I, Lekadir K, Camara O, Ballester MAG et al (2018) Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE Trans Med Imaging 37(11):2514\u20132525","journal-title":"IEEE Trans Med Imaging"},{"key":"1359_CR26","doi-asserted-by":"crossref","unstructured":"Yong H, Huang J, Hua X, Zhang L (2020) Gradient centralization: a new optimization technique for deep neural networks. In: European conference on computer vision. Springer, pp 635\u2013652","DOI":"10.1007\/978-3-030-58452-8_37"},{"key":"1359_CR27","doi-asserted-by":"crossref","unstructured":"He T, Zhang Z, Zhang H, Zhang Z, Xie J, Li M (2019) Bag of tricks for image classification with convolutional neural networks. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 558\u2013567","DOI":"10.1109\/CVPR.2019.00065"},{"key":"1359_CR28","doi-asserted-by":"crossref","unstructured":"Cubuk ED, Zoph B, Mane D, Vasudevan V, Le QV (2019) Autoaugment: learning augmentation strategies from data. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 113\u2013123","DOI":"10.1109\/CVPR.2019.00020"},{"key":"1359_CR29","doi-asserted-by":"crossref","unstructured":"Cui Y, Che W, Liu T, Qin B, Yang Z (2021) Pre-training with whole word masking for Chinese Bert. IEEE ACM Trans Audio Speech Lang Process 29:3504\u20133514","DOI":"10.1109\/TASLP.2021.3124365"},{"key":"1359_CR30","doi-asserted-by":"crossref","unstructured":"Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 801\u2013818","DOI":"10.1007\/978-3-030-01234-2_49"},{"key":"1359_CR31","doi-asserted-by":"crossref","unstructured":"Xiao T, Liu Y, Zhou B, Jiang Y, Sun J (2018) Unified perceptual parsing for scene understanding. In: Proceedings of the European conference on computer vision (ECCV), pp 418\u2013434","DOI":"10.1007\/978-3-030-01228-1_26"}],"container-title":["Complex &amp; Intelligent Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-024-01359-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s40747-024-01359-6\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-024-01359-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,16]],"date-time":"2024-05-16T18:17:31Z","timestamp":1715883451000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s40747-024-01359-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,24]]},"references-count":31,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,6]]}},"alternative-id":["1359"],"URL":"https:\/\/doi.org\/10.1007\/s40747-024-01359-6","relation":{},"ISSN":["2199-4536","2198-6053"],"issn-type":[{"value":"2199-4536","type":"print"},{"value":"2198-6053","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,2,24]]},"assertion":[{"value":"28 March 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"21 January 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"24 February 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}