{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,9]],"date-time":"2026-04-09T04:39:57Z","timestamp":1775709597017,"version":"3.50.1"},"reference-count":59,"publisher":"Springer Science and Business Media LLC","issue":"10","license":[{"start":{"date-parts":[[2025,8,28]],"date-time":"2025-08-28T00:00:00Z","timestamp":1756339200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,8,28]],"date-time":"2025-08-28T00:00:00Z","timestamp":1756339200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100019904","name":"Vellore Institute of Technology, Chennai","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100019904","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Complex Intell. Syst."],"published-print":{"date-parts":[[2025,10]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Vision Transformers (ViTs) have redefined image classification by leveraging self-attention to capture complex patterns and long-range dependencies between image patches. However, a key challenge for ViTs is efficiently incorporating multi-scale feature representations, which is inherent in convolutional neural networks (CNNs) through their hierarchical structure. Graph transformers have made strides in addressing this by leveraging graph-based modeling, but they often lose or insufficiently represent spatial hierarchies, especially since redundant or less relevant areas dilute the image\u2019s contextual representation. To bridge this gap, we propose SAG-ViT, a Scale-Aware Graph Attention ViT that integrates the multi-scale feature capabilities of CNNs, the representational power of ViTs, and graph-attended patching to enable richer contextual representation. Using EfficientNetV2 as a backbone, the model extracts multi-scale feature maps, dividing them into patches to preserve richer semantic information compared to directly patching the input images. The patches are structured into a graph using spatial and feature similarities, where a Graph Attention Network (GAT) refines the node embeddings. This refined graph representation is then processed by a Transformer encoder, capturing long-range dependencies and complex interactions. Evaluated on six diverse image classification benchmarks, SAG-ViT achieves an F1 score of 0.9574 on CIFAR-10 and 0.9958 on GTSRB, to validate its consistent improvements across multiple backbones. Our code and weights are available at <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/github.com\/shravan-18\/SAG-ViT\" ext-link-type=\"uri\">https:\/\/github.com\/shravan-18\/SAG-ViT<\/jats:ext-link>.<\/jats:p>","DOI":"10.1007\/s40747-025-02043-z","type":"journal-article","created":{"date-parts":[[2025,8,28]],"date-time":"2025-08-28T08:16:30Z","timestamp":1756368990000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["SAG-ViT: a scale-aware, high-fidelity patching approach with graph attention for vision transformers"],"prefix":"10.1007","volume":"11","author":[{"given":"Shravan","family":"Venkatraman","sequence":"first","affiliation":[]},{"given":"Jaskaran Singh","family":"Walia","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9022-9145","authenticated-orcid":false,"given":"P. R.","family":"Joe Dhanith","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,8,28]]},"reference":[{"issue":"6","key":"2043_CR1","doi-asserted-by":"publisher","first-page":"84","DOI":"10.1145\/3065386","volume":"60","author":"A Krizhevsky","year":"2017","unstructured":"Krizhevsky A, Sutskever I, Hinton GE (2017) ImageNet classification with deep convolutional neural networks. Commun ACM 60(6):84\u201390. https:\/\/doi.org\/10.1145\/3065386","journal-title":"Commun ACM"},{"key":"2043_CR2","unstructured":"Luo W, Li Y, Urtasun R, Zemel R (2016) Understanding the effective receptive field in deep convolutional neural networks. In: Proceedings of the 30th international conference on neural information processing systems, NIPS\u201916. Curran Associates Inc., Red Hook, NY, USA, pp 4905\u20134913"},{"key":"2043_CR3","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems, NIPS\u201917. Curran Associates Inc., Red Hook, NY, USA, pp 6000\u20136010"},{"key":"2043_CR4","unstructured":"Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929"},{"key":"2043_CR5","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2025.103165","volume":"121","author":"H Khan","year":"2025","unstructured":"Khan H, Usman MT, Koo J (2025) Bilateral feature fusion with hexagonal attention for robust saliency detection under uncertain environments. Inf Fusion 121:103165","journal-title":"Inf Fusion"},{"key":"2043_CR6","unstructured":"Tan M, Le QV (2021) EfficientNetv2: smaller models and faster training. arXiv:2104.00298"},{"key":"2043_CR7","unstructured":"Veli\u010dkovi\u0107 P, Cucurull G, Casanova A, Romero A, Li\u00f2 P, Bengio Y (2018) Graph attention networks. arXiv:1710.10903"},{"issue":"1","key":"2043_CR8","doi-asserted-by":"publisher","first-page":"4","DOI":"10.1109\/TNNLS.2020.2978386","volume":"32","author":"Z Wu","year":"2021","unstructured":"Wu Z, Pan S, Chen F, Long G, Zhang C, Yu PS (2021) A comprehensive survey on graph neural networks. IEEE Trans Neural Netw Learn Syst 32(1):4\u201324. https:\/\/doi.org\/10.1109\/TNNLS.2020.2978386","journal-title":"IEEE Trans Neural Netw Learn Syst"},{"key":"2043_CR9","unstructured":"Krizhevsky A (2009) Learning multiple layers of features from tiny images. https:\/\/api.semanticscholar.org\/CorpusID:18268744"},{"key":"2043_CR10","doi-asserted-by":"crossref","unstructured":"Ignatov A, Malivenko G (2024) NCT-CRC-HE: not all histopathological datasets are equally useful. arXiv:2409.11546","DOI":"10.1007\/978-3-031-91721-9_19"},{"key":"2043_CR11","unstructured":"Hughes DP, Salathe M (2016) An open access repository of images on plant health to enable the development of mobile disease diagnostics. arXiv:1511.08060"},{"issue":"10","key":"2043_CR12","doi-asserted-by":"publisher","first-page":"1865","DOI":"10.1109\/JPROC.2017.2675998","volume":"105","author":"G Cheng","year":"2017","unstructured":"Cheng G, Han J, Lu X (2017) Remote sensing image scene classification: benchmark and state of the art. Proc IEEE 105(10):1865\u20131883. https:\/\/doi.org\/10.1109\/JPROC.2017.2675998","journal-title":"Proc IEEE"},{"key":"2043_CR13","doi-asserted-by":"publisher","first-page":"323","DOI":"10.1016\/j.neunet.2012.02.016","volume":"32","author":"J Stallkamp","year":"2012","unstructured":"Stallkamp J, Schlipsing M, Salmen J, Igel C (2012) Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural Netw 32:323\u2013332","journal-title":"Neural Netw"},{"key":"2043_CR14","doi-asserted-by":"crossref","unstructured":"Walia JS, Pavithra LK (2024) Deep learning innovations for underwater waste detection: an in-depth analysis. arXiv:2405.18299","DOI":"10.1109\/ACCESS.2025.3569344"},{"key":"2043_CR15","doi-asserted-by":"publisher","first-page":"292","DOI":"10.1007\/978-3-031-43360-3_24","volume-title":"Towards autonomous robotic systems","author":"JS Walia","year":"2023","unstructured":"Walia JS, Seemakurthy K (2023) Optimized custom dataset for efficient detection of underwater trash. In: Iida F, Maiolino P, Abdulali A, Wang M (eds) Towards autonomous robotic systems. Springer, Cham, pp 292\u2013303"},{"key":"2043_CR16","unstructured":"Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jegou H (2021) Training data-efficient image transformers & distillation through attention. In: Meila M, Zhang T (eds) Proceedings of the 38th international conference on machine learning, vol 139. Proceedings of machine learning research. PMLR. pp 10347\u201310357. https:\/\/proceedings.mlr.press\/v139\/touvron21a.html"},{"key":"2043_CR17","doi-asserted-by":"publisher","unstructured":"Yuan L, Chen Y, Wang T, Yu W, Shi Y, Jiang Z, Tay FEH, Feng J, Yan S (2021) Tokens-to-token ViT: training vision transformers from scratch on ImageNet. In: 2021 IEEE\/CVF International conference on computer vision (ICCV), IEEE Computer Society, Los Alamitos, CA, USA, pp 538\u2013547. https:\/\/doi.org\/10.1109\/ICCV48922.2021.00060","DOI":"10.1109\/ICCV48922.2021.00060"},{"key":"2043_CR18","unstructured":"Jaegle A, Gimeno F, Brock A, Zisserman A, Vinyals O, Carreira J (2021) Perceiver: general perception with iterative attention. arXiv:2103.03206"},{"key":"2043_CR19","doi-asserted-by":"crossref","unstructured":"Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L (2021) CVT: introducing convolutions to vision transformers. arXiv:2103.15808","DOI":"10.1109\/ICCV48922.2021.00009"},{"key":"2043_CR20","doi-asserted-by":"publisher","unstructured":"Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical Vision transformer using shifted windows. In: 2021 IEEE\/CVF international conference on computer vision (ICCV). IEEE Computer Society, Los Alamitos, CA, USA, pp 9992\u201310002. https:\/\/doi.org\/10.1109\/ICCV48922.2021.00986","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"2043_CR21","doi-asserted-by":"publisher","unstructured":"Chen C-FR, Fan Q, Panda R (2021) CrossViT: cross-attention multi-scale vision transformer for image classification. In: 2021 IEEE\/CVF international conference on computer vision (ICCV). IEEE Computer Society, Los Alamitos, CA, USA. pp 347\u2013356. https:\/\/doi.org\/10.1109\/ICCV48922.2021.00041","DOI":"10.1109\/ICCV48922.2021.00041"},{"key":"2043_CR22","doi-asserted-by":"publisher","unstructured":"Lin T-Y, Doll\u00e1r P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 936\u2013944. https:\/\/doi.org\/10.1109\/CVPR.2017.106","DOI":"10.1109\/CVPR.2017.106"},{"key":"2043_CR23","doi-asserted-by":"publisher","unstructured":"Chen Y et al (2019) Drop an octave: reducing spatial redundancy in convolutional neural networks with octave convolution. In: 2019 IEEE\/CVF international conference on computer vision (ICCV). IEEE, pp 3434\u20133443. https:\/\/doi.org\/10.1109\/ICCV.2019.00353","DOI":"10.1109\/ICCV.2019.00353"},{"key":"2043_CR24","doi-asserted-by":"crossref","unstructured":"Li Y, Wu C-Y, Fan H, Mangalam K, Xiong B, Malik J, Feichtenhofer C (2022) MViTv2: improved multiscale vision transformers for classification and detection. arXiv:2112.01526","DOI":"10.1109\/CVPR52688.2022.00476"},{"key":"2043_CR25","doi-asserted-by":"crossref","unstructured":"Liu Z, Mao H, Wu C-Y, Feichtenhofer C, Darrell T, Xie S (2022) A convnet for the 2020s. arXiv:2201.03545","DOI":"10.1109\/CVPR52688.2022.01167"},{"key":"2043_CR26","doi-asserted-by":"publisher","first-page":"58625","DOI":"10.1109\/ACCESS.2024.3389808","volume":"12","author":"G-I Kim","year":"2024","unstructured":"Kim G-I, Chung K (2024) ViT-based multi-scale classification using digital signal processing and image transformation. IEEE Access 12:58625\u201358638. https:\/\/doi.org\/10.1109\/ACCESS.2024.3389808","journal-title":"IEEE Access"},{"key":"2043_CR27","doi-asserted-by":"publisher","DOI":"10.1016\/j.engappai.2025.110917","volume":"155","author":"MT Usman","year":"2025","unstructured":"Usman MT, Khan H, Rida I, Koo J (2025) Lightweight transformer-driven multi-scale trapezoidal attention network for saliency detection. Eng Appl Artif Intell 155:110917","journal-title":"Eng Appl Artif Intell"},{"key":"2043_CR28","unstructured":"Zhu Y, Xu W, Zhang J, Du Y, Zhang J, Liu Q, Yang C, Wu S (2022) A survey on graph structure learning: progress and opportunities. arXiv:2103.03036"},{"key":"2043_CR29","doi-asserted-by":"publisher","unstructured":"Gao H, Wang Z, Ji S (2018) Large-scale learnable graph convolutional networks. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, KDD \u201918. Association for Computing Machinery, New York, NY, USA, pp 1416\u20131424. https:\/\/doi.org\/10.1145\/3219819.3219947","DOI":"10.1145\/3219819.3219947"},{"key":"2043_CR30","doi-asserted-by":"publisher","unstructured":"Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: 2018 IEEE\/CVF conference on computer vision and pattern recognition. IEEE, pp 7794\u20137803. https:\/\/doi.org\/10.1109\/CVPR.2018.00813","DOI":"10.1109\/CVPR.2018.00813"},{"key":"2043_CR31","doi-asserted-by":"publisher","unstructured":"Srinivas A, Lin T-Y, Parmar N, Shlens J, Abbeel P, Vaswani A (2021) Bottleneck transformers for visual recognition. In: 2021 IEEE\/CVF conference on computer vision and pattern recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, pp 16514\u201316524. https:\/\/doi.org\/10.1109\/CVPR46437.2021.01625","DOI":"10.1109\/CVPR46437.2021.01625"},{"key":"2043_CR32","doi-asserted-by":"publisher","unstructured":"Guo J, Han K, Wu H, Tang Y, Chen X, Wang Y, Xu C (2022) CMT: convolutional neural networks meet vision transformers. In: 2022 IEEE\/CVF conference on computer vision and pattern recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, pp 12165\u201312175. https:\/\/doi.org\/10.1109\/CVPR52688.2022.01186","DOI":"10.1109\/CVPR52688.2022.01186"},{"key":"2043_CR33","doi-asserted-by":"publisher","unstructured":"Graham B, El-Nouby A, Touvron H, Stock P, Joulin A, Jegou H, Douze M (2021) LeViT: a vision transformer in ConvNet\u2019s clothing for faster inference. In: 2021 IEEE\/CVF international conference on computer vision (ICCV). IEEE Computer Society, Los Alamitos, CA, USA, pp 12239\u201312249. https:\/\/doi.org\/10.1109\/ICCV48922.2021.01204","DOI":"10.1109\/ICCV48922.2021.01204"},{"key":"2043_CR34","unstructured":"Mehta S, Rastegari M (2022) MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv:2110.02178"},{"key":"2043_CR35","doi-asserted-by":"publisher","unstructured":"Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) MobileNetV2: inverted residuals and linear bottlenecks. In: 2018 IEEE\/CVF conference on computer vision and pattern recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, pp 4510\u20134520. https:\/\/doi.org\/10.1109\/CVPR.2018.00474","DOI":"10.1109\/CVPR.2018.00474"},{"key":"2043_CR36","doi-asserted-by":"publisher","unstructured":"Chen Y, Dai X, Chen D, Liu M, Dong X, Yuan L, Liu Z (2022) Mobile-former: bridging MobileNet and transformer. In: 2022 IEEE\/CVF conference on computer vision and pattern recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, pp 5260\u20135269. https:\/\/doi.org\/10.1109\/CVPR52688.2022.00520","DOI":"10.1109\/CVPR52688.2022.00520"},{"key":"2043_CR37","doi-asserted-by":"publisher","unstructured":"Ding Y, Wang A, Zhang L(2024) Multidimensional semantic disentanglement network for clothes-changing person re-identification. In: Proceedings of the 2024 international conference on multimedia retrieval, ICMR \u201924. Association for Computing Machinery, New York, NY, USA, pp 1025\u20131033. https:\/\/doi.org\/10.1145\/3652583.3658037","DOI":"10.1145\/3652583.3658037"},{"key":"2043_CR38","doi-asserted-by":"publisher","DOI":"10.1007\/s40747-024-01646-2","author":"Y Ding","year":"2025","unstructured":"Ding Y, Li J, Wang H et al (2025) Attention-enhanced multimodal feature fusion network for clothes-changing person re-identification. Complex Intell Syst. https:\/\/doi.org\/10.1007\/s40747-024-01646-2","journal-title":"Complex Intell Syst"},{"key":"2043_CR39","unstructured":"Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980"},{"key":"2043_CR40","unstructured":"Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556"},{"key":"2043_CR41","doi-asserted-by":"publisher","unstructured":"He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770\u2013778. https:\/\/doi.org\/10.1109\/CVPR.2016.90","DOI":"10.1109\/CVPR.2016.90"},{"key":"2043_CR42","doi-asserted-by":"crossref","unstructured":"Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818\u20132826","DOI":"10.1109\/CVPR.2016.308"},{"key":"2043_CR43","doi-asserted-by":"crossref","unstructured":"Huang G, Liu Z, Van Der\u00a0Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700\u20134708","DOI":"10.1109\/CVPR.2017.243"},{"key":"2043_CR44","doi-asserted-by":"crossref","unstructured":"Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1\u20139","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"2043_CR45","doi-asserted-by":"crossref","unstructured":"Ma N, Zhang X, Zheng H-T, Sun J (2018) ShuffleNet v2: practical guidelines for efficient CNN architecture design. In: Proceedings of the European conference on computer vision (ECCV), pp 116\u2013131","DOI":"10.1007\/978-3-030-01264-9_8"},{"key":"2043_CR46","unstructured":"Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and $$<$$0.5mb model size. arXiv:1602.07360"},{"key":"2043_CR47","unstructured":"Tan M, Le QV (2020) EfficientNet: rethinking model scaling for convolutional neural networks. arXiv:1905.11946"},{"issue":"12","key":"2043_CR48","doi-asserted-by":"publisher","first-page":"4190","DOI":"10.1109\/TMI.2024.3417007","volume":"43","author":"Z Huang","year":"2024","unstructured":"Huang Z, Sun J, Shao Y, Wang Z, Wang S, Li Q, Li J, Yu Q (2024) PolarFormer: a transformer-based method for multi-lesion segmentation in intravascular oct. IEEE Trans Med Imaging 43(12):4190\u20134199. https:\/\/doi.org\/10.1109\/TMI.2024.3417007","journal-title":"IEEE Trans Med Imaging"},{"issue":"12","key":"2043_CR49","doi-asserted-by":"publisher","first-page":"4404","DOI":"10.1109\/TMI.2024.3421644","volume":"43","author":"J Xiao","year":"2024","unstructured":"Xiao J, Li S, Lin T, Zhu J, Yuan X, Feng DD, Sheng B (2024) Multi-label chest X-ray image classification with single positive labels. IEEE Trans Med Imaging 43(12):4404\u20134418. https:\/\/doi.org\/10.1109\/TMI.2024.3421644","journal-title":"IEEE Trans Med Imaging"},{"key":"2043_CR50","series-title":"Lecture notes in computer science","doi-asserted-by":"publisher","first-page":"472","DOI":"10.1007\/978-3-031-72378-0_44","volume-title":"Medical image computing and computer assisted intervention\u2014MICCAI 2024","author":"J Lee","year":"2024","unstructured":"Lee J, Kim J, Lee H (2024) Covid19 to pneumonia: multi region lung severity classification using CNN transformer position-aware feature encoding network. In: Linguraru M et al (eds) Medical image computing and computer assisted intervention\u2014MICCAI 2024, vol 15001. Lecture notes in computer science. Springer, Cham, pp 472\u2013481. https:\/\/doi.org\/10.1007\/978-3-031-72378-0_44"},{"key":"2043_CR51","series-title":"Lecture notes in computer science","doi-asserted-by":"publisher","first-page":"621","DOI":"10.1007\/978-3-031-72378-0_58","volume-title":"Medical image computing and computer assisted intervention\u2014MICCAI 2024","author":"L Chen","year":"2024","unstructured":"Chen L et al (2024) Hybrid-structure-oriented transformer for arm musculoskeletal ultrasound segmentation. In: Linguraru M et al (eds) Medical image computing and computer assisted intervention\u2014MICCAI 2024, vol 15001. Lecture notes in computer science. Springer, Cham, pp 621\u2013631. https:\/\/doi.org\/10.1007\/978-3-031-72378-0_58"},{"key":"2043_CR52","doi-asserted-by":"crossref","unstructured":"Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. arXiv:2005.12872","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"2043_CR53","doi-asserted-by":"crossref","unstructured":"Chen P, Zhang M, Shen Y, Sheng K, Gao Y, Sun X, Li K, Shen C (2022) Efficient decoder-free object detection with transformers. arXiv:2206.06829","DOI":"10.1007\/978-3-031-20080-9_5"},{"key":"2043_CR54","unstructured":"Song H, Sun D, Chun S, Jampani V, Han D, Heo B, Kim W, Yang M-H (2021) ViDT: an efficient and effective fully transformer-based object detector. arXiv:2110.03921"},{"key":"2043_CR55","unstructured":"Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, Luo P (2021) SegFormer: simple and efficient design for semantic segmentation with transformers. arXiv:2105.15203"},{"key":"2043_CR56","unstructured":"Tang Q, Liu C, Liu F, Liu Y, Jiang J, Zhang B, Han K, Wang Y (2023) Category feature transformer for semantic segmentation. arXiv:2308.05581"},{"issue":"12","key":"2043_CR57","doi-asserted-by":"publisher","first-page":"4419","DOI":"10.1109\/TMI.2024.3422102","volume":"43","author":"Z Cai","year":"2024","unstructured":"Cai Z, Lin L, He H, Cheng P, Tang X (2024) Uni4Eye++: a general masked image modeling multi-modal pre-training framework for ophthalmic image classification and segmentation. IEEE Trans Med Imaging 43(12):4419\u20134429. https:\/\/doi.org\/10.1109\/TMI.2024.3422102","journal-title":"IEEE Trans Med Imaging"},{"key":"2043_CR58","first-page":"775","volume-title":"Medical image computing and computer assisted intervention\u2014MICCAI 2024","author":"X Tian","year":"2024","unstructured":"Tian X, Anantrasirichai N, Nicholson L, Achim A (2024) TaGAT: topology-aware graph attention network for multi-modal retinal image fusion. In: Linguraru MG, Dou Q, Feragen A, Giannarou S, Glocker B, Lekadir K, Schnabel JA (eds) Medical image computing and computer assisted intervention\u2014MICCAI 2024. Springer, Cham, pp 775\u2013784"},{"key":"2043_CR59","doi-asserted-by":"publisher","unstructured":"Zhou Q, Zou H, Wang Z, Jiang H, Wang Y (2024) Refining intraocular lens power calculation: a multi-modal framework using cross-layer attention and effective channel attention. In: Linguraru M et al (eds) Medical image computing and computer assisted intervention\u2014MICCAI 2024, vol 15001. Lecture notes in computer science. Springer, Cham, pp 754\u2013763. https:\/\/doi.org\/10.1007\/978-3-031-72378-0_70","DOI":"10.1007\/978-3-031-72378-0_70"}],"container-title":["Complex &amp; Intelligent Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-025-02043-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s40747-025-02043-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-025-02043-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,25]],"date-time":"2025-09-25T13:32:22Z","timestamp":1758807142000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s40747-025-02043-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,28]]},"references-count":59,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2025,10]]}},"alternative-id":["2043"],"URL":"https:\/\/doi.org\/10.1007\/s40747-025-02043-z","relation":{},"ISSN":["2199-4536","2198-6053"],"issn-type":[{"value":"2199-4536","type":"print"},{"value":"2198-6053","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,8,28]]},"assertion":[{"value":"17 February 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 July 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 August 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that there are no conflict of interest associated with this work.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}],"article-number":"428"}}