{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,4,23]],"date-time":"2025-04-23T06:31:13Z","timestamp":1745389873307,"version":"3.37.3"},"reference-count":24,"publisher":"Springer Science and Business Media LLC","issue":"7","license":[{"start":{"date-parts":[[2024,5,8]],"date-time":"2024-05-08T00:00:00Z","timestamp":1715126400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,5,8]],"date-time":"2024-05-08T00:00:00Z","timestamp":1715126400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100010269","name":"Wellcome Trust","doi-asserted-by":"publisher","award":["EPSRC[WT203148\/Z\/16\/Z; NS\/A000049\/1]"],"award-info":[{"award-number":["EPSRC[WT203148\/Z\/16\/Z; NS\/A000049\/1]"]}],"id":[{"id":"10.13039\/100010269","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100013915","name":"EPSRC Centre for Doctoral Training in Medical Imaging","doi-asserted-by":"publisher","award":["[EP\/S022104\/1]"],"award-info":[{"award-number":["[EP\/S022104\/1]"]}],"id":[{"id":"10.13039\/501100013915","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J CARS"],"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n                <jats:title>Purpose<\/jats:title>\n                <jats:p>In surgical image segmentation, a major challenge is the extensive time and resources required to gather large-scale annotated datasets. Given the scarcity of annotated data in this field, our work aims to develop a model that achieves competitive performance with training on limited datasets, while also enhancing model robustness in various surgical scenarios.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Methods<\/jats:title>\n                <jats:p>We propose a method that harnesses the strengths of pre-trained Vision Transformers (ViTs) and data efficiency of convolutional neural networks (CNNs). Specifically, we demonstrate how a CNN segmentation model can be used as a lightweight adapter for a frozen ViT feature encoder. Our novel feature adapter uses cross-attention modules that merge the multiscale features derived from the CNN encoder with feature embeddings from ViT, ensuring integration of the global insights from ViT along with local information from CNN.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Results<\/jats:title>\n                <jats:p>Extensive experiments demonstrate our method outperforms current models in surgical instrument segmentation. Specifically, it achieves superior performance in binary segmentation on the Robust-MIS 2019 dataset, as well as in multiclass segmentation tasks on the EndoVis 2017 and EndoVis 2018 datasets. It also showcases remarkable robustness through cross-dataset validation across these 3 datasets, along with the CholecSeg8k and AutoLaparo datasets. Ablation studies based on the datasets prove the efficacy of our novel adapter module.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Conclusion<\/jats:title>\n                <jats:p>In this study, we presented a novel approach integrating ViT and CNN. Our unique feature adapter successfully combines the global insights of ViT with the local, multi-scale spatial capabilities of CNN. This integration effectively overcomes data limitations in surgical instrument segmentation. The source code is available at: <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/weimengmeng1999\/AdapterSIS.git\">https:\/\/github.com\/weimengmeng1999\/AdapterSIS.git<\/jats:ext-link>.<\/jats:p>\n              <\/jats:sec>","DOI":"10.1007\/s11548-024-03140-z","type":"journal-article","created":{"date-parts":[[2024,5,8]],"date-time":"2024-05-08T06:02:29Z","timestamp":1715148149000},"page":"1313-1320","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Enhancing surgical instrument segmentation: integrating vision transformer insights with adapter"],"prefix":"10.1007","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-8167-7374","authenticated-orcid":false,"given":"Meng","family":"Wei","sequence":"first","affiliation":[]},{"given":"Miaojing","family":"Shi","sequence":"additional","affiliation":[]},{"given":"Tom","family":"Vercauteren","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,5,8]]},"reference":[{"key":"3140_CR1","doi-asserted-by":"publisher","unstructured":"Ross T, Reinke A, Full PM, Wagner M, Kenngott H, Apitz M, Hempe H, Filimon DM, Scholz P, Tran TN (2020) Robust medical instrument segmentation challenge 2019. arXiv preprint arXiv:2003.10299https:\/\/doi.org\/10.48550\/arXiv.2003.10299","DOI":"10.48550\/arXiv.2003.10299"},{"key":"3140_CR2","doi-asserted-by":"publisher","unstructured":"Isensee F, Maier-Hein K (2020) Or-unet: an optimized robust residual u-net for instrument segmentation in endoscopic images. arXiv:2004.12668https:\/\/doi.org\/10.48550\/arXiv.2004.12668","DOI":"10.48550\/arXiv.2004.12668"},{"key":"3140_CR3","doi-asserted-by":"crossref","unstructured":"Gonz\u00e1lez C, Bravo-S\u00e1nchez L, Arbelaez P (2020) Isinet: an instance-based approach for surgical instrument segmentation. In: MICCAI, pp. 595\u2013605. Springer","DOI":"10.1007\/978-3-030-59716-0_57"},{"key":"3140_CR4","unstructured":"Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S (2021) An image is worth 16x16 words: transformers for image recognition at scale. ICLR"},{"key":"3140_CR5","doi-asserted-by":"crossref","unstructured":"Caron M, Touvron H, Misra I, J\u00e9gou H, Mairal J, Bojanowski P, Joulin A (2021) Emerging properties in self-supervised vision transformers. In: ICCV, pp. 9650\u20139660","DOI":"10.1109\/ICCV48922.2021.00951"},{"key":"3140_CR6","unstructured":"Oquab M, Darcet T, Moutakanni T, Vo H, Szafraniec M, Khalidov V, Fernandez P, Haziza D, Massa F, El-Nouby A (2024) Dinov2: learning robust visual features without supervision. TMLR"},{"key":"3140_CR7","first-page":"17864","volume":"34","author":"B Cheng","year":"2021","unstructured":"Cheng B, Schwing A, Kirillov A (2021) Per-pixel classification is not all you need for semantic segmentation. NeurIPS 34:17864\u201317875","journal-title":"NeurIPS"},{"key":"3140_CR8","doi-asserted-by":"crossref","unstructured":"Liu Z, Hu H, Lin Y, Yao Z, Xie Z, Wei Y, Ning J, Cao Y, Zhang Z, Dong L, Wei F, Guo B (2022) Swin transformer v2: Scaling up capacity and resolution. In: CVPR, pp. 12009\u201312019","DOI":"10.1109\/CVPR52688.2022.01170"},{"key":"3140_CR9","doi-asserted-by":"crossref","unstructured":"Cheng B, Misra I, Schwing AG, Kirillov A, Girdhar R (2022) Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1290\u20131299","DOI":"10.1109\/CVPR52688.2022.00135"},{"key":"3140_CR10","unstructured":"Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y (2021) Transunet: transformers make strong encoders for medical image segmentation. ICML workshop"},{"key":"3140_CR11","doi-asserted-by":"crossref","unstructured":"Zhang, Y, Liu H, Hu Q (2021) Transfuse: fusing transformers and CNNS for medical image segmentation. In: MICCAI, pp. 14\u201324 Springer","DOI":"10.1007\/978-3-030-87193-2_2"},{"key":"3140_CR12","doi-asserted-by":"crossref","unstructured":"Gao Y, Zhou M, Metaxas DN (2021) Utnet: a hybrid transformer architecture for medical image segmentation. In: MICCAI, pp. 61\u201371 Springer","DOI":"10.1007\/978-3-030-87199-4_6"},{"key":"3140_CR13","doi-asserted-by":"publisher","first-page":"109228","DOI":"10.1016\/j.patcog.2022.109228","volume":"136","author":"F Yuan","year":"2023","unstructured":"Yuan F, Zhang Z, Fang Z (2023) An effective CNN and transformer complementary network for medical image segmentation. Pattern Recognit 136:109228","journal-title":"Pattern Recognit"},{"key":"3140_CR14","doi-asserted-by":"crossref","unstructured":"Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: MICCAI, pp. 234\u2013241 Springer","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"3140_CR15","doi-asserted-by":"crossref","unstructured":"Ayobi N, P\u00e9rez-Rond\u00f3n A, Rodr\u00edguez S, Arbel\u00e1ez P (2023) Matis: masked-attention transformers for surgical instrument segmentation. ISBI, pp. 1\u20135","DOI":"10.1109\/ISBI53787.2023.10230819"},{"key":"3140_CR16","doi-asserted-by":"crossref","unstructured":"Zhao Z, Jin Y, Heng P-A (2022) Trasetr: track-to-segment transformer with contrastive query for instance-level instrument segmentation in robotic surgery. In: ICRA, pp. 11186\u201311193 IEEE","DOI":"10.1109\/ICRA46639.2022.9811873"},{"key":"3140_CR17","doi-asserted-by":"crossref","unstructured":"Gheini M, Ren X, May J (2021) Cross-attention is all you need: Adapting pretrained transformers for machine translation. In: EMNLP, pp. 1754\u20131765 ACL","DOI":"10.18653\/v1\/2021.emnlp-main.132"},{"key":"3140_CR18","doi-asserted-by":"crossref","unstructured":"Liu M, Yin H (2019) Cross attention network for semantic segmentation. In: ICIP, pp. 2434\u20132438. IEEE","DOI":"10.1109\/ICIP.2019.8803320"},{"key":"3140_CR19","doi-asserted-by":"publisher","unstructured":"Allan M, Shvets A, Kurmann T, Zhang Z, Duggal R, Su Y-H, Rieke N, Laina I, Kalavakonda N, Bodenstedt S (2019) 2017 robotic instrument segmentation challenge. arXiv preprint arXiv:1902.06426https:\/\/doi.org\/10.48550\/arXiv.1902.06426","DOI":"10.48550\/arXiv.1902.06426"},{"key":"3140_CR20","doi-asserted-by":"publisher","unstructured":"Allan M, Kondo S, Bodenstedt S, Leger S, Kadkhodamohammadi R, Luengo I, Fuentes F, Flouty E, Mohammed A, Pedersen M (2020) 2018 robotic scene segmentation challenge. arXiv preprint arXiv:2001.11190https:\/\/doi.org\/10.48550\/arXiv.2001.11190","DOI":"10.48550\/arXiv.2001.11190"},{"key":"3140_CR21","doi-asserted-by":"publisher","unstructured":"Hong W-Y, Kao C-L, Kuo Y-H, Wang J-R, Chang W-L, Shih C-S (2020) Cholecseg8k: a semantic segmentation dataset for laparoscopic cholecystectomy based on cholec80. arXiv preprint arXiv:2012.12453https:\/\/doi.org\/10.48550\/arXiv.2012.12453","DOI":"10.48550\/arXiv.2012.12453"},{"key":"3140_CR22","doi-asserted-by":"crossref","unstructured":"Wang Z, Lu B, Long Y, Zhong F, Cheung T-H, Dou Q, Liu Y (2022) Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy. In: MICCAI, pp. 486\u2013496. Springer","DOI":"10.1007\/978-3-031-16449-1_46"},{"key":"3140_CR23","doi-asserted-by":"crossref","unstructured":"Baby B, Thapar D, Chasmai M, Banerjee T, Dargan K, Suri A, Banerjee S, Arora C (2023) From forks to forceps: a new framework for instance segmentation of surgical instruments. In: WACV, pp. 6191\u20136201","DOI":"10.1109\/WACV56688.2023.00613"},{"issue":"2","key":"3140_CR24","doi-asserted-by":"publisher","first-page":"3858","DOI":"10.1109\/LRA.2022.3146544","volume":"7","author":"L Seenivasan","year":"2022","unstructured":"Seenivasan L, Mitheran S, Islam M, Ren H (2022) Global-reasoned multi-task learning model for surgical scene understanding. IEEE Robot Autom Lett 7(2):3858\u20133865","journal-title":"IEEE Robot Autom Lett"}],"container-title":["International Journal of Computer Assisted Radiology and Surgery"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11548-024-03140-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11548-024-03140-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11548-024-03140-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,8]],"date-time":"2024-07-08T17:14:39Z","timestamp":1720458879000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11548-024-03140-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,8]]},"references-count":24,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2024,7]]}},"alternative-id":["3140"],"URL":"https:\/\/doi.org\/10.1007\/s11548-024-03140-z","relation":{},"ISSN":["1861-6429"],"issn-type":[{"type":"electronic","value":"1861-6429"}],"subject":[],"published":{"date-parts":[[2024,5,8]]},"assertion":[{"value":"4 March 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 April 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 May 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"TV is a co-founder and shareholder of Hypervision Surgical. The authors declare that they have no other Conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}},{"value":"This article only uses publicly available datasets. Their re-use did not require any ethical approval.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethical approval"}}]}}