{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,22]],"date-time":"2025-12-22T12:43:54Z","timestamp":1766407434824,"version":"3.48.0"},"reference-count":24,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2025,10,27]],"date-time":"2025-10-27T00:00:00Z","timestamp":1761523200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,10,27]],"date-time":"2025-10-27T00:00:00Z","timestamp":1761523200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Hum-Cent Intell Syst"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Recent advancements in deep learning have greatly improved computer vision tasks like object detection, image classification, and segmentation. Despite these successes, traditional supervised learning methods still depend on large annotated datasets, which are often expensive and time consuming to create. To overcome this limitation, we present a zero-shot segmentation framework that combines the strengths of CLIP (Contrastive Language-Image Pretraining), its segmentation-focused variant CLIPSeg, and the Segment Anything Model (SAM). This approach first uses the zero-shot classification ability of CLIP or CLIPSeg to produce initial segmentation cues. These cues, such as point and box prompts, are then refined by SAM to generate accurate segmentation masks. Using this prompt-based strategy, the system can perform segmentation without requiring labeled data, making it suitable for a wide range of domains, including both natural scenes and medical imaging. Our experiments on benchmarks such as MS-COCO, Pascal VOC, and chest X-ray datasets highlight the effectiveness of the method. In particular, the CLIPSeg+SAM combination achieves a mean IoU of 0.793 and a Dice score of 0.873 in the chest X-ray dataset, outperforming both CLIPSeg and SAM when used alone. Visual results also show that this method produces clearer and more precise mask boundaries, even in challenging or cluttered environments. In summary, the proposed training-free framework offers a scalable and generalizable solution for zero-shot segmentation, significantly reducing the reliance on annotated datasets while delivering strong performance on unseen classes.<\/jats:p>","DOI":"10.1007\/s44230-025-00115-4","type":"journal-article","created":{"date-parts":[[2025,10,27]],"date-time":"2025-10-27T16:16:49Z","timestamp":1761581809000},"page":"431-449","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Prompt-SAM: A Vision-Language and SAM based Hybrid Framework for Prompt-Augmented Zero-Shot Segmentation"],"prefix":"10.1007","volume":"5","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8816-6333","authenticated-orcid":false,"given":"Uma","family":"Gurav","sequence":"first","affiliation":[]},{"given":"Sanket","family":"Jadhav","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,10,27]]},"reference":[{"key":"115_CR1","doi-asserted-by":"publisher","unstructured":"Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI). Lecture Notes in Computer Science, vol. 9351, pp. 234\u2013241. Springer, Cham; 2015. https:\/\/doi.org\/10.1007\/978-3-319-24574-4_28.","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"115_CR2","doi-asserted-by":"publisher","unstructured":"Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J. Unet++: A nested u-net architecture for medical image segmentation. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support \u2013 4th International Workshop, DLMIA 2018 and 8th International Workshop, ML-CDS 2018 (MICCAI). Lecture Notes in Computer Science, vol. 11045, pp. 3\u201311. Springer (2018). https:\/\/doi.org\/10.1007\/978-3-030-00889-5_1","DOI":"10.1007\/978-3-030-00889-5_1"},{"issue":"4","key":"115_CR3","doi-asserted-by":"publisher","first-page":"834","DOI":"10.1109\/TPAMI.2017.2699184","volume":"40","author":"L-C Chen","year":"2018","unstructured":"Chen L-C, Papandreou G, Kokkinos I, Murphy KP, Yuille AL. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell. 2018;40(4):834\u201348. https:\/\/doi.org\/10.1109\/TPAMI.2017.2699184.","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"issue":"12","key":"115_CR4","doi-asserted-by":"publisher","first-page":"2481","DOI":"10.1109\/TPAMI.2016.2644615","volume":"39","author":"V Badrinarayanan","year":"2017","unstructured":"Badrinarayanan V, Kendall A, Cipolla R. Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell. 2017;39(12):2481\u201395. https:\/\/doi.org\/10.1109\/TPAMI.2016.2644615.","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"115_CR5","doi-asserted-by":"publisher","unstructured":"Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo W-Y, Doll\u00e1r P, Girshick R. Segment anything. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), 2023:4015\u20134026. https:\/\/doi.org\/10.1109\/ICCV51070.2023.00371. https:\/\/openaccess.thecvf.com\/content\/ICCV2023\/html\/Kirillov_Segment_Anything_ICCV_2023_paper.html","DOI":"10.1109\/ICCV51070.2023.00371"},{"key":"115_CR6","doi-asserted-by":"publisher","unstructured":"He K, Chen X, Xie S, Li Y, Doll\u00e1r P, Girshick R. Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022:15979\u201315988. https:\/\/doi.org\/10.1109\/CVPR52688.2022.01553. https:\/\/openaccess.thecvf.com\/content\/CVPR2022\/html\/He_Masked_Autoencoders_Are_Scalable_Vision_Learners_CVPR_2022_paper.html","DOI":"10.1109\/CVPR52688.2022.01553"},{"key":"115_CR7","unstructured":"Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML); 2021. OpenAI (CLIP), arXiv preprint arXiv:2103.00020."},{"key":"115_CR8","doi-asserted-by":"publisher","unstructured":"L\u00fcddecke T, Ecker AS. Image segmentation using text and image prompts. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022:7076\u20137086. https:\/\/doi.org\/10.1109\/CVPR52688.2022.00695 . https:\/\/openaccess.thecvf.com\/content\/CVPR2022\/papers\/Luddecke_Image_Segmentation_Using_Text_and_Image_Prompts_CVPR_2022_paper.pdf","DOI":"10.1109\/CVPR52688.2022.00695"},{"key":"115_CR9","doi-asserted-by":"publisher","unstructured":"Aleem S, Wang F, Maniparambil M, Arazo E, Dietlmeier J, Curran K, O\u2019Connor NE, Little S. Test-time adaptation with salip: A cascade of sam and clip for zero-shot medical image segmentation. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024:5184\u20135193. https:\/\/doi.org\/10.1109\/CVPRW63382.2024.00526 . https:\/\/openaccess.thecvf.com\/content\/CVPR2024W\/DEF-AI-MIA\/html\/Aleem_Test-Time_Adaptation_with_SaLIP_A_Cascade_of_SAM_and_CLIP_CVPRW_2024_paper.html","DOI":"10.1109\/CVPRW63382.2024.00526"},{"key":"115_CR10","doi-asserted-by":"publisher","unstructured":"Zhou Z, Lei Y, Zhang B, Liu L, Liu Y. Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: 2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023:11175\u201311185. https:\/\/doi.org\/10.1109\/CVPR52729.2023.01075","DOI":"10.1109\/CVPR52729.2023.01075"},{"key":"115_CR11","doi-asserted-by":"publisher","unstructured":"Xu J, De\u00a0Mello S, Liu S, Byeon W, Breuel T, Kautz J, Wang X. Groupvit: Semantic segmentation emerges from text supervision. In: 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022:18113\u201318123. https:\/\/doi.org\/10.1109\/CVPR52688.2022.01760","DOI":"10.1109\/CVPR52688.2022.01760"},{"key":"115_CR12","doi-asserted-by":"publisher","unstructured":"Dong X, Bao J, Zheng Y, Zhang T, Chen D, Yang H, Zeng M, Zhang W, Yuan L, Chen D, Wen F, Yu N. MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining . In: 2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10995\u201311005. IEEE Computer Society, Los Alamitos, CA, USA; 2023. https:\/\/doi.org\/10.1109\/CVPR52729.2023.01058.","DOI":"10.1109\/CVPR52729.2023.01058"},{"key":"115_CR13","unstructured":"Zou X, Yang J, Zhang H, Li F, Li L, Wang J, Wang L, Gao J, Lee YJ. Segment everything everywhere all at once. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. NIPS \u201923. Curran Associates Inc., Red Hook, NY, USA; 2023. https:\/\/papers.nips.cc\/paper\/2023\/hash\/3ef61f7e4afacf9a2c5b71c726172b86-Abstract-Conference.html"},{"key":"115_CR14","doi-asserted-by":"publisher","unstructured":"Ghiasi G, Gu X, Cui Y, Lin T-Y. Scaling open-vocabulary image segmentation with image-level labels. In: Avidan, S., Brostow, G., Ciss\u00e9, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision \u2013 ECCV 2022, pp. 540\u2013557. Springer, Cham 2022. https:\/\/doi.org\/10.1007\/978-3-031-20059-5_31","DOI":"10.1007\/978-3-031-20059-5_31"},{"key":"115_CR15","doi-asserted-by":"publisher","unstructured":"Rao Y, Zhao W, Chen G, Tang Y, Zheng Z, Huang G, Zhou J, Lu J. Denseclip: Language-guided dense prediction with context-aware prompting, 2022:18061\u201318070.https:\/\/doi.org\/10.1109\/CVPR52688.2022.01755","DOI":"10.1109\/CVPR52688.2022.01755"},{"key":"115_CR16","unstructured":"Bucher M, Vu T-H, Cord M, Perez P. Zero-shot semantic segmentation. In: Advances in Neural Information Processing Systems (NeurIPS); 2019. https:\/\/papers.neurips.cc\/paper\/2019\/hash\/0266e33d3f546cb5436a10798e657d97-Abstract.html"},{"key":"115_CR17","doi-asserted-by":"publisher","unstructured":"Caron M, Touvron H, Misra I, J\u00e9gou H, Mairal J, Bojanowski P, Joulin A. Emerging properties in self-supervised vision transformers, 2021:9630\u20139640. https:\/\/doi.org\/10.1109\/ICCV48922.2021.00951","DOI":"10.1109\/ICCV48922.2021.00951"},{"key":"115_CR18","doi-asserted-by":"publisher","unstructured":"Sim\u00e9oni O, Sekkat C, Puy G, Vobeck\u00fd A, Zablocki \u00c9, P\u00e9rez P. Unsupervised object localization: Observing the background to discover objects, 2023:3176\u20133186. https:\/\/doi.org\/10.1109\/CVPR52729.2023.00310","DOI":"10.1109\/CVPR52729.2023.00310"},{"key":"115_CR19","doi-asserted-by":"publisher","unstructured":"Wysocza\u0144ska M, Sim\u00e9oni O, Ramamonjisoa M, Bursuc A, Trzci\u0144ski T, Perez P. Clip-dinoiser: Teaching clip a few dino tricks for open-vocabulary semantic segmentation. In: Computer Vision \u2013 ECCV 2024, Lecture Notes in Computer Science, 2024:320\u2013337. https:\/\/doi.org\/10.1007\/978-3-031-73030-6_18 . Also available as arXiv preprint arXiv:2312.12359.","DOI":"10.1007\/978-3-031-73030-6_18"},{"key":"115_CR20","doi-asserted-by":"publisher","unstructured":"Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Doll\u00e1r P, Zitnick CL. Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 740\u2013755. Springer 2014.https:\/\/doi.org\/10.1007\/978-3-319-10602-1_48","DOI":"10.1007\/978-3-319-10602-1_48"},{"issue":"2","key":"115_CR21","doi-asserted-by":"publisher","first-page":"303","DOI":"10.1007\/s11263-009-0275-4","volume":"88","author":"M Everingham","year":"2010","unstructured":"Everingham M, Gool LV, Williams CKI, Winn J, Zisserman A. The pascal visual object classes (voc) challenge. Int J Comput Vision. 2010;88(2):303\u201338. https:\/\/doi.org\/10.1007\/s11263-009-0275-4.","journal-title":"Int J Comput Vision"},{"key":"115_CR22","doi-asserted-by":"publisher","unstructured":"Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, et al. The cityscapes dataset for semantic urban scene understanding. 2016. https:\/\/doi.org\/10.1109\/CVPR.2016.350.","DOI":"10.1109\/CVPR.2016.350"},{"issue":"3\u20134","key":"115_CR23","doi-asserted-by":"publisher","first-page":"302","DOI":"10.1007\/s11263-018-1140-0","volume":"127","author":"B Zhou","year":"2019","unstructured":"Zhou B, Zhao H, Puig X, Xiao T, Fidler S, Barriuso A, et al. Semantic understanding of scenes through the ade20k dataset. Int J Comput Vision. 2019;127(3\u20134):302\u201321. https:\/\/doi.org\/10.1007\/s11263-018-1140-0.","journal-title":"Int J Comput Vision"},{"key":"115_CR24","doi-asserted-by":"publisher","unstructured":"Stojnic V, Kalantidis Y, Matas J, Tolias G. Lposs: Label propagation over patches and pixels for open-vocabulary semantic segmentation, 2025:9794\u20139803. https:\/\/doi.org\/10.1109\/CVPR52734.2025.00915","DOI":"10.1109\/CVPR52734.2025.00915"}],"container-title":["Human-Centric Intelligent Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44230-025-00115-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s44230-025-00115-4\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44230-025-00115-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,17]],"date-time":"2025-12-17T09:08:53Z","timestamp":1765962533000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s44230-025-00115-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,27]]},"references-count":24,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["115"],"URL":"https:\/\/doi.org\/10.1007\/s44230-025-00115-4","relation":{},"ISSN":["2667-1336"],"issn-type":[{"type":"electronic","value":"2667-1336"}],"subject":[],"published":{"date-parts":[[2025,10,27]]},"assertion":[{"value":"28 March 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 September 2025","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 October 2025","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"27 October 2025","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no Conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of Interest"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics Approval and Consent to Participate"}},{"value":"Not applicable.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for Publication"}}]}}