{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,1]],"date-time":"2026-01-01T10:04:54Z","timestamp":1767261894918,"version":"build-2065373602"},"reference-count":66,"publisher":"Association for Computing Machinery (ACM)","issue":"11","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62206166 and 62302287"],"award-info":[{"award-number":["62206166 and 62302287"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Shanghai Sailing Program","award":["23YF1413000"],"award-info":[{"award-number":["23YF1413000"]}]},{"name":"Shanghai Committee of Science and Technology, Shanghai","award":["23ZR1423500"],"award-info":[{"award-number":["23ZR1423500"]}]},{"name":"Shanghai Pujiang Program","award":["22PJ1403800"],"award-info":[{"award-number":["22PJ1403800"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,11,30]]},"abstract":"<jats:p>\n                    Generalized Few-shot Segmentation (GFSS) aims to segment both base and novel classes in a query image, conditioning on richly annotated data of base classes and limited exemplars from novel classes. The learning of novel classes undoubtedly faces a disadvantage in this competition due to the highly unbalanced data, which skews the learned feature space toward the base classes. In this article, we present an innovative idea termed as \u201clearning from orthogonal space\u201d to avoid the conflict in the process of learning novel classes. Specifically, we first utilize textual modal information from labels to provide more distinguishable initial prototypes for different categories, ensuring that the prototypes for base and novel classes have distinct initial separations. Then, a simple but effective Feature Separating Module (FSM) is introduced to enhance the model\u2019s ability to differentiate between base and novel classes through learning the novel features from orthogonal space. In addition, we propose a Trigger-Promoting Framework (TPF) during the testing stage to further boost performance. The prediction results from the FSM serve as a multimodal prompt to leverage information residing in large models, such as CLIP and SAM, to enhance performance. Comprehensive experiments on two benchmarks demonstrate that our method achieves superior performance on novel classes without sacrificing accuracy on base classes. Notably, our Feature Separating with Trigger-promoting Network (FS-TPNet) outperforms the current state-of-the-art method by\n                    <jats:inline-formula content-type=\"math\/tex\">\n                      <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(12.8\\%\\)<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    overall IoU on novel classes on PASCAL-\n                    <jats:inline-formula content-type=\"math\/tex\">\n                      <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(5^{i}\\)<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    under the 1-shot scenario. Our codes will be available at\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/returnZXJ\/FS-TPNet\">https:\/\/github.com\/returnZXJ\/FS-TPNet<\/jats:ext-link>\n                    .\n                  <\/jats:p>","DOI":"10.1145\/3712597","type":"journal-article","created":{"date-parts":[[2025,1,20]],"date-time":"2025-01-20T11:25:56Z","timestamp":1737372356000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Learning from Orthogonal Space with Multimodal Large Models for Generalized Few-shot Segmentation"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-6819-9171","authenticated-orcid":false,"given":"Xiaojie","family":"Zhou","sequence":"first","affiliation":[{"name":"Shanghai University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3444-9992","authenticated-orcid":false,"given":"Hang","family":"Yu","sequence":"additional","affiliation":[{"name":"Shanghai University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-7174-5220","authenticated-orcid":false,"given":"Shengjie","family":"Yang","sequence":"additional","affiliation":[{"name":"Shanghai University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8504-455X","authenticated-orcid":false,"given":"Jing","family":"Huo","sequence":"additional","affiliation":[{"name":"Nanjing University, Nanjing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5472-2469","authenticated-orcid":false,"given":"Pinzhuo","family":"Tian","sequence":"additional","affiliation":[{"name":"Shanghai University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,11,7]]},"reference":[{"key":"e_1_3_1_2_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Antoniou Antreas","year":"2018","unstructured":"Antreas Antoniou, Harrison Edwards, and Amos Storkey. 2018. How to train your MAML. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/3643850"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2699184"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/3514250"},{"key":"e_1_3_1_6_2","first-page":"17864","volume-title":"Proceedings of the 34th Advances in Neural Information Processing Systems","author":"Cheng Bowen","year":"2021","unstructured":"Bowen Cheng, Alex Schwing, and Alexander Kirillov. 2021. Per-pixel classification is not all you need for semantic segmentation. In Proceedings of the 34th Advances in Neural Information Processing Systems, 17864\u201317875."},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.350"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475389"},{"key":"e_1_3_1_9_2","first-page":"3","volume-title":"BMVC","author":"Dong Nanqing","year":"2018","unstructured":"Nanqing Dong and Eric P. Xing. 2018. Few-shot semantic segmentation with prototype learning. In BMVC, 3."},{"key":"e_1_3_1_10_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https:\/\/arxiv.org\/abs\/2010.11929"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-009-0275-4"},{"key":"e_1_3_1_12_2","first-page":"701","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Fan Qi","year":"2022","unstructured":"Qi Fan, Wenjie Pei, Yu-Wing Tai, and Chi-Keung Tang. 2022. Self-support few-shot semantic segmentation. In Proceedings of the European Conference on Computer Vision. Springer, 701\u2013719."},{"key":"e_1_3_1_13_2","first-page":"1126","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Finn Chelsea","year":"2017","unstructured":"Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning. PMLR, 1126\u20131135."},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00459"},{"key":"e_1_3_1_15_2","first-page":"297","volume-title":"Proceedings of the 13th European Conference on Computer Vision (ECCV \u201914), Part VII","author":"Hariharan Bharath","year":"2014","unstructured":"Bharath Hariharan, Pablo Arbel\u00e1ez, Ross Girshick, and Jitendra Malik. 2014. Simultaneous detection and segmentation. In Proceedings of the 13th European Conference on Computer Vision (ECCV \u201914), Part VII. Springer, 297\u2013312."},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_17_2","first-page":"12511","volume-title":"Proceedings of the 38th AAAI Conference on Artificial Intelligence","author":"Hu Jian","year":"2024","unstructured":"Jian Hu, Jiayi Lin, Shaogang Gong, and Weitong Cai. 2024. Relax image-specific prompt requirement in SAM: A single generic prompt for segmenting camouflaged objects. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, 12511\u201312518."},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00745"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3635153"},{"key":"e_1_3_1_20_2","first-page":"19256","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Huang Kai","year":"2023","unstructured":"Kai Huang, Feigege Wang, Ye Xi, and Yutao Gao. 2023. Prototypical kernel learning and open-set foreground perception for generalized few-shot semantic segmentation. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, 19256\u201319265."},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00371"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.1611835114"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00789"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2022.3164083"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_1_26_2","unstructured":"Jinlu Liu and Yongqiang Qin. 2020. Prototype refinement network for few-shot segmentation. arXiv:2002.03579. Retrieved from https:\/\/arxiv.org\/abs\/2002.03579"},{"key":"e_1_3_1_27_2","first-page":"275","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Liu Quande","year":"2022","unstructured":"Quande Liu, Youpeng Wen, Jianhua Han, Chunjing Xu, Hang Xu, and Xiaodan Liang. 2022. Open-world semantic segmentation via contrasting and clustering vision-language embedding. In Proceedings of the European Conference on Computer Vision. Springer, 275\u2013292."},{"key":"e_1_3_1_28_2","unstructured":"Shilong Liu Zhaoyang Zeng Tianhe Ren Feng Li Hao Zhang Jie Yang Qing Jiang Chunyuan Li Jianwei Yang Hang Su et al. 2023. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv:2303.05499. Retrieved from https:\/\/arxiv.org\/abs\/2303.05499"},{"key":"e_1_3_1_29_2","first-page":"11319","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Liu Sun-Ao","year":"2023","unstructured":"Sun-Ao Liu, Yiheng Zhang, Zhaofan Qiu, Hongtao Xie, Yongdong Zhang, and Ting Yao. 2023. Learning orthogonal prototypes for generalized few-shot semantic segmentation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 11319\u201311328."},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3650032"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3441577"},{"key":"e_1_3_1_32_2","first-page":"142","volume-title":"Proceedings of the 16th European Conference on Computer Vision (ECCV \u201920), Part IX","author":"Liu Yongfei","year":"2020","unstructured":"Yongfei Liu, Xiangyi Zhang, Songyang Zhang, and Xuming He. 2020. Part-aware prototype network for few-shot semantic segmentation. In Proceedings of the 16th European Conference on Computer Vision (ECCV \u201920), Part IX. Springer, 142\u2013158."},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00319"},{"key":"e_1_3_1_34_2","first-page":"475","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Long Fuchen","year":"2022","unstructured":"Fuchen Long, Zhaofan Qiu, Yingwei Pan, Ting Yao, Chong-Wah Ngo, and Tao Mei. 2022. Dynamic temporal filtering in video models. In Proceedings of the European Conference on Computer Vision. Springer, 475\u2013492."},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2023.3282070"},{"key":"e_1_3_1_36_2","first-page":"6941","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Min Juhong","year":"2021","unstructured":"Juhong Min, Dahyun Kang, and Minsu Cho. 2021. Hypercorrelation squeeze for few-shot segmentation. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, 6941\u20136952."},{"key":"e_1_3_1_37_2","unstructured":"Josh Myers-Dean Yinan Zhao Brian Price Scott Cohen and Danna Gurari. 2021. Generalized few-shot semantic segmentation: All you need is fine-tuning. arXiv:2112.10982. Retrieved from https:\/\/arxiv.org\/abs\/2112.10982"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00071"},{"key":"e_1_3_1_39_2","first-page":"1","volume-title":"ACM Transactions on Multimedia Computing, Communications, and Applications","author":"Punn Narinder Singh","year":"2020","unstructured":"Narinder Singh Punn and Sonali Agarwal. 2020. Inception u-net architecture for semantic segmentation to identify nuclei in microscopy cell images. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 1 (2020), 1\u201315."},{"key":"e_1_3_1_40_2","first-page":"8748","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748\u20138763."},{"key":"e_1_3_1_41_2","unstructured":"Tianhe Ren Shilong Liu Ailing Zeng Jing Lin Kunchang Li He Cao Jiayu Chen Xinyu Huang Yukang Chen Feng Yan et al. 2024. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv:2401.14159. Retrieved from https:\/\/arxiv.org\/abs\/2401.14159"},{"key":"e_1_3_1_42_2","doi-asserted-by":"crossref","unstructured":"Amirreza Shaban Shray Bansal Zhen Liu Irfan Essa and Byron Boots. 2017. One-shot learning for semantic segmentation. arXiv:1709.03410. Retrieved from https:\/\/arxiv.org\/abs\/1709.03410","DOI":"10.5244\/C.31.167"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3611783"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/3630257"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/3555314"},{"key":"e_1_3_1_46_2","first-page":"12087","volume-title":"Proceedings of the 34th AAAI Conference on Artificial Intelligence","author":"Tian Pinzhuo","year":"2020","unstructured":"Pinzhuo Tian, Zhangkai Wu, Lei Qi, Lei Wang, Yinghuan Shi, and Yang Gao. 2020. Differentiable meta-learning model for few-shot semantic segmentation. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, 12087\u201312094."},{"key":"e_1_3_1_47_2","first-page":"11563","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Tian Zhuotao","year":"2022","unstructured":"Zhuotao Tian, Xin Lai, Li Jiang, Shu Liu, Michelle Shu, Hengshuang Zhao, and Jiaya Jia. 2022. Generalized few-shot semantic segmentation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 11563\u201311572."},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3611816"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2020.3013717"},{"key":"e_1_3_1_50_2","first-page":"24261","article-title":"Mlp-mixer: An all-mlp architecture for vision","volume":"34","author":"Ilya O Tolstikhin","year":"2021","unstructured":"Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. 2021. Mlp-mixer: An all-mlp architecture for vision. Advances in Neural Information Processing Systems 34 (2021), 24261\u201324272.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_51_2","first-page":"3635","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Wang Haoxiang","year":"2024","unstructured":"Haoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar, Sachin Mehta, Mohammad Rastegari, Oncel Tuzel, and Hadi Pouransari. 2024. Sam-clip: Merging vision foundation models towards semantic and spatial understanding. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 3635\u20133647."},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00929"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1145\/3572916"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1145\/3357233"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1145\/3548459"},{"key":"e_1_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00689"},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2023.3268446"},{"key":"e_1_3_1_58_2","first-page":"328","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Yao Ting","year":"2022","unstructured":"Ting Yao, Yingwei Pan, Yehao Li, Chong-Wah Ngo, and Tao Mei. 2022. Wave-vit: Unifying wavelet and transformers for visual representation learning. In Proceedings of the European Conference on Computer Vision. Springer, 328\u2013345."},{"issue":"6","key":"e_1_3_1_59_2","doi-asserted-by":"crossref","first-page":"1930","DOI":"10.1007\/s11263-020-01381-4","article-title":"Learning adaptive classifiers synthesis for generalized few-shot learning","volume":"129","author":"Han-Jia Ye","year":"2021","unstructured":"Han-Jia Ye, Hexiang Hu, and De-Chuan Zhan. 2021. Learning adaptive classifiers synthesis for generalized few-shot learning. International Journal of Computer Vision 129, 6 (2021), 1930\u20131953.","journal-title":"International Journal of Computer Vision"},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.1145\/3321512"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3547961"},{"key":"e_1_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00536"},{"key":"e_1_3_1_63_2","first-page":"21984","article-title":"Few-shot segmentation via cycle-consistent transformer","volume":"34","author":"Zhang Gengwei","year":"2021","unstructured":"Gengwei Zhang, Guoliang Kang, Yi Yang, and Yunchao Wei. 2021. Few-shot segmentation via cycle-consistent transformer. Advances in Neural Information Processing Systems 34 (2021), 21984\u201321996.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_64_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3547835"},{"key":"e_1_3_1_65_2","unstructured":"Renrui Zhang Zhengkai Jiang Ziyu Guo Shilin Yan Junting Pan Xianzheng Ma Hao Dong Peng Gao and Hongsheng Li. 2023. Personalize segment anything model with one shot. arXiv:2305.03048. Retrieved from https:\/\/arxiv.org\/abs\/2305.03048"},{"key":"e_1_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.660"},{"key":"e_1_3_1_67_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475658"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3712597","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,7]],"date-time":"2025-11-07T15:10:19Z","timestamp":1762528219000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3712597"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,7]]},"references-count":66,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2025,11,30]]}},"alternative-id":["10.1145\/3712597"],"URL":"https:\/\/doi.org\/10.1145\/3712597","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2025,11,7]]},"assertion":[{"value":"2024-06-18","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-01-05","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-11-07","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}