{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,27]],"date-time":"2026-01-27T22:13:03Z","timestamp":1769551983228,"version":"3.49.0"},"reference-count":67,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,6,27]],"date-time":"2024-06-27T00:00:00Z","timestamp":1719446400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,6,27]],"date-time":"2024-06-27T00:00:00Z","timestamp":1719446400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"NTU Presidential Postdoctoral Fellowship"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Vis. Intell."],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In referring segmentation, modeling the complicated constraints in the multimodal information is one of the most challenging problems. As the information in a given language expression and image becomes increasingly abundant, most of the current one-stage methods that directly output the segmentation mask encounter difficulties in understanding the complicated relationships between the image and the expression. In this work, we propose a PrimitiveNet to decompose the difficult global constraints into a set of simple primitives. Each primitive produces a primitive mask that represents only simple semantic meanings, e.g., all instances from the same category. Then, the output segmentation mask is computed by selectively combining these primitives according to the language expression. Furthermore, we propose a cross-primitive attention (CPA) module and a language-primitive attention (LPA) module to exchange information among all primitives and the language expression, respectively. The proposed CPA and LPA help the network find appropriate weights for primitive masks, so as to recover the target object. Extensive experiments have proven the effectiveness of our design and verified that the proposed network outperforms current state-of-the-art referring segmentation methods on three RefCOCO datasets.<\/jats:p>","DOI":"10.1007\/s44267-024-00049-8","type":"journal-article","created":{"date-parts":[[2024,6,27]],"date-time":"2024-06-27T13:01:26Z","timestamp":1719493286000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":18,"title":["PrimitiveNet: decomposing the global constraints for referring segmentation"],"prefix":"10.1007","volume":"2","author":[{"given":"Chang","family":"Liu","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9104-2315","authenticated-orcid":false,"given":"Xudong","family":"Jiang","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4868-6526","authenticated-orcid":false,"given":"Henghui","family":"Ding","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,6,27]]},"reference":[{"key":"49_CR1","first-page":"108","volume-title":"Proceedings of the 14th European conference of computer vision","author":"R. Hu","year":"2016","unstructured":"Hu, R., Rohrbach, M., & Darrell, T. (2016). Segmentation from natural language expressions. In B. Leibe, J. Matas, N. Sebe, et al. (Eds.), Proceedings of the 14th European conference of computer vision (pp. 108\u2013124). Cham: Springer."},{"key":"49_CR2","first-page":"3431","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition","author":"J. Long","year":"2015","unstructured":"Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431\u20133440). Piscataway: IEEE."},{"issue":"1","key":"49_CR3","doi-asserted-by":"publisher","first-page":"134","DOI":"10.1109\/TNNLS.2021.3090426","volume":"34","author":"Y. Zhou","year":"2023","unstructured":"Zhou, Y., Ji, R., Luo, G., Sun, X., Su, J., Ding, X., et al. (2023). A real-time global inference network for one-stage referring expression comprehension. IEEE Transactions on Neural Networks and Learning Systems, 34(1), 134\u2013143.","journal-title":"IEEE Transactions on Neural Networks and Learning Systems"},{"key":"49_CR4","doi-asserted-by":"publisher","first-page":"3689","DOI":"10.1109\/TMM.2023.3314153","volume":"26","author":"G. Luo","year":"2024","unstructured":"Luo, G., Zhou, Y., Sun, J., Sun, X., & Ji, R. (2024). A survivor in the era of large-scale pretraining: an empirical study of one-stage referring expression comprehension. IEEE Transactions on Multimedia, 26, 3689\u20133700.","journal-title":"IEEE Transactions on Multimedia"},{"key":"49_CR5","unstructured":"He, S., Ding, H., Liu, C., & Jiang, X. (2023). GREC: generalized referring expression comprehension. arXiv preprint. arXiv:2308.16182."},{"key":"49_CR6","first-page":"19144","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"J. Sun","year":"2023","unstructured":"Sun, J., Luo, G., Zhou, Y., Sun, X., Jiang, G., Wang, Z., et al. (2023). RefTeacher: a strong baseline for semi-supervised referring expression comprehension. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition 19144\u201319154. Piscataway: IEEE."},{"key":"49_CR7","first-page":"1280","volume-title":"Proceedings of the IEEE international conference on computer vision","author":"C. Liu","year":"2017","unstructured":"Liu, C., Lin, Z., Shen, X., Yang, J., Lu, X., & Yuille, A. (2017). Recurrent multimodal interaction for referring image segmentation. In Proceedings of the IEEE international conference on computer vision 1280\u20131289. Piscataway: IEEE."},{"key":"49_CR8","first-page":"5745","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"R. Li","year":"2018","unstructured":"Li, R., Li, K., Kuo, Y.-C., Shu, M., Qi, X., Shen, X., et al. (2018). Referring image segmentation via recurrent refinement networks. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 5745\u20135753). Piscataway: IEEE."},{"key":"49_CR9","first-page":"10502","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"L. Ye","year":"2019","unstructured":"Ye, L., Rochan, M., Liu, Z., & Wang, Y. (2019). Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 10502\u201310511). Piscataway: IEEE."},{"key":"49_CR10","first-page":"10031","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"G. Luo","year":"2020","unstructured":"Luo, G., Zhou, Y., Sun, X., Cao, L., Wu, C., Deng, C., et al. (2020). Multi-task collaborative network for joint referring expression comprehension and segmentation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 10031\u201310040). Piscataway: IEEE."},{"key":"49_CR11","first-page":"1307","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"L. Yu","year":"2018","unstructured":"Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., et al. (2018). MAttNet: modular attention network for referring expression comprehension. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 1307\u20131315). Piscataway: IEEE."},{"key":"49_CR12","doi-asserted-by":"publisher","first-page":"3657","DOI":"10.1109\/TMM.2022.3163578","volume":"25","author":"C. Liu","year":"2023","unstructured":"Liu, C., Jiang, X., & Ding, H. (2023). Instance-specific feature propagation for referring segmentation. IEEE Transactions on Multimedia, 25, 3657\u20133667.","journal-title":"IEEE Transactions on Multimedia"},{"key":"49_CR13","first-page":"5998","volume-title":"Proceedings of the 31st international conference on neural information processing systems","author":"A. Vaswani","year":"2017","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, et al. (Eds.), Proceedings of the 31st international conference on neural information processing systems (pp. 5998\u20136008). Red Hook: Curran Associates."},{"key":"49_CR14","first-page":"16301","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"H. Ding","year":"2021","unstructured":"Ding, H., Liu, C., Wang, S., & Jiang, X. (2021). Vision-language transformer and query generation for referring segmentation. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 16301\u201316310). Piscataway: IEEE."},{"issue":"6","key":"49_CR15","doi-asserted-by":"publisher","first-page":"7900","DOI":"10.1109\/TPAMI.2022.3217852","volume":"45","author":"H. Ding","year":"2023","unstructured":"Ding, H., Liu, C., Wang, S., & Vlt, X. J. (2023). Vision-language transformer and query generation for referring segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6), 7900\u20137916.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"49_CR16","first-page":"23592","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"C. Liu","year":"2023","unstructured":"Liu, C., Ding, H., & Gres, X. J. (2023). Generalized referring expression segmentation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 23592\u201323601). Piscataway: IEEE."},{"key":"49_CR17","first-page":"13128","volume-title":"Proceedings of the IEEE conferene of computer vision and pattern recognition","author":"C. Liu","year":"2024","unstructured":"Liu, C., Li, X., & Ding, H. (2024). Referring image editing: object-level image editing via referring expressions. In Proceedings of the IEEE conferene of computer vision and pattern recognition (pp. 13128\u201313138). Piscataway: IEEE."},{"key":"49_CR18","first-page":"18134","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Z. Yang","year":"2022","unstructured":"Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., & Lavt, P. H. S. (2022). LAVT: language-aware vision transformer for referring image segmentation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 18134\u201318144). Piscataway: IEEE."},{"key":"49_CR19","first-page":"15506","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"G. Feng","year":"2021","unstructured":"Feng, G., Hu, Z., Zhang, L., & Lu, H. (2021). Encoder fusion network with co-attention embedding for referring image segmentation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 15506\u201315515). Piscataway: IEEE."},{"key":"49_CR20","doi-asserted-by":"publisher","first-page":"3054","DOI":"10.1109\/TIP.2023.3277791","volume":"32","author":"C. Liu","year":"2023","unstructured":"Liu, C., Ding, H., Zhang, Y., & Jiang, X. (2023). Multi-modal mutual attention and iterative interaction for referring image segmentation. IEEE Transactions on Image Processing, 32, 3054\u20133065.","journal-title":"IEEE Transactions on Image Processing"},{"key":"49_CR21","first-page":"11676","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Z. Wang","year":"2022","unstructured":"Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M., et al. (2022). CRIS: clip-driven referring image segmentation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 11676\u201311685). Piscataway: IEEE."},{"key":"49_CR22","first-page":"19456","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"S. Yu","year":"2023","unstructured":"Yu, S., Seo, P. H., & Son, J. (2023). Zero-shot referring image segmentation with global-local context features. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 19456\u201319465). Piscataway: IEEE."},{"key":"49_CR23","first-page":"8748","volume-title":"Proceedings of the 38th international conference on machine learning","author":"A. Radford","year":"2021","unstructured":"Radford, A., Wook Kim, J., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. (2021). Learning transferable visual models from natural language supervision. In M. Meila & T. Zhang (Eds.), Proceedings of the 38th international conference on machine learning (pp. 8748\u20138763). Retrieved June 1, 2024, from http:\/\/proceedings.mlr.press\/v139\/radford21a.html."},{"key":"49_CR24","unstructured":"Huang, M., Zhou, Y., Luo, G., Jiang, G., Zhuang, W., & Sun, X. (2023). Towards omni-supervised referring expression segmentation. arXiv preprint. arXiv:2311.00397."},{"key":"49_CR25","first-page":"598","volume-title":"Proceedings of the 17th European conference of computer vision","author":"C. Zhu","year":"2022","unstructured":"Zhu, C., Zhou, Y., Shen, Y., Luo, G., Pan, X., Lin, M., et al. (2022). SeqTR: a simple yet universal network for visual grounding. In S. Avidan, G. J. Brostow, M. Ciss\u00e9, et al. (Eds.), Proceedings of the 17th European conference of computer vision (pp. 598\u2013615). Cham: Springer."},{"key":"49_CR26","first-page":"18124","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"N. Kim","year":"2022","unstructured":"Kim, N., Kim, D., Kwak, S., Lan, C., & Restr, W. Z. (2022). Convolution-free referring image segmentation using transformers. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 18124\u201318133). Piscataway: IEEE."},{"key":"49_CR27","first-page":"18653","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"J. Liu","year":"2023","unstructured":"Liu, J., Ding, H., Cai, Z., Zhang, Y., Kumar Satzoda, R., Mahadevan, V., et al. (2023). Polyformer: referring image segmentation as sequential polygon generation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 18653\u201318663). Piscataway: IEEE."},{"key":"49_CR28","first-page":"23570","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"J. Tang","year":"2023","unstructured":"Tang, J., Zheng, G., Shi, C., & Yang, S. (2023). Contrastive grouping with transformer for referring image segmentation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 23570\u201323580). Piscataway: IEEE."},{"key":"49_CR29","first-page":"19478","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"L. Xu","year":"2023","unstructured":"Xu, L., Huang, M. H., Shang, X., Yuan, Z., Sun, Y., & Liu, J. (2023). Meta compositional referring expression segmentation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 19478\u201319487). Piscataway: IEEE."},{"key":"49_CR30","first-page":"3021","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"M. Qu","year":"2023","unstructured":"Qu, M., Wu, Y., Wei, Y., Liu, W., Liang, X., & Zhao, Y. (2023). Learning to segment every referring object point by point. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 3021\u20133030). Piscataway: IEEE."},{"key":"49_CR31","first-page":"17457","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"Z. Xu","year":"2023","unstructured":"Xu, Z., Chen, Z., Zhang, Y., Song, Y., Wan, X., & Li, G. (2023). Bridging vision and language encoders: parameter-efficient tuning for referring image segmentation. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 17457\u201317466). Piscataway: IEEE."},{"key":"49_CR32","unstructured":"Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., et al. (2023). LISA: reasoning segmentation via large language model. arXiv preprint. arXiv:2308.00692."},{"key":"49_CR33","first-page":"1","volume-title":"Proceedings of the 31st international conference on neural information processing systems","author":"X. Zou","year":"2023","unstructured":"Zou, X., Yang, J., Zhang, H., Li, F., Li, L., Wang, J., et al. (2023). Segment everything everywhere all at once. In A. Oh, T. Naumann, A. Globerson, et al. (Eds.), Proceedings of the 31st international conference on neural information processing systems. (pp. 1\u201314). Red Hook: Curran Associates."},{"key":"49_CR34","first-page":"15116","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"X. Zou","year":"2023","unstructured":"Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., et al. (2023). Generalized decoding for pixel, image, and language. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 15116\u201315127). Piscataway: IEEE."},{"key":"49_CR35","unstructured":"Rasheed, H. A., Maaz, M., Mullappilly, S. S., Shaker, A. M., Khan, S. H., Cholakkal, H., et al. (2023). GLaMM: pixel grounding large multimodal model. arXiv preprint. arXiv:2311.03356."},{"key":"49_CR36","first-page":"2393","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"H. Ding","year":"2018","unstructured":"Ding, H., Jiang, X., Shuai, B., Liu, A. Q., & Wang, G. (2018). Context contrasted feature and gated multi-scale aggregation for scene segmentation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 2393\u20132402). Piscataway: IEEE."},{"key":"49_CR37","first-page":"8885","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"H. Ding","year":"2019","unstructured":"Ding, H., Jiang, X., Shuai, B., Liu, A. Q., & Wang, G. (2019). Semantic correlation promoted shape-variant context for segmentation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 8885\u20138894). Piscataway: IEEE."},{"key":"49_CR38","doi-asserted-by":"publisher","first-page":"3520","DOI":"10.1109\/TIP.2019.2962685","volume":"29","author":"H. Ding","year":"2020","unstructured":"Ding, H., Jiang, X., Shuai, B., Liu, A. Q., & Wang, G. (2020). Semantic segmentation with context encoding and multi-path decoding. IEEE Transactions on Image Processing, 29, 3520\u20133533.","journal-title":"IEEE Transactions on Image Processing"},{"key":"49_CR39","first-page":"6818","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"H. Ding","year":"2019","unstructured":"Ding, H., Jiang, X., Liu, A. Q., Thalmann, N. M., & Wang, G. (2019). Boundary-aware feature propagation for scene segmentation. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 6818\u20136828). Piscataway: IEEE."},{"key":"49_CR40","unstructured":"Li, X., Ding, H., Zhang, W., Yuan, H., Pang, J., Cheng, G., et al. (2023). Transformer-based visual segmentation: a survey. arXiv preprint. arXiv:2304.09854."},{"key":"49_CR41","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2022.108835","volume":"131","author":"J. Mei","year":"2022","unstructured":"Mei, J., Jiang, X., & Ding, H. (2022). Spatial feature mapping for 6DoF object pose estimation. Pattern Recognition, 131, 108835.","journal-title":"Pattern Recognition"},{"key":"49_CR42","volume-title":"Towards open vocabulary learning: a survey","author":"J. Wu","year":"2023","unstructured":"Wu, J., Li, X., Xu, S., Yuan, H., Ding, H., Yang, Y., et al. (2023). Towards open vocabulary learning: a survey. arXiv preprint. arXiv:2306.15880."},{"key":"49_CR43","doi-asserted-by":"crossref","unstructured":"Wei, X.-S., Xu, Y.-Y., Zhang, C.-L., Xia, G.-S., & Peng, Y.-X. (2023). CAT: a coarse-to-fine attention tree for semantic change detection. Visual Intelligence, 1(1), Article No. 3.","DOI":"10.1007\/s44267-023-00004-z"},{"key":"49_CR44","unstructured":"He, S., & Ding, H. (2024). Decoupling static and hierarchical motion perception for referring video segmentation. arXiv preprint. arXiv:2404.03645."},{"key":"49_CR45","first-page":"2694","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"H. Ding","year":"2023","unstructured":"Ding, H., Liu, C., He, S., Jiang, X., & Mevis, C. C. L. (2023). A large-scale benchmark for video segmentation with motion expressions. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 2694\u20132703). Piscataway: IEEE."},{"key":"49_CR46","first-page":"20167","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"H. Ding","year":"2023","unstructured":"Ding, H., Liu, C., He, S., Jiang, X., Torr, P. H. S., & Bai, S. (2023). MOSE: a new dataset for video object segmentation in complex scenes. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 20167\u201320177). Piscataway: IEEE."},{"key":"49_CR47","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2021.108075","volume":"120","author":"X. Wang","year":"2021","unstructured":"Wang, X., Jiang, X., Ding, H., Zhao, Y., & Liu, J. (2021). Knowledge-aware deep framework for collaborative skin lesion segmentation and melanoma recognition. Pattern Recognition, 120, 108075.","journal-title":"Pattern Recognition"},{"key":"49_CR48","doi-asserted-by":"crossref","unstructured":"Fan, D.-P., Ji, G.-P., Xu, P., Cheng, M.-M., Sakaridis, C., & Van Gool, L. (2023). Advances in deep concealed scene understanding. Visual Intelligence, 1(1), Article No. 16.","DOI":"10.1007\/s44267-023-00019-6"},{"key":"49_CR49","first-page":"11238","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"S. He","year":"2023","unstructured":"He, S., Ding, H., & Jiang, W. (2023). Primitive generation and semantic-related alignment for universal zero-shot segmentation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 11238\u201311247). Piscataway: IEEE."},{"key":"49_CR50","first-page":"6954","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"H. Zhang","year":"2021","unstructured":"Zhang, H., & Ding, H. (2021). Prototypical matching and open set rejection for zero-shot semantic segmentation. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 6954\u20136963). Piscataway: IEEE."},{"key":"49_CR51","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2022.109018","volume":"133","author":"H. Ding","year":"2023","unstructured":"Ding, H., Zhang, H., & Jiang, X. (2023). Self-regularized prototypical network for few-shot semantic segmentation. Pattern Recognition, 133, 109018.","journal-title":"Pattern Recognition"},{"key":"49_CR52","first-page":"466","volume-title":"Proceedings of the 33rd international conference on neural information processing systems","author":"M. Bucher","year":"2019","unstructured":"Bucher, M., Vu, T.-H., Cord, M., & P\u00e9rez, P. (2019). Zero-shot semantic segmentation. In H. M. Wallach, H. Larochelle, A. Beygelzimer, et al. (Eds.), Proceedings of the 33rd international conference on neural information processing systems (pp. 466\u2013477). Red Hook: Curran Associates."},{"key":"49_CR53","first-page":"1","volume":"60","author":"G. Cheng","year":"2022","unstructured":"Cheng, G., Cai, L., Lang, C., Yao, X., Chen, J., Guo, L., et al. (2022). SPNet: Siamese-prototype network for few-shot remote sensing image scene classification. IEEE Transactions on Geoscience and Remote Sensing, 60, 1\u201311.","journal-title":"IEEE Transactions on Geoscience and Remote Sensing"},{"key":"49_CR54","first-page":"18113","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"J. Xu","year":"2022","unstructured":"Xu, J., de Mello, S., Liu, S., Byeon, W., Breuel, T., Kautz, J., et al. (2022). GroupViT: semantic segmentation emerges from text supervision. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 18113\u201318123). Piscataway: IEEE."},{"key":"49_CR55","first-page":"7794","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"X. Wang","year":"2018","unstructured":"Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 7794\u20137803). Piscataway: IEEE."},{"key":"49_CR56","first-page":"787","volume-title":"Proceedings of the empirical methods in natural language processing","author":"S. Kazemzadeh","year":"2014","unstructured":"Kazemzadeh, S., Ordonez, V., Matten, M., & ReferItGame, T. B. (2014). Referring to objects in photographs of natural scenes. In Proceedings of the empirical methods in natural language processing (pp. 787\u2013798). Stroudsburg: ACL."},{"key":"49_CR57","first-page":"11","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition","author":"J. Mao","year":"2016","unstructured":"Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., & Murphy, K. (2016). Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 11\u201320). Piscataway: IEEE."},{"key":"49_CR58","first-page":"792","volume-title":"Proceedings of the European conferene of computer vision","author":"V.K. Nagaraja","year":"2016","unstructured":"Nagaraja, V.K., Morariu, V. I., & Davis, L. S. (2016). Modeling context between objects for referring expression understanding. In B. Leibe, J. Matas, N. Sebe, et al. (Eds.), Proceedings of the European conferene of computer vision (pp. 792\u2013807). Cham: Springer."},{"key":"49_CR59","first-page":"4171","volume-title":"Proceedings of the conference of the North American Chapter of the Association for computational linguistics: human language technologies","author":"J. Devlin","year":"2019","unstructured":"Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). In Proceedings of the conference of the North American Chapter of the Association for computational linguistics: human language technologies (pp. 4171\u20134186). Stroudsburg: ACL."},{"key":"49_CR60","first-page":"1280","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"B. Cheng","year":"2022","unstructured":"Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., & Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 1280\u20131289). Piscataway: IEEE."},{"key":"49_CR61","unstructured":"Yu, L. (2016). Refcoco dataset. Retrieved June 1, 2024, from https:\/\/github.com\/lichengunc\/refer."},{"key":"49_CR62","first-page":"656","volume-title":"Proceedings of the 15th European conference of computer vision","author":"E. Margffoy-Tuay","year":"2018","unstructured":"Margffoy-Tuay, E., P\u00e9rez, J. C., Botero, E., & Arbel\u00e1ez, P. (2018). Dynamic multimodal instance segmentation guided by natural language queries. In V. Ferrari, M. Hebert, C. Sminchisescu, et al. (Eds.), Proceedings of the 15th European conference of computer vision (pp. 656\u2013672). Cham: Springer."},{"key":"49_CR63","first-page":"4423","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Z. Hu","year":"2020","unstructured":"Hu, Z., Feng, G., Sun, J., Zhang, L., & Lu, H. (2020). Bi-directional relationship inferring network for referring image segmentation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 4423\u20134432). Piscataway: IEEE."},{"key":"49_CR64","first-page":"10485","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"S. Huang","year":"2020","unstructured":"Huang, S., Hui, T., Liu, S., Li, G., Wei, Y., Han, J., et al. (2020). Referring image segmentation via cross-modal progressive comprehension. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 10485\u201310494). Piscataway: IEEE."},{"key":"49_CR65","first-page":"59","volume-title":"Proceedings of the 16th European conference of computer vision","author":"T. Hui","year":"2020","unstructured":"Hui, T., Liu, S., Huang, S., Li, G., Yu, S., Zhang, F., et al. (2020). Linguistic structure guided context modeling for referring image segmentation. In A. Vedaldi, H. Bischof, T. Brox, et al. (Eds.), Proceedings of the 16th European conference of computer vision (pp. 59\u201375). Cham: Springer."},{"key":"49_CR66","doi-asserted-by":"publisher","first-page":"1274","DOI":"10.1145\/3394171.3414006","volume-title":"Proceedings of the 28th ACM international conference on multimedia","author":"G. Luo","year":"2020","unstructured":"Luo, G., Zhou, Y., Ji, R., Sun, X., Su, J., Lin, C.-W., et al. (2020). Cascade grouped attention network for referring expression segmentation. In Proceedings of the 28th ACM international conference on multimedia (pp. 1274\u20131282). New York: ACM."},{"key":"49_CR67","first-page":"9858","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Y. Jing","year":"2021","unstructured":"Jing, Y., Kong, T., Wang, W., Wang, L., Li, L., & Tan, T. (2021). Locate then segment: a strong pipeline for referring image segmentation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 9858\u20139867). Piscataway: IEEE."}],"container-title":["Visual Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44267-024-00049-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s44267-024-00049-8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44267-024-00049-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,6,27]],"date-time":"2024-06-27T13:09:44Z","timestamp":1719493784000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s44267-024-00049-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,6,27]]},"references-count":67,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["49"],"URL":"https:\/\/doi.org\/10.1007\/s44267-024-00049-8","relation":{},"ISSN":["2731-9008"],"issn-type":[{"value":"2731-9008","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,6,27]]},"assertion":[{"value":"9 February 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 June 2024","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"10 June 2024","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"27 June 2024","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Henghui Ding is an Associate Editor at Visual Intelligence and was not involved in the editorial review of this article or the decision to publish it. The authors declare that they have no other competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"16"}}