{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,18]],"date-time":"2026-01-18T14:23:39Z","timestamp":1768746219712,"version":"3.49.0"},"reference-count":53,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2024,2,22]],"date-time":"2024-02-22T00:00:00Z","timestamp":1708560000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,2,22]],"date-time":"2024-02-22T00:00:00Z","timestamp":1708560000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Neural Process Lett"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Referring image segmentation aims to segment object in an image based on a referring expression. Its difficulty lies in aligning expression semantics with visual instances. The existing methods based on semantic reasoning are limited by the performance of external syntax parser and do not explicitly explore the relationships between visual instances. This article proposes an end-to-end method for referring image segmentation by aligning \u2019linguistic relationship\u2019 with \u2019visual relationships\u2019. This method does not rely on external syntax parser for expression parsing. In this paper, the expression is adaptively and structurally parsed into three components: \u2019subject\u2019, \u2019object\u2019, and \u2019linguistic relationship\u2019 by the Semantic Component Parser (SCP) in a learnable manner. Instances Activation Map Module (IAM) locates multiple visual instances based on the subject and object. In addition, the Relationship Based Visual Localization Module (RBVL) firstly enables each instance of the image to learn global knowledge, then decodes the visual relationships between these visual instances, and finally aligns the visual relationships with the linguistic relationships to further accurately locate the target object. The experimental results show that the proposed method improves performance by 4\u2013 9% compared with baseline method on multiple referring image segmentation datasets.<\/jats:p>","DOI":"10.1007\/s11063-024-11487-2","type":"journal-article","created":{"date-parts":[[2024,2,22]],"date-time":"2024-02-22T14:02:59Z","timestamp":1708610579000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Text-Vision Relationship Alignment for Referring Image Segmentation"],"prefix":"10.1007","volume":"56","author":[{"given":"Mingxing","family":"Pu","sequence":"first","affiliation":[]},{"given":"Bing","family":"Luo","sequence":"additional","affiliation":[]},{"given":"Chao","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Li","family":"Xu","sequence":"additional","affiliation":[]},{"given":"Fayou","family":"Xu","sequence":"additional","affiliation":[]},{"given":"Mingming","family":"Kong","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,2,22]]},"reference":[{"key":"11487_CR1","unstructured":"Khan E (2012) Natural language based human computer interaction : a necessity for mobile devices. https:\/\/api.semanticscholar.org\/CorpusID:15641099"},{"key":"11487_CR2","doi-asserted-by":"crossref","unstructured":"Chen J, Shen Y, Gao J, Liu J, Liu X (2017) Language-based image editing with recurrent attentive models. In: 2018 IEEE\/CVF conference on computer vision and pattern recognition, pp 8721\u20138729","DOI":"10.1109\/CVPR.2018.00909"},{"key":"11487_CR3","doi-asserted-by":"crossref","unstructured":"Liu D, Zhang H, Zha Z, Wu F (2018) Learning to assemble neural module tree networks for visual grounding. In: 2019 IEEE\/CVF international conference on computer vision (ICCV), pp 4672\u20134681","DOI":"10.1109\/ICCV.2019.00477"},{"key":"11487_CR4","doi-asserted-by":"crossref","unstructured":"Hui T, Liu S, Huang S, Li G, Yu S, Zhang F, Han J (2020) Linguistic structure guided context modeling for referring image segmentation. In: European conference on computer vision","DOI":"10.1007\/978-3-030-58607-2_4"},{"key":"11487_CR5","doi-asserted-by":"crossref","unstructured":"Luo G, Zhou Y, Sun X, Cao L, Wu C, Deng C, Ji R (2020) Multi-task collaborative network for joint referring expression comprehension and segmentation. In: 2020 IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 10031\u201310040","DOI":"10.1109\/CVPR42600.2020.01005"},{"key":"11487_CR6","doi-asserted-by":"crossref","unstructured":"Yang S, Xia M, Li G, Zhou H-Y, Yu Y (2021) Bottom-up shift and reasoning for referring image segmentation. 2021 IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 11261\u201311270","DOI":"10.1109\/CVPR46437.2021.01111"},{"key":"11487_CR7","doi-asserted-by":"publisher","first-page":"1922","DOI":"10.1109\/TMM.2021.3074008","volume":"24","author":"L Lin","year":"2022","unstructured":"Lin L, Yan P, Xu X, Yang S, Zeng K, Li G (2022) Structured attention network for referring image segmentation. IEEE Trans Multimed 24:1922\u20131932","journal-title":"IEEE Trans Multimed"},{"key":"11487_CR8","doi-asserted-by":"crossref","unstructured":"Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D (2014) The stanford corenlp natural language processing toolkit. In: Annual meeting of the association for computational linguistics","DOI":"10.3115\/v1\/P14-5010"},{"key":"11487_CR9","first-page":"4761","volume":"44","author":"S Liu","year":"2021","unstructured":"Liu S, Hui T, Huang S, Wei Y, Li B, Li G (2021) Cross-modal progressive comprehension for referring segmentation. IEEE Trans Pattern Anal Mach Intell 44:4761\u20134775","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"11487_CR10","unstructured":"Vaswani A, Shazeer NM, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: NIPS"},{"key":"11487_CR11","doi-asserted-by":"crossref","unstructured":"Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: 2021 IEEE\/CVF international conference on computer vision (ICCV), pp 9992\u201310002","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"11487_CR12","unstructured":"Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929"},{"key":"11487_CR13","doi-asserted-by":"crossref","unstructured":"Ding H, Liu C, Wang S, Jiang X (2021) Vision-language transformer and query generation for referring segmentation. In: 2021 IEEE\/CVF international conference on computer vision (ICCV), pp 16301\u201316310","DOI":"10.1109\/ICCV48922.2021.01601"},{"key":"11487_CR14","doi-asserted-by":"crossref","unstructured":"Kim NH, Kim D, Lan C, Zeng W, Kwak S (2022) Restr: Convolution-free referring image segmentation using transformers. In: 2022 IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 18124\u201318133","DOI":"10.1109\/CVPR52688.2022.01761"},{"key":"11487_CR15","doi-asserted-by":"crossref","unstructured":"Yang Z, Wang J, Tang Y, Chen K, Zhao H, Torr PHS (2021) Lavt: Language-aware vision transformer for referring image segmentation. In: 2022 IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 18134\u201318144","DOI":"10.1109\/CVPR52688.2022.01762"},{"key":"11487_CR16","doi-asserted-by":"crossref","unstructured":"Wang W, Zhou T, Yu F, Dai J, Konukoglu E, Gool LV (2021) Exploring cross-image pixel contrast for semantic segmentation. In: 2021 IEEE\/CVF international conference on computer vision (ICCV), pp 7283\u20137293","DOI":"10.1109\/ICCV48922.2021.00721"},{"key":"11487_CR17","unstructured":"Li B, Weinberger KQ, Belongie SJ, Koltun V, Ranftl R (2022) Language-driven semantic segmentation. abs\/2201.03546"},{"key":"11487_CR18","doi-asserted-by":"crossref","unstructured":"Zhou T, Wang W, Konukoglu E, Gool LV (2022) Rethinking semantic segmentation: a prototype view. In: 2022 IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 2572\u20132583","DOI":"10.1109\/CVPR52688.2022.00261"},{"key":"11487_CR19","doi-asserted-by":"crossref","unstructured":"Hu R, Rohrbach M, Darrell T (2016) Segmentation from natural language expressions. arXiv:1603.06180","DOI":"10.1007\/978-3-319-46448-0_7"},{"key":"11487_CR20","doi-asserted-by":"crossref","unstructured":"Liu C, Lin ZL, Shen X, Yang J, Lu X, Yuille AL (2017) Recurrent multimodal interaction for referring image segmentation. In: 2017 IEEE international conference on computer vision (ICCV), pp 1280\u20131289","DOI":"10.1109\/ICCV.2017.143"},{"key":"11487_CR21","doi-asserted-by":"crossref","unstructured":"Shi H, Li H, Meng F, Wu Q (2018) Key-word-aware network for referring expression image segmentation. In: European conference on computer vision","DOI":"10.1007\/978-3-030-01231-1_3"},{"key":"11487_CR22","doi-asserted-by":"crossref","unstructured":"Ye L, Rochan M, Liu Z, Wang Y (2019) Cross-modal self-attention network for referring image segmentation. In: 2019 IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 10494\u201310503","DOI":"10.1109\/CVPR.2019.01075"},{"key":"11487_CR23","doi-asserted-by":"crossref","unstructured":"Jain K, Gandhi V (2021) Comprehensive multi-modal interactions for referring image segmentation. arXiv:2104.10412","DOI":"10.18653\/v1\/2022.findings-acl.270"},{"key":"11487_CR24","doi-asserted-by":"crossref","unstructured":"Feng G, Hu Z, Zhang L, Lu H (2021) Encoder fusion network with co-attention embedding for referring image segmentation. In: 2021 IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 15501\u201315510","DOI":"10.1109\/CVPR46437.2021.01525"},{"key":"11487_CR25","doi-asserted-by":"crossref","unstructured":"Li R, Li K, Kuo Y-C, Shu M, Qi X, Shen X, Jia J (2018) Referring image segmentation via recurrent refinement networks. In: 2018 IEEE\/CVF conference on computer vision and pattern recognition, pp 5745\u20135753","DOI":"10.1109\/CVPR.2018.00602"},{"key":"11487_CR26","doi-asserted-by":"crossref","unstructured":"Margffoy-Tuay E, P\u00e9rez J, Botero E, Arbel\u00e1ez P (2018) Dynamic multimodal instance segmentation guided by natural language queries. arXiv:1807.02257","DOI":"10.1007\/978-3-030-01252-6_39"},{"key":"11487_CR27","doi-asserted-by":"publisher","first-page":"3224","DOI":"10.1109\/TMM.2020.2971171","volume":"22","author":"L Ye","year":"2020","unstructured":"Ye L, Liu Z, Wang Y (2020) Dual convolutional LSTM network for referring image segmentation. IEEE Trans Multimed 22:3224\u20133235","journal-title":"IEEE Trans Multimed"},{"key":"11487_CR28","unstructured":"Chen Y-W, Tsai Y-H, Wang T, Lin Y-Y, Yang M-H (2019) Referring expression object segmentation with caption-aware consistency. arXiv:1910.04748"},{"key":"11487_CR29","doi-asserted-by":"publisher","first-page":"995","DOI":"10.1109\/TMM.2020.2991504","volume":"23","author":"H Shi","year":"2021","unstructured":"Shi H, Li H, Wu Q, Ngan KN (2021) Query reconstruction network for referring expression image segmentation. IEEE Trans Multimed 23:995\u20131007","journal-title":"IEEE Trans Multimed"},{"key":"11487_CR30","doi-asserted-by":"crossref","unstructured":"Wang Z, Lu Y, Li Q, Tao X, Guo Y, Gong M, Liu T (2021) CRIS: Clip-driven referring image segmentation. In: 2022 IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 11676\u201311685","DOI":"10.1109\/CVPR52688.2022.01139"},{"key":"11487_CR31","unstructured":"Kim S, Kang M, Park J (2023) Risclip: Referring image segmentation framework using clip. arXiv:2306.08498"},{"key":"11487_CR32","unstructured":"Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning"},{"key":"11487_CR33","doi-asserted-by":"publisher","first-page":"1333","DOI":"10.1109\/TMM.2019.2942480","volume":"22","author":"S Qiu","year":"2020","unstructured":"Qiu S, Zhao Y, Jiao J, Wei Y, Wei S (2020) Referring image segmentation by generative adversarial learning. IEEE Trans Multimed 22:1333\u20131344","journal-title":"IEEE Trans Multimed"},{"key":"11487_CR34","doi-asserted-by":"crossref","unstructured":"Liu C, Jiang X, Ding H (2022) Instance-specific feature propagation for referring segmentation. arXiv:2204.12109","DOI":"10.1109\/TMM.2022.3163578"},{"key":"11487_CR35","doi-asserted-by":"crossref","unstructured":"Jiao Y, Jie Z, Luo W, Chen J, Jiang Y-G, Wei X, Ma L (2021) Two-stage visual cues enhancement network for referring image segmentation. In: Proceedings of the 29th ACM international conference on multimedia","DOI":"10.1145\/3474085.3475222"},{"key":"11487_CR36","doi-asserted-by":"publisher","first-page":"10055","DOI":"10.1109\/TPAMI.2023.3262578","volume":"45","author":"C Liang","year":"2022","unstructured":"Liang C, Wang W, Zhou T, Miao J, Luo Y, Yang Y (2022) Local-global context aware transformer for language-guided video segmentation. IEEE Trans Pattern Anal Mach Intell 45:10055\u201310069","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"11487_CR37","doi-asserted-by":"crossref","unstructured":"Zhao W, Wang K, Chu X, Xue F, Wang X, You Y (2022) Modeling motion with multi-modal features for text-based video segmentation. 2022 IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 11727\u201311736","DOI":"10.1109\/CVPR52688.2022.01144"},{"key":"11487_CR38","doi-asserted-by":"crossref","unstructured":"Wu D, Dong X, Shao L, Shen J (2022) Multi-level representation learning with semantic alignment for referring video object segmentation. 2022 IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 4986\u20134995","DOI":"10.1109\/CVPR52688.2022.00494"},{"key":"11487_CR39","doi-asserted-by":"crossref","unstructured":"He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770\u2013778","DOI":"10.1109\/CVPR.2016.90"},{"key":"11487_CR40","doi-asserted-by":"crossref","unstructured":"Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv:1802.05365","DOI":"10.18653\/v1\/N18-1202"},{"key":"11487_CR41","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","volume":"9","author":"S Hochreiter","year":"1997","unstructured":"Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735\u20131780","journal-title":"Neural Comput"},{"key":"11487_CR42","unstructured":"Mikolov T, Yih W-t, Zweig G (2013) Linguistic regularities in continuous space word representations. In: North American chapter of the association for computational linguistics"},{"key":"11487_CR43","doi-asserted-by":"crossref","unstructured":"Yu L, Poirson P, Yang S, Berg AC, Berg TL (2016) Modeling context in referring expressions. arXiv:1608.00272","DOI":"10.1007\/978-3-319-46475-6_5"},{"key":"11487_CR44","doi-asserted-by":"crossref","unstructured":"Mao J, Huang J, Toshev A, Camburu O-M, Yuille AL, Murphy KP (2015) Generation and comprehension of unambiguous object descriptions. 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 11\u201320","DOI":"10.1109\/CVPR.2016.9"},{"key":"11487_CR45","doi-asserted-by":"crossref","unstructured":"Kazemzadeh S, Ordonez V, Matten M, Berg TL (2014) Referitgame: referring to objects in photographs of natural scenes. In: Conference on empirical methods in natural language processing","DOI":"10.3115\/v1\/D14-1086"},{"key":"11487_CR46","doi-asserted-by":"publisher","first-page":"834","DOI":"10.1109\/TPAMI.2017.2699184","volume":"40","author":"L-C Chen","year":"2016","unstructured":"Chen L-C, Papandreou G, Kokkinos I, Murphy KP, Yuille AL (2016) Deeplab: Semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFS. IEEE Trans Pattern Anal Mach Intell 40:834\u2013848","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"11487_CR47","doi-asserted-by":"publisher","first-page":"303","DOI":"10.1007\/s11263-009-0275-4","volume":"88","author":"M Everingham","year":"2010","unstructured":"Everingham M, Gool LV, Williams CKI, Winn JM, Zisserman A (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vision 88:303\u2013338","journal-title":"Int J Comput Vision"},{"key":"11487_CR48","unstructured":"Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. CoRR arXiv:1412.6980"},{"key":"11487_CR49","doi-asserted-by":"crossref","unstructured":"Yu L, Lin ZL, Shen X, Yang J, Lu X, Bansal M, Berg TL (2018) Mattnet: modular attention network for referring expression comprehension. In: 2018 IEEE\/CVF conference on computer vision and pattern recognition, pp 1307\u20131315","DOI":"10.1109\/CVPR.2018.00142"},{"key":"11487_CR50","doi-asserted-by":"publisher","first-page":"386","DOI":"10.1109\/TPAMI.2018.2844175","volume":"42","author":"K He","year":"2017","unstructured":"He K, Gkioxari G, Doll\u00e1r P, Girshick RB (2017) Mask r-CNN. IEEE Trans Pattern Anal Mach Intell 42:386\u2013397","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"11487_CR51","unstructured":"Chen Y, Li J, Xiao H, Jin X, Yan S, Feng J (2017) Dual path networks. In: NIPS. https:\/\/api.semanticscholar.org\/CorpusID:35602767"},{"key":"11487_CR52","doi-asserted-by":"crossref","unstructured":"Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Conference on empirical methods in natural language processing. https:\/\/api.semanticscholar.org\/CorpusID:1957433","DOI":"10.3115\/v1\/D14-1162"},{"key":"11487_CR53","unstructured":"Kr\u00e4henb\u00fchl P, Koltun V (2011) Efficient inference in fully connected CRFS with gaussian edge potentials. In: NIPS. https:\/\/api.semanticscholar.org\/CorpusID:5574079"}],"container-title":["Neural Processing Letters"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11063-024-11487-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11063-024-11487-2\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11063-024-11487-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,16]],"date-time":"2024-05-16T20:24:59Z","timestamp":1715891099000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11063-024-11487-2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,22]]},"references-count":53,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2024,4]]}},"alternative-id":["11487"],"URL":"https:\/\/doi.org\/10.1007\/s11063-024-11487-2","relation":{},"ISSN":["1573-773X"],"issn-type":[{"value":"1573-773X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,2,22]]},"assertion":[{"value":"27 November 2023","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 February 2024","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}],"article-number":"64"}}