{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,6]],"date-time":"2026-05-06T15:18:49Z","timestamp":1778080729332,"version":"3.51.4"},"publisher-location":"California","reference-count":0,"publisher":"International Joint Conferences on Artificial Intelligence Organization","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2023,8]]},"abstract":"<jats:p>Referring image segmentation aims to segment an object out of an image via a specific language expression. The main concept is establishing global visual-linguistic relationships to locate the object and identify boundaries using details of the image. Recently, various Transformer-based techniques have been proposed to efficiently leverage long-range cross-modal dependencies,  enhancing performance for referring segmentation. However, existing methods consider visual feature extraction and cross-modal fusion separately, resulting in insufficient visual-linguistic alignment in semantic space. In addition, they employ sequential structures and hence lack multi-scale information interaction. To address these limitations, we propose a Scale-Wise Language-Guided Vision Transformer (SLViT) with two appealing designs: (1) Language-Guided Multi-Scale Fusion Attention, a novel attention mechanism module for extracting rich local visual information and modeling global visual-linguistic relationships in an integrated manner. (2) An Uncertain Region Cross-Scale Enhancement module that can identify regions of high uncertainty using linguistic features and refine them via aggregated multi-scale features. We have evaluated our method on three benchmark datasets. The experimental results demonstrate that SLViT surpasses state-of-the-art methods with lower computational cost. The code is publicly available at: https:\/\/github.com\/NaturalKnight\/SLViT.<\/jats:p>","DOI":"10.24963\/ijcai.2023\/144","type":"proceedings-article","created":{"date-parts":[[2023,8,11]],"date-time":"2023-08-11T08:31:30Z","timestamp":1691742690000},"page":"1294-1302","source":"Crossref","is-referenced-by-count":22,"title":["SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation"],"prefix":"10.24963","author":[{"given":"Shuyi","family":"Ouyang","sequence":"first","affiliation":[{"name":"Zhejiang University"}]},{"given":"Hongyi","family":"Wang","sequence":"additional","affiliation":[{"name":"Zhejiang University"}]},{"given":"Shiao","family":"Xie","sequence":"additional","affiliation":[{"name":"Zhejiang University"}]},{"given":"Ziwei","family":"Niu","sequence":"additional","affiliation":[{"name":"Zhejiang University"}]},{"given":"Ruofeng","family":"Tong","sequence":"additional","affiliation":[{"name":"Zhejiang University"},{"name":"Zhejiang Lab"}]},{"given":"Yen-Wei","family":"Chen","sequence":"additional","affiliation":[{"name":"Ritsumeikan University"}]},{"given":"Lanfen","family":"Lin","sequence":"additional","affiliation":[{"name":"Zhejiang University"}]}],"member":"10584","event":{"name":"Thirty-Second International Joint Conference on Artificial Intelligence {IJCAI-23}","theme":"Artificial Intelligence","location":"Macau, SAR China","acronym":"IJCAI-2023","number":"32","sponsor":["International Joint Conferences on Artificial Intelligence Organization (IJCAI)"],"start":{"date-parts":[[2023,8,19]]},"end":{"date-parts":[[2023,8,25]]}},"container-title":["Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence"],"original-title":[],"deposited":{"date-parts":[[2023,8,11]],"date-time":"2023-08-11T08:37:26Z","timestamp":1691743046000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.ijcai.org\/proceedings\/2023\/144"}},"subtitle":[],"proceedings-subject":"Artificial Intelligence Research Articles","short-title":[],"issued":{"date-parts":[[2023,8]]},"references-count":0,"URL":"https:\/\/doi.org\/10.24963\/ijcai.2023\/144","relation":{},"subject":[],"published":{"date-parts":[[2023,8]]}}}