{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,16]],"date-time":"2026-06-16T16:41:07Z","timestamp":1781628067641,"version":"3.54.5"},"reference-count":68,"publisher":"Association for Computing Machinery (ACM)","issue":"2","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62276072"],"award-info":[{"award-number":["62276072"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Guangxi Natural Science Foundation","award":["2025GXNSFDA069017"],"award-info":[{"award-number":["2025GXNSFDA069017"]}]},{"name":"Guangxi Science and Technology Base and Talent Special","award":["AD25069071"],"award-info":[{"award-number":["AD25069071"]}]},{"DOI":"10.13039\/501100015749","name":"Communication University of China","doi-asserted-by":"crossref","award":["SKLMCC2023KF005"],"award-info":[{"award-number":["SKLMCC2023KF005"]}],"id":[{"id":"10.13039\/501100015749","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2026,2,28]]},"abstract":"<jats:p>\n                    Referring Expression Comprehension (REC) aims to achieve fine-grained cross-modal content alignment. The traditional two-stage approaches, by decomposing REC into localization (region proposal) and comprehension (expression-based ranking), lead to the isolation of continuous image information and heavily rely on the quality of the proposals. In this article, we propose a point-based two-stage framework for REC to quickly achieve localization by inserting a language-modulated auto-focus module into the locked vision model. Specifically, we redefine REC as two processes: point-based cross-modal comprehension and point-based instance localization. For the comprehension stage, we reconstruct the raw annotations into soft masks at the feature point level as a metric of cross-modal correlation. With this indirect metric, REC can be approximated as a binary classification problem, which fundamentally avoids the impact of isolated regions. Remarkably, soft masks are shape-independent, which means our method is extremely general. By switching different vision models, different types of predictions (e.g., localization and segmentation) can be obtained. Experiments on multiple benchmarks demonstrate the feasibility and potential of our point-based paradigm. Our code will be public at\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/VILAN-Lab\/PBREC-AF\">https:\/\/github.com\/VILAN-Lab\/PBREC-AF<\/jats:ext-link>\n                    .\n                  <\/jats:p>","DOI":"10.1145\/3777449","type":"journal-article","created":{"date-parts":[[2025,11,27]],"date-time":"2025-11-27T09:19:54Z","timestamp":1764235194000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Implement Referring Expression Comprehension by Extending Auto-focus Lens to Locked Vision Model"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-0101-6597","authenticated-orcid":false,"given":"Shiyi","family":"Zheng","sequence":"first","affiliation":[{"name":"School of Electrical Engineering, Guangxi University, Nanning, China and College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-0853-5880","authenticated-orcid":false,"given":"Peizhi","family":"Zhao","sequence":"additional","affiliation":[{"name":"School of Electrical Engineering, Guangxi University, Nanning, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7691-347X","authenticated-orcid":false,"given":"Qingbao","family":"Huang","sequence":"additional","affiliation":[{"name":"School of Electrical Engineering, Guangxi University, Nanning, China, and Guangxi Key Laboratory of Multimedia Communications and Network Technology, Nanning, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-0521-4021","authenticated-orcid":false,"given":"Yi","family":"Cai","sequence":"additional","affiliation":[{"name":"School of Software Engineering, South China University of Technology, Guangzhou, China and Key Laboratory of Big Dat and Intelligent Robot (SCUT), Ministry of Education, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3407-4318","authenticated-orcid":false,"given":"Haonan","family":"Cheng","sequence":"additional","affiliation":[{"name":"The State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3631-256X","authenticated-orcid":false,"given":"Qi","family":"Wu","sequence":"additional","affiliation":[{"name":"The University of Adelaide, Adelaide, Australia"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2026,2,10]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.593"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_3_1_4_2","first-page":"1036","volume-title":"35th AAAI Conference on Artificial Intelligence (AAAI \u201921","author":"Chen Long","year":"2021","unstructured":"Long Chen, Wenbo Ma, Jun Xiao, Hanwang Zhang, and Shih-Fu Chang. 2021. Ref-NMS: Breaking proposal bottlenecks in two-stage referring expression grounding. In 35th AAAI Conference on Artificial Intelligence (AAAI \u201921), 1036\u20131044."},{"issue":"3","key":"e_1_3_1_5_2","doi-asserted-by":"crossref","first-page":"1670","DOI":"10.1109\/TPAMI.2020.3023438","article-title":"Visual grounding via accumulated attention","volume":"44","author":"Deng Chaorui","year":"2022","unstructured":"Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan Lyu, and Mingkui Tan. 2022. Visual grounding via accumulated attention. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 3 (2022), 1670\u20131684.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_6_2","first-page":"1749","volume-title":"2021 IEEE\/CVF International Conference on Computer Vision (ICCV \u201921)","author":"Deng Jiajun","year":"2021","unstructured":"Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. 2021. TransVG: End-to-end visual grounding with transformers. In 2021 IEEE\/CVF International Conference on Computer Vision (ICCV \u201921), 1749\u20131759."},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2023.3296823"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2022.3217852"},{"key":"e_1_3_1_9_2","first-page":"1","volume-title":"IEEE International Conference on Multimedia and ExpoICME \u201922","author":"Du Ye","year":"2022","unstructured":"Ye Du, Zehua Fu, Qingjie Liu, and Yunhong Wang. 2022. Visual grounding with transformers. In IEEE International Conference on Multimedia and Expo (ICME \u201922), 1\u20136."},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2019.2911066"},{"key":"e_1_3_1_12_2","volume-title":"The 10th International Conference on Learning Representations (ICLR \u201922)","author":"Hu Edward J.","year":"2022","unstructured":"Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-rank adaptation of large language models. In The 10th International Conference on Learning Representations (ICLR \u201922)."},{"key":"e_1_3_1_13_2","first-page":"4418","volume-title":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201917)","author":"Hu Ronghang","year":"2017","unstructured":"Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. 2017. Modeling relationships in referential expressions with compositional modular networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201917), 4418\u20134427."},{"key":"e_1_3_1_14_2","first-page":"108","volume-title":"14th European Conference on Computer Vision (ECCV \u201916)","author":"Hu Ronghang","year":"2016","unstructured":"Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. 2016. Segmentation from natural language expressions. In 14th European Conference on Computer Vision (ECCV \u201916), 108\u2013124."},{"key":"e_1_3_1_15_2","first-page":"4044","volume-title":"2023 IEEE\/CVF International Conference on Computer Vision (ICCV)","author":"Hu Yutao","year":"2023","unstructured":"Yutao Hu, Qixiong Wang, Wenqi Shao, Enze Xie, Zhenguo Li, Jungong Han, and Ping Luo. 2023. Beyond one-to-one: Rethinking the referring image segmentation. In 2023 IEEE\/CVF International Conference on Computer Vision (ICCV), 4044\u20134054."},{"key":"e_1_3_1_16_2","first-page":"16888","volume-title":"IEEE Conference on Computer Vision and Pattern RecognitionCVPR \u201921)","author":"Huang Binbin","year":"2021","unstructured":"Binbin Huang, Dongze Lian, Weixin Luo, and Shenghua Gao. 2021. Look before you leap: Learning landmark features for one-stage visual grounding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201921), 16888\u201316897."},{"key":"e_1_3_1_17_2","first-page":"998","volume-title":"36th AAAI Conference on Artificial IntelligenceAAAI \u201922","author":"Huang Jianqiang","year":"2022","unstructured":"Jianqiang Huang, Yu Qin, Jiaxin Qi, Qianru Sun, and Hanwang Zhang. 2022. Deconfounded visual grounding. In 36th AAAI Conference on Artificial Intelligence (AAAI \u201922), 998\u20131006."},{"key":"e_1_3_1_18_2","first-page":"9858","volume-title":"IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201921)","author":"Jing Ya","year":"2021","unstructured":"Ya Jing, Tao Kong, Wei Wang, Liang Wang, Lei Li, and Tieniu Tan. 2021. Locate then segment: A strong pipeline for referring image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201921), 9858\u20139867."},{"key":"e_1_3_1_19_2","doi-asserted-by":"crossref","first-page":"787","DOI":"10.3115\/v1\/D14-1086","volume-title":"Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP \u201914","author":"Kazemzadeh Sahar","year":"2014","unstructured":"Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L. Berg. 2014. ReferItGame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP \u201914), 787\u2013798."},{"key":"e_1_3_1_20_2","first-page":"18124","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern RecognitionCVPR \u201922)","author":"Kim Namyup","year":"2022","unstructured":"Namyup Kim, Dongwon Kim, Suha Kwak, Cuiling Lan, and Wenjun Zeng. 2022. ReSTR: Convolution-free referring image segmentation using transformers. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201922). IEEE, 18124\u201318133."},{"key":"e_1_3_1_21_2","first-page":"1213","volume-title":"37th AAAI Conference on Artificial Intelligence","author":"Lan Xiaohan","year":"2023","unstructured":"Xiaohan Lan, Yitian Yuan, Hong Chen, Xin Wang, Zequn Jie, Lin Ma, Zhi Wang, and Wenwu Zhu. 2023. Curriculum multi-negative augmentation for debiased video grounding. In 37th AAAI Conference on Artificial Intelligence. AAAI Press, 1213\u20131221."},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3565573"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3532626"},{"key":"e_1_3_1_24_2","first-page":"19652","volume-title":"Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021NeurIPS \u201921)","author":"Li Muchen","year":"2021","unstructured":"Muchen Li and Leonid Sigal. 2021. Referring transformer: A one-step approach to multi-task visual grounding. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021 (NeurIPS \u201921), 19652\u201319664."},{"key":"e_1_3_1_25_2","first-page":"10877","volume-title":"2020 IEEE\/CVF Conference on Computer Vision and Pattern RecognitionCVPR \u201920","author":"Liao Yue","year":"2020","unstructured":"Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, and Bo Li. 2020. A real-time cross-modality correlation filtering method for referring expression comprehension. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201920), 10877\u201310886."},{"key":"e_1_3_1_26_2","doi-asserted-by":"crossref","first-page":"4266","DOI":"10.1109\/TIP.2022.3181516","article-title":"Progressive language-customized visual feature learning for one-stage visual grounding","volume":"31","author":"Liao Yue","year":"2022","unstructured":"Yue Liao, Aixi Zhang, Zhiyuan Chen, Tianrui Hui, and Si Liu. 2022. Progressive language-customized visual feature learning for one-stage visual grounding. IEEE Transactions on Image Processing 31 (2022), 4266\u20134277.","journal-title":"IEEE Transactions on Image Processing"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_1_28_2","first-page":"1280","volume-title":"IEEE International Conference on Computer VisionICCV \u201917","author":"Liu Chenxi","year":"2017","unstructured":"Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and Alan L. Yuille. 2017. Recurrent multimodal interaction for referring image segmentation. In IEEE International Conference on Computer Vision (ICCV \u201917), 1280\u20131289."},{"key":"e_1_3_1_29_2","first-page":"4672","volume-title":"2019 IEEE\/CVF International Conference on Computer VisionICCV \u201919","author":"Liu Daqing","year":"2019","unstructured":"Daqing Liu, Hanwang Zhang, Zheng-Jun Zha, and Feng Wu. 2019. Learning to assemble neural module tree networks for visual grounding. In 2019 IEEE\/CVF International Conference on Computer Vision (ICCV \u201919), 4672\u20134681."},{"key":"e_1_3_1_30_2","first-page":"1950","volume-title":"IEEE Conference on Computer Vision and Pattern RecognitionCVPR \u201919","author":"Liu Xihui","year":"2019","unstructured":"Xihui Liu, Zihao Wang, Jing Shao, Xiaogang Wang, and Hongsheng Li. 2019. Improving referring expression grounding with cross-modal attention-guided erasing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201919), 1950\u20131959."},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_3_1_32_2","first-page":"3431","volume-title":"IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201915","author":"Long Jonathan","year":"2015","unstructured":"Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201915), 3431\u20133440."},{"key":"e_1_3_1_33_2","volume-title":"5th International Conference on Learning Representations (ICLR \u201917","author":"Loshchilov Ilya","year":"2017","unstructured":"Ilya Loshchilov and Frank Hutter. 2017. SGDR: Stochastic gradient descent with warm restarts. In 5th International Conference on Learning Representations (ICLR \u201917)."},{"key":"e_1_3_1_34_2","volume-title":"7th International Conference on Learning Representations (ICLR \u201919)","year":"2019","unstructured":"Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations (ICLR \u201919)."},{"key":"e_1_3_1_35_2","first-page":"10031","volume-title":"2020 IEEE\/CVF Conference on Computer Vision and Pattern RecognitionCVPR \u201920","author":"Luo Gen","year":"2020","unstructured":"Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and Rongrong Ji. 2020. Multi-task collaborative network for joint referring expression comprehension and segmentation. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201920), 10031\u201310040."},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-023-01871-1"},{"key":"e_1_3_1_37_2","first-page":"11","volume-title":"2016 IEEE Conference on Computer Vision and Pattern RecognitionCVPR \u201916","author":"Mao Junhua","year":"2016","unstructured":"Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201916), 11\u201320."},{"key":"e_1_3_1_38_2","doi-asserted-by":"crossref","first-page":"4426","DOI":"10.1109\/TMM.2020.3042066","article-title":"Referring expression comprehension: A survey of methods and datasets","volume":"23","author":"Qiao Yanyuan","year":"2021","unstructured":"Yanyuan Qiao, Chaorui Deng, and Qi Wu. 2021. Referring expression comprehension: A survey of methods and datasets. IEEE Transactions on Multimedia 23 (2021), 4426\u20134440.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_1_39_2","first-page":"91","volume-title":"Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, 91\u201399."},{"key":"e_1_3_1_40_2","first-page":"4693","volume-title":"2019 IEEE\/CVF International Conference on Computer VisionICCV \u201919","author":"Sadhu Arka","year":"2019","unstructured":"Arka Sadhu, Kan Chen, and Ram Nevatia. 2019. Zero-shot grounding of objects from natural language queries. In 2019 IEEE\/CVF International Conference on Computer Vision (ICCV \u201919), 4693\u20134702."},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2022.3231964"},{"key":"e_1_3_1_42_2","first-page":"14060","volume-title":"IEEE Conference on Computer Vision and Pattern RecognitionCVPR \u201921)","author":"Sun Mingjie","year":"2021","unstructured":"Mingjie Sun, Jimin Xiao, and Eng Gee Lim. 2021. Iterative shrinking for referring expression grounding using deep reinforcement learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201921), 14060\u201314069."},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2021.3139467"},{"key":"e_1_3_1_44_2","first-page":"9626","volume-title":"2019 IEEE\/CVF International Conference on Computer Vision (ICCV \u201919)","author":"Tian Zhi","year":"2019","unstructured":"Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. 2019. FCOS: Fully convolutional one-stage object detection. In 2019 IEEE\/CVF International Conference on Computer Vision (ICCV \u201919). IEEE, 9626\u20139635."},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/3604557"},{"key":"e_1_3_1_46_2","first-page":"1960","volume-title":"IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201919)","author":"Wang Peng","year":"2019","unstructured":"Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, and Anton van den Hengel. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201919), 1960\u20131968."},{"key":"e_1_3_1_47_2","first-page":"649","volume-title":"Proceedings of the 16th European Conference on Computer Vision (ECCV \u201920), Part XVIII","volume":"12363","author":"Wang Xinlong","year":"2020","unstructured":"Xinlong Wang, Tao Kong, Chunhua Shen, Yuning Jiang, and Lei Li. 2020. SOLO: Segmenting objects by locations. In Proceedings of the 16th European Conference on Computer Vision (ECCV \u201920), Part XVIII. Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.), Lecture Notes in Computer Science, Vol. 12363, Springer, 649\u2013665."},{"key":"e_1_3_1_48_2","first-page":"11676","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern RecognitionCVPR \u201922","author":"Wang Zhaoqing","year":"2022","unstructured":"Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. 2022. CRIS: CLIP-driven referring image segmentation. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201922). IEEE, 11676\u201311685."},{"key":"e_1_3_1_49_2","doi-asserted-by":"crossref","first-page":"111115","DOI":"10.1016\/j.engappai.2025.111115","article-title":"A codebook-driven approach for low-light image enhancement","volume":"156","author":"Wu Xu","year":"2025","unstructured":"Xu Wu, Xianxu Hou, Zhihui Lai, Jie Zhou, Ya-Nan Zhang, Witold Pedrycz, and Linlin Shen. 2025. A codebook-driven approach for low-light image enhancement. Engineering Applications of Artificial Intelligence 156 (2025), 111115.","journal-title":"Engineering Applications of Artificial Intelligence"},{"key":"e_1_3_1_50_2","unstructured":"Xu Wu Zhihui Lai Xianxu Hou Jie Zhou Ya-Nan Zhang and Linlin Shen. 2025. LightQANet: Quantized and adaptive feature learning for low-light image enhancement. arXiv:2510.14753. Retrieved from https:\/\/arxiv.org\/abs\/2510.14753"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1145\/3665498"},{"key":"e_1_3_1_52_2","first-page":"17503","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV)","author":"Xu Zunnan","year":"2023","unstructured":"Zunnan Xu, Zhihong Chen, Yong Zhang, Yibing Song, Xiang Wan, and Guanbin Li. 2023. Bridging vision and language encoders: Parameter-efficient tuning for referring image segmentation. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), 17503\u201317512."},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01471"},{"key":"e_1_3_1_54_2","first-page":"9489","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern RecognitionCVPR \u201922","author":"Yang Li","year":"2022","unstructured":"Li Yang, Yan Xu, Chunfeng Yuan, Wei Liu, Bing Li, and Weiming Hu. 2022. Improving visual grounding with Visual-Linguistic verification and iterative reasoning. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201922), 9489\u20139498."},{"key":"e_1_3_1_55_2","first-page":"9949","volume-title":"2020 IEEE\/CVF Conference on Computer Vision and Pattern RecognitionCVPR \u201920","author":"Yang Sibei","year":"2020","unstructured":"Sibei Yang, Guanbin Li, and Yizhou Yu. 2020. Graph-structured referring expression reasoning in the wild. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201920), 9949\u20139958."},{"key":"e_1_3_1_56_2","first-page":"387","volume-title":"16th European Conference on Computer Vision (ECCV \u201920)","author":"Yang Zhengyuan","year":"2020","unstructured":"Zhengyuan Yang, Tianlang Chen, Liwei Wang, and Jiebo Luo. 2020. Improving one-stage visual grounding by recursive Sub-query construction. In 16th European Conference on Computer Vision (ECCV \u201920), 387\u2013404."},{"key":"e_1_3_1_57_2","first-page":"4682","volume-title":"2019 IEEE\/CVF International Conference on Computer VisionICCV \u201919","author":"Yang Zhengyuan","year":"2019","unstructured":"Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. 2019. A fast and accurate one-stage approach to visual grounding. In 2019 IEEE\/CVF International Conference on Computer Vision (ICCV \u201919), 4682\u20134692."},{"key":"e_1_3_1_58_2","first-page":"15481","volume-title":"IEEE\/CVF Conference on Computer Vision and Pattern RecognitionCVPR \u201922","author":"Ye Jiabo","year":"2022","unstructured":"Jiabo Ye, Junfeng Tian, Ming Yan, Xiaoshan Yang, Xuwu Wang, Ji Zhang, Liang He, and Xin Lin. 2022. Shifting more attention to visual backbone: Query-modulated refinement networks for end-to-end visual grounding. In IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201922), 15481\u201315491."},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00255"},{"key":"e_1_3_1_60_2","first-page":"1307","volume-title":"2018 IEEE Conference on Computer Vision and Pattern RecognitionCVPR \u201918","author":"Yu Licheng","year":"2018","unstructured":"Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. 2018. MAttNet: Modular attention network for referring expression comprehension. In 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR \u201918), 1307\u20131315."},{"key":"e_1_3_1_61_2","first-page":"69","volume-title":"14th European Conference on Computer Vision (ECCV \u201916)","author":"Yu Licheng","year":"2016","unstructured":"Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. 2016. Modeling context in referring expressions. In 14th European Conference on Computer Vision (ECCV \u201916), 69\u201385."},{"key":"e_1_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1145\/3656045"},{"key":"e_1_3_1_63_2","first-page":"9756","volume-title":"2020 IEEE\/CVF Conference on Computer Vision and Pattern RecognitionCVPR \u201920","author":"Zhang Shifeng","year":"2020","unstructured":"Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z. Li. 2020. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR \u201920), 9756\u20139765."},{"key":"e_1_3_1_64_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2022.3183827"},{"key":"e_1_3_1_65_2","first-page":"7487","volume-title":"38th AAAI Conference on Artificial Intelligence","author":"Zhao Peizhi","year":"2024","unstructured":"Peizhi Zhao, Shiyi Zheng, Wenye Zhao, Dongsheng Xu, Pijian Li, Yi Cai, and Qingbao Huang. 2024. Rethinking two-stage referring expression comprehension: A novel grounding and segmentation method modulated by point. In 38th AAAI Conference on Artificial Intelligence, 7487\u20137495."},{"key":"e_1_3_1_66_2","doi-asserted-by":"crossref","first-page":"284","DOI":"10.1007\/978-3-031-78383-8_19","volume-title":"Pattern Recognition","author":"Zhao Wenjie","year":"2025","unstructured":"Wenjie Zhao and Qiuming Luo. 2025. Leveraging computer vision for automatic modulation classification: Insights from spectrum and constellation diagram analysis. In Pattern Recognition. Apostolos Antonacopoulos, Subhasis Chaudhuri, Rama Chellappa, Cheng-Lin Liu, Saumik Bhattacharya, and Umapada Pal (Eds.). Springer Nature Switzerland, Cham, 284\u2013294."},{"key":"e_1_3_1_67_2","first-page":"1656","volume-title":"Proceedings of the 39th AAAI Conference on Artificial Intelligence","author":"Zheng Shiyi","year":"2025","unstructured":"Shiyi Zheng, Peizhi Zhao, Zhilong Zheng, Peihang He, Haonan Cheng, Yi Cai, and Qingbao Huang. 2025. Look around before locating: Considering content and structure information for visual grounding. In Proceedings of the 39th AAAI Conference on Artificial Intelligence, 1656\u20131664."},{"key":"e_1_3_1_68_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2021.3090426"},{"key":"e_1_3_1_69_2","first-page":"598","volume-title":"17th European Conference on Computer Vision (ECCV \u201922)","author":"Zhu Chaoyang","year":"2022","unstructured":"Chaoyang Zhu, Yiyi Zhou, Yunhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, and Rongrong Ji. 2022. SeqTR: A simple yet universal network for visual grounding. In 17th European Conference on Computer Vision (ECCV \u201922), 598\u2013615."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3777449","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,10]],"date-time":"2026-02-10T12:14:48Z","timestamp":1770725688000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3777449"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,10]]},"references-count":68,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,2,28]]}},"alternative-id":["10.1145\/3777449"],"URL":"https:\/\/doi.org\/10.1145\/3777449","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,10]]},"assertion":[{"value":"2024-07-14","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-10-29","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-02-10","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}