{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,28]],"date-time":"2026-01-28T04:44:30Z","timestamp":1769575470263,"version":"3.49.0"},"reference-count":78,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2023,2,6]],"date-time":"2023-02-06T00:00:00Z","timestamp":1675641600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100004663","name":"Ministry of Science and Technology of Taiwan","doi-asserted-by":"crossref","award":["MOST-109-2223-E-009-002-MY3, MOST-110-2218-E-A49-018, and MOST-111-2634-F-007-002"],"award-info":[{"award-number":["MOST-109-2223-E-009-002-MY3, MOST-110-2218-E-A49-018, and MOST-111-2634-F-007-002"]}],"id":[{"id":"10.13039\/501100004663","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2023,5,31]]},"abstract":"<jats:p>Referring expression comprehension aims to localize a specific object in an image according to a given language description. It is still challenging to comprehend and mitigate the gap between various types of information in the visual and textual domains. Generally, it needs to extract the salient features from a given expression and match the features of expression to an image. One challenge in referring expression comprehension is the number of region proposals generated by object detection methods is far more than the number of entities in the corresponding language description. Remarkably, the candidate regions without described by the expression will bring a severe impact on referring expression comprehension. To tackle this problem, we first propose a novel Enhanced Cross-modal Graph Attention Networks (ECMGANs) that boosts the matching between the expression and the entity position of an image. Then, an effective strategy named Graph Node Erase (GNE) is proposed to assist ECMGANs in eliminating the effect of irrelevant objects on the target object. Experiments on three public referring expression comprehension datasets show unambiguously that our ECMGANs framework achieves better performance than other state-of-the-art methods. Moreover, GNE is able to obtain higher accuracies of visual-expression matching effectively.<\/jats:p>","DOI":"10.1145\/3548688","type":"journal-article","created":{"date-parts":[[2022,7,15]],"date-time":"2022-07-15T11:33:45Z","timestamp":1657884825000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":18,"title":["Referring Expression Comprehension Via Enhanced Cross-modal Graph Attention Networks"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0998-251X","authenticated-orcid":false,"given":"Jia","family":"Wang","sequence":"first","affiliation":[{"name":"National Yang Ming Chiao Tung University, Hsinchu, Taiwan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2262-6261","authenticated-orcid":false,"given":"Jingcheng","family":"Ke","sequence":"additional","affiliation":[{"name":"National Tsing Hua University, Hsinchu, Taiwan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2216-077X","authenticated-orcid":false,"given":"Hong-Han","family":"Shuai","sequence":"additional","affiliation":[{"name":"National Yang Ming Chiao Tung University, Hsinchu, Taiwan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0475-3689","authenticated-orcid":false,"given":"Yung-Hui","family":"Li","sequence":"additional","affiliation":[{"name":"Hon Hai Research Institute, Hsinchu, Taiwan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4662-7875","authenticated-orcid":false,"given":"Wen-Huang","family":"Cheng","sequence":"additional","affiliation":[{"name":"National Yang Ming Chiao Tung University, Hsinchu, Taiwan"}]}],"member":"320","published-online":{"date-parts":[[2023,2,6]]},"reference":[{"key":"e_1_3_1_2_2","first-page":"2874","article-title":"Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks","author":"Bell Sean","year":"2016","unstructured":"Sean Bell, C. Lawrence Zitnick, Kavita Bala, and Ross B. Girshick. 2016. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201916). 2874\u20132883.","journal-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201916)"},{"key":"e_1_3_1_3_2","volume-title":"Proceedings of the 3rd International Conference on Networking, Information Systems Security (NISS\u201920)","author":"Berhich Asmae","year":"2020","unstructured":"Asmae Berhich, Fatima-Zahra Belouadha, and Mohammed Issam Kabbaj. 2020. LSTM-based models for earthquake prediction. In Proceedings of the 3rd International Conference on Networking, Information Systems Security (NISS\u201920). Association for Computing Machinery, New York, NY, Article 46, 7 pages."},{"key":"e_1_3_1_4_2","first-page":"12576","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201921)","author":"Chen Chaoqi","year":"2021","unstructured":"Chaoqi Chen, Zebiao Zheng, Yue Huang, Xinghao Ding, and Yizhou Yu. 2021. I3Net: Implicit instance-invariant network for adapting one-stage object detectors. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201921). 12576\u201312585."},{"key":"e_1_3_1_5_2","first-page":"1036","volume-title":"Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI\u201921), the 33rd Conference on Innovative Applications of Artificial Intelligence (IAAI\u201921), and the 11th Symposium on Educational Advances in Artificial Intelligence (EAAI\u201921)","author":"Chen Long","year":"2021","unstructured":"Long Chen, Wenbo Ma, Jun Xiao, Hanwang Zhang, and Shih-Fu Chang. 2021. Ref-NMS: Breaking proposal bottlenecks in two-stage referring expression grounding. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI\u201921), the 33rd Conference on Innovative Applications of Artificial Intelligence (IAAI\u201921), and the 11th Symposium on Educational Advances in Artificial Intelligence (EAAI\u201921). AAAI Press, 1036\u20131044."},{"key":"e_1_3_1_6_2","first-page":"104","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV\u201920)","author":"Chen Yen-Chun","year":"2020","unstructured":"Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal image-text representation learning. In Proceedings of the European Conference on Computer Vision (ECCV\u201920), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, Cham, 104\u2013120."},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/3447239"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475251"},{"key":"e_1_3_1_9_2","first-page":"7746","article-title":"Visual grounding via accumulated attention","author":"Deng Chaorui","year":"2018","unstructured":"Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan Lyu, and Mingkui Tan. 2018. Visual grounding via accumulated attention. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 7746\u20137755.","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"e_1_3_1_10_2","first-page":"1769","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV\u201921)","author":"Deng Jiajun","year":"2021","unstructured":"Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. 2021. TransVG: End-to-end visual grounding with transformers. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV\u201921). 1769\u20131779."},{"issue":"3","key":"e_1_3_1_11_2","first-page":"78","article-title":"Recurrent attention network with reinforced generator for visual dialog","volume":"16","author":"Fan Hehe","year":"2020","unstructured":"Hehe Fan, Linchao Zhu, Yi Yang, and Fei Wu. 2020. Recurrent attention network with reinforced generator for visual dialog. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3, Article 78 (July 2020), 16 pages.","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"issue":"4","key":"e_1_3_1_12_2","first-page":"129","article-title":"Evaluation of information comprehension in concurrent speech-based designs","volume":"16","author":"Fazal Muhammad Abu Ul","year":"2020","unstructured":"Muhammad Abu Ul Fazal, Sam Ferguson, and Andrew Johnston. 2020. Evaluation of information comprehension in concurrent speech-based designs. ACM Trans. Multimedia Comput. Commun. Appl. 16, 4, Article 129 (Dec. 2020), 19 pages.","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"key":"e_1_3_1_13_2","doi-asserted-by":"crossref","unstructured":"Peng Gao Pan Lu Hongsheng Li Shuang Li Yikang Li Steven C. H. Hoi and Xiaogang Wang. 2018. Question-guided hybrid convolution for visual question answering. Retrieved from https:\/\/arxiv.org\/abs\/1808.02632.","DOI":"10.1007\/978-3-030-01246-5_29"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1177\/1461444819858691"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00325"},{"key":"e_1_3_1_16_2","first-page":"2344","volume-title":"TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding","author":"He Dailan","year":"2021","unstructured":"Dailan He, Yusheng Zhao, Junyu Luo, Tianrui Hui, Shaofei Huang, Aixi Zhang, and Si Liu. 2021. TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding. Association for Computing Machinery, New York, NY, 2344\u20132352."},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.5555\/3326943.3326994"},{"key":"e_1_3_1_19_2","first-page":"804","volume-title":"Proceedings of the IEEE International Conference on Computer Vision (ICCV\u201917)","author":"Hu R.","year":"2017","unstructured":"R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko. 2017. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV\u201917). IEEE Computer Society, Los Alamitos, CA, 804\u2013813."},{"key":"e_1_3_1_20_2","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201917)","author":"Hu Ronghang","year":"2017","unstructured":"Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. 2017. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201917)."},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.493"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCYB.2016.2640288"},{"issue":"3","key":"e_1_3_1_23_2","first-page":"79","article-title":"Attention-based modality-gated networks for image-text sentiment analysis","volume":"16","author":"Huang Feiran","year":"2020","unstructured":"Feiran Huang, Kaimin Wei, Jian Weng, and Zhoujun Li. 2020. Attention-based modality-gated networks for image-text sentiment analysis. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3, Article 79 (July 2020), 19 pages.","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.imu.2020.100412"},{"key":"e_1_3_1_25_2","first-page":"4041","volume-title":"Visual-Semantic Graph Matching for Visual Grounding","author":"Jing Chenchen","year":"2020","unstructured":"Chenchen Jing, Yuwei Wu, Mingtao Pei, Yao Hu, Yunde Jia, and Qi Wu. 2020. Visual-Semantic Graph Matching for Visual Grounding. Association for Computing Machinery, New York, NY, 4041\u20134050."},{"key":"e_1_3_1_26_2","first-page":"1780","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV\u201921)","author":"Kamath Aishwarya","year":"2021","unstructured":"Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. 2021. MDETR\u2014Modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV\u201921). 1780\u20131790."},{"key":"e_1_3_1_27_2","first-page":"787","volume-title":"Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP\u201914)","author":"Kazemzadeh Sahar","year":"2014","unstructured":"Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014. ReferItGame: Referring to objects in photographs of natural scenes. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP\u201914). Association for Computational Linguistics, Doha, Qatar, 787\u2013798."},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_3_1_29_2","first-page":"1104","volume-title":"Robust ECG R-Peak Detection Using LSTM","author":"Laitala Juho","year":"2020","unstructured":"Juho Laitala, Mingzhe Jiang, Elise Syrj\u00e4l\u00e4, Emad Kasaeyan Naeini, Antti Airola, Amir M. Rahmani, Nikil D. Dutt, and Pasi Liljeberg. 2020. Robust ECG R-Peak Detection Using LSTM. Association for Computing Machinery, New York, NY, 1104\u20131111."},{"issue":"1","key":"e_1_3_1_30_2","first-page":"29","article-title":"Multi-human parsing with a graph-based generative adversarial model","volume":"17","author":"Li Jianshu","year":"2021","unstructured":"Jianshu Li, Jian Zhao, Congyan Lang, Yidong Li, Yunchao Wei, Guodong Guo, Terence Sim, Shuicheng Yan, and Jiashi Feng. 2021. Multi-human parsing with a graph-based generative adversarial model. ACM Trans. Multimedia Comput. Commun. Appl. 17, 1, Article 29 (Apr. 2021), 21 pages.","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2017.2751140"},{"issue":"3","key":"e_1_3_1_32_2","first-page":"97","article-title":"A hierarchical CNN-RNN approach for visual emotion classification","volume":"15","author":"Li Liang","year":"2019","unstructured":"Liang Li, Xinge Zhu, Yiming Hao, Shuhui Wang, Xingyu Gao, and Qingming Huang. 2019. A hierarchical CNN-RNN approach for visual emotion classification. ACM Trans. Multimedia Comput. Commun. Appl. 15, 3s, Article 97 (Dec. 2019), 17 pages.","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2021.09.066"},{"key":"e_1_3_1_34_2","first-page":"121","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV\u201920)","author":"Li Xiujun","year":"2020","unstructured":"Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the European Conference on Computer Vision (ECCV\u201920), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, Cham, 121\u2013137."},{"key":"e_1_3_1_35_2","first-page":"1316","volume-title":"Proceedings of the IEEE 22nd International Conference on High Performance Computing and Communications, the IEEE 18th International Conference on Smart City, and the IEEE 6th International Conference on Data Science and Systems (HPCC\/SmartCity\/DSS\u201920)","author":"Liang Shengbin","year":"2020","unstructured":"Shengbin Liang, Bin Zhu, Yuying Zhang, Suying Cheng, and Jiangyong Jin. 2020. A double channel CNN-LSTM model for text classification. In Proceedings of the IEEE 22nd International Conference on High Performance Computing and Communications, the IEEE 18th International Conference on Smart City, and the IEEE 6th International Conference on Data Science and Systems (HPCC\/SmartCity\/DSS\u201920). 1316\u20131321."},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01089"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.520"},{"key":"e_1_3_1_38_2","first-page":"1950","article-title":"Improving referring expression grounding with cross-modal attention-guided erasing","author":"Liu Xihui","year":"2019","unstructured":"Xihui Liu, Zihao Wang, Jing Shao, Xiaogang Wang, and Hongsheng Li. 2019. Improving referring expression grounding with cross-modal attention-guided erasing. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919). 1950\u20131959.","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919)"},{"issue":"4","key":"e_1_3_1_39_2","first-page":"107","article-title":"AB-LSTM: Attention-based bidirectional LSTM model for scene text detection","volume":"15","author":"Liu Zhandong","year":"2019","unstructured":"Zhandong Liu, Wengang Zhou, and Houqiang Li. 2019. AB-LSTM: Attention-based bidirectional LSTM model for scene text detection. ACM Trans. Multimedia Comput. Commun. Appl. 15, 4, Article 107 (Dec. 2019), 23 pages.","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.3390\/electronics10030287"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICME51207.2021.9428120"},{"key":"e_1_3_1_42_2","volume-title":"Advances in Neural Information Processing Systems","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch\u00e9-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates.Retrieved from https:\/\/proceedings.neurips.cc\/paper\/2019\/file\/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf."},{"key":"e_1_3_1_43_2","first-page":"10031","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920)","author":"Luo G.","year":"2020","unstructured":"G. Luo, Y. Zhou, X. Sun, L. Cao, C. Wu, C. Deng, and R. Ji. 2020. Multi-task collaborative network for joint referring expression comprehension and segmentation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920). IEEE Computer Society, Los Alamitos, CA, 10031\u201310040."},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.333"},{"issue":"2","key":"e_1_3_1_45_2","first-page":"48","article-title":"FIN: Feature integrated network for object detection","volume":"16","author":"Luo Xiaofan","year":"2020","unstructured":"Xiaofan Luo, Fukoeng Wong, and Haifeng Hu. 2020. FIN: Feature integrated network for object detection. ACM Trans. Multimedia Comput. Commun. Appl. 16, 2, Article 48 (May 2020), 18 pages.","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2021.3050059"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P14-5010"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.9"},{"key":"e_1_3_1_49_2","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV\u201916)","author":"Nagaraja Varun K.","year":"2016","unstructured":"Varun K. Nagaraja, Vlad I. Morariu, and Larry S. Davis. 2016. Modeling context between objects for referring expression understanding. In Proceedings of the European Conference on Computer Vision (ECCV\u201916)."},{"key":"e_1_3_1_50_2","article-title":"Im2text: Describing images using 1 million captioned photographs","volume":"24","author":"Ordonez Vicente","year":"2011","unstructured":"Vicente Ordonez, Girish Kulkarni, and Tamara Berg. 2011. Im2text: Describing images using 1 million captioned photographs. Adv. Neural Info. Process. Syst. 24 (2011).","journal-title":"Adv. Neural Info. Process. Syst."},{"key":"e_1_3_1_51_2","first-page":"752","article-title":"Accurate single stage detector using recurrent rolling convolution","author":"Ren Jimmy S. J.","year":"2017","unstructured":"Jimmy S. J. Ren, Xiaohao Chen, Jianbo Liu, Wenxiu Sun, Jiahao Pang, Qiong Yan, Yu-Wing Tai, and Li Xu. 2017. Accurate single stage detector using recurrent rolling convolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201917), 752\u2013760.","journal-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201917)"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2577031"},{"key":"e_1_3_1_53_2","first-page":"817","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV\u201916)","author":"Rohrbach Anna","year":"2016","unstructured":"Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele. 2016. Grounding of textual phrases in images by reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV\u201916), Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer International Publishing, Cham, 817\u2013834."},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2015.12.006"},{"issue":"5","key":"e_1_3_1_55_2","first-page":"49","article-title":"Large scale datasets for image and video captioning in italian","volume":"2","author":"Scaiella Antonio","year":"2019","unstructured":"Antonio Scaiella, Danilo Croce, and Roberto Basili. 2019. Large scale datasets for image and video captioning in italian. Ital. J. Comput. Ling. 2, 5 (2019), 49\u201360.","journal-title":"Ital. J. Comput. Ling."},{"key":"e_1_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.physd.2019.132306"},{"key":"e_1_3_1_57_2","first-page":"1948","volume-title":"S2SiamFC: Self-Supervised Fully Convolutional Siamese Network for Visual Tracking","author":"Sio Chon Hou","year":"2020","unstructured":"Chon Hou Sio, Yu-Jen Ma, Hong-Han Shuai, Jun-Cheng Chen, and Wen-Huang Cheng. 2020. S2SiamFC: Self-Supervised Fully Convolutional Siamese Network for Visual Tracking. Association for Computing Machinery, New York, NY, 1948\u20131957."},{"key":"e_1_3_1_58_2","first-page":"1346","article-title":"Co-grounding networks with semantic attention for referring expression comprehension in videos","author":"Song Sijie","year":"2021","unstructured":"Sijie Song, Xudong Lin, Jiaying Liu, Zongming Guo, and Shih-Fu Chang. 2021. Co-grounding networks with semantic attention for referring expression comprehension in videos. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201921). 1346\u20131355.","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201921)"},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.1145\/3387920"},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2021\/143"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2014.2298982"},{"key":"e_1_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2017.2733623"},{"key":"e_1_3_1_63_2","first-page":"1960","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919)","author":"Wang P.","year":"2019","unstructured":"P. Wang, Q. Wu, J. Cao, C. Shen, L. Gao, and A. Hengel. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919). IEEE Computer Society, Los Alamitos, CA, 1960\u20131968."},{"key":"e_1_3_1_64_2","first-page":"9567","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV\u201921)","author":"Wu Aming","year":"2021","unstructured":"Aming Wu, Yahong Han, Linchao Zhu, and Yi Yang. 2021. Universal-prototype enhancing for few-shot object detection. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV\u201921). 9567\u20139576."},{"key":"e_1_3_1_65_2","doi-asserted-by":"publisher","DOI":"10.1109\/MIE.2020.2970790"},{"key":"e_1_3_1_66_2","first-page":"2871","volume-title":"AU-Assisted Graph Attention Convolutional Network for Micro-Expression Recognition","author":"Xie Hong-Xia","year":"2020","unstructured":"Hong-Xia Xie, Ling Lo, Hong-Han Shuai, and Wen-Huang Cheng. 2020. AU-Assisted Graph Attention Convolutional Network for Micro-Expression Recognition. Association for Computing Machinery, New York, NY, 2871\u20132880."},{"issue":"3","key":"e_1_3_1_67_2","first-page":"107","article-title":"WTRPNet: An explainable graph feature convolutional neural network for epileptic EEG classification","volume":"17","author":"Xin Qi","year":"2022","unstructured":"Qi Xin, Shaohao Hu, Shuaiqi Liu, Ling Zhao, and Shuihua Wang. 2022. WTRPNet: An explainable graph feature convolutional neural network for epileptic EEG classification. ACM Trans. Multimedia Comput. Commun. Appl. 17, 3s, Article 107 (Dec. 2022), 18 pages.","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"issue":"4","key":"e_1_3_1_68_2","first-page":"120","article-title":"Dual-stream structured graph convolution network for skeleton-based action recognition","volume":"17","author":"Xu Chunyan","year":"2021","unstructured":"Chunyan Xu, Rong Liu, Tong Zhang, Zhen Cui, Jian Yang, and Chunlong Hu. 2021. Dual-stream structured graph convolution network for skeleton-based action recognition. ACM Trans. Multimedia Comput. Commun. Appl. 17, 4, Article 120 (Nov. 2021), 22 pages.","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"key":"e_1_3_1_69_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458281"},{"key":"e_1_3_1_70_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01246-5_41"},{"key":"e_1_3_1_71_2","first-page":"4140","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919)","author":"Yang S.","year":"2019","unstructured":"S. Yang, G. Li, and Y. Yu. 2019. Cross-modal relationship inference for grounding referring expressions. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919). 4140\u20134149."},{"key":"e_1_3_1_72_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00474"},{"key":"e_1_3_1_73_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920)","author":"Yang Sibei","year":"2020","unstructured":"Sibei Yang, Guanbin Li, and Yizhou Yu. 2020. Graph-structured referring expression reasoning in the wild. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920)."},{"key":"e_1_3_1_74_2","first-page":"1307","article-title":"MAttNet: Modular attention network for referring expression comprehension","author":"Yu Licheng","year":"2018","unstructured":"Licheng Yu, Zhe L. Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. 2018. MAttNet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 1307\u20131315.","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"e_1_3_1_75_2","first-page":"69","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV\u201916)","author":"Yu Licheng","year":"2016","unstructured":"Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. 2016. Modeling context in referring expressions. In Proceedings of the European Conference on Computer Vision (ECCV\u201916). Springer International Publishing, Cham, 69\u201385."},{"key":"e_1_3_1_76_2","first-page":"3521","article-title":"A joint speaker-listener-reinforcer model for referring expressions","author":"Yu Licheng","year":"2017","unstructured":"Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L. Berg. 2017. A joint speaker-listener-reinforcer model for referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201917). 3521\u20133529.","journal-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201917)"},{"key":"e_1_3_1_77_2","first-page":"4158","article-title":"Grounding referring expressions in images by variational context","author":"Zhang Hanwang","year":"2018","unstructured":"Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. 2018. Grounding referring expressions in images by variational context. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 4158\u20134166.","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"e_1_3_1_78_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2935678"},{"key":"e_1_3_1_79_2","doi-asserted-by":"crossref","first-page":"4252","DOI":"10.1109\/CVPR.2018.00447","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201918)","author":"Zhuang Bohan","year":"2018","unstructured":"Bohan Zhuang, Qi Wu, Chunhua Shen, Ian Reid, and Anton van den Hengel. 2018. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201918). IEEE, 4252\u20134261."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3548688","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3548688","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:50:53Z","timestamp":1750182653000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3548688"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,6]]},"references-count":78,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,5,31]]}},"alternative-id":["10.1145\/3548688"],"URL":"https:\/\/doi.org\/10.1145\/3548688","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,2,6]]},"assertion":[{"value":"2022-01-04","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-06-27","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-02-06","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}