{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:23:40Z","timestamp":1750220620032,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":56,"publisher":"ACM","license":[{"start":{"date-parts":[[2020,10,12]],"date-time":"2020-10-12T00:00:00Z","timestamp":1602460800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Key Research and Development Program of China","award":["2020AAA0140000"],"award-info":[{"award-number":["2020AAA0140000"]}]},{"name":"National Natural Science Foundation of China","award":["61876177"],"award-info":[{"award-number":["61876177"]}]},{"name":"Beijing Natural Science Foundation","award":["L182013,4202034"],"award-info":[{"award-number":["L182013,4202034"]}]},{"name":"Fundamental Research Funds for the Central Universities, Zhejiang Lab","award":["2019KD0AB04"],"award-info":[{"award-number":["2019KD0AB04"]}]},{"name":"Tencent Open Fund"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2020,10,12]]},"DOI":"10.1145\/3394171.3413846","type":"proceedings-article","created":{"date-parts":[[2020,10,12]],"date-time":"2020-10-12T13:12:00Z","timestamp":1602508320000},"page":"1725-1734","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["Cross-Modal Omni Interaction Modeling for Phrase Grounding"],"prefix":"10.1145","author":[{"given":"Tianyu","family":"Yu","sequence":"first","affiliation":[{"name":"Beihang University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tianrui","family":"Hui","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering, Chinese Academy of Sciences &amp;38; University of Chinese Academy of Sciences, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhihao","family":"Yu","sequence":"additional","affiliation":[{"name":"Beihang University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yue","family":"Liao","sequence":"additional","affiliation":[{"name":"Beihang University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sansi","family":"Yu","sequence":"additional","affiliation":[{"name":"Tencent Marketing Solution, Shen Zhen, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Faxi","family":"Zhang","sequence":"additional","affiliation":[{"name":"Tencent Marketing Solution, Shen Zhen, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Si","family":"Liu","sequence":"additional","affiliation":[{"name":"Beihang University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2020,10,12]]},"reference":[{"key":"e_1_3_2_2_1_1","volume-title":"VQA: Visual Question Answering. arXiv:1505.00468 [cs] (Oct.","author":"Agrawal Aishwarya","year":"2016","unstructured":"Aishwarya Agrawal , Jiasen Lu , Stanislaw Antol , Margaret Mitchell , C. Lawrence Zitnick , Dhruv Batra , and Devi Parikh . 2016 . VQA: Visual Question Answering. arXiv:1505.00468 [cs] (Oct. 2016). http:\/\/arxiv.org\/abs\/1505.00468 arXiv: 1505.00468. Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh. 2016. VQA: Visual Question Answering. arXiv:1505.00468 [cs] (Oct. 2016). http:\/\/arxiv.org\/abs\/1505.00468 arXiv: 1505.00468."},{"key":"e_1_3_2_2_2_1","volume-title":"Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding. arXiv:1811.11683 [cs, eess] (May","author":"Akbari Hassan","year":"2019","unstructured":"Hassan Akbari , Svebor Karaman , Surabhi Bhargava , Brian Chen , Carl Vondrick , and Shih-Fu Chang . 2019. Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding. arXiv:1811.11683 [cs, eess] (May 2019 ). http:\/\/arxiv.org\/abs\/1811.11683 arXiv: 1811.11683. Hassan Akbari, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. 2019. Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding. arXiv:1811.11683 [cs, eess] (May 2019). http:\/\/arxiv.org\/abs\/1811.11683 arXiv: 1811.11683."},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00438"},{"key":"e_1_3_2_2_4_1","volume-title":"Knowledge Aided Consistency for Weakly Supervised Phrase Grounding. In 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition. IEEE","author":"Chen Kan","year":"2018","unstructured":"Kan Chen , Jiyang Gao , and Ram Nevatia . 2018 . Knowledge Aided Consistency for Weakly Supervised Phrase Grounding. In 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition. IEEE , Salt Lake City, UT. https:\/\/doi.org\/10.1109\/CVPR. 2018.00425 10.1109\/CVPR.2018.00425 Kan Chen, Jiyang Gao, and Ram Nevatia. 2018. Knowledge Aided Consistency for Weakly Supervised Phrase Grounding. In 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, UT. https:\/\/doi.org\/10.1109\/CVPR.2018.00425"},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1007\/s13735-017-0139-6"},{"key":"e_1_3_2_2_6_1","volume-title":"Query-Guided Regression Network with Context Policy for Phrase Grounding. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, Venice. https:\/\/doi.org\/10","author":"Chen Kan","year":"2017","unstructured":"Kan Chen , Rama Kovvuri , and Ram Nevatia . 2017 . Query-Guided Regression Network with Context Policy for Phrase Grounding. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, Venice. https:\/\/doi.org\/10 .1109\/ICCV.2017.95 10.1109\/ICCV.2017.95 Kan Chen, Rama Kovvuri, and Ram Nevatia. 2017. Query-Guided Regression Network with Context Policy for Phrase Grounding. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, Venice. https:\/\/doi.org\/10.1109\/ICCV.2017.95"},{"key":"e_1_3_2_2_7_1","volume-title":"Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment. arXiv:1903.11649 [cs] (Oct","author":"Datta Samyak","year":"2019","unstructured":"Samyak Datta , Karan Sikka , Anirban Roy , Karuna Ahuja , Devi Parikh , and Ajay Divakaran . 2019. Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment. arXiv:1903.11649 [cs] (Oct . 2019 ). http:\/\/arxiv.org\/abs\/1903.11649 arXiv: 1903.11649. Samyak Datta, Karan Sikka, Anirban Roy, Karuna Ahuja, Devi Parikh, and Ajay Divakaran. 2019. Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment. arXiv:1903.11649 [cs] (Oct. 2019). http:\/\/arxiv.org\/abs\/1903.11649 arXiv: 1903.11649."},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00808"},{"key":"e_1_3_2_2_9_1","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL]","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL]"},{"key":"e_1_3_2_2_10_1","volume-title":"Neural Sequential Phrase Grounding (SeqGROUND). arXiv:1903.07669 [cs] (March","author":"Dogan Pelin","year":"2019","unstructured":"Pelin Dogan , Leonid Sigal , and Markus Gross . 2019. Neural Sequential Phrase Grounding (SeqGROUND). arXiv:1903.07669 [cs] (March 2019 ). http:\/\/arxiv.org\/abs\/1903.07669 arXiv: 1903.07669. Pelin Dogan, Leonid Sigal, and Markus Gross. 2019. Neural Sequential Phrase Grounding (SeqGROUND). arXiv:1903.07669 [cs] (March 2019). http:\/\/arxiv.org\/abs\/1903.07669 arXiv: 1903.07669."},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2009.03.008"},{"key":"e_1_3_2_2_12_1","volume-title":"Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach.","author":"Fukui Akira","year":"2016","unstructured":"Akira Fukui , Dong Huk Park , Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016 . Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding . arXiv:1606.01847 [cs] (Sept. 2016). http:\/\/arxiv.org\/abs\/1606.01847 arXiv: 1606.01847. Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. arXiv:1606.01847 [cs] (Sept. 2016). http:\/\/arxiv.org\/abs\/1606.01847 arXiv: 1606.01847."},{"key":"e_1_3_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00572"},{"key":"e_1_3_2_2_14_1","volume-title":"Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv:1311.2524 [cs] (Oct","author":"Girshick Ross","year":"2014","unstructured":"Ross Girshick , Jeff Donahue , Trevor Darrell , and Jitendra Malik . 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv:1311.2524 [cs] (Oct . 2014 ). http:\/\/arxiv.org\/abs\/1311.2524 arXiv: 1311.2524. Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv:1311.2524 [cs] (Oct. 2014). http:\/\/arxiv.org\/abs\/1311.2524 arXiv: 1311.2524."},{"key":"e_1_3_2_2_15_1","volume-title":"Deep Image Retrieval: Learning global representations for image search. arXiv:1604.01325 [cs] (July","author":"Gordo Albert","year":"2016","unstructured":"Albert Gordo , Jon Almazan , Jerome Revaud , and Diane Larlus . 2016. Deep Image Retrieval: Learning global representations for image search. arXiv:1604.01325 [cs] (July 2016 ). http:\/\/arxiv.org\/abs\/1604.01325 arXiv: 1604.01325. Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Larlus. 2016. Deep Image Retrieval: Learning global representations for image search. arXiv:1604.01325 [cs] (July 2016). http:\/\/arxiv.org\/abs\/1604.01325 arXiv: 1604.01325."},{"key":"e_1_3_2_2_16_1","unstructured":"Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs.CV]  Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs.CV]"},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1281"},{"key":"e_1_3_2_2_18_1","volume-title":"Long Short-Term Memory. Neural Computation 9, 8","author":"Hochreiter Sepp","year":"1997","unstructured":"Sepp Hochreiter and J\u00fcrgen Schmidhuber . 1997. Long Short-Term Memory. Neural Computation 9, 8 ( 1997 ). https:\/\/doi.org\/10.1162\/neco.1997.9.8.1735 arXiv:https:\/\/doi.org\/10.1162\/neco.1997.9.8.1735 10.1162\/neco.1997.9.8.1735 Sepp Hochreiter and J\u00fcrgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997). https:\/\/doi.org\/10.1162\/neco.1997.9.8.1735 arXiv:https:\/\/doi.org\/10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_2_2_19_1","volume-title":"Natural Language Object Retrieval. arXiv:1511.04164 [cs] (April","author":"Hu Ronghang","year":"2016","unstructured":"Ronghang Hu , Huazhe Xu , Marcus Rohrbach , Jiashi Feng , Kate Saenko , and Trevor Darrell . 2016. Natural Language Object Retrieval. arXiv:1511.04164 [cs] (April 2016 ). http:\/\/arxiv.org\/abs\/1511.04164 arXiv: 1511.04164. Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. 2016. Natural Language Object Retrieval. arXiv:1511.04164 [cs] (April 2016). http:\/\/arxiv.org\/abs\/1511.04164 arXiv: 1511.04164."},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"crossref","unstructured":"Shaofei Huang Tianrui Hui Si Liu Guanbin Li YunchaoWei Jizhong Han Luoqi Liu and Bo Li. 2020. Referring Image Segmentation via Cross-Modal Progressive Comprehension. In CVPR.  Shaofei Huang Tianrui Hui Si Liu Guanbin Li YunchaoWei Jizhong Han Luoqi Liu and Bo Li. 2020. Referring Image Segmentation via Cross-Modal Progressive Comprehension. In CVPR.","DOI":"10.1109\/CVPR42600.2020.01050"},{"key":"e_1_3_2_2_21_1","unstructured":"Tianrui Hui Si Liu Shaofei Huang Guanbin Li Sansi Yu Faxi Zhang and Jizhong Han. 2020. Linguistic Structure Guided Context Modeling for Referring Image Segmentation. In ECCV.  Tianrui Hui Si Liu Shaofei Huang Guanbin Li Sansi Yu Faxi Zhang and Jizhong Han. 2020. Linguistic Structure Guided Context Modeling for Referring Image Segmentation. In ECCV."},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00524"},{"key":"e_1_3_2_2_23_1","volume-title":"DenseCap: Fully Convolutional Localization Networks for Dense Captioning. arXiv:1511.07571 [cs] (Nov","author":"Johnson Justin","year":"2015","unstructured":"Justin Johnson , Andrej Karpathy , and Li Fei-Fei . 2015. DenseCap: Fully Convolutional Localization Networks for Dense Captioning. arXiv:1511.07571 [cs] (Nov . 2015 ). http:\/\/arxiv.org\/abs\/1511.07571 arXiv: 1511.07571. Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2015. DenseCap: Fully Convolutional Localization Networks for Dense Captioning. arXiv:1511.07571 [cs] (Nov. 2015). http:\/\/arxiv.org\/abs\/1511.07571 arXiv: 1511.07571."},{"key":"e_1_3_2_2_24_1","unstructured":"Andrej Karpathy and Li Fei-Fei. [n.d.]. Deep Visual-Semantic Alignments for Generating Image Descriptions. ([n. d.]).  Andrej Karpathy and Li Fei-Fei. [n.d.]. Deep Visual-Semantic Alignments for Generating Image Descriptions. ([n. d.])."},{"key":"e_1_3_2_2_25_1","volume-title":"January","author":"Karpathy Andrej","year":"2014","unstructured":"Andrej Karpathy , Armand Joulin , and Fei Fei Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. Advances in Neural Information Processing Systems 3 , January ( 2014 ). arXiv:1406.5679 Andrej Karpathy, Armand Joulin, and Fei Fei Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. Advances in Neural Information Processing Systems 3, January (2014). arXiv:1406.5679"},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1086"},{"key":"e_1_3_2_2_27_1","volume-title":"Bilinear Attention Networks. arXiv:1805.07932 [cs] (Oct","author":"Kim Jin-Hwa","year":"2018","unstructured":"Jin-Hwa Kim , Jaehyun Jun , and Byoung-Tak Zhang . 2018. Bilinear Attention Networks. arXiv:1805.07932 [cs] (Oct . 2018 ). http:\/\/arxiv.org\/abs\/1805.07932 arXiv: 1805.07932. Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear Attention Networks. arXiv:1805.07932 [cs] (Oct. 2018). http:\/\/arxiv.org\/abs\/1805.07932 arXiv: 1805.07932."},{"key":"e_1_3_2_2_28_1","volume-title":"Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. https:\/\/arxiv.org\/abs\/1602.07332","author":"Krishna Ranjay","year":"2016","unstructured":"Ranjay Krishna , Yuke Zhu , Oliver Groth , Justin Johnson , Kenji Hata , Joshua Kravitz , Stephanie Chen , Yannis Kalantidis , Li-Jia Li , David A Shamma , Michael Bernstein , and Li Fei-Fei . 2016 . Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. https:\/\/arxiv.org\/abs\/1602.07332 Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei. 2016. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. https:\/\/arxiv.org\/abs\/1602.07332"},{"key":"e_1_3_2_2_29_1","volume-title":"Contextual Grounding of Natural Language Entities in Images. arXiv:1911.02133 [cs] (Nov","author":"Lai Farley","year":"2019","unstructured":"Farley Lai , Ning Xie , Derek Doran , and Asim Kadav . 2019. Contextual Grounding of Natural Language Entities in Images. arXiv:1911.02133 [cs] (Nov . 2019 ). http:\/\/arxiv.org\/abs\/1911.02133 arXiv: 1911.02133. Farley Lai, Ning Xie, Derek Doran, and Asim Kadav. 2019. Contextual Grounding of Natural Language Entities in Images. arXiv:1911.02133 [cs] (Nov. 2019). http:\/\/arxiv.org\/abs\/1911.02133 arXiv: 1911.02133."},{"key":"e_1_3_2_2_30_1","unstructured":"Liunian Harold Li Mark Yatskar Da Yin Cho-Jui Hsieh and Kai-Wei Chang. [n.d.]. VISUALBERT: A SIMPLE AND PERFORMANT BASELINE FOR VISION AND LANGUAGE. ([n. d.]).  Liunian Harold Li Mark Yatskar Da Yin Cho-Jui Hsieh and Kai-Wei Chang. [n.d.]. VISUALBERT: A SIMPLE AND PERFORMANT BASELINE FOR VISION AND LANGUAGE. ([n. d.])."},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"crossref","unstructured":"Yue Liao Si Liu Guanbin Li FeiWang Yanjie Chen Chen Qian and Bo Li. 2020. A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension. In CVPR.  Yue Liao Si Liu Guanbin Li FeiWang Yanjie Chen Chen Qian and Bo Li. 2020. A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension. In CVPR.","DOI":"10.1109\/CVPR42600.2020.01089"},{"key":"e_1_3_2_2_32_1","volume-title":"PPDM: Parallel Point Detection and Matching for Real-Time Human-Object Interaction Detection. In CVPR.","author":"Liao Yue","year":"2020","unstructured":"Yue Liao , Si Liu , FeiWang, Yanjie Chen , Chen Qian , and Jiashi Feng . 2020 . PPDM: Parallel Point Detection and Matching for Real-Time Human-Object Interaction Detection. In CVPR. Yue Liao, Si Liu, FeiWang, Yanjie Chen, Chen Qian, and Jiashi Feng. 2020. PPDM: Parallel Point Detection and Matching for Real-Time Human-Object Interaction Detection. In CVPR."},{"key":"e_1_3_2_2_33_1","unstructured":"Chenxi Liu Junhua Mao Fei Sha and Alan Yuille. [n.d.]. Attention Correctness in Neural Image Captioning. ([n. d.]).  Chenxi Liu Junhua Mao Fei Sha and Alan Yuille. [n.d.]. Attention Correctness in Neural Image Captioning. ([n. d.])."},{"key":"e_1_3_2_2_34_1","volume-title":"Learning to Assemble Neural Module Tree Networks for Visual Grounding. arXiv:1812.03299 [cs] (Oct","author":"Liu Daqing","year":"2019","unstructured":"Daqing Liu , Hanwang Zhang , Feng Wu , and Zheng-Jun Zha . 2019. Learning to Assemble Neural Module Tree Networks for Visual Grounding. arXiv:1812.03299 [cs] (Oct . 2019 ). http:\/\/arxiv.org\/abs\/1812.03299 arXiv: 1812.03299. Daqing Liu, Hanwang Zhang, Feng Wu, and Zheng-Jun Zha. 2019. Learning to Assemble Neural Module Tree Networks for Visual Grounding. arXiv:1812.03299 [cs] (Oct. 2019). http:\/\/arxiv.org\/abs\/1812.03299 arXiv: 1812.03299."},{"key":"e_1_3_2_2_35_1","unstructured":"Jonathan Long Evan Shelhamer and Trevor Darrell. [n.d.]. Fully Convolutional Networks for Semantic Segmentation. ([n. d.]).  Jonathan Long Evan Shelhamer and Trevor Darrell. [n.d.]. Fully Convolutional Networks for Semantic Segmentation. ([n. d.])."},{"key":"e_1_3_2_2_36_1","volume-title":"ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. arXiv:1908.02265 [cs] (Aug","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu , Dhruv Batra , Devi Parikh , and Stefan Lee . 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. arXiv:1908.02265 [cs] (Aug . 2019 ). http:\/\/arxiv.org\/abs\/1908.02265 arXiv: 1908.02265. Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. arXiv:1908.02265 [cs] (Aug. 2019). http:\/\/arxiv.org\/abs\/1908.02265 arXiv: 1908.02265."},{"key":"e_1_3_2_2_37_1","volume-title":"Comprehension-Guided Referring Expressions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE","author":"Luo Ruotian","year":"2017","unstructured":"Ruotian Luo and Gregory Shakhnarovich . 2017 . Comprehension-Guided Referring Expressions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE , Honolulu, HI. https:\/\/doi.org\/10.1109\/CVPR. 2017.333 10.1109\/CVPR.2017.333 Ruotian Luo and Gregory Shakhnarovich. 2017. Comprehension-Guided Referring Expressions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Honolulu, HI. https:\/\/doi.org\/10.1109\/CVPR.2017.333"},{"key":"e_1_3_2_2_38_1","volume-title":"Computer Vision -- ECCV","author":"Plummer Bryan A.","year":"2018","unstructured":"Bryan A. Plummer , Paige Kordas , M. Hadi Kiapour , Shuai Zheng , Robinson Piramuthu , and Svetlana Lazebnik . 2018. Conditional Image-Text Embedding Networks . In Computer Vision -- ECCV 2018 , Vittorio Ferrari, Martial Hebert , Cristian Sminchisescu, and Yair Weiss (Eds.). Vol. 11216 . Springer International Publishing , Cham. https:\/\/doi.org\/10.1007\/978--3-030-01258--8_16 Series Title : Lecture Notes in Computer Science. 10.1007\/978--3-030-01258--8_16 Bryan A. Plummer, Paige Kordas, M. Hadi Kiapour, Shuai Zheng, Robinson Piramuthu, and Svetlana Lazebnik. 2018. Conditional Image-Text Embedding Networks. In Computer Vision -- ECCV 2018, Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Vol. 11216. Springer International Publishing, Cham. https:\/\/doi.org\/10.1007\/978--3-030-01258--8_16 Series Title: Lecture Notes in Computer Science."},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"crossref","unstructured":"Bryan A. Plummer Arun Mallya Christopher M. Cervantes Julia Hockenmaier and Svetlana Lazebnik. 2016. Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues. arXiv:1611.06641 [cs.CV]  Bryan A. Plummer Arun Mallya Christopher M. Cervantes Julia Hockenmaier and Svetlana Lazebnik. 2016. Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues. arXiv:1611.06641 [cs.CV]","DOI":"10.1109\/ICCV.2017.213"},{"key":"e_1_3_2_2_40_1","volume-title":"Flickr30k Entities: Collecting Region-to- Phrase Correspondences for Richer Image-to-Sentence Models. arXiv:1505.04870 [cs] (Sept","author":"Plummer Bryan A.","year":"2016","unstructured":"Bryan A. Plummer , LiweiWang, Chris M. Cervantes , Juan C. Caicedo , Julia Hockenmaier , and Svetlana Lazebnik . 2016. Flickr30k Entities: Collecting Region-to- Phrase Correspondences for Richer Image-to-Sentence Models. arXiv:1505.04870 [cs] (Sept . 2016 ). http:\/\/arxiv.org\/abs\/1505.04870 arXiv: 1505.04870. Bryan A. Plummer, LiweiWang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2016. Flickr30k Entities: Collecting Region-to- Phrase Correspondences for Richer Image-to-Sentence Models. arXiv:1505.04870 [cs] (Sept. 2016). http:\/\/arxiv.org\/abs\/1505.04870 arXiv: 1505.04870."},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2018\/124"},{"key":"e_1_3_2_2_42_1","unstructured":"Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster RCNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv:1506.01497 [cs.CV]  Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster RCNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv:1506.01497 [cs.CV]"},{"key":"e_1_3_2_2_43_1","series-title":"Lecture Notes in Computer Science","volume-title":"Grounding of Textual Phrases in Images by Reconstruction","author":"Rohrbach Anna","year":"2016","unstructured":"Anna Rohrbach , Marcus Rohrbach , Ronghang Hu , Trevor Darrell , and Bernt Schiele . 2016. Grounding of Textual Phrases in Images by Reconstruction . Lecture Notes in Computer Science ( 2016 ), 817--834. https:\/\/doi.org\/10.1007\/978--3--319--46448-0_49 10.1007\/978--3--319--46448-0_49 Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele. 2016. Grounding of Textual Phrases in Images by Reconstruction. Lecture Notes in Computer Science (2016), 817--834. https:\/\/doi.org\/10.1007\/978--3--319--46448-0_49"},{"key":"e_1_3_2_2_44_1","volume-title":"Grounding of Textual Phrases in Images by Reconstruction. arXiv:1511.03745 [cs] 9905","author":"Rohrbach Anna","year":"2016","unstructured":"Anna Rohrbach , Marcus Rohrbach , Ronghang Hu , Trevor Darrell , and Bernt Schiele . 2016. Grounding of Textual Phrases in Images by Reconstruction. arXiv:1511.03745 [cs] 9905 ( 2016 ). https:\/\/doi.org\/10.1007\/978--3--319--46448-0_49 arXiv: 1511.03745. 10.1007\/978--3--319--46448-0_49 Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele. 2016. Grounding of Textual Phrases in Images by Reconstruction. arXiv:1511.03745 [cs] 9905 (2016). https:\/\/doi.org\/10.1007\/978--3--319--46448-0_49 arXiv: 1511.03745."},{"key":"e_1_3_2_2_45_1","volume-title":"VL-BERT: Pre-training of Generic Visual-Linguistic Representations. arXiv:1908.08530 [cs] (Feb","author":"Su Weijie","year":"2020","unstructured":"Weijie Su , Xizhou Zhu , Yue Cao , Bin Li , Lewei Lu , Furu Wei , and Jifeng Dai . 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. arXiv:1908.08530 [cs] (Feb . 2020 ). http:\/\/arxiv.org\/abs\/1908.08530 arXiv:1908.08530. Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. arXiv:1908.08530 [cs] (Feb. 2020). http:\/\/arxiv.org\/abs\/1908.08530 arXiv:1908.08530."},{"key":"e_1_3_2_2_46_1","volume-title":"Berg","author":"Tommasi Tatiana","year":"2016","unstructured":"Tatiana Tommasi , Arun Mallya , Bryan Plummer , Svetlana Lazebnik , Alexander C. Berg , and Tamara L . Berg . 2016 . Solving Visual Madlibs with Multiple Cues . arXiv:1608.03410 [cs] (Aug. 2016). http:\/\/arxiv.org\/abs\/1608.03410 arXiv: 1608.03410. Tatiana Tommasi, Arun Mallya, Bryan Plummer, Svetlana Lazebnik, Alexander C. Berg, and Tamara L. Berg. 2016. Solving Visual Madlibs with Multiple Cues. arXiv:1608.03410 [cs] (Aug. 2016). http:\/\/arxiv.org\/abs\/1608.03410 arXiv: 1608.03410."},{"key":"e_1_3_2_2_47_1","volume-title":"Learning Two-Branch Neural Networks for Image-Text Matching Tasks. arXiv:1704.03470 [cs] (May","author":"Wang Liwei","year":"2018","unstructured":"Liwei Wang , Yin Li , Jing Huang , and Svetlana Lazebnik . 2018. Learning Two-Branch Neural Networks for Image-Text Matching Tasks. arXiv:1704.03470 [cs] (May 2018 ). http:\/\/arxiv.org\/abs\/1704.03470 arXiv: 1704.03470. Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. 2018. Learning Two-Branch Neural Networks for Image-Text Matching Tasks. arXiv:1704.03470 [cs] (May 2018). http:\/\/arxiv.org\/abs\/1704.03470 arXiv: 1704.03470."},{"key":"e_1_3_2_2_48_1","volume-title":"Computer Vision -- ECCV","author":"Wang Mingzhe","year":"2016","unstructured":"Mingzhe Wang , Mahmoud Azab , Noriyuki Kojima , Rada Mihalcea , and Jia Deng . 2016. Structured Matching for Phrase Localization . In Computer Vision -- ECCV 2016 , Bastian Leibe, Jiri Matas , Nicu Sebe, and Max Welling (Eds.). Vol. 9912 . Springer International Publishing , Cham. https:\/\/doi.org\/10.1007\/978--3--319--46484--8_42 Series Title : Lecture Notes in Computer Science. 10.1007\/978--3--319--46484--8_42 Mingzhe Wang, Mahmoud Azab, Noriyuki Kojima, Rada Mihalcea, and Jia Deng. 2016. Structured Matching for Phrase Localization. In Computer Vision -- ECCV 2016, Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Vol. 9912. Springer International Publishing, Cham. https:\/\/doi.org\/10.1007\/978--3--319--46484--8_42 Series Title: Lecture Notes in Computer Science."},{"key":"e_1_3_2_2_49_1","volume-title":"HuggingFace's Transformers: State-of-the-art Natural Language Processing. ArXiv abs\/1910.03771","author":"Debut Lysandre","year":"2019","unstructured":"ThomasWolf, Lysandre Debut , Victor Sanh , Julien Chaumond , Clement Delangue , Anthony Moi , Pierric Cistac , Tim Rault , R?emi Louf, Morgan Funtowicz , and Jamie Brew . 2019. HuggingFace's Transformers: State-of-the-art Natural Language Processing. ArXiv abs\/1910.03771 ( 2019 ). ThomasWolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R?emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace's Transformers: State-of-the-art Natural Language Processing. ArXiv abs\/1910.03771 (2019)."},{"key":"e_1_3_2_2_50_1","volume-title":"Attend and Tell: Neural Image Caption Generation with Visual Attention. arXiv:1502.03044 [cs] (April","author":"Xu Kelvin","year":"2016","unstructured":"Kelvin Xu , Jimmy Ba , Ryan Kiros , Kyunghyun Cho , Aaron Courville , Ruslan Salakhutdinov , Richard Zemel , and Yoshua Bengio . 2016. Show , Attend and Tell: Neural Image Caption Generation with Visual Attention. arXiv:1502.03044 [cs] (April 2016 ). http:\/\/arxiv.org\/abs\/1502.03044 arXiv: 1502.03044. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2016. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. arXiv:1502.03044 [cs] (April 2016). http:\/\/arxiv.org\/abs\/1502.03044 arXiv: 1502.03044."},{"key":"e_1_3_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00478"},{"key":"e_1_3_2_2_52_1","volume-title":"Schwing","author":"Yeh Raymond A.","year":"2018","unstructured":"Raymond A. Yeh , Jinjun Xiong , Wen-mei W. Hwu , Minh N. Do , and Alexander G . Schwing . 2018 . Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts . arXiv:1803.11209 [cs] (March 2018). http:\/\/arxiv.org\/abs\/1803.11209 arXiv: 1803.11209. Raymond A. Yeh, Jinjun Xiong, Wen-mei W. Hwu, Minh N. Do, and Alexander G. Schwing. 2018. Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts. arXiv:1803.11209 [cs] (March 2018). http:\/\/arxiv.org\/abs\/1803.11209 arXiv: 1803.11209."},{"key":"e_1_3_2_2_53_1","volume-title":"Context and Attribute Grounded Dense Captioning. In 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE","author":"Yin Guojun","year":"2019","unstructured":"Guojun Yin , Lu Sheng , Bin Liu , Nenghai Yu , XiaogangWang, and Jing Shao . 2019 . Context and Attribute Grounded Dense Captioning. In 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE , Long Beach, CA, USA. https:\/\/doi.org\/10.1109\/CVPR. 2019.00640 10.1109\/CVPR.2019.00640 Guojun Yin, Lu Sheng, Bin Liu, Nenghai Yu, XiaogangWang, and Jing Shao. 2019. Context and Attribute Grounded Dense Captioning. In 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA. https:\/\/doi.org\/10.1109\/CVPR.2019.00640"},{"key":"e_1_3_2_2_54_1","volume-title":"arXiv:1703.06114 [cs, stat] (April","author":"Zaheer Manzil","year":"2018","unstructured":"Manzil Zaheer , Satwik Kottur , Siamak Ravanbakhsh , Barnabas Poczos , Ruslan Salakhutdinov , and Alexander Smola . 2018. Deep Sets . arXiv:1703.06114 [cs, stat] (April 2018 ). http:\/\/arxiv.org\/abs\/1703.06114 arXiv: 1703.06114. Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov, and Alexander Smola. 2018. Deep Sets. arXiv:1703.06114 [cs, stat] (April 2018). http:\/\/arxiv.org\/abs\/1703.06114 arXiv: 1703.06114."},{"key":"e_1_3_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3351063"},{"key":"e_1_3_2_2_56_1","volume-title":"Unified Vision-Language Pre-Training for Image Captioning and VQA. arXiv:1909.11059 [cs] (Dec","author":"Zhou Luowei","year":"2019","unstructured":"Luowei Zhou , Hamid Palangi , Lei Zhang , Houdong Hu , Jason J. Corso , and Jianfeng Gao . 2019. Unified Vision-Language Pre-Training for Image Captioning and VQA. arXiv:1909.11059 [cs] (Dec . 2019 ). http:\/\/arxiv.org\/abs\/1909.11059 arXiv: 1909.11059. Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. 2019. Unified Vision-Language Pre-Training for Image Captioning and VQA. arXiv:1909.11059 [cs] (Dec. 2019). http:\/\/arxiv.org\/abs\/1909.11059 arXiv: 1909.11059."}],"event":{"name":"MM '20: The 28th ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Seattle WA USA","acronym":"MM '20"},"container-title":["Proceedings of the 28th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3394171.3413846","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3394171.3413846","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:01:18Z","timestamp":1750197678000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3394171.3413846"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,10,12]]},"references-count":56,"alternative-id":["10.1145\/3394171.3413846","10.1145\/3394171"],"URL":"https:\/\/doi.org\/10.1145\/3394171.3413846","relation":{},"subject":[],"published":{"date-parts":[[2020,10,12]]},"assertion":[{"value":"2020-10-12","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}