{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:18:51Z","timestamp":1750220331544,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":36,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,8,21]],"date-time":"2021-08-21T00:00:00Z","timestamp":1629504000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,8,21]]},"DOI":"10.1145\/3463945.3469055","type":"proceedings-article","created":{"date-parts":[[2021,8,27]],"date-time":"2021-08-27T14:29:53Z","timestamp":1630074593000},"page":"14-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Language-Conditioned Region Proposal and Retrieval Network for Referring Expression Comprehension"],"prefix":"10.1145","author":[{"given":"Yanwei","family":"Xie","sequence":"first","affiliation":[{"name":"University of Science and Technology of China, HeFei, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Daqing","family":"Liu","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, HeFei, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xuejin","family":"Chen","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, HeFei, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zheng-Jun","family":"Zha","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, HeFei, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,8,27]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00522"},{"key":"e_1_3_2_1_2_1","volume-title":"Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426","author":"Chen Xinpeng","year":"2018","unstructured":"Xinpeng Chen , Lin Ma , Jingyuan Chen , Zequn Jie , Wei Liu , and Jiebo Luo . 2018. Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426 ( 2018 ). Xinpeng Chen, Lin Ma, Jingyuan Chen, Zequn Jie, Wei Liu, and Jiebo Luo. 2018. Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426 (2018)."},{"key":"e_1_3_2_1_3_1","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"32","author":"Cirik Volkan","year":"2018","unstructured":"Volkan Cirik , Taylor Berg-Kirkpatrick , and Louis-Philippe Morency . 2018 . Using syntax to ground referring expressions in natural images . In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 32 . Volkan Cirik, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. 2018. Using syntax to ground referring expressions in natural images. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.89"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00667"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.322"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_1_8_1","unstructured":"Ronghang Hu Daniel Fried Anna Rohrbach Dan Klein Trevor Darrell and Kate Saenko. 2019. Are you looking? grounding to multiple modalities in vision-and-language navigation. In ACL .  Ronghang Hu Daniel Fried Anna Rohrbach Dan Klein Trevor Darrell and Kate Saenko. 2019. Are you looking? grounding to multiple modalities in vision-and-language navigation. In ACL ."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.470"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.493"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1086"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"crossref","unstructured":"Ranjay Krishna Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen Yannis Kalantidis Li-Jia Li David A Shamma etal 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision Vol. 123 1 (2017) 32--73.  Ranjay Krishna Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen Yannis Kalantidis Li-Jia Li David A Shamma et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision Vol. 123 1 (2017) 32--73.","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01089"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.324"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240632"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00477"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6833"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.9"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46493-0_48"},{"key":"e_1_3_2_1_21_1","volume-title":"Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767","author":"Redmon Joseph","year":"2018","unstructured":"Joseph Redmon and Ali Farhadi . 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 ( 2018 ). Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018)."},{"key":"e_1_3_2_1_22_1","volume-title":"Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren , Kaiming He , Ross Girshick , and Jian Sun . 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 ( 2015 ). Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 (2015)."},{"key":"e_1_3_2_1_23_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7026--7035","author":"Wang Hao","year":"2021","unstructured":"Hao Wang , Zheng-Jun Zha , Liang Li , Dong Liu , and Jiebo Luo . 2021 . Structured Multi-Level Interaction Network for Video Moment Localization via Language Query . In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7026--7035 . Hao Wang, Zheng-Jun Zha, Liang Li, Dong Liu, and Jiebo Luo. 2021. Structured Multi-Level Interaction Network for Video Moment Localization via Language Query. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7026--7035."},{"volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 1960--1968","author":"Wang Peng","key":"e_1_3_2_1_24_1","unstructured":"Peng Wang , Qi Wu , Jiewei Cao , Chunhua Shen , Lianli Gao , and Anton van den Hengel. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks . In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 1960--1968 . Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, and Anton van den Hengel. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 1960--1968."},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00474"},{"key":"e_1_3_2_1_26_1","volume-title":"Improving one-stage visual grounding by recursive sub-query construction. arXiv preprint arXiv:2008.01059","author":"Yang Zhengyuan","year":"2020","unstructured":"Zhengyuan Yang , Tianlang Chen , Liwei Wang , and Jiebo Luo . 2020. Improving one-stage visual grounding by recursive sub-query construction. arXiv preprint arXiv:2008.01059 ( 2020 ). Zhengyuan Yang, Tianlang Chen, Liwei Wang, and Jiebo Luo. 2020. Improving one-stage visual grounding by recursive sub-query construction. arXiv preprint arXiv:2008.01059 (2020)."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00478"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00255"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00142"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46475-6_5"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.375"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2019.2909864"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01329"},{"key":"e_1_3_2_1_34_1","volume-title":"Objects as points. arXiv preprint arXiv:1904.07850","author":"Zhou Xingyi","year":"2019","unstructured":"Xingyi Zhou , Dequan Wang , and Philipp Kr\"ahenb \u00fchl . 2019. Objects as points. arXiv preprint arXiv:1904.07850 ( 2019 ). Xingyi Zhou, Dequan Wang, and Philipp Kr\"ahenb\u00fchl. 2019. Objects as points. arXiv preprint arXiv:1904.07850 (2019)."},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00483"},{"key":"e_1_3_2_1_36_1","volume-title":"European conference on computer vision. Springer, 391--405","author":"Lawrence Zitnick C","year":"2014","unstructured":"C Lawrence Zitnick and Piotr Doll\u00e1r . 2014 . Edge boxes: Locating object proposals from edges . In European conference on computer vision. Springer, 391--405 . C Lawrence Zitnick and Piotr Doll\u00e1r. 2014. Edge boxes: Locating object proposals from edges. In European conference on computer vision. Springer, 391--405."}],"event":{"name":"ICMR '21: International Conference on Multimedia Retrieval","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Taipei Taiwan","acronym":"ICMR '21"},"container-title":["Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3463945.3469055","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3463945.3469055","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:12:15Z","timestamp":1750191135000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3463945.3469055"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,8,21]]},"references-count":36,"alternative-id":["10.1145\/3463945.3469055","10.1145\/3463945"],"URL":"https:\/\/doi.org\/10.1145\/3463945.3469055","relation":{},"subject":[],"published":{"date-parts":[[2021,8,21]]},"assertion":[{"value":"2021-08-27","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}