{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,8]],"date-time":"2026-05-08T22:11:26Z","timestamp":1778278286440,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":60,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,10,17]],"date-time":"2021-10-17T00:00:00Z","timestamp":1634428800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,10,17]]},"DOI":"10.1145\/3474085.3475354","type":"proceedings-article","created":{"date-parts":[[2021,10,18]],"date-time":"2021-10-18T20:56:12Z","timestamp":1634590572000},"page":"1966-1975","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":18,"title":["Distributed Attention for Grounded Image Captioning"],"prefix":"10.1145","author":[{"given":"Nenglun","family":"Chen","sequence":"first","affiliation":[{"name":"University of Hong Kong, Hong Kong, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xingjia","family":"Pan","sequence":"additional","affiliation":[{"name":"Youtu Lab, Tencent, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Runnan","family":"Chen","sequence":"additional","affiliation":[{"name":"University of Hong Kong, Hong Kong, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lei","family":"Yang","sequence":"additional","affiliation":[{"name":"University of Hong Kong, Hong Kong, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhiwen","family":"Lin","sequence":"additional","affiliation":[{"name":"Youtu Lab, Tencent, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yuqiang","family":"Ren","sequence":"additional","affiliation":[{"name":"Youtu Lab, Tencent, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Haolei","family":"Yuan","sequence":"additional","affiliation":[{"name":"Youtu Lab, Tencent, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiaowei","family":"Guo","sequence":"additional","affiliation":[{"name":"Youtu Lab, Tencent, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Feiyue","family":"Huang","sequence":"additional","affiliation":[{"name":"Youtu Lab, Tencent, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wenping","family":"Wang","sequence":"additional","affiliation":[{"name":"University of Hong Kong, Hong Kong, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,10,17]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46454-1_24"},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_1_3_1","volume-title":"Scheduled sampling for sequence prediction with recurrent neural networks. arXiv preprint arXiv:1506.03099","author":"Bengio Samy","year":"2015","unstructured":"Samy Bengio , Oriol Vinyals , Navdeep Jaitly , and Noam Shazeer . 2015. Scheduled sampling for sequence prediction with recurrent neural networks. arXiv preprint arXiv:1506.03099 ( 2015 ). Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. arXiv preprint arXiv:1506.03099 (2015)."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298711"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.311"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00425"},{"key":"e_1_3_2_1_7_1","volume-title":"Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325","author":"Chen Xinlei","year":"2015","unstructured":"Xinlei Chen , Hao Fang , Tsung-Yi Lin , Ramakrishna Vedantam , Saurabh Gupta , Pi-otr Doll\u00e1r, and C Lawrence Zitnick . 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 ( 2015 ). Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Pi-otr Doll\u00e1r, and C Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)."},{"key":"e_1_3_2_1_8_1","volume-title":"Attention-based dropout layer for weakly supervised object localization","author":"Choe Junsuk","unstructured":"Junsuk Choe and Hyunjung Shim . 2019. Attention-based dropout layer for weakly supervised object localization . In IEEE CVPR. 2219--2228. Junsuk Choe and Hyunjung Shim. 2019. Attention-based dropout layer for weakly supervised object localization. In IEEE CVPR. 2219--2228."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01059"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2013.340"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-3348"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00430"},{"key":"e_1_3_2_1_13_1","volume-title":"TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization. arXiv preprint arXiv:2103.14862","author":"Gao Wei","year":"2021","unstructured":"Wei Gao , Fang Wan , Xingjia Pan , Zhiliang Peng , Qi Tian , Zhenjun Han , Bolei Zhou , and Qixiang Ye. 2021. TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization. arXiv preprint arXiv:2103.14862 ( 2021 ). Wei Gao, Fang Wan, Xingjia Pan, Zhiliang Peng, Qi Tian, Zhenjun Han, Bolei Zhou, and Qixiang Ye. 2021. TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization. arXiv preprint arXiv:2103.14862 (2021)."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.309"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01034"},{"key":"e_1_3_2_1_16_1","volume-title":"Contrastive learning for weakly supervised phrase grounding. arXiv preprint arXiv:2006.09920","author":"Gupta Tanmay","year":"2020","unstructured":"Tanmay Gupta , Arash Vahdat , Gal Chechik , Xiaodong Yang , Jan Kautz , and Derek Hoiem . 2020. Contrastive learning for weakly supervised phrase grounding. arXiv preprint arXiv:2006.09920 ( 2020 ). Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, Jan Kautz, and Derek Hoiem. 2020. Contrastive learning for weakly supervised phrase grounding. arXiv preprint arXiv:2006.09920 (2020)."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00473"},{"key":"e_1_3_2_1_19_1","volume-title":"European Conference on Computer Vision. Springer, 350--365","author":"Kantorov Vadim","year":"2016","unstructured":"Vadim Kantorov , Maxime Oquab , Minsu Cho , and Ivan Laptev . 2016 . Contextloc-net: Context-aware deep network models for weakly supervised localization . In European Conference on Computer Vision. Springer, 350--365 . Vadim Kantorov, Maxime Oquab, Minsu Cho, and Ivan Laptev. 2016. Contextloc-net: Context-aware deep network models for weakly supervised localization. In European Conference on Computer Vision. Springer, 350--365."},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_3_2_1_21_1","volume-title":"Adam: A method for stochastic opti-mization. arXiv preprint arXiv:1412.6980","author":"Kingma Diederik P","year":"2014","unstructured":"Diederik P Kingma and Jimmy Ba . 2014 . Adam: A method for stochastic opti-mization. arXiv preprint arXiv:1412.6980 (2014). Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-mization. arXiv preprint arXiv:1412.6980 (2014)."},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2012.162"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01225-0_13"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.5555\/3298023.3298174"},{"key":"e_1_3_2_1_26_1","volume-title":"Prophet Attention: Predicting Attention with Future Attention. Advances in Neural Information Processing Systems 33","author":"Liu Fenglin","year":"2020","unstructured":"Fenglin Liu , Xuancheng Ren , Xian Wu , Shen Ge , Wei Fan , Yuexian Zou , and Xu Sun . 2020 . Prophet Attention: Predicting Attention with Future Attention. Advances in Neural Information Processing Systems 33 (2020). Fenglin Liu, Xuancheng Ren, Xian Wu, Shen Ge, Wei Fan, Yuexian Zou, and Xu Sun. 2020. Prophet Attention: Predicting Attention with Future Attention. Advances in Neural Information Processing Systems 33 (2020)."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3351074"},{"key":"e_1_3_2_1_28_1","volume-title":"Relation-aware In-stance Refinement for Weakly Supervised Visual Grounding. arXiv preprint arXiv:2103.12989","author":"Liu Yongfei","year":"2021","unstructured":"Yongfei Liu , Bo Wan , Lin Ma , and Xuming He. 2021. Relation-aware In-stance Refinement for Weakly Supervised Visual Grounding. arXiv preprint arXiv:2103.12989 ( 2021 ). Yongfei Liu, Bo Wan, Lin Ma, and Xuming He. 2021. Relation-aware In-stance Refinement for Weakly Supervised Visual Grounding. arXiv preprint arXiv:2103.12989 (2021)."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6833"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.345"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58523-5_21"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.5555\/2380816.2380907"},{"key":"e_1_3_2_1_33_1","volume-title":"Unveiling the Potential of Structure-Preserving for Weakly Supervised Object Localization. arXiv preprint arXiv:2103.04523","author":"Pan Xingjia","year":"2021","unstructured":"Xingjia Pan , Yingguo Gao , Zhiwen Lin , Fan Tang , Weiming Dong , Haolei Yuan , Feiyue Huang , and Changsheng Xu. 2021. Unveiling the Potential of Structure-Preserving for Weakly Supervised Object Localization. arXiv preprint arXiv:2103.04523 ( 2021 ). Xingjia Pan, Yingguo Gao, Zhiwen Lin, Fan Tang, Weiming Dong, Haolei Yuan, Feiyue Huang, and Changsheng Xu. 2021. Unveiling the Potential of Structure-Preserving for Weakly Supervised Object Localization. arXiv preprint arXiv:2103.04523 (2021)."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073083.1073135"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01258-8_16"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.303"},{"key":"e_1_3_2_1_37_1","volume-title":"Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren , Kaiming He , Ross Girshick , and Jian Sun . 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 ( 2015 ). Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 (2015)."},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.131"},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46448-0_49"},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2019.2933735"},{"key":"e_1_3_2_1_41_1","volume-title":"Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization","author":"Singh Krishna Kumar","year":"2017","unstructured":"Krishna Kumar Singh and Yong Jae Lee . 2017 . Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization . In IEEE ICCV. 3544--3553. Krishna Kumar Singh and Yong Jae Lee. 2017. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In IEEE ICCV. 3544--3553."},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.5555\/3044805.3045072"},{"key":"e_1_3_2_1_43_1","unstructured":"Eu Wern Teh Mrigank Rochan and Yang Wang. 2016. Attention Networks for Weakly Supervised Object Localization.. In BMVC. 1--11.  Eu Wern Teh Mrigank Rochan and Yang Wang. 2016. Attention Networks for Weakly Supervised Object Localization.. In BMVC. 1--11."},{"key":"e_1_3_2_1_44_1","volume-title":"Attention is all you need. arXiv preprint arXiv:1706.03762","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N Gomez , Lukasz Kaiser , and Illia Polosukhin . 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 ( 2017 ). Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)."},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_3_2_1_47_1","volume-title":"Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation. arXiv preprint arXiv:2007.01951","author":"Wang Liwei","year":"2020","unstructured":"Liwei Wang , Jing Huang , Yin Li , Kun Xu , Zhengyuan Yang , and Dong Yu. 2020. Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation. arXiv preprint arXiv:2007.01951 ( 2020 ). Liwei Wang, Jing Huang, Yin Li, Kun Xu, Zhengyuan Yang, and Dong Yu. 2020. Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation. arXiv preprint arXiv:2007.01951 (2020)."},{"key":"e_1_3_2_1_48_1","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV). 434--450","author":"Wei Yunchao","year":"2018","unstructured":"Yunchao Wei , Zhiqiang Shen , Bowen Cheng , Honghui Shi , Jinjun Xiong , Jiashi Feng , and Thomas Huang . 2018 . Ts2c: Tight box mining with surrounding segmentation context for weakly supervised object detection . In Proceedings of the European Conference on Computer Vision (ECCV). 434--450 . Yunchao Wei, Zhiqiang Shen, Bowen Cheng, Honghui Shi, Jinjun Xiong, Jiashi Feng, and Thomas Huang. 2018. Ts2c: Tight box mining with surrounding segmentation context for weakly supervised object detection. In Proceedings of the European Conference on Computer Vision (ECCV). 434--450."},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01094"},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.503"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00142"},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.648"},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00612"},{"key":"e_1_3_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413746"},{"key":"e_1_3_2_1_55_1","volume-title":"Adversarial complementary learning for weakly supervised object localization","author":"Zhang Xiaolin","unstructured":"Xiaolin Zhang , Yunchao Wei , Jiashi Feng , Yi Yang , and Thomas S Huang . 2018. Adversarial complementary learning for weakly supervised object localization . In IEEE CVPR. 1325--1334. Xiaolin Zhang, Yunchao Wei, Jiashi Feng, Yi Yang, and Thomas S Huang. 2018. Adversarial complementary learning for weakly supervised object localization. In IEEE CVPR. 1325--1334."},{"key":"e_1_3_2_1_56_1","doi-asserted-by":"crossref","unstructured":"Xiaolin Zhang Yunchao Wei Guoliang Kang Yi Yang and Thomas Huang. 2018. Self-produced guidance for weakly-supervised object localization. In ECCV. 597--613.  Xiaolin Zhang Yunchao Wei Guoliang Kang Yi Yang and Thomas Huang. 2018. Self-produced guidance for weakly-supervised object localization. In ECCV. 597--613.","DOI":"10.1007\/978-3-030-01258-8_37"},{"key":"e_1_3_2_1_57_1","first-page":"271","article-title":"Inter-image communication for weakly supervised localization","volume":"12364","author":"Zhang Xiaolin","year":"2020","unstructured":"Xiaolin Zhang , Yunchao Wei , and Yi Yang . 2020 . Inter-image communication for weakly supervised localization . In ECCV , Vol. 12364. 271 -- 287 . Xiaolin Zhang, Yunchao Wei, and Yi Yang. 2020. Inter-image communication for weakly supervised localization. In ECCV, Vol. 12364. 271--287.","journal-title":"ECCV"},{"key":"e_1_3_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.319"},{"key":"e_1_3_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00674"},{"key":"e_1_3_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00483"}],"event":{"name":"MM '21: ACM Multimedia Conference","location":"Virtual Event China","acronym":"MM '21","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 29th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3474085.3475354","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3474085.3475354","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:49:19Z","timestamp":1750193359000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3474085.3475354"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,17]]},"references-count":60,"alternative-id":["10.1145\/3474085.3475354","10.1145\/3474085"],"URL":"https:\/\/doi.org\/10.1145\/3474085.3475354","relation":{},"subject":[],"published":{"date-parts":[[2021,10,17]]},"assertion":[{"value":"2021-10-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}