{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,16]],"date-time":"2026-05-16T16:13:41Z","timestamp":1778948021913,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":24,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,8,21]],"date-time":"2021-08-21T00:00:00Z","timestamp":1629504000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,8,21]]},"DOI":"10.1145\/3463945.3469054","type":"proceedings-article","created":{"date-parts":[[2021,8,27]],"date-time":"2021-08-27T14:29:53Z","timestamp":1630074593000},"page":"4-13","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["Be Specific, Be Clear: Bridging Machine and Human Captions by Scene-Guided Transformer"],"prefix":"10.1145","author":[{"given":"Yupan","family":"Huang","sequence":"first","affiliation":[{"name":"Sun Yat-sen University, Guangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhaoyang","family":"Zeng","sequence":"additional","affiliation":[{"name":"Sun Yat-sen University, Guangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yutong","family":"Lu","sequence":"additional","affiliation":[{"name":"Sun Yat-sen University, Guangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,8,27]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46454-1_24"},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_2_3_1","volume-title":"Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts","author":"Fu Kun","year":"2016","unstructured":"Kun Fu , Junqi Jin , Runpeng Cui , Fei Sha , and Changshui Zhang . 2016. Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts . IEEE transactions on pattern analysis and machine intelligence, Vol. 39 , 12 ( 2016 ), 2321--2334. Kun Fu, Junqi Jin, Runpeng Cui, Fei Sha, and Changshui Zhang. 2016. Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE transactions on pattern analysis and machine intelligence, Vol. 39, 12 (2016), 2321--2334."},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2019.2894139"},{"key":"e_1_3_2_2_5_1","volume-title":"2019 a. Show, Tell and Polish: Ruminant Decoding for Image Captioning","author":"Guo Longteng","year":"2019","unstructured":"Longteng Guo , Jing Liu , Shichen Lu , and Hanqing Lu . 2019 a. Show, Tell and Polish: Ruminant Decoding for Image Captioning . IEEE Transactions on Multimedia ( 2019 ). Longteng Guo, Jing Liu, Shichen Lu, and Hanqing Lu. 2019 a. Show, Tell and Polish: Ruminant Decoding for Image Captioning. IEEE Transactions on Multimedia (2019)."},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3350943"},{"key":"e_1_3_2_2_7_1","volume-title":"Image Captioning: Transforming Objects into Words. In Advances in Neural Information Processing Systems. 11135--11145.","author":"Herdade Simao","year":"2019","unstructured":"Simao Herdade , Armin Kappeler , Kofi Boakye , and Joao Soares . 2019 . Image Captioning: Transforming Objects into Words. In Advances in Neural Information Processing Systems. 11135--11145. Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image Captioning: Transforming Objects into Words. In Advances in Neural Information Processing Systems. 11135--11145."},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00473"},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00898"},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00902"},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2896516"},{"key":"e_1_3_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_2_2_14_1","unstructured":"Fenglin Liu Yuanxin Liu Xuancheng Ren Xiaodong He and Xu Sun. 2019. Aligning visual regions and textual concepts for semantic-grounded image representations. In Advances in Neural Information Processing Systems. 6847--6857.  Fenglin Liu Yuanxin Liu Xuancheng Ren Xiaodong He and Xu Sun. 2019. Aligning visual regions and textual concepts for semantic-grounded image representations. In Advances in Neural Information Processing Systems. 6847--6857."},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01267-0_21"},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.345"},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00754"},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.131"},{"key":"e_1_3_2_2_19_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.  Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008."},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01094"},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00435"},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01264-9_42"},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.524"},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2020.2976552"}],"event":{"name":"ICMR '21: International Conference on Multimedia Retrieval","location":"Taipei Taiwan","acronym":"ICMR '21","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3463945.3469054","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3463945.3469054","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:12:15Z","timestamp":1750191135000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3463945.3469054"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,8,21]]},"references-count":24,"alternative-id":["10.1145\/3463945.3469054","10.1145\/3463945"],"URL":"https:\/\/doi.org\/10.1145\/3463945.3469054","relation":{},"subject":[],"published":{"date-parts":[[2021,8,21]]},"assertion":[{"value":"2021-08-27","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}