{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T23:01:55Z","timestamp":1773270115144,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":46,"publisher":"ACM","license":[{"start":{"date-parts":[[2020,10,12]],"date-time":"2020-10-12T00:00:00Z","timestamp":1602460800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Natural Science Foundation of China","award":["61925204"],"award-info":[{"award-number":["61925204"]}]},{"name":"National Key Research and Development Program of China","award":["2018AAA0102002"],"award-info":[{"award-number":["2018AAA0102002"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2020,10,12]]},"DOI":"10.1145\/3394171.3413753","type":"proceedings-article","created":{"date-parts":[[2020,10,12]],"date-time":"2020-10-12T13:10:18Z","timestamp":1602508218000},"page":"4337-4345","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":44,"title":["Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning"],"prefix":"10.1145","author":[{"given":"Jing","family":"Wang","sequence":"first","affiliation":[{"name":"Nanjing University of Science and Technology, Nanjing, China"}]},{"given":"Jinhui","family":"Tang","sequence":"additional","affiliation":[{"name":"Nanjing University of Science and Technology, Nanjing, China"}]},{"given":"Jiebo","family":"Luo","sequence":"additional","affiliation":[{"name":"University of Rochester, Rochester, NY, USA"}]}],"member":"320","published-online":{"date-parts":[[2020,10,12]]},"reference":[{"key":"e_1_3_2_2_1_1","volume-title":"Word spotting and recognition with embedded attributes","author":"Almaz\u00e1n Jon","year":"2014","unstructured":"Jon Almaz\u00e1n , Albert Gordo , Alicia Forn\u00e9s , and Ernest Valveny . 2014. Word spotting and recognition with embedded attributes . IEEE transactions on pattern analysis and machine intelligence 36, 12 ( 2014 ), 2552--2566. Jon Almaz\u00e1n, Albert Gordo, Alicia Forn\u00e9s, and Ernest Valveny. 2014. Word spotting and recognition with embedded attributes. IEEE transactions on pattern analysis and machine intelligence 36, 12 (2014), 2552--2566."},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46454-1_24"},{"key":"e_1_3_2_2_3_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086","author":"Anderson Peter","year":"2018","unstructured":"Peter Anderson , Xiaodong He , Chris Buehler , Damien Teney , Mark Johnson , Stephen Gould , and Lei Zhang . 2018 . Bottom-up and top-down attention for image captioning and VQA . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086 . Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and VQA. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086."},{"key":"e_1_3_2_2_4_1","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics workshop on intrinsic and extrinsic evaluation measures for machine translation and\/or summarization. 65--72","author":"Banerjee Satanjeev","year":"2005","unstructured":"Satanjeev Banerjee and Alon Lavie . 2005 . METEOR: An automatic metric for MT evaluation with improved correlation with human judgments . In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics workshop on intrinsic and extrinsic evaluation measures for machine translation and\/or summarization. 65--72 . Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics workshop on intrinsic and extrinsic evaluation measures for machine translation and\/or summarization. 65--72."},{"key":"e_1_3_2_2_5_1","volume-title":"Icdar 2019 competition on scene text visual question answering. arXiv preprint arXiv:1907.00490","author":"Biten Ali Furkan","year":"2019","unstructured":"Ali Furkan Biten , Ruben Tito , Andres Mafla , Lluis Gomez , Mar\u00e7al Rusinol , Minesh Mathew , CV Jawahar , Ernest Valveny , and Dimosthenis Karatzas . 2019. Icdar 2019 competition on scene text visual question answering. arXiv preprint arXiv:1907.00490 ( 2019 ). Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Mar\u00e7al Rusinol, Minesh Mathew, CV Jawahar, Ernest Valveny, and Dimosthenis Karatzas. 2019. Icdar 2019 competition on scene text visual question answering. arXiv preprint arXiv:1907.00490 (2019)."},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00439"},{"key":"e_1_3_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00051"},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3219861"},{"key":"e_1_3_2_2_9_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11390-017-1738-7"},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.254"},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00527"},{"key":"e_1_3_2_2_13_1","volume-title":"Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA. arXiv preprint arXiv:1911.06258","author":"Hu Ronghang","year":"2019","unstructured":"Ronghang Hu , Amanpreet Singh , Trevor Darrell , and Marcus Rohrbach . 2019. Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA. arXiv preprint arXiv:1911.06258 ( 2019 ). Ronghang Hu, Amanpreet Singh, Trevor Darrell, and Marcus Rohrbach. 2019. Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA. arXiv preprint arXiv:1911.06258 (2019)."},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00473"},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0823-z"},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01216-8_31"},{"key":"e_1_3_2_2_17_1","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Kingma Diederik P","year":"2015","unstructured":"Diederik P Kingma and Jimmy Ba . 2015 . Adam: A method for stochastic optimization . Proceedings of the International Conference on Learning Representations (2015). Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. Proceedings of the International Conference on Learning Representations (2015)."},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00902"},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.560"},{"key":"e_1_3_2_2_20_1","volume-title":"Deep collaborative embedding for social image understanding","author":"Li Zechao","year":"2018","unstructured":"Zechao Li , Jinhui Tang , and Tao Mei . 2018. Deep collaborative embedding for social image understanding . IEEE transactions on pattern analysis and machine intelligence 41, 9 ( 2018 ), 2070--2083. Zechao Li, Jinhui Tang, and Tao Mei. 2018. Deep collaborative embedding for social image understanding. IEEE transactions on pattern analysis and machine intelligence 41, 9 (2018), 2070--2083."},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v31i1.11196"},{"key":"e_1_3_2_2_22_1","volume-title":"Proceedings of the Annual Meeting of the Association for Computational Linguistics Workshop: Text Summarization Braches Out","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin . 2004 . Rouge: A package for automatic evaluation of summaries . In Proceedings of the Annual Meeting of the Association for Computational Linguistics Workshop: Text Summarization Braches Out 2004. 74--81. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Annual Meeting of the Association for Computational Linguistics Workshop: Text Summarization Braches Out 2004. 74--81."},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00595"},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2019.00156"},{"key":"e_1_3_2_2_25_1","volume-title":"Proceedings of the Association for Computational Linguistics. 311--318","author":"Papineni Kishore","year":"2002","unstructured":"Kishore Papineni , Salim Roukos , Todd Ward , and Wei-Jing Zhu . 2002 . BLEU: a method for automatic evaluation of machine translation . In Proceedings of the Association for Computational Linguistics. 311--318 . Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the Association for Computational Linguistics. 311--318."},{"key":"e_1_3_2_2_26_1","volume-title":"Phrase Localization and Visual Relationship Detection With Comprehensive Image-Language Cues. In The IEEE International Conference on Computer Vision (ICCV).","author":"Plummer Bryan A.","year":"2017","unstructured":"Bryan A. Plummer , Arun Mallya , Christopher M. Cervantes , Julia Hockenmaier , and Svetlana Lazebnik . 2017 . Phrase Localization and Visual Relationship Detection With Comprehensive Image-Language Cues. In The IEEE International Conference on Computer Vision (ICCV). Bryan A. Plummer, Arun Mallya, Christopher M. Cervantes, Julia Hockenmaier, and Svetlana Lazebnik. 2017. Phrase Localization and Visual Relationship Detection With Comprehensive Image-Language Cues. In The IEEE International Conference on Computer Vision (ICCV)."},{"key":"e_1_3_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00856"},{"key":"e_1_3_2_2_28_1","unstructured":"Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.  Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99."},{"key":"e_1_3_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.131"},{"key":"e_1_3_2_2_30_1","volume-title":"TextCaps: a Dataset for Image Captioning with Reading Comprehension. arXiv preprint arXiv:2003.12462","author":"Sidorov Oleksii","year":"2020","unstructured":"Oleksii Sidorov , Ronghang Hu , Marcus Rohrbach , and Amanpreet Singh . 2020. TextCaps: a Dataset for Image Captioning with Reading Comprehension. arXiv preprint arXiv:2003.12462 ( 2020 ). Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. 2020. TextCaps: a Dataset for Image Captioning with Reading Comprehension. arXiv preprint arXiv:2003.12462 (2020)."},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"crossref","unstructured":"Amanpreet Singh Vivek Natarajan Meet Shah Yu Jiang Xinlei Chen Dhruv Batra Devi Parikh and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8317--8326.  Amanpreet Singh Vivek Natarajan Meet Shah Yu Jiang Xinlei Chen Dhruv Batra Devi Parikh and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8317--8326.","DOI":"10.1109\/CVPR.2019.00851"},{"key":"e_1_3_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2007.4376991"},{"key":"e_1_3_2_2_33_1","volume-title":"Social anchor-unit graph regularized tensor completion for large-scale image retagging","author":"Tang Jinhui","year":"2019","unstructured":"Jinhui Tang , Xiangbo Shu , Zechao Li , Yu-Gang Jiang , and Qi Tian . 2019. Social anchor-unit graph regularized tensor completion for large-scale image retagging . IEEE transactions on pattern analysis and machine intelligence 41, 8 ( 2019 ), 2027--2034. Jinhui Tang, Xiangbo Shu, Zechao Li, Yu-Gang Jiang, and Qi Tian. 2019. Social anchor-unit graph regularized tensor completion for large-scale image retagging. IEEE transactions on pattern analysis and machine intelligence 41, 8 (2019), 2027--2034."},{"key":"e_1_3_2_2_34_1","volume-title":"Tri-clustered tensor completion for social-aware image tag refinement","author":"Tang Jinhui","year":"2016","unstructured":"Jinhui Tang , Xiangbo Shu , Guo-Jun Qi , Zechao Li , Meng Wang , Shuicheng Yan , and Ramesh Jain . 2016. Tri-clustered tensor completion for social-aware image tag refinement . IEEE transactions on pattern analysis and machine intelligence 39, 8 ( 2016 ), 1662--1674. Jinhui Tang, Xiangbo Shu, Guo-Jun Qi, Zechao Li, Meng Wang, Shuicheng Yan, and Ramesh Jain. 2016. Tri-clustered tensor completion for social-aware image tag refinement. IEEE transactions on pattern analysis and machine intelligence 39, 8 (2016), 1662--1674."},{"key":"e_1_3_2_2_35_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.  Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008."},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_3_2_2_38_1","volume-title":"Thirty-Second AAAI Conference on Artificial Intelligence. 7396--7403","author":"Wang Jing","year":"2018","unstructured":"Jing Wang , Jianlong Fu , Jinhui Tang , Zechao Li , and Tao Mei . 2018 . Show, reward and tell: Automatic generation of narrative paragraph from photo stream by adversarial training . In Thirty-Second AAAI Conference on Artificial Intelligence. 7396--7403 . Jing Wang, Jianlong Fu, Jinhui Tang, Zechao Li, and Tao Mei. 2018. Show, reward and tell: Automatic generation of narrative paragraph from photo stream by adversarial training. In Thirty-Second AAAI Conference on Artificial Intelligence. 7396--7403."},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2019\/132"},{"key":"e_1_3_2_2_40_1","volume-title":"Proceedings of the International Conference on Machine Learning. 2048--2057","author":"Xu Kelvin","year":"2015","unstructured":"Kelvin Xu , Jimmy Ba , Ryan Kiros , Kyunghyun Cho , Aaron Courville , Ruslan Salakhudinov , Rich Zemel , and Yoshua Bengio . 2015 . Show, attend and tell: Neural image caption generation with visual attention . In Proceedings of the International Conference on Machine Learning. 2048--2057 . Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048--2057."},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00427"},{"key":"e_1_3_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01094"},{"key":"e_1_3_2_2_43_1","unstructured":"Zhilin Yang Ye Yuan Yuexin Wu William W Cohen and Ruslan R Salakhutdinov. 2016. Review networks for caption generation. In Advances in Neural Information Processing Systems. 2361--2369.  Zhilin Yang Ye Yuan Yuexin Wu William W Cohen and Ruslan R Salakhutdinov. 2016. Review networks for caption generation. In Advances in Neural Information Processing Systems. 2361--2369."},{"key":"e_1_3_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01264-9_42"},{"key":"e_1_3_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00271"},{"key":"e_1_3_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.503"}],"event":{"name":"MM '20: The 28th ACM International Conference on Multimedia","location":"Seattle WA USA","acronym":"MM '20","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 28th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3394171.3413753","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3394171.3413753","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:01:16Z","timestamp":1750197676000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3394171.3413753"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,10,12]]},"references-count":46,"alternative-id":["10.1145\/3394171.3413753","10.1145\/3394171"],"URL":"https:\/\/doi.org\/10.1145\/3394171.3413753","relation":{},"subject":[],"published":{"date-parts":[[2020,10,12]]},"assertion":[{"value":"2020-10-12","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}