{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,27]],"date-time":"2025-06-27T04:13:26Z","timestamp":1750997606758,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":25,"publisher":"ACM","license":[{"start":{"date-parts":[[2017,10,23]],"date-time":"2017-10-23T00:00:00Z","timestamp":1508716800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Basic Research Program of China","award":["2015CB352302"],"award-info":[{"award-number":["2015CB352302"]}]},{"name":"Chinese Knowledge Center of Engineering Science and Technology","award":["124001- D01206"],"award-info":[{"award-number":["124001- D01206"]}]},{"name":"Qianjiang Talents Program of Zhejiang Province 2015","award":["2015C01027"],"award-info":[{"award-number":["2015C01027"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["U1611461, U1509206, 61625107"],"award-info":[{"award-number":["U1611461, U1509206, 61625107"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2017,10,23]]},"DOI":"10.1145\/3126686.3126715","type":"proceedings-article","created":{"date-parts":[[2017,10,23]],"date-time":"2017-10-23T19:20:32Z","timestamp":1508786432000},"page":"271-279","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Learning Deep Contextual Attention Network for Narrative Photo Stream Captioning"],"prefix":"10.1145","author":[{"given":"Hanqi","family":"Wang","sequence":"first","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]},{"given":"Siliang","family":"Tang","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]},{"given":"Yin","family":"Zhang","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]},{"given":"Tao","family":"Mei","sequence":"additional","affiliation":[{"name":"Microsoft Research Asia, Beijing, China"}]},{"given":"Yueting","family":"Zhuang","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]},{"given":"Fei","family":"Wu","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]}],"member":"320","published-online":{"date-parts":[[2017,10,23]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1179"},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"crossref","unstructured":"Jacob Devlin Hao Cheng Hao Fang Saurabh Gupta Li Deng Xiaodong He Geoffrey Zweig and Margaret Mitchell. 2015. Language Models for Image Captioning: The Quirks and What Works Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 100--105. Jacob Devlin Hao Cheng Hao Fang Saurabh Gupta Li Deng Xiaodong He Geoffrey Zweig and Margaret Mitchell. 2015. Language Models for Image Captioning: The Quirks and What Works Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 100--105.","DOI":"10.3115\/v1\/P15-2017"},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"crossref","unstructured":"Desmond Elliott and Frank Keller. 2013. Image Description using Visual Dependency Representations Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1292--1302. Desmond Elliott and Frank Keller. 2013. Image Description using Visual Dependency Representations Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1292--1302.","DOI":"10.18653\/v1\/D13-1128"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298754"},{"key":"e_1_3_2_1_5_1","volume-title":"Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth.","author":"Farhadi Ali","year":"2010","unstructured":"Ali Farhadi , Mohsen Hejrati , Mohammad Amin Sadeghi , Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010 . Every Picture Tells a Story : Generating Sentences from Images Proceedings of the 11th European Conference on Computer Vision . 15--29. Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every Picture Tells a Story: Generating Sentences from Images Proceedings of the 11th European Conference on Computer Vision. 15--29."},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"crossref","unstructured":"Yunchao Gong Liwei Wang Micah Hodosh Julia Hockenmaier and Svetlana Lazebnik. 2014. Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections. 529--545. Yunchao Gong Liwei Wang Micah Hodosh Julia Hockenmaier and Svetlana Lazebnik. 2014. Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections. 529--545.","DOI":"10.1007\/978-3-319-10593-2_35"},{"key":"e_1_3_2_1_7_1","volume-title":"Visual Storytelling Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1233--1239","author":"Huang Ting-Hao","year":"2016","unstructured":"Ting-Hao Huang , Francis Ferraro , Nasrin Mostafazadeh , Ishan Misra , Aishwarya Agrawal , Jacob Devlin , Ross Girshick , Xiaodong He , Pushmeet Kohli , Dhruv Batra , C. Lawrence Zitnick , Devi Parikh , Lucy Vanderwende , Michel Galley , and Margaret Mitchell . 2016 . Visual Storytelling Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1233--1239 . Ting-Hao Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, Lucy Vanderwende, Michel Galley, and Margaret Mitchell. 2016. Visual Storytelling Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1233--1239."},{"key":"e_1_3_2_1_8_1","volume-title":"DenseCap: Fully Convolutional Localization Networks for Dense Captioning Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 4565--4574","author":"Johnson Justin","year":"2016","unstructured":"Justin Johnson , Andrej Karpathy , and Li Fei-Fei . 2016 . DenseCap: Fully Convolutional Localization Networks for Dense Captioning Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 4565--4574 . Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. DenseCap: Fully Convolutional Localization Networks for Dense Captioning Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 4565--4574."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2012.162"},{"volume-title":"Let Your Photos Talk: Generating Narrative Paragraph for Photo Stream via Bidirectional Attention Recurrent Neural Networks Proceedings of the 31st AAAI Conference on Artificial Intelligence","author":"Liu Yu","key":"e_1_3_2_1_10_1","unstructured":"Yu Liu , Jianlong Fu , Tao Mei , and Chang Wen Chen . 2017. Let Your Photos Talk: Generating Narrative Paragraph for Photo Stream via Bidirectional Attention Recurrent Neural Networks Proceedings of the 31st AAAI Conference on Artificial Intelligence . Yu Liu, Jianlong Fu, Tao Mei, and Chang Wen Chen. 2017. Let Your Photos Talk: Generating Narrative Paragraph for Photo Stream via Bidirectional Attention Recurrent Neural Networks Proceedings of the 31st AAAI Conference on Artificial Intelligence."},{"key":"e_1_3_2_1_11_1","volume-title":"Nonparametric Method for Data-driven Image Captioning Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 592--598","author":"Mason Rebecca","year":"2014","unstructured":"Rebecca Mason and Eugene Charniak . 2014 . Nonparametric Method for Data-driven Image Captioning Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 592--598 . Rebecca Mason and Eugene Charniak. 2014. Nonparametric Method for Data-driven Image Captioning Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 592--598."},{"key":"e_1_3_2_1_12_1","volume-title":"Berg","author":"Ordonez Vicente","year":"2011","unstructured":"Vicente Ordonez , Girish Kulkarni , and Tamara L . Berg . 2011 . Im2Text: Describing Images Using 1 Million Captioned Photographs Advances in Neural Information Processing Systems 24. 1143--1151. Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2Text: Describing Images Using 1 Million Captioned Photographs Advances in Neural Information Processing Systems 24. 1143--1151."},{"key":"e_1_3_2_1_13_1","unstructured":"Cesc Chunseong Park and Gunhee Kim. 2015. Expressing an Image Stream with a Sequence of Natural Sentences Advances in Neural Information Processing Systems 28. 73--81. Cesc Chunseong Park and Gunhee Kim. 2015. Expressing an Image Stream with a Sequence of Natural Sentences Advances in Neural Information Processing Systems 28. 73--81."},{"key":"e_1_3_2_1_14_1","first-page":"91","article-title":"Faster R-CNN","volume":"28","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren , Kaiming He , Ross Girshick , and Jian Sun . 2015 . Faster R-CNN : Towards Real-Time Object Detection with Region Proposal Networks. Advances in Neural Information Processing Systems 28. 91 -- 99 . Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Advances in Neural Information Processing Systems 28. 91--99.","journal-title":"Towards Real-Time Object Detection with Region Proposal Networks. Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"crossref","unstructured":"Mohit Sharma Jiayu Zhou Junling Hu and George Karypis. 2015. Feature-based factorized Bilinear Similarity Model for Cold-Start Top-n Item Recommendation Proceedings of the 2015 SIAM International Conference on Data Mining. 190--198. Mohit Sharma Jiayu Zhou Junling Hu and George Karypis. 2015. Feature-based factorized Bilinear Similarity Model for Cold-Start Top-n Item Recommendation Proceedings of the 2015 SIAM International Conference on Data Mining. 190--198.","DOI":"10.1137\/1.9781611974010.22"},{"key":"e_1_3_2_1_16_1","unstructured":"Sainbayar Sukhbaatar Arthur Szlam Jason Weston and Rob Fergus. 2015. End-To-End Memory Networks. Advances in Neural Information Processing Systems 28. 2440--2448. Sainbayar Sukhbaatar Arthur Szlam Jason Weston and Rob Fergus. 2015. End-To-End Memory Networks. Advances in Neural Information Processing Systems 28. 2440--2448."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.306"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.515"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"crossref","unstructured":"Subhashini Venugopalan Huijun Xu Jeff Donahue Marcus Rohrbach Raymond Mooney and Kate Saenko. 2015. Translating Videos to Natural Language Using Deep Recurrent Neural Networks Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1494--1504. Subhashini Venugopalan Huijun Xu Jeff Donahue Marcus Rohrbach Raymond Mooney and Kate Saenko. 2015. Translating Videos to Natural Language Using Deep Recurrent Neural Networks Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1494--1504.","DOI":"10.3115\/v1\/N15-1173"},{"key":"e_1_3_2_1_20_1","volume-title":"Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164","author":"Vinyals Oriol","year":"2016","unstructured":"Oriol Vinyals , Alexander Toshev , Samy Bengio , and Dumitru Erhan . 2016 . Show and Tell: A Neural Image Caption Generator . Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164 . Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2016. Show and Tell: A Neural Image Caption Generator. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164."},{"key":"e_1_3_2_1_21_1","volume-title":"Attend and Tell: Neural Image Caption Generation with Visual Attention Proceedings of The 32nd International Conference on Machine Learning. 2048--2057","author":"Xu Kelvin","year":"2015","unstructured":"Kelvin Xu , Jimmy Lei Ba , Ryan Kiros , Kyunghyun Cho , Aaron Courville , Ruslan Salakhutdinov , Richard S. Zemel , and Yoshua Bengio . 2015 . Show , Attend and Tell: Neural Image Caption Generation with Visual Attention Proceedings of The 32nd International Conference on Machine Learning. 2048--2057 . Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Proceedings of The 32nd International Conference on Machine Learning. 2048--2057."},{"key":"e_1_3_2_1_22_1","volume-title":"Salakhutdinov","author":"Yang Zhilin","year":"2016","unstructured":"Zhilin Yang , Ye Yuan , Yuexin Wu , William W. Cohen , and Ruslan R . Salakhutdinov . 2016 . Review Networks for Caption Generation. In Advances in Neural Information Processing Systems 29. 2361--2369. Zhilin Yang, Ye Yuan, Yuexin Wu, William W. Cohen, and Ruslan R. Salakhutdinov. 2016. Review Networks for Caption Generation. In Advances in Neural Information Processing Systems 29. 2361--2369."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.512"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.503"},{"key":"e_1_3_2_1_25_1","volume-title":"Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 4584--4593","author":"Yu Haonan","year":"2016","unstructured":"Haonan Yu , Jiang Wang , Zhiheng Huang , Yi Yang , and Wei Xu . 2016 . Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 4584--4593 . Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 4584--4593."}],"event":{"name":"MM '17: ACM Multimedia Conference","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Mountain View California USA","acronym":"MM '17"},"container-title":["Proceedings of the on Thematic Workshops of ACM Multimedia 2017"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3126686.3126715","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3126686.3126715","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,26]],"date-time":"2025-06-26T17:48:25Z","timestamp":1750960105000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3126686.3126715"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,10,23]]},"references-count":25,"alternative-id":["10.1145\/3126686.3126715","10.1145\/3126686"],"URL":"https:\/\/doi.org\/10.1145\/3126686.3126715","relation":{},"subject":[],"published":{"date-parts":[[2017,10,23]]},"assertion":[{"value":"2017-10-23","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}