{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,1]],"date-time":"2025-10-01T16:15:01Z","timestamp":1759335301308,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":50,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"NIFA","award":["2020-67021-32799"],"award-info":[{"award-number":["2020-67021-32799"]}]},{"name":"NSF","award":["1718221 2008387 2045586 2106825"],"award-info":[{"award-number":["1718221 2008387 2045586 2106825"]}]},{"name":"MRI","award":["1725729"],"award-info":[{"award-number":["1725729"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3548161","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:43:12Z","timestamp":1665416592000},"page":"3310-3318","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["Ordered Attention for Coherent Visual Storytelling"],"prefix":"10.1145","author":[{"given":"Tom","family":"Braude","sequence":"first","affiliation":[{"name":"Reichman University &amp; Microsoft, Hertzliya, Israel"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Idan","family":"Schwartz","sequence":"additional","affiliation":[{"name":"Technion, Haifa, Israel"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Alex","family":"Schwing","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign, Champaign, IL, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ariel","family":"Shamir","sequence":"additional","affiliation":[{"name":"Reichman University, Hertzliya, Israel"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"crossref","unstructured":"Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould and Lei Zhang. 2017. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR. Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould and Lei Zhang. 2017. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR.","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_2_2_1","volume-title":"METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In ACL.","author":"Banerjee Satanjeev","year":"2005","unstructured":"Satanjeev Banerjee and Alon Lavie . 2005 . METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In ACL. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In ACL."},{"key":"e_1_3_2_2_3_1","volume-title":"David A. Forsyth, Nando de Freitas, David M. Blei, and Michael I. Jordan.","author":"Barnard Kobus","year":"2003","unstructured":"Kobus Barnard , Pinar Duygulu Sahin , David A. Forsyth, Nando de Freitas, David M. Blei, and Michael I. Jordan. 2003 . Matching Words and Pictures. JMLR ( 2003). Kobus Barnard, Pinar Duygulu Sahin, David A. Forsyth, Nando de Freitas, David M. Blei, and Michael I. Jordan. 2003. Matching Words and Pictures. JMLR (2003)."},{"key":"e_1_3_2_2_4_1","volume-title":"Mutan: Multimodal tucker fusion for visual question answering. In ICCV.","author":"Ben-Younes Hedi","year":"2017","unstructured":"Hedi Ben-Younes , R\u00e9mi Cadene , Matthieu Cord , and Nicolas Thome . 2017 . Mutan: Multimodal tucker fusion for visual question answering. In ICCV. Hedi Ben-Younes, R\u00e9mi Cadene, Matthieu Cord, and Nicolas Thome. 2017. Mutan: Multimodal tucker fusion for visual question answering. In ICCV."},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"crossref","unstructured":"Xinlei Chen and C. Lawrence Zitnick. 2015. Mind's eye: A recurrent visual representation for image caption generation. In CVPR. Xinlei Chen and C. Lawrence Zitnick. 2015. Mind's eye: A recurrent visual representation for image caption generation. In CVPR.","DOI":"10.1109\/CVPR.2015.7298856"},{"key":"e_1_3_2_2_6_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. NAACL","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . Bert: Pre-training of deep bidirectional transformers for language understanding. NAACL (2019). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. NAACL (2019)."},{"key":"e_1_3_2_2_7_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly etal 2021. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR (2021). Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR (2021)."},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"crossref","unstructured":"Ruichao Fan Hanli Wang Jinjing Gu and Xianhui Liu. 2021. Visual Storytelling with Hierarchical BERT Semantic Guidance. In ACM Multimedia Asia. 1--7. Ruichao Fan Hanli Wang Jinjing Gu and Xianhui Liu. 2021. Visual Storytelling with Hierarchical BERT Semantic Guidance. In ACM Multimedia Asia. 1--7.","DOI":"10.1145\/3469877.3490604"},{"key":"e_1_3_2_2_9_1","volume-title":"Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach.","author":"Fukui Akira","year":"2016","unstructured":"Akira Fukui , Dong Huk Park , Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016 . Multimodal compact bilinear pooling for visual question answering and visual grounding. In EMNLP. Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In EMNLP."},{"key":"e_1_3_2_2_10_1","volume-title":"Xiaogang Wang, and Hongsheng Li.","author":"Gao Peng","year":"2019","unstructured":"Peng Gao , Zhengkai Jiang , Haoxuan You , Pan Lu , Steven CH Hoi , Xiaogang Wang, and Hongsheng Li. 2019 . Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In CVPR. Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven CH Hoi, Xiaogang Wang, and Hongsheng Li. 2019. Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In CVPR."},{"key":"e_1_3_2_2_11_1","volume-title":"Show and Tell: A Neural Visual Storyteller. In Storytelling Workshop, NAACL.","author":"Gonzalez-Rico Diana","year":"2018","unstructured":"Diana Gonzalez-Rico and Gibran Fuentes Pineda . 2018 . Contextualize , Show and Tell: A Neural Visual Storyteller. In Storytelling Workshop, NAACL. Diana Gonzalez-Rico and Gibran Fuentes Pineda. 2018. Contextualize, Show and Tell: A Neural Visual Storyteller. In Storytelling Workshop, NAACL."},{"key":"e_1_3_2_2_12_1","unstructured":"Longteng Guo Jing Liu Xinxin Zhu Peng Yao Shichen Lu and Hanqing Lu. 2020. Normalized and geometry-aware self-attention network for image captioning. In CVPR. Longteng Guo Jing Liu Xinxin Zhu Peng Yao Shichen Lu and Hanqing Lu. 2020. Normalized and geometry-aware self-attention network for image captioning. In CVPR."},{"key":"e_1_3_2_2_13_1","unstructured":"Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2015. Deep Residual Learning for Image Recognition. In CVPR. Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2015. Deep Residual Learning for Image Recognition. In CVPR."},{"key":"e_1_3_2_2_14_1","unstructured":"Ari Holtzman Jan Buys Li Du Maxwell Forbes and Yejin Choi. 2020. The curious case of neural text degeneration. In ICLR. Ari Holtzman Jan Buys Li Du Maxwell Forbes and Yejin Choi. 2020. The curious case of neural text degeneration. In ICLR."},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"crossref","unstructured":"Xudong Hong Rakshith Shetty Asad Sayeed Khushboo Mehra Vera Demberg and Bernt Schiele. 2020. Diverse and Relevant Visual Storytelling with Scene Graph Embeddings. In CONLL. Xudong Hong Rakshith Shetty Asad Sayeed Khushboo Mehra Vera Demberg and Bernt Schiele. 2020. Diverse and Relevant Visual Storytelling with Scene Graph Embeddings. In CONLL.","DOI":"10.18653\/v1\/2020.conll-1.34"},{"key":"e_1_3_2_2_16_1","volume-title":"Ting- Hao Huang, and Lun-Wei Ku","author":"Hsu Chao-Chun","year":"2020","unstructured":"Chao-Chun Hsu , Zi-Yuan Chen , Chi-Yang Hsu , Chih-Chia Li , Tzu-Yuan Lin , Ting- Hao Huang, and Lun-Wei Ku . 2020 . Knowledge-Enriched Visual Storytelling. In AAAI. Chao-Chun Hsu, Zi-Yuan Chen, Chi-Yang Hsu, Chih-Chia Li, Tzu-Yuan Lin, Ting- Hao Huang, and Lun-Wei Ku. 2020. Knowledge-Enriched Visual Storytelling. In AAAI."},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"crossref","unstructured":"Qiuyuan Huang Zhe Gan Asli \u00c7elikyilmaz Dapeng Wu Jianfeng Wang and Xiaodong He. 2018. Hierarchically Structured Reinforcement Learning for Topically Coherent Visual Story Generation. In AAAI. Qiuyuan Huang Zhe Gan Asli \u00c7elikyilmaz Dapeng Wu Jianfeng Wang and Xiaodong He. 2018. Hierarchically Structured Reinforcement Learning for Topically Coherent Visual Story Generation. In AAAI.","DOI":"10.1609\/aaai.v33i01.33018465"},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"crossref","unstructured":"Ting-Hao Huang Francis Ferraro Nasrin Mostafazadeh Ishan Misra Aishwarya Agrawal Jacob Devlin Ross B. Girshick Xiaodong He Pushmeet Kohli Dhruv Batra C. Lawrence Zitnick Devi Parikh Lucy Vanderwende Michel Galley and Margaret Mitchell. 2016. Visual Storytelling. In NAACL. Ting-Hao Huang Francis Ferraro Nasrin Mostafazadeh Ishan Misra Aishwarya Agrawal Jacob Devlin Ross B. Girshick Xiaodong He Pushmeet Kohli Dhruv Batra C. Lawrence Zitnick Devi Parikh Lucy Vanderwende Michel Galley and Margaret Mitchell. 2016. Visual Storytelling. In NAACL.","DOI":"10.18653\/v1\/N16-1147"},{"key":"e_1_3_2_2_19_1","unstructured":"Jin-Hwa Kim Jaehyun Jun and Byoung-Tak Zhang. 2018. Bilinear attention networks. In NeurIPS. Jin-Hwa Kim Jaehyun Jun and Byoung-Tak Zhang. 2018. Bilinear attention networks. In NeurIPS."},{"key":"e_1_3_2_2_20_1","unstructured":"Jin-Hwa Kim Kyoung-Woon On Woosang Lim Jeonghee Kim Jung-Woo Ha and Byoung-Tak Zhang. 2017. Hadamard product for low-rank bilinear pooling. In ICLR. Jin-Hwa Kim Kyoung-Woon On Woosang Lim Jeonghee Kim Jung-Woo Ha and Byoung-Tak Zhang. 2017. Hadamard product for low-rank bilinear pooling. In ICLR."},{"key":"e_1_3_2_2_21_1","unstructured":"Taehyeong Kim Min-Oh Heo Seonil Son Kyoung-Wha Park and Byoung-Tak Zhang. 2018. GLAC Net: GLocal Attention Cascading Networks for Multi-image Cued Story Generation. In CoRR. Taehyeong Kim Min-Oh Heo Seonil Son Kyoung-Wha Park and Byoung-Tak Zhang. 2018. GLAC Net: GLocal Attention Cascading Networks for Multi-image Cued Story Generation. In CoRR."},{"key":"e_1_3_2_2_22_1","unstructured":"Jiacheng Li Haizhou Shi Siliang Tang Fei Wu and Yueting Zhuang. 2019. Informative Visual Storytelling with Cross-modal Rules. In MM. Jiacheng Li Haizhou Shi Siliang Tang Fei Wu and Yueting Zhuang. 2019. Informative Visual Storytelling with Cross-modal Rules. In MM."},{"key":"e_1_3_2_2_23_1","volume-title":"Oscar: Objectsemantics aligned pre-training for vision-language tasks. In ECCV.","author":"Li Xiujun","year":"2020","unstructured":"Xiujun Li , Xi Yin , Chunyuan Li , Pengchuan Zhang , Xiaowei Hu , Lei Zhang , Lijuan Wang , Houdong Hu , Li Dong , Furu Wei , 2020 . Oscar: Objectsemantics aligned pre-training for vision-language tasks. In ECCV. Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Objectsemantics aligned pre-training for vision-language tasks. In ECCV."},{"key":"e_1_3_2_2_24_1","volume-title":"ROUGE: A Package For Automatic Evaluation Of Summaries. In ACL.","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin . 2004 . ROUGE: A Package For Automatic Evaluation Of Summaries. In ACL. Chin-Yew Lin. 2004. ROUGE: A Package For Automatic Evaluation Of Summaries. In ACL."},{"key":"e_1_3_2_2_25_1","unstructured":"Jiasen Lu Jianwei Yang Dhruv Batra and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In NeurIPS. Jiasen Lu Jianwei Yang Dhruv Batra and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In NeurIPS."},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-1805"},{"key":"e_1_3_2_2_27_1","unstructured":"Yingwei Pan Ting Yao Yehao Li and Tao Mei. 2020. X-linear attention networks for image captioning. In CVPR. Yingwei Pan Ting Yao Yehao Li and Tao Mei. 2020. X-linear attention networks for image captioning. In CVPR."},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"crossref","unstructured":"Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu. 2001. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL. Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu. 2001. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL.","DOI":"10.3115\/1073083.1073135"},{"key":"e_1_3_2_2_29_1","volume-title":"Park and Gunhee Kim","author":"Cesc","year":"2015","unstructured":"Cesc C. Park and Gunhee Kim . 2015 . Expressing an Image Stream with a Sequence of Natural Sentences. In NeurIPS. Cesc C. Park and Gunhee Kim. 2015. Expressing an Image Stream with a Sequence of Natural Sentences. In NeurIPS."},{"key":"e_1_3_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475236"},{"key":"e_1_3_2_2_31_1","volume-title":"Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al.","author":"Radford Alec","year":"2021","unstructured":"Alec Radford , Jong Wook Kim , Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021 . Learning transferable visual models from natural language supervision. In ICML. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In ICML."},{"key":"e_1_3_2_2_32_1","unstructured":"Idan Schwartz Alexander G. Schwing and Tamir Hazan. 2017. High-Order Attention Models for Visual Question Answering. In NeurIPS. Idan Schwartz Alexander G. Schwing and Tamir Hazan. 2017. High-Order Attention Models for Visual Question Answering. In NeurIPS."},{"key":"e_1_3_2_2_33_1","doi-asserted-by":"crossref","unstructured":"Idan Schwartz Seunghak Yu Tamir Hazan and Alexander G Schwing. 2019. Factor graph attention. In CVPR. Idan Schwartz Seunghak Yu Tamir Hazan and Alexander G Schwing. 2019. Factor graph attention. In CVPR.","DOI":"10.1109\/CVPR.2019.00214"},{"key":"e_1_3_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.csl.2020.101169"},{"key":"e_1_3_2_2_35_1","volume-title":"Le","author":"Sutskever Ilya","year":"2014","unstructured":"Ilya Sutskever , Oriol Vinyals , and Quoc V . Le . 2014 . Sequence to Sequence Learning with Neural Networks. In NeurIPS. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In NeurIPS."},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"crossref","unstructured":"Christian Szegedy Vincent Vanhoucke Sergey Ioffe Jonathon Shlens and Zbigniew Wojna. 2015. Rethinking the Inception Architecture for Computer Vision. In CVPR. Christian Szegedy Vincent Vanhoucke Sergey Ioffe Jonathon Shlens and Zbigniew Wojna. 2015. Rethinking the Inception Architecture for Computer Vision. In CVPR.","DOI":"10.1109\/CVPR.2016.308"},{"key":"e_1_3_2_2_37_1","volume-title":"Zero-Shot Imageto- Text Generation for Visual-Semantic Arithmetic. CVPR","author":"Tewel Yoad","year":"2022","unstructured":"Yoad Tewel , Yoav Shalev , Idan Schwartz , and Lior Wolf . 2022. Zero-Shot Imageto- Text Generation for Visual-Semantic Arithmetic. CVPR ( 2022 ). Yoad Tewel, Yoav Shalev, Idan Schwartz, and Lior Wolf. 2022. Zero-Shot Imageto- Text Generation for Visual-Semantic Arithmetic. CVPR (2022)."},{"key":"e_1_3_2_2_38_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS. Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS."},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"crossref","unstructured":"Ramakrishna Vedantam C. Lawrence Zitnick and Devi Parikh. 2014. CIDEr: Consensus-based image description evaluation. In CVPR. Ramakrishna Vedantam C. Lawrence Zitnick and Devi Parikh. 2014. CIDEr: Consensus-based image description evaluation. In CVPR.","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"crossref","unstructured":"Oriol Vinyals Alexander Toshev Samy Bengio and Dumitru Erhan. 2014. Show and tell: A neural image caption generator. In CVPR. Oriol Vinyals Alexander Toshev Samy Bengio and Dumitru Erhan. 2014. Show and tell: A neural image caption generator. In CVPR.","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"crossref","unstructured":"Ruize Wang Zhongyu Wei Piji Li Qi Zhang and Xuanjing Huang. 2019. Storytelling from an Image Stream Using Scene Graphs. In AAAI. Ruize Wang Zhongyu Wei Piji Li Qi Zhang and Xuanjing Huang. 2019. Storytelling from an Image Stream Using Scene Graphs. In AAAI.","DOI":"10.1609\/aaai.v34i05.6455"},{"key":"e_1_3_2_2_42_1","volume-title":"Yuan fang Wang, and William Yang Wang","author":"Wang Xin","year":"2018","unstructured":"Xin Wang , Wenhu Chen , Yuan fang Wang, and William Yang Wang . 2018 . No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling. In ACL. Xin Wang, Wenhu Chen, Yuan fang Wang, and William Yang Wang. 2018. No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling. In ACL."},{"key":"e_1_3_2_2_43_1","unstructured":"Huijuan Xu and Kate Saenko. 2016. Ask attend and answer: Exploring question-guided spatial attention for visual question answering. In ECCV. Huijuan Xu and Kate Saenko. 2016. Ask attend and answer: Exploring question-guided spatial attention for visual question answering. In ECCV."},{"key":"e_1_3_2_2_44_1","unstructured":"Kelvin Xu Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C. Courville Ruslan Salakhutdinov Richard S. Zemel and Yoshua Bengio. 2015. Show Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML. Kelvin Xu Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C. Courville Ruslan Salakhutdinov Richard S. Zemel and Yoshua Bengio. 2015. Show Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML."},{"key":"e_1_3_2_2_45_1","volume-title":"Knowledgeable Storyteller: A Commonsense-Driven Generative Model for Visual Storytelling. In IJCAI.","author":"Yang Pengcheng","year":"2019","unstructured":"Pengcheng Yang , Fuli Luo , Peng Chen , Lei Li , Zhiyi Yin , Xiaodong He , and Xu Sun . 2019 . Knowledgeable Storyteller: A Commonsense-Driven Generative Model for Visual Storytelling. In IJCAI. Pengcheng Yang, Fuli Luo, Peng Chen, Lei Li, Zhiyi Yin, Xiaodong He, and Xu Sun. 2019. Knowledgeable Storyteller: A Commonsense-Driven Generative Model for Visual Storytelling. In IJCAI."},{"key":"e_1_3_2_2_46_1","doi-asserted-by":"crossref","unstructured":"Zichao Yang Xiaodong He Jianfeng Gao Li Deng and Alexander J Smola. 2015. Stacked attention networks for image question answering. Zichao Yang Xiaodong He Jianfeng Gao Li Deng and Alexander J Smola. 2015. Stacked attention networks for image question answering.","DOI":"10.1109\/CVPR.2016.10"},{"key":"e_1_3_2_2_47_1","unstructured":"Youngjae Yu Jiwan Chung Heeseung Yun Jongseok Kim and Gunhee Kim. 2021. Transitional Adaptation of Pretrained Models for Visual Storytelling. In CVPR. Youngjae Yu Jiwan Chung Heeseung Yun Jongseok Kim and Gunhee Kim. 2021. Transitional Adaptation of Pretrained Models for Visual Storytelling. In CVPR."},{"key":"e_1_3_2_2_48_1","unstructured":"Zhou Yu Jun Yu Chenchao Xiang Jianping Fan and Dacheng Tao. 2018. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. In NeurIPS. Zhou Yu Jun Yu Chenchao Xiang Jianping Fan and Dacheng Tao. 2018. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. In NeurIPS."},{"key":"e_1_3_2_2_49_1","volume-title":"ICCV. Workshop on Closing the Loop Between Vision and Language.","author":"Zhang Bowen","year":"2020","unstructured":"Bowen Zhang , Hexiang Hu , and Fei Sha . 2020 . Visual Storytelling via Predicting AnchorWord Embeddings in the Stories . In ICCV. Workshop on Closing the Loop Between Vision and Language. Bowen Zhang, Hexiang Hu, and Fei Sha. 2020. Visual Storytelling via Predicting AnchorWord Embeddings in the Stories. In ICCV. Workshop on Closing the Loop Between Vision and Language."},{"key":"e_1_3_2_2_50_1","volume-title":"Vinvl: Revisiting visual representations in vision-language models. In CVPR.","author":"Zhang Pengchuan","year":"2021","unstructured":"Pengchuan Zhang , Xiujun Li , Xiaowei Hu , Jianwei Yang , Lei Zhang , LijuanWang, Yejin Choi , and Jianfeng Gao . 2021 . Vinvl: Revisiting visual representations in vision-language models. In CVPR. Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, LijuanWang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In CVPR."}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Lisboa Portugal","acronym":"MM '22"},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548161","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3548161","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3548161","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:00:19Z","timestamp":1750186819000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548161"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":50,"alternative-id":["10.1145\/3503161.3548161","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3548161","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}