{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,10]],"date-time":"2025-11-10T21:14:56Z","timestamp":1762809296564,"version":"3.41.0"},"reference-count":68,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2023,2,6]],"date-time":"2023-02-06T00:00:00Z","timestamp":1675641600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62172256 and 61872428"],"award-info":[{"award-number":["62172256 and 61872428"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Shandong Province Key Research and Development Program","award":["2019JZZY010127"],"award-info":[{"award-number":["2019JZZY010127"]}]},{"DOI":"10.13039\/501100007129","name":"Natural Science Foundation of Shandong Province","doi-asserted-by":"crossref","award":["ZR2019ZD06"],"award-info":[{"award-number":["ZR2019ZD06"]}],"id":[{"id":"10.13039\/501100007129","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Major Program of the National Natural Science Foundation of China","award":["61991411"],"award-info":[{"award-number":["61991411"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2023,5,31]]},"abstract":"<jats:p>\n            Video captioning, which bridges vision and language, is a fundamental yet challenging task in computer vision. To generate accurate and comprehensive sentences, both visual and semantic information is quite important. However, most existing methods simply concatenate different types of features and ignore the interactions between them. In addition, there is a large semantic gap between visual feature space and semantic embedding space, making the task very challenging. To address these issues, we propose a framework named semantic embedding guided attention with\n            <jats:bold>Explicit visual Feature Fusion for vidEo CapTioning, EFFECT<\/jats:bold>\n            for short, in which we design an\n            <jats:bold>explicit visual-feature fusion (EVF)<\/jats:bold>\n            scheme to capture the pairwise interactions between multiple visual modalities and fuse multimodal visual features of videos in an explicit way. Furthermore, we propose a novel attention mechanism called\n            <jats:bold>semantic embedding guided attention (SEGA<\/jats:bold>\n            ), which cooperates with the temporal attention to generate a joint attention map. Specifically, in SEGA, the semantic word embedding information is leveraged to guide the model to pay more attention to the most correlated visual features at each decoding stage. In this way, the semantic gap between visual and semantic space is alleviated to some extent. To evaluate the proposed model, we conduct extensive experiments on two widely used datasets, i.e., MSVD and MSR-VTT. The experimental results demonstrate that our approach achieves state-of-the-art results in terms of four evaluation metrics.\n          <\/jats:p>","DOI":"10.1145\/3550276","type":"journal-article","created":{"date-parts":[[2022,7,22]],"date-time":"2022-07-22T11:13:27Z","timestamp":1658488407000},"page":"1-18","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2500-9488","authenticated-orcid":false,"given":"Shanshan","family":"Dong","sequence":"first","affiliation":[{"name":"School of Software, Shandong University, Jinan, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7389-5883","authenticated-orcid":false,"given":"Tianzi","family":"Niu","sequence":"additional","affiliation":[{"name":"School of Software, Shandong University, Jinan, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6901-5476","authenticated-orcid":false,"given":"Xin","family":"Luo","sequence":"additional","affiliation":[{"name":"School of Software, Shandong University, Jinan, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1633-7575","authenticated-orcid":false,"given":"Wu","family":"Liu","sequence":"additional","affiliation":[{"name":"JD AI Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9972-7370","authenticated-orcid":false,"given":"Xinshun","family":"Xu","sequence":"additional","affiliation":[{"name":"School of Software, Shandong University, Jinan, China"}]}],"member":"320","published-online":{"date-parts":[[2023,2,6]]},"reference":[{"key":"e_1_3_1_2_2","first-page":"12487","volume-title":"Proc. IEEE Conf. Comput. Vis. Pattern Recognit.","author":"Aafaq Nayyer","year":"2019","unstructured":"Nayyer Aafaq, Naveed Akhtar, Wei Liu, Syed Zulqarnain Gilani, and Ajmal Mian. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.12487\u201312496."},{"key":"e_1_3_1_3_2","first-page":"2664","volume-title":"Proc. Conf. Empirical Methods Natural Lang. Process.","author":"Behnke Maximiliana","year":"2020","unstructured":"Maximiliana Behnke and Kenneth Heafield. 2020. Losing heads in the lottery: Pruning transformer attention in neural machine translation. In Proc. Conf. Empirical Methods Natural Lang. Process.2664\u20132674."},{"key":"e_1_3_1_4_2","first-page":"76","volume-title":"Proc. CHI Conf. Hum. Factors Comput. Syst.","author":"Bennett Cynthia L.","year":"2018","unstructured":"Cynthia L. Bennett, Jane E. Martez, E. Mott, Edward Cutrell, and Meredith Ringel Morris. 2018. How teens with visual impairments take, edit, and share photos on social media. In Proc. CHI Conf. Hum. Factors Comput. Syst.76."},{"key":"e_1_3_1_5_2","first-page":"4724","volume-title":"Proc. IEEE Conf. Comput. Vis. Pattern Recognit.","author":"Carreira Jo\u00e3o","year":"2017","unstructured":"Jo\u00e3o Carreira and Andrew Zisserman. 2017. Quo Vadis, action recognition? A new model and the kinetics dataset. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.4724\u20134733."},{"key":"e_1_3_1_6_2","first-page":"190","volume-title":"Proc. Annu. Meeting Assoc. Comput. Linguistics","author":"Chen David L.","year":"2011","unstructured":"David L. Chen and William B. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proc. Annu. Meeting Assoc. Comput. Linguistics. 190\u2013200."},{"key":"e_1_3_1_7_2","first-page":"333","volume-title":"Proc. Eur. Conf. Comput. Vis.","author":"Chen Shaoxiang","year":"2020","unstructured":"Shaoxiang Chen, Wenhao Jiang, Wei Liu, and Yu-Gang Jiang. 2020. Learning modality interaction for temporal sentence localization and event captioning in videos. In Proc. Eur. Conf. Comput. Vis.333\u2013351."},{"key":"e_1_3_1_8_2","first-page":"8191","volume-title":"Proc. AAAI Conf. Artif. Intell.","author":"Chen Shaoxiang","year":"2019","unstructured":"Shaoxiang Chen and Yu-Gang Jiang. 2019. Motion guided spatial attention for video captioning. In Proc. AAAI Conf. Artif. Intell.8191\u20138198."},{"key":"e_1_3_1_9_2","first-page":"1523","volume-title":"Proc. IEEE Int. Conf. Comput. Vis.","author":"Chen Shaoxiang","year":"2021","unstructured":"Shaoxiang Chen and Yu-Gang Jiang. 2021. Motion guided region message passing for video captioning. In Proc. IEEE Int. Conf. Comput. Vis.1523\u20131532."},{"key":"e_1_3_1_10_2","first-page":"8425","volume-title":"Proc. IEEE Conf. Comput. Vis. Pattern Recognit.","author":"Chen Shaoxiang","year":"2021","unstructured":"Shaoxiang Chen and Yu-Gang Jiang. 2021. Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.8425\u20138435."},{"key":"e_1_3_1_11_2","unstructured":"Xinlei Chen Hao Fang Tsung-Yi Lin Ramakrishna Vedantam Saurabh Gupta Piotr Doll\u00e1r and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv:1504.00325. http:\/\/arxiv.org\/abs\/1504.00325."},{"key":"e_1_3_1_12_2","first-page":"367","volume-title":"Proc. Eur. Conf. Comput. Vis.","author":"Chen Yangyu","year":"2018","unstructured":"Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In Proc. Eur. Conf. Comput. Vis.367\u2013384."},{"key":"e_1_3_1_13_2","first-page":"234","volume-title":"Proc. IEEE Conf. Comput. Vis. Pattern Recognit.","author":"Deng Chaorui","year":"2021","unstructured":"Chaorui Deng, Shizhe Chen, Da Chen, Yuan He, and Qi Wu. 2021. Sketch, ground, and refine: Top-down dense video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.234\u2013243."},{"key":"e_1_3_1_14_2","doi-asserted-by":"crossref","first-page":"376","DOI":"10.3115\/v1\/W14-3348","volume-title":"Proc. 9th Workshop Stat. Mach. Transl.","author":"Denkowski Michael J.","year":"2014","unstructured":"Michael J. Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proc. 9th Workshop Stat. Mach. Transl.376\u2013380."},{"key":"e_1_3_1_15_2","first-page":"457","volume-title":"Proc. Conf. Empirical Methods Natural Lang. Process.","author":"Fukui Akira","year":"2016","unstructured":"Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proc. Conf. Empirical Methods Natural Lang. Process.457\u2013468."},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2017.2729019"},{"key":"e_1_3_1_17_2","first-page":"2712","volume-title":"Proc. IEEE Int. Conf. Comput. Vis.","author":"Guadarrama Sergio","year":"2013","unstructured":"Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2013. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proc. IEEE Int. Conf. Comput. Vis.2712\u20132719."},{"key":"e_1_3_1_18_2","first-page":"729","volume-title":"Proc. IEEE Conf. Comput. Vis. Pattern Recognit.","author":"Guo Hao","year":"2019","unstructured":"Hao Guo, Kang Zheng, Xiaochuan Fan, Hongkai Yu, and Song Wang. 2019. Visual attention consistency under image transforms for multi-label image classification. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.729\u2013739."},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3436494"},{"key":"e_1_3_1_20_2","first-page":"4203","volume-title":"Proc. IEEE Int. Conf. Comput. Vis.","author":"Hori Chiori","year":"2017","unstructured":"Chiori Hori, Takaaki Hori, Teng-Yok Lee, Ziming Zhang, Bret Harsham, John R. Hershey, Tim K. Marks, and Kazuhiko Sumi. 2017. Attention-based multimodal fusion for video description. In Proc. IEEE Int. Conf. Comput. Vis.4203\u20134212."},{"key":"e_1_3_1_21_2","first-page":"8917","volume-title":"Proc. IEEE Int. Conf. Comput. Vis.","author":"Hou Jingyi","year":"2019","unstructured":"Jingyi Hou, Xinxiao Wu, Wentian Zhao, Jiebo Luo, and Yunde Jia. 2019. Joint syntax representation learning and visual cue translation for video captioning. In Proc. IEEE Int. Conf. Comput. Vis.8917\u20138926."},{"key":"e_1_3_1_22_2","first-page":"774","volume-title":"Proc. ACM Multimedia Conf.","author":"Hu Yaosi","year":"2019","unstructured":"Yaosi Hu, Zhenzhong Chen, Zheng-Jun Zha, and Feng Wu. 2019. Hierarchical global-local temporal modeling for video captioning. In Proc. ACM Multimedia Conf.774\u2013783."},{"issue":"2","key":"e_1_3_1_23_2","first-page":"20:1\u201320:19","article-title":"V-JAUNE: A framework for joint action recognition and video summarization","volume":"13","author":"Hussein Fairouz","year":"2017","unstructured":"Fairouz Hussein and Massimo Piccardi. 2017. V-JAUNE: A framework for joint action recognition and video summarization. ACM Trans. Multim. Comput. Commun. Appl. 13, 2 (2017), 20:1\u201320:19.","journal-title":"ACM Trans. Multim. Comput. Commun. Appl."},{"key":"e_1_3_1_24_2","first-page":"2001","volume-title":"Proc. Conf. Empirical Methods Natural Lang. Process.","author":"Jin Tao","year":"2019","unstructured":"Tao Jin, Siyu Huang, Yingming Li, and Zhongfei Zhang. 2019. Low-rank HOCA: Efficient high-order cross-modal attention for video captioning. In Proc. Conf. Empirical Methods Natural Lang. Process.2001\u20132011."},{"key":"e_1_3_1_25_2","unstructured":"Will Kay Jo\u00e3o Carreira Karen Simonyan Brian Zhang Chloe Hillier Sudheendra Vijayanarasimhan Fabio Viola Tim Green Trevor Back Paul Natsev Mustafa Suleyman and Andrew Zisserman. 2017. The kinetics human action video dataset. arXiv:1705.06950. http:\/\/arxiv.org\/abs\/1705.06950."},{"key":"e_1_3_1_26_2","first-page":"10312","volume-title":"Proc. IEEE Int. Conf. Comput. Vis.","author":"Li Linjie","year":"2019","unstructured":"Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. Relation-aware graph attention network for visual question answering. In Proc. IEEE Int. Conf. Comput. Vis.10312\u201310321."},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2017.2695887"},{"key":"e_1_3_1_28_2","first-page":"605","volume-title":"Proc. 42nd Annu. Meeting Assoc. Comput. Linguistics","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proc. 42nd Annu. Meeting Assoc. Comput. Linguistics. 605\u2013612."},{"key":"e_1_3_1_29_2","first-page":"1425","volume-title":"Proc. ACM Multimedia Conf.","author":"Liu Sheng","year":"2018","unstructured":"Sheng Liu, Zhou Ren, and Junsong Yuan. 2018. SibNet: Sibling convolutional encoder for video captioning. In Proc. ACM Multimedia Conf.1425\u20131434."},{"key":"e_1_3_1_30_2","first-page":"10867","volume-title":"Proc. IEEE Conf. Comput. Vis. Pattern Recognit.","author":"Pan Boxiao","year":"2020","unstructured":"Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. 2020. Spatio-temporal graph for video captioning with knowledge distillation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.10867\u201310876."},{"key":"e_1_3_1_31_2","first-page":"4594","volume-title":"Proc. IEEE Conf. Comput. Vis. Pattern Recognit.","author":"Pan Yingwei","year":"2016","unstructured":"Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.4594\u20134602."},{"key":"e_1_3_1_32_2","first-page":"10968","volume-title":"Proc. IEEE Conf. Comput. Vis. Pattern Recognit.","author":"Pan Yingwei","year":"2020","unstructured":"Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.10968\u201310977."},{"key":"e_1_3_1_33_2","first-page":"311","volume-title":"Proc. 40th Annu. Meeting Assoc. Comput. Linguistics","author":"Papineni Kishore","year":"2002","unstructured":"Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proc. 40th Annu. Meeting Assoc. Comput. Linguistics. 311\u2013318."},{"key":"e_1_3_1_34_2","first-page":"8347","volume-title":"Proc. IEEE Conf. Comput. Vis. Pattern Recognit.","author":"Pei Wenjie","year":"2019","unstructured":"Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, and Yu-Wing Tai. 2019. Memory-attended recurrent network for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.8347\u20138356."},{"key":"e_1_3_1_35_2","first-page":"1532","volume-title":"Proc. Conf. Empirical Methods Natural Lang. Process.","author":"Pennington Jeffrey","year":"2014","unstructured":"Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proc. Conf. Empirical Methods Natural Lang. Process.1532\u20131543."},{"key":"e_1_3_1_36_2","first-page":"433","volume-title":"Proc. IEEE Int. Conf. Comput. Vis.","author":"Rohrbach Marcus","year":"2013","unstructured":"Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating video content to natural language descriptions. In Proc. IEEE Int. Conf. Comput. Vis.433\u2013440."},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_3_1_38_2","first-page":"2514","volume-title":"Proc. AAAI Conf. Artif. Intell.","author":"Ryu Hobin","year":"2021","unstructured":"Hobin Ryu, Sunghun Kang, Haeyong Kang, and Chang D. Yoo. 2021. Semantic grouping network for video captioning. In Proc. AAAI Conf. Artif. Intell.2514\u20132522."},{"key":"e_1_3_1_39_2","first-page":"2737","volume-title":"Proc. 26th Int. Joint Conf. Artif. Intell.","author":"Song Jingkuan","year":"2017","unstructured":"Jingkuan Song, Lianli Gao, Zhao Guo, Wu Liu, Dongxiang Zhang, and Heng Tao Shen. 2017. Hierarchical LSTM with adjusted temporal attention for video captioning. In Proc. 26th Int. Joint Conf. Artif. Intell.2737\u20132743."},{"key":"e_1_3_1_40_2","first-page":"11245","volume-title":"Proc. IEEE Conf. Comput. Vis. Pattern Recognit.","author":"Song Yuqing","year":"2021","unstructured":"Yuqing Song, Shizhe Chen, and Qin Jin. 2021. Towards diverse paragraph captioning for untrimmed videos. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.11245\u201311254."},{"key":"e_1_3_1_41_2","first-page":"4278","volume-title":"Proc. AAAI Conf. Artif. Intell.","author":"Szegedy Christian","year":"2017","unstructured":"Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. 2017. Inception-v4, inception-ResNet and the impact of residual connections on learning. In Proc. AAAI Conf. Artif. Intell.4278\u20134284."},{"key":"e_1_3_1_42_2","first-page":"745","volume-title":"Proc. Int. Joint Conf. Artif. Intell.","author":"Tan Ganchao","year":"2020","unstructured":"Ganchao Tan, Daqing Liu, Meng Wang, and Zheng-Jun Zha. 2020. Learning to discretely compose reasoning module networks for video captioning. In Proc. Int. Joint Conf. Artif. Intell.745\u2013752."},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3303083"},{"key":"e_1_3_1_44_2","first-page":"1014","volume-title":"Proc. ACM Multimedia Conf.","author":"Tu Yunbin","year":"2017","unstructured":"Yunbin Tu, Xishan Zhang, Bingtao Liu, and Chenggang Yan. 2017. Video description with spatial-temporal attention. In Proc. ACM Multimedia Conf.1014\u20131022."},{"key":"e_1_3_1_45_2","first-page":"4566","volume-title":"Proc. IEEE Conf. Comput. Vis. Pattern Recognit.","author":"Vedantam Ramakrishna","year":"2015","unstructured":"Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.4566\u20134575."},{"key":"e_1_3_1_46_2","first-page":"1961","volume-title":"Proc. Conf. Empirical Methods Natural Lang. Process.","author":"Venugopalan Subhashini","year":"2016","unstructured":"Subhashini Venugopalan, Lisa Anne Hendricks, Raymond J. Mooney, and Kate Saenko. 2016. Improving LSTM-based video description with linguistic knowledge mined from text. In Proc. Conf. Empirical Methods Natural Lang. Process.1961\u20131966."},{"key":"e_1_3_1_47_2","first-page":"4534","volume-title":"Proc. IEEE Int. Conf. Comput. Vis.","author":"Venugopalan Subhashini","year":"2015","unstructured":"Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence - video to text. In Proc. IEEE Int. Conf. Comput. Vis.4534\u20134542."},{"key":"e_1_3_1_48_2","first-page":"1494","volume-title":"Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol.","author":"Venugopalan Subhashini","year":"2015","unstructured":"Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond J. Mooney, and Kate Saenko. 2015. Translating videos to natural language using deep recurrent neural networks. In Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol.1494\u20131504."},{"key":"e_1_3_1_49_2","first-page":"2641","volume-title":"Proc. IEEE Int. Conf. Comput. Vis.","author":"Wang Bairui","year":"2019","unstructured":"Bairui Wang, Lin Ma, Wei Zhang, Wenhao Jiang, Jingwen Wang, and Wei Liu. 2019. Controllable video captioning with POS sequence guidance based on gated fusion network. In Proc. IEEE Int. Conf. Comput. Vis.2641\u20132650."},{"key":"e_1_3_1_50_2","first-page":"7622","volume-title":"Proc. IEEE Conf. Comput. Vis. Pattern Recognit.","author":"Wang Bairui","year":"2018","unstructured":"Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. 2018. Reconstruction network for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.7622\u20137631."},{"key":"e_1_3_1_51_2","first-page":"1519","volume-title":"Proc. ACM Multimedia Conf.","author":"Wang Huiyun","year":"2018","unstructured":"Huiyun Wang, Youjiang Xu, and Yahong Han. 2018. Spotting and aggregating salient regions for video captioning. In Proc. ACM Multimedia Conf.1519\u20131526."},{"key":"e_1_3_1_52_2","first-page":"7512","volume-title":"Proc. IEEE Conf. Comput. Vis. Pattern Recognit.","author":"Wang Junbo","year":"2018","unstructured":"Junbo Wang, Wei Wang, Yan Huang, Liang Wang, and Tieniu Tan. 2018. M3: Multimodal memory modelling for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.7512\u20137520."},{"key":"e_1_3_1_53_2","first-page":"8198","volume-title":"Proc. IEEE Conf. Comput. Vis. Pattern Recognit.","author":"Wang Qi","year":"2019","unstructured":"Qi Wang, Junyu Gao, Wei Lin, and Yuan Yuan. 2019. Learning from synthetic data for crowd counting in the wild. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.8198\u20138207."},{"issue":"4","key":"e_1_3_1_54_2","first-page":"87:1\u201387:19","article-title":"Image captioning via semantic guidance attention and consensus selection strategy","volume":"14","author":"Wu Jie","year":"2018","unstructured":"Jie Wu, Haifeng Hu, and Yi Wu. 2018. Image captioning via semantic guidance attention and consensus selection strategy. ACM Trans. Multim. Comput. Commun. Appl. 14, 4 (2018), 87:1\u201387:19.","journal-title":"ACM Trans. Multim. Comput. Commun. Appl."},{"key":"e_1_3_1_55_2","first-page":"2397","volume-title":"Proc. Int. Conf. Mach. Learn.","author":"Xiong Caiming","year":"2016","unstructured":"Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In Proc. Int. Conf. Mach. Learn.2397\u20132406."},{"key":"e_1_3_1_56_2","first-page":"9062","volume-title":"Proc. AAAI Conf. Artif. Intell.","author":"Xu Huijuan","year":"2019","unstructured":"Huijuan Xu, Kun He, Bryan A. Plummer, Leonid Sigal, Stan Sclaroff, and Kate Saenko. 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proc. AAAI Conf. Artif. Intell.9062\u20139069."},{"key":"e_1_3_1_57_2","first-page":"5288","volume-title":"Proc. IEEE Conf. Comput. Vis. Pattern Recognit.","author":"Xu Jun","year":"2016","unstructured":"Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.5288\u20135296."},{"key":"e_1_3_1_58_2","first-page":"537","volume-title":"Proc. ACM Multimedia Conf.","author":"Xu Jun","year":"2017","unstructured":"Jun Xu, Ting Yao, Yongdong Zhang, and Tao Mei. 2017. Learning multimodal attention LSTM networks for video captioning. In Proc. ACM Multimedia Conf.537\u2013545."},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2018.2846664"},{"key":"e_1_3_1_60_2","first-page":"21","volume-title":"Proc. IEEE Conf. Comput. Vis. Pattern Recognit.","author":"Yang Zichao","year":"2016","unstructured":"Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked attention networks for image question answering. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.21\u201329."},{"key":"e_1_3_1_61_2","first-page":"4507","volume-title":"Proc. IEEE Int. Conf. Comput. Vis.","author":"Yao Li","year":"2015","unstructured":"Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher J. Pal, Hugo Larochelle, and Aaron C. Courville. 2015. Describing videos by exploiting temporal structure. In Proc. IEEE Int. Conf. Comput. Vis.4507\u20134515."},{"key":"e_1_3_1_62_2","first-page":"1839","volume-title":"Proc. IEEE Int. Conf. Comput. Vis.","author":"Yu Zhou","year":"2017","unstructured":"Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proc. IEEE Int. Conf. Comput. Vis.1839\u20131848."},{"issue":"2","key":"e_1_3_1_63_2","first-page":"53:1\u201353:18","article-title":"Spatiotemporal-textual co-attention network for video question answering","volume":"15","author":"Zha Zheng-Jun","year":"2019","unstructured":"Zheng-Jun Zha, Jiawei Liu, Tianhao Yang, and Yongdong Zhang. 2019. Spatiotemporal-textual co-attention network for video question answering. ACM Trans. Multim. Comput. Commun. Appl. 15, 2s (2019), 53:1\u201353:18.","journal-title":"ACM Trans. Multim. Comput. Commun. Appl."},{"key":"e_1_3_1_64_2","first-page":"8327","volume-title":"Proc. IEEE Conf. Comput. Vis. Pattern Recognit.","author":"Zhang Junchao","year":"2019","unstructured":"Junchao Zhang and Yuxin Peng. 2019. Object-aware aggregation with bidirectional temporal graph for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.8327\u20138336."},{"key":"e_1_3_1_65_2","first-page":"6250","volume-title":"Proc. IEEE Conf. Comput. Vis. Pattern Recognit.","author":"Zhang Xishan","year":"2017","unstructured":"Xishan Zhang, Ke Gao, Yongdong Zhang, Dongming Zhang, Jintao Li, and Qi Tian. 2017. Task-driven dynamic fusion: Reducing ambiguity in video description. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.6250\u20136258."},{"key":"e_1_3_1_66_2","first-page":"13275","volume-title":"Proc. IEEE Conf. Comput. Vis. Pattern Recognit.","author":"Zhang Ziqi","year":"2020","unstructured":"Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, and Zheng-Jun Zha. 2020. Object relational graph with teacher-recommended learning for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.13275\u201313285."},{"key":"e_1_3_1_67_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2019.2916757"},{"key":"e_1_3_1_68_2","first-page":"13093","volume-title":"Proc. IEEE Conf. Comput. Vis. Pattern Recognit.","author":"Zheng Qi","year":"2020","unstructured":"Qi Zheng, Chaoyue Wang, and Dacheng Tao. 2020. Syntax-aware action targeting for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.13093\u201313102."},{"key":"e_1_3_1_69_2","unstructured":"Bolei Zhou Yuandong Tian Sainbayar Sukhbaatar Arthur Szlam and Rob Fergus. 2015. Simple baseline for visual question answering. arXiv:1512.02167. http:\/\/arxiv.org\/abs\/1512.02167."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3550276","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3550276","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T18:43:23Z","timestamp":1750272203000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3550276"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,6]]},"references-count":68,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,5,31]]}},"alternative-id":["10.1145\/3550276"],"URL":"https:\/\/doi.org\/10.1145\/3550276","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2023,2,6]]},"assertion":[{"value":"2021-10-27","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-07-05","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-02-06","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}