{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T09:23:25Z","timestamp":1772184205608,"version":"3.50.1"},"reference-count":53,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2022,3,4]],"date-time":"2022-03-04T00:00:00Z","timestamp":1646352000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61832001"],"award-info":[{"award-number":["61832001"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Open Fund of Intelligent Terminal Key Laboratory of Sichuan Province","award":["SCITLAB-1016"],"award-info":[{"award-number":["SCITLAB-1016"]}]},{"name":"Zhejiang Lab\u2019s International Talent Fund for Young Professionals"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2022,11,30]]},"abstract":"<jats:p>Fully mining visual cues to aid in content understanding is crucial for video captioning. However, most state-of-the-art video captioning methods are limited to generating captions purely based on straightforward information while ignoring the scenario and context information. To fill the gap, we propose a novel, simple but effective scenario-aware recurrent transformer (SART) model to execute video captioning. Our model contains a \u201cscenario understanding\u201d module to obtain a global perspective across multiple frames, providing a specific scenario to guarantee a goal-directed description. Moreover, for the sake of achieving narrative continuity in the generated paragraph, a unified recurrent transformer is adopted. To demonstrate the effectiveness of our proposed SART, we have conducted comprehensive experiments on various large-scale video description datasets, including ActivityNet, YouCookII, and VideoStory. Additionally, we extend a story-oriented evaluation framework for assessing the quality of the generated caption more precisely. The superior performance has shown that SART has a strong ability to generate correct, deliberative, and narrative coherent video descriptions.<\/jats:p>","DOI":"10.1145\/3503927","type":"journal-article","created":{"date-parts":[[2022,3,4]],"date-time":"2022-03-04T10:31:58Z","timestamp":1646389918000},"page":"1-17","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":29,"title":["Scenario-Aware Recurrent Transformer for Goal-Directed Video Captioning"],"prefix":"10.1145","volume":"18","author":[{"given":"Xin","family":"Man","sequence":"first","affiliation":[{"name":"Center for Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2259-886X","authenticated-orcid":false,"given":"Deqiang","family":"Ouyang","sequence":"additional","affiliation":[{"name":"College of Computer Science, Chongqing University, China and Intelligent Terminal Key Laboratory of Sichuan Province, Yibin, China"}]},{"given":"Xiangpeng","family":"Li","sequence":"additional","affiliation":[{"name":"Center for Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China"}]},{"given":"Jingkuan","family":"Song","sequence":"additional","affiliation":[{"name":"Center for Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2615-1555","authenticated-orcid":false,"given":"Jie","family":"Shao","sequence":"additional","affiliation":[{"name":"Center for Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China"}]}],"member":"320","published-online":{"date-parts":[[2022,3,4]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/3355390"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W18-5709"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00671"},{"key":"e_1_3_1_5_2","first-page":"65","volume-title":"Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization","author":"Banerjee Satanjeev","year":"2005","unstructured":"Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization. 65\u201372."},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123420"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58577-8_7"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1285"},{"key":"e_1_3_1_9_2","first-page":"4171","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 4171\u20134186."},{"key":"e_1_3_1_10_2","first-page":"3063","volume-title":"Proceedings of the Annual Conference on Neural Information Processing Systems 2018","author":"Duan Xuguang","year":"2018","unstructured":"Xuguang Duan, Wen-bing Huang, Chuang Gan, Jingdong Wang, Wenwu Zhu, and Junzhou Huang. 2018. Weakly supervised dense event captioning in videos. In Proceedings of the Annual Conference on Neural Information Processing Systems 2018. 3063\u20133073."},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58539-6_31"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.108"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.127"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2017.2729019"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1117"},{"key":"e_1_3_1_16_2","volume-title":"Proceedings of the Annual Conference on Neural Information Processing Systems 2020","author":"Ging Simon","year":"2020","unstructured":"Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, and Thomas Brox. 2020. COOT: Cooperative hierarchical transformer for video-text representation learning. In Proceedings of the Annual Conference on Neural Information Processing Systems 2020."},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICMEW.2019.00134"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46493-0_38"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298698"},{"key":"e_1_3_1_20_2","first-page":"448","volume-title":"Proceedings of the 32nd International Conference on Machine Learning, ICML 2015","author":"Ioffe Sergey","year":"2015","unstructured":"Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015. 448\u2013456."},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2984065"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3226036"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.83"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.233"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/TETCI.2019.2892755"},{"key":"e_1_3_1_26_2","first-page":"173","article-title":"Video captioning with multi-faceted attention","volume":"6","author":"Long Xiang","year":"2018","unstructured":"Xiang Long, Chuang Gan, and Gerard de Melo. 2018. Video captioning with multi-faceted attention. Transactions of the Association for Computational 6 (2018), 173\u2013184.","journal-title":"Transactions of the Association for Computational"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/2487268.2487269"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00675"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.117"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.111"},{"key":"e_1_3_1_31_2","first-page":"311","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics","author":"Papineni Kishore","year":"2002","unstructured":"Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311\u2013318."},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00676"},{"key":"e_1_3_1_33_2","first-page":"1310","volume-title":"Proceedings of the 30th International Conference on Machine Learning","author":"Pascanu Razvan","year":"2013","unstructured":"Razvan Pascanu, Tom\u00e1s Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning. 1310\u20131318."},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.5555\/1113166.1644545"},{"key":"e_1_3_1_35_2","first-page":"4967","volume-title":"Proceedings of the Annual Conference on Neural Information Processing Systems","author":"Santoro Adam","year":"2017","unstructured":"Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter W. Battaglia, and Tim Lillicrap. 2017. A simple neural network module for relational reasoning. In Proceedings of the Annual Conference on Neural Information Processing Systems. 4967\u20134976."},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00756"},{"key":"e_1_3_1_37_2","first-page":"3104","volume-title":"Proceedings of the Annual Conference on Neural Information Processing Systems","author":"Sutskever Ilya","year":"2014","unstructured":"Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Annual Conference on Neural Information Processing Systems. 3104\u20133112."},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1514"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3303083"},{"key":"e_1_3_1_40_2","first-page":"5998","volume-title":"Proceedings of the Annual Conference on Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems. 5998\u20136008."},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3226037"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00443"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01252-6_29"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2924576"},{"key":"e_1_3_1_47_2","first-page":"5754","volume-title":"Proceedings of the Annual Conference on Neural Information Processing Systems","author":"Yang Zhilin","year":"2019","unstructured":"Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of the Annual Conference on Neural Information Processing Systems. 5754\u20135764."},{"key":"e_1_3_1_48_2","volume-title":"Proceedings of the 8th International Conference on Learning Representations","author":"Yi Kexin","year":"2020","unstructured":"Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B. Tenenbaum. 2020. CLEVRER: Collision events for video representation and reasoning. In Proceedings of the 8th International Conference on Learning Representations."},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01261-8_23"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00852"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2020.2988435"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00674"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12342"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00911"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503927","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503927","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:30:32Z","timestamp":1750188632000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503927"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,3,4]]},"references-count":53,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2022,11,30]]}},"alternative-id":["10.1145\/3503927"],"URL":"https:\/\/doi.org\/10.1145\/3503927","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,3,4]]},"assertion":[{"value":"2021-08-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-12-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-03-04","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}