{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,6]],"date-time":"2026-03-06T05:59:32Z","timestamp":1772776772305,"version":"3.50.1"},"reference-count":70,"publisher":"Association for Computing Machinery (ACM)","issue":"10","license":[{"start":{"date-parts":[[2024,10,30]],"date-time":"2024-10-30T00:00:00Z","timestamp":1730246400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62271361"],"award-info":[{"award-number":["62271361"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100003819","name":"Natural Science Foundation of Hubei Province","doi-asserted-by":"crossref","award":["2023AFB206"],"award-info":[{"award-number":["2023AFB206"]}],"id":[{"id":"10.13039\/501100003819","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Hubei Institute of Education Science","award":["2022ZA41"],"award-info":[{"award-number":["2022ZA41"]}]},{"name":"Scientific Research Foundation of Hubei University of Education for Talent Introduction","award":["ESRC20230009"],"award-info":[{"award-number":["ESRC20230009"]}]},{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities","doi-asserted-by":"crossref","award":["WHUTIOT2023-006"],"award-info":[{"award-number":["WHUTIOT2023-006"]}],"id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Hubei Provincial Collaborative Innovation Center for Basic Education Information Technology Services","award":["OFHUE202305"],"award-info":[{"award-number":["OFHUE202305"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,10,31]]},"abstract":"<jats:p>Non-autoregressive video captioning methods generate visual words in parallel but often overlook semantic correlations among them, especially regarding verbs, leading to lower caption quality. To address this, we integrate action information of highlighted objects to enhance semantic connections among visual words. Our proposed Action-aware Language Skeleton Optimization Network (ALSO-Net) tackles the challenge of extracting action information across frames, improving understanding of complex context-dependent video actions and reducing sentence inconsistencies. ALSO-Net incorporates a linguistic skeleton tag generator to refine semantic correlations and a video action predictor to enhance verb prediction accuracy in video captions. We also address issues of unsatisfactory caption length and quality by jointly optimizing different levels of motion prediction loss. Experimental evaluation on prominent video captioning datasets demonstrates that ALSO-Net outperforms baseline methods by a significant margin and achieves competitive performance compared to state-of-the-art autoregressive methods with smaller model complexity and faster inference time.<\/jats:p>","DOI":"10.1145\/3679203","type":"journal-article","created":{"date-parts":[[2024,7,20]],"date-time":"2024-07-20T15:14:41Z","timestamp":1721488481000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":19,"title":["Action-aware Linguistic Skeleton Optimization Network for Non-autoregressive Video Captioning"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6449-5063","authenticated-orcid":false,"given":"Shuqin","family":"Chen","sequence":"first","affiliation":[{"name":"School of Computer Science and Hubei Provincial Collaborative Innovation Center for Basic Education Information Technology Services, Hubei University of Education, Wuhan, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5242-0467","authenticated-orcid":false,"given":"Xian","family":"Zhong","sequence":"additional","affiliation":[{"name":"Hubei Key Lab of Transportation Internet of Things, School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan, China and Rapid-Rich Object Search Lab, School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-1050-163X","authenticated-orcid":false,"given":"Yi","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3871-663X","authenticated-orcid":false,"given":"Lei","family":"Zhu","sequence":"additional","affiliation":[{"name":"ROAS Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China and Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1503-0240","authenticated-orcid":false,"given":"Ping","family":"Li","sequence":"additional","affiliation":[{"name":"Department of Computing and School of Design, The Hong Kong Polytechnic University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4029-3322","authenticated-orcid":false,"given":"Xiaokang","family":"Yang","sequence":"additional","affiliation":[{"name":"MOE Key Laboratory of AI, School of Electronic, Information, and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8678-2784","authenticated-orcid":false,"given":"Bin","family":"Sheng","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, School of Electronic, Information, and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China"}]}],"member":"320","published-online":{"date-parts":[[2024,10,30]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475519"},{"key":"e_1_3_1_3_2","first-page":"65","volume-title":"Proc. Assoc. Comput. Linguist. Workshops","author":"Banerjee Satanjeev","year":"2005","unstructured":"Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proc. Assoc. Comput. Linguist. Workshops, 65\u201372."},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3416291"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3539225"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58548-8_20"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00157"},{"key":"e_1_3_1_9_2","first-page":"656","article-title":"A Neural Compositional Paradigm for Image Captioning","author":"Dai Bo","year":"2018","unstructured":"Bo Dai, Sanja Fidler, and Dahua Lin. 2018. A Neural Compositional Paradigm for Image Captioning. In Adv. Neural Inf. Process. Syst, 656\u2013666.","journal-title":"Adv. Neural Inf. Process. Syst,"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1285"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2021.3063423"},{"key":"e_1_3_1_12_2","first-page":"4171","volume-title":"Proc. North Am. Chapter Assoc. Comput. Linguist,","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proc. North Am. Chapter Assoc. Comput. Linguist, 4171\u20134186."},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3550276"},{"key":"e_1_3_1_14_2","unstructured":"Zhengcong Fei. 2019. Fast Image Caption Generation with Position Alignment. arXiv:1912.06365. Retrieved from https:\/\/arxiv.org\/abs\/1912.06365"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i2.16219"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3120867"},{"issue":"5","key":"e_1_3_1_17_2","first-page":"1112","article-title":"Hierarchical LSTMs with Adaptive Attention for Visual Captioning","volume":"42","author":"Gao Lianli","year":"2020","unstructured":"Lianli Gao, Xiangpeng Li, Jingkuan Song, and Heng Tao Shen. 2020. Hierarchical LSTMs with Adaptive Attention for Visual Captioning. IEEE Trans. Pattern Anal. Mach. Intell. 42, 5 (2020), 1112\u20131131.","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1633"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.337"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2017.373"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2020.2969330"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3102504"},{"key":"e_1_3_1_23_2","first-page":"630","article-title":"SBAT: Video Captioning with Sparse Boundary-Aware Transformer","author":"Jin Tao","year":"2020","unstructured":"Tao Jin, Siyu Huang, Ming Chen, Yingming Li, and Zhongfei Zhang. 2020. SBAT: Video Captioning with Sparse Boundary-Aware Transformer, In Proc. Int. Joint Conf. Artif. Intell, 630\u2013636.","journal-title":"Proc. Int. Joint Conf. Artif. Intell"},{"key":"e_1_3_1_24_2","unstructured":"Will Kay Jo\u00e3o Carreira Karen Simonyan Brian Zhang Chloe Hillier Sudheendra Vijayanarasimhan Fabio Viola Tim Green Trevor Back Paul Natsev Mustafa Suleyman and Andrew Zisserman. 2017. The Kinetics Human Action Video Dataset. arXiv:1705.06950. Retrieved from https:\/\/arxiv.org\/abs\/1705.06950"},{"key":"e_1_3_1_25_2","volume-title":"Proc. Int. Conf. Learn. Represent","author":"Kingma Diederik P.","year":"2015","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proc. Int. Conf. Learn. Represent. 1\u201315."},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.83"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2023.3251097"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.233"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2022.3158546"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2020.3045735"},{"key":"e_1_3_1_31_2","first-page":"74","article-title":"Rouge: A package for automatic evaluation of summaries","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out, 74\u201381.","journal-title":"Text Summarization Branches Out"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.findings-acl.24"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2019.2940007"},{"issue":"3","key":"e_1_3_1_35_2","first-page":"3003","article-title":"Entity-Enhanced Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding","volume":"45","author":"Liu Xuejing","year":"2023","unstructured":"Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Zechao Li, Qi Tian, and Qingming Huang. 2023. Entity-Enhanced Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding. IEEE Trans. Pattern Anal. Mach. Intell. 45, 3 (2023), 3003\u20133018.","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/3409388"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01088"},{"key":"e_1_3_1_38_2","first-page":"311","volume-title":"Proc. Assoc. Comput. Linguist,","author":"Papineni Kishore","year":"2002","unstructured":"Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proc. Assoc. Comput. Linguist, 311\u2013318."},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00854"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.277"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i15.17618"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i3.16353"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCYB.2021.3079311"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/3546828"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6413"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01109"},{"key":"e_1_3_1_47_2","unstructured":"Khurram Soomro Amir Roshan Zamir and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv:1212.0402. Retrieved from https:\/\/arxiv.org\/abs\/1212.0402"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICME52920.2022.9859743"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2023.102043"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2023.3268004"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00263"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2020.107702"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00795"},{"key":"e_1_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00468"},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33015377"},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2022.3169894"},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.29"},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.571"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v37i3.25412"},{"key":"e_1_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2924576"},{"key":"e_1_3_1_63_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i4.16421"},{"key":"e_1_3_1_64_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01329"},{"key":"e_1_3_1_65_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01311"},{"key":"e_1_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v37i3.25484"},{"key":"e_1_3_1_67_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICME52920.2022.9859882"},{"key":"e_1_3_1_68_2","volume-title":"Proc. Int. Conf. Learn.","author":"Zhou Chunting","year":"2020","unstructured":"Chunting Zhou, Jiatao Gu, and Graham Neubig. 2020. Understanding Knowledge Distillation in Non-autoregressive Machine Translation. In Proc. Int. Conf. Learn. Represent."},{"key":"e_1_3_1_69_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12342"},{"key":"e_1_3_1_70_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2022.3146004"},{"key":"e_1_3_1_71_2","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2022.3218656"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3679203","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3679203","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:58:15Z","timestamp":1750294695000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3679203"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10,30]]},"references-count":70,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2024,10,31]]}},"alternative-id":["10.1145\/3679203"],"URL":"https:\/\/doi.org\/10.1145\/3679203","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,10,30]]},"assertion":[{"value":"2023-08-20","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-07-13","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-10-30","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}