{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,20]],"date-time":"2026-02-20T15:05:36Z","timestamp":1771599936213,"version":"3.50.1"},"reference-count":57,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2019,5,31]],"date-time":"2019-05-31T00:00:00Z","timestamp":1559260800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100011002","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61622115, 61472281"],"award-info":[{"award-number":["61622115, 61472281"]}],"id":[{"id":"10.13039\/501100011002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100013285","name":"Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning","doi-asserted-by":"crossref","award":["GZ2015005"],"award-info":[{"award-number":["GZ2015005"]}],"id":[{"id":"10.13039\/501100013285","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Shanghai Engineering Research Center of Industrial Vision Perception & Intelligent Computing","award":["17DZ2251600"],"award-info":[{"award-number":["17DZ2251600"]}]},{"name":"IBM Shared University Research Awards Program"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2019,5,31]]},"abstract":"<jats:p>It is interesting and challenging to translate a video to natural description sentences based on the video content. In this work, an advanced framework is built to generate sentences with coherence and rich semantic expressions for video captioning. A long short term memory\u00a0(LSTM) network with an improved factored way is first developed, which takes the inspiration of LSTM with a conventional factored way and a common practice to feed multi-modal features into LSTM at the first time step for visual description. Then, the incorporation of the LSTM network with the proposed improved factored way and un-factored way is exploited, and a voting strategy is utilized to predict candidate words. In addition, for robust and abstract visual and language representation, residuals are employed to enhance the gradient signals that are learned from the residual network\u00a0(ResNet), and a deeper LSTM network is constructed. Furthermore, three convolutional neural network based features extracted from GoogLeNet, ResNet101, and ResNet152, are fused to catch more comprehensive and complementary visual information. Experiments are conducted on two benchmark datasets, including MSVD and MSR-VTT2016, and competitive performances are obtained by the proposed techniques as compared to other state-of-the-art methods.<\/jats:p>","DOI":"10.1145\/3303083","type":"journal-article","created":{"date-parts":[[2019,6,6]],"date-time":"2019-06-06T12:28:42Z","timestamp":1559824122000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":21,"title":["Rich Visual and Language Representation with Complementary Semantics for Video Captioning"],"prefix":"10.1145","volume":"15","author":[{"given":"Pengjie","family":"Tang","sequence":"first","affiliation":[{"name":"Tongji University, P. R. China and Jinggangshan University, Shanghai, P. R. China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9999-4871","authenticated-orcid":false,"given":"Hanli","family":"Wang","sequence":"additional","affiliation":[{"name":"Tongji University, Shanghai, P. R. China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Qinyu","family":"Li","sequence":"additional","affiliation":[{"name":"Tongji University, P. R. China and Lanzhou City University, P. R. China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2019,6,5]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the International Conference on Learning Representations.","author":"Ballas Nicolas","year":"2015"},{"key":"e_1_2_1_2_1","volume-title":"Proceedings of the Meeting of the Association for Computational Linguistics Workshop. 65\u201372","author":"Banerjee Satanjeev","year":"2005"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.339"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCYB.2018.2831447"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_2_1_6_1","volume-title":"Proceedings of the Meeting of the Association for Computational Linguistics. 190\u2013200","author":"David"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01261-8_22"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2015.2477044"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298878"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2984064"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICME.2018.8486437"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.127"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2017.2729019"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.337"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654889"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2984065"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the International Conference on Machine Learning. 595\u2013603","author":"Kiros Ryan","year":"2014"},{"key":"e_1_2_1_20_1","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence. 541\u2013547","author":"Krishnamoorthy Niveda","year":"2013"},{"key":"e_1_2_1_21_1","volume-title":"Proceedings of the Conference on Neural Information Processing Systems. 1097\u20131105","author":"Krizhevsky Alex"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.3115\/1218955.1219032"},{"key":"e_1_2_1_23_1","volume-title":"Proceedings of the European Conference on Computer Vision. 740\u2013755","author":"Lin Tsung-Yi"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/1273496.1273577"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299101"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.117"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.497"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.111"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073083.1073135"},{"key":"e_1_2_1_30_1","volume-title":"Zhe Gan, and Lawrence Carin.","author":"Pu Yunchen","year":"2016"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2984066"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.548"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2984062"},{"key":"e_1_2_1_34_1","volume-title":"Proceedings of the International Conference on Learning Representations.","author":"Simonyan Karen","year":"2014"},{"key":"e_1_2_1_35_1","unstructured":"Jingkuan Song Yuyu Guo Lianli Gao Xuelong Li Alan Hanjalic and Heng Tao Shen. 2017. From deterministic to generative: Multimodal stochastic RNNs for video captioning. arXiv preprint arXiv: 1708.02478.  Jingkuan Song Yuyu Guo Lianli Gao Xuelong Li Alan Hanjalic and Heng Tao Shen. 2017. From deterministic to generative: Multimodal stochastic RNNs for video captioning. arXiv preprint arXiv: 1708.02478."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3127895"},{"key":"e_1_2_1_38_1","volume-title":"Proceedings of the International Conference on Computational Linguistics. 1218\u20131227","author":"Thomason Jesse"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.515"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/N15-1173"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00795"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00521"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00784"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2009.145"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.29"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.571"},{"key":"e_1_2_1_50_1","volume-title":"Proceedings of the International Conference on Machine Learning. 2048\u20132057","author":"Xu Kelvin","year":"2015"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICME.2017.8019408"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2018.2855422"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.512"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.524"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.496"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2018.2855415"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01216-8_43"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3303083","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3303083","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:53:39Z","timestamp":1750204419000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3303083"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,5,31]]},"references-count":57,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2019,5,31]]}},"alternative-id":["10.1145\/3303083"],"URL":"https:\/\/doi.org\/10.1145\/3303083","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,5,31]]},"assertion":[{"value":"2018-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-12-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-06-05","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}