{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,21]],"date-time":"2025-10-21T15:23:31Z","timestamp":1761060211601,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":27,"publisher":"ACM","license":[{"start":{"date-parts":[[2016,10,1]],"date-time":"2016-10-01T00:00:00Z","timestamp":1475280000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"the Fundamental Research Funds for the Central Universities","award":["ZYGX2014J063"],"award-info":[{"award-number":["ZYGX2014J063"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61502080"],"award-info":[{"award-number":["61502080"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2016,10]]},"DOI":"10.1145\/2964284.2967242","type":"proceedings-article","created":{"date-parts":[[2016,9,29]],"date-time":"2016-09-29T19:17:32Z","timestamp":1475176652000},"page":"357-361","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":46,"title":["Attention-based LSTM with Semantic Consistency for Videos Captioning"],"prefix":"10.1145","author":[{"given":"Zhao","family":"Guo","sequence":"first","affiliation":[{"name":"University of Electronic Science and Technology of China, Chengdu, China"}]},{"given":"Lianli","family":"Gao","sequence":"additional","affiliation":[{"name":"University of Electronic Science and Technology of China, Chengdu, China"}]},{"given":"Jingkuan","family":"Song","sequence":"additional","affiliation":[{"name":"Columbia University, NEW YORK, NY, USA"}]},{"given":"Xing","family":"Xu","sequence":"additional","affiliation":[{"name":"University of Electronic Science and Technology of China, Chengdu, China"}]},{"given":"Jie","family":"Shao","sequence":"additional","affiliation":[{"name":"University of Electronic Science and Technology of China, Chengdu, China"}]},{"given":"Heng Tao","family":"Shen","sequence":"additional","affiliation":[{"name":"The University of Queensland, Brisbane, Australia"}]}],"member":"320","published-online":{"date-parts":[[2016,10]]},"reference":[{"key":"e_1_3_2_1_1_1","first-page":"65","volume-title":"Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and\/or summarization","volume":"29","author":"Banerjee S.","year":"2005","unstructured":"S. Banerjee and A. Lavie . Meteor: An automatic metric for mt evaluation with improved correlation with human judgments . In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and\/or summarization , volume 29 , pages 65 -- 72 , 2005 . S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and\/or summarization, volume 29, pages 65--72, 2005."},{"key":"e_1_3_2_1_2_1","first-page":"190","volume-title":"ACL","author":"Chen D. L.","year":"2011","unstructured":"D. L. Chen and W. B. Dolan . Collecting highly parallel data for paraphrase evaluation . In ACL , pages 190 -- 200 . Association for Computational Linguistics , 2011 . D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evaluation. In ACL, pages 190--200. Association for Computational Linguistics, 2011."},{"key":"e_1_3_2_1_3_1","volume-title":"Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:1411.5654","author":"Chen X.","year":"2014","unstructured":"X. Chen and C. L. Zitnick . Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:1411.5654 , 2014 . X. Chen and C. L. Zitnick. Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:1411.5654, 2014."},{"issue":"11","key":"e_1_3_2_1_4_1","first-page":"1875","article-title":"Describing multimedia content using attention-based encoder-decoder networks. Multimedia","volume":"17","author":"Cho K.","year":"2015","unstructured":"K. Cho , A. Courville , and Y. Bengio . Describing multimedia content using attention-based encoder-decoder networks. Multimedia , IEEE Transactions on , 17 ( 11 ): 1875 -- 1886 , 2015 . K. Cho, A. Courville, and Y. Bengio. Describing multimedia content using attention-based encoder-decoder networks. Multimedia, IEEE Transactions on, 17(11):1875--1886, 2015.","journal-title":"IEEE Transactions on"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298878"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298754"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299066"},{"key":"e_1_3_2_1_8_1","first-page":"1188","volume-title":"AAAI","author":"Gao L.","year":"2016","unstructured":"L. Gao , J. Song , F. Nie , F. Zou , N. Sebe , and H. T. Shen . Graph-without-cut: An ideal graph learning for image segmentation . In AAAI , pages 1188 -- 1194 , 2016 . L. Gao, J. Song, F. Nie, F. Zou, N. Sebe, and H. T. Shen. Graph-without-cut: An ideal graph learning for image segmentation. In AAAI, pages 1188--1194, 2016."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.277"},{"key":"e_1_3_2_1_10_1","first-page":"1889","volume-title":"NIPS","author":"Karpathy A.","year":"2014","unstructured":"A. Karpathy , A. Joulin , and F. F. F. Li . Deep fragment embeddings for bidirectional image sentence mapping . In NIPS , pages 1889 -- 1897 , 2014 . A. Karpathy, A. Joulin, and F. F. F. Li. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS, pages 1889--1897, 2014."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2733373.2806314"},{"key":"e_1_3_2_1_12_1","volume-title":"Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090","author":"Mao J.","year":"2014","unstructured":"J. Mao , W. Xu , Y. Yang , J. Wang , and A. L. Yuille . Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090 , 2014 . J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090, 2014."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073083.1073135"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2013.2271746"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2465274"},{"key":"e_1_3_2_1_16_1","volume-title":"Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402","author":"Soomro K.","year":"2012","unstructured":"K. Soomro , A. R. Zamir , and M. Shah . Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 , 2012 . K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"e_1_3_2_1_18_1","volume-title":"Learning spatiotemporal features with 3d convolutional networks. arXiv preprint arXiv:1412.0767","author":"Tran D.","year":"2014","unstructured":"D. Tran , L. Bourdev , R. Fergus , L. Torresani , and M. Paluri . Learning spatiotemporal features with 3d convolutional networks. arXiv preprint arXiv:1412.0767 , 2014 . D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. arXiv preprint arXiv:1412.0767, 2014."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.515"},{"key":"e_1_3_2_1_21_1","volume-title":"Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729","author":"Venugopalan S.","year":"2014","unstructured":"S. Venugopalan , H. Xu , J. Donahue , M. Rohrbach , R. Mooney , and K. Saenko . Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 , 2014 . S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729, 2014."},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_3_2_1_23_1","volume-title":"L. Liu, and A. Dick. Image captioning with an intermediate attributes layer. arXiv preprint arXiv:1506.01144","author":"Wu Q.","year":"2015","unstructured":"Q. Wu , C. Shen , A. v. d. Hengel , L. Liu, and A. Dick. Image captioning with an intermediate attributes layer. arXiv preprint arXiv:1506.01144 , 2015 . Q. Wu, C. Shen, A. v. d. Hengel, L. Liu, and A. Dick. Image captioning with an intermediate attributes layer. arXiv preprint arXiv:1506.01144, 2015."},{"key":"e_1_3_2_1_24_1","volume-title":"Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044","author":"Xu K.","year":"2015","unstructured":"K. Xu , J. Ba , R. Kiros , A. Courville , R. Salakhutdinov , R. Zemel , and Y. Bengio . Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044 , 2015 . K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044, 2015."},{"key":"e_1_3_2_1_25_1","first-page":"2346","volume-title":"AAAI","author":"Xu R.","year":"2015","unstructured":"R. Xu , C. Xiong , W. Chen , and J. J. Corso . Jointly modeling deep video and compositional text to bridge vision and language in a unified framework . In AAAI , pages 2346 -- 2352 . Citeseer , 2015 . R. Xu, C. Xiong, W. Chen, and J. J. Corso. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In AAAI, pages 2346--2352. Citeseer, 2015."},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.512"},{"key":"e_1_3_2_1_27_1","volume-title":"an adaptive learning rate method. arXiv preprint arXiv:1212.5701","author":"Zeiler M. D.","year":"2012","unstructured":"M. D. Zeiler . Adadelta : an adaptive learning rate method. arXiv preprint arXiv:1212.5701 , 2012 . M. D. Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012."}],"event":{"name":"MM '16: ACM Multimedia Conference","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Amsterdam The Netherlands","acronym":"MM '16"},"container-title":["Proceedings of the 24th ACM international conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2964284.2967242","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2964284.2967242","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T03:39:55Z","timestamp":1750217995000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2964284.2967242"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,10]]},"references-count":27,"alternative-id":["10.1145\/2964284.2967242","10.1145\/2964284"],"URL":"https:\/\/doi.org\/10.1145\/2964284.2967242","relation":{},"subject":[],"published":{"date-parts":[[2016,10]]},"assertion":[{"value":"2016-10-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}