{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,21]],"date-time":"2025-10-21T15:42:18Z","timestamp":1761061338070,"version":"build-2065373602"},"reference-count":51,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2021,2,23]],"date-time":"2021-02-23T00:00:00Z","timestamp":1614038400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["No. 61572306"],"award-info":[{"award-number":["No. 61572306"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100012166","name":"National Key Research and Development Program of China","doi-asserted-by":"publisher","award":["NO. 2017YFB0701600"],"award-info":[{"award-number":["NO. 2017YFB0701600"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100003399","name":"Science and Technology Commission of Shanghai Municipality","doi-asserted-by":"publisher","award":["No. 19511121002"],"award-info":[{"award-number":["No. 19511121002"]}],"id":[{"id":"10.13039\/501100003399","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Future Internet"],"abstract":"<jats:p>Video captioning is a popular task which automatically generates a natural-language sentence to describe video content. Previous video captioning works mainly use the encoder\u2013decoder framework and exploit special techniques such as attention mechanisms to improve the quality of generated sentences. In addition, most attention mechanisms focus on global features and spatial features. However, global features are usually fully connected features. Recurrent convolution networks (RCNs) receive 3-dimensional features as input at each time step, but the temporal structure of each channel at each time step has been ignored, which provide temporal relation information of each channel. In this paper, a video captioning model based on channel soft attention and semantic reconstructor is proposed, which considers the global information for each channel. In a video feature map sequence, the same channel of every time step is generated by the same convolutional kernel. We selectively collect the features generated by each convolutional kernel and then input the weighted sum of each channel to RCN at each time step to encode video representation. Furthermore, a semantic reconstructor is proposed to rebuild semantic vectors to ensure the integrity of semantic information in the training process, which takes advantage of both forward (semantic to sentence) and backward (sentence to semantic) flows. Experimental results on popular datasets MSVD and MSR-VTT demonstrate the effectiveness and feasibility of our model.<\/jats:p>","DOI":"10.3390\/fi13020055","type":"journal-article","created":{"date-parts":[[2021,2,23]],"date-time":"2021-02-23T12:40:16Z","timestamp":1614084016000},"page":"55","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":18,"title":["Video Captioning Based on Channel Soft Attention and Semantic Reconstructor"],"prefix":"10.3390","volume":"13","author":[{"given":"Zhou","family":"Lei","sequence":"first","affiliation":[{"name":"School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5977-4786","authenticated-orcid":false,"given":"Yiyong","family":"Huang","sequence":"additional","affiliation":[{"name":"School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China"}]}],"member":"1968","published-online":{"date-parts":[[2021,2,23]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Wang, H., Klaser, A., Schmid, C., and Liu, C.L. (2011, January 20\u201325). Action Recognition by Dense Trajectories. Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Springs, CO, USA.","DOI":"10.1109\/CVPR.2011.5995407"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Wang, H., and Schmid, C. (2013, January 1\u20138). Action Recognition with Improved Trajectories. Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV), Sydney, Australia.","DOI":"10.1109\/ICCV.2013.441"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Jegou, H., Schmid, C., Douze, M., and Perez, P. (2010, January 13\u201318). Aggregating local descriptors into a compact image representation. Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA.","DOI":"10.1109\/CVPR.2010.5540039"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"222","DOI":"10.1007\/s11263-013-0636-x","article-title":"Image Classification with the Fisher Vector: Theory and Practice","volume":"105","author":"Perronnin","year":"2013","journal-title":"Int. J. Comput. Vision"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015, January 12\u201317). Going deeper with convolutions. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"4933","DOI":"10.1109\/TIP.2018.2846664","article-title":"Sequential Video VLAD: Training the Aggregation Locally and Temporally","volume":"27","author":"Xu","year":"2018","journal-title":"IEEE Trans. Image Process."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Ferrari, V., Hebert, M., Sminchisescu, C., and Weiss, Y. (2018, January 8\u201314). CBAM: Convolutional Block Attention Module. Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01249-6"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Luong, T., Pham, H., and Manning, C.D. (2015, January 17\u201321). Effective Approaches to Attention-based Neural Machine Translation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal.","DOI":"10.18653\/v1\/D15-1166"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"You, Q., Jin, H., Wang, Z., Fang, C., and Luo, J. (July, January 26). Image Captioning with Semantic Attention. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.503"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Wu, Q., Shen, C., Liu, L., Dick, A., and Van Den Hengel, A. (July, January 26). What Value Do Explicit High Level Concepts Have in Vision to Language Problems?. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.29"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin, L., and Deng, L. (2017, January 21\u201326). Semantic Compositional Networks for Visual Captioning. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.127"},{"key":"ref_12","first-page":"190","article-title":"Collecting Highly Parallel Data for Paraphrase Evaluation","volume":"Volume 1","author":"Chen","year":"2011","journal-title":"Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Xu, J., Mei, T., Yao, T., and Rui, Y. (July, January 26). MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.571"},{"key":"ref_14","unstructured":"Lebret, R., Pinheiro, P.O., and Collobert, R. (2015, January 6\u201311). Phrase-Based Image Captioning. Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., and Schiele, B. (2013, January 1\u20138). Translating Video Content to Natural Language Descriptions. Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV), Sydney, Australia.","DOI":"10.1109\/ICCV.2013.61"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015, January 12\u201317). Show and tell: A neural image caption generator. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"677","DOI":"10.1109\/TPAMI.2016.2599174","article-title":"Long-Term Recurrent Convolutional Networks for Visual Recognition and Description","volume":"39","author":"Donahue","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., and Saenko, K. (2014). Translating Videos to Natural Language Using Deep Recurrent Neural Networks. arXiv.","DOI":"10.3115\/v1\/N15-1173"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","article-title":"ImageNet Large Scale Visual Recognition Challenge","volume":"115","author":"Russakovsky","year":"2015","journal-title":"Int. J. Comput. Vision"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., and Saenko, K. (2015, January 11\u201318). Sequence to Sequence\u2014Video to Text. Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile.","DOI":"10.1109\/ICCV.2015.515"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., and Courville, A. (2015, January 11\u201318). Describing Videos by Exploiting Temporal Structure. Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile.","DOI":"10.1109\/ICCV.2015.512"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Pan, Y., Mei, T., Yao, T., Li, H., and Rui, Y. (July, January 26). Jointly Modeling Embedding and Translation to Bridge Video and Language. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.497"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Pan, Y., Yao, T., Li, H., and Mei, T. (2017, January 21\u201326). Video Captioning with Transferred Semantic Attributes. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.111"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Wang, B., Ma, L., Zhang, W., and Liu, W. (2018, January 18\u201322). Reconstruction Network for Video Captioning. Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00795"},{"key":"ref_25","unstructured":"Stollenga, M.F., Masci, J., Gomez, F., and Schmidhuber, J. (2014, January 8\u201311). Deep Networks with Internal Selective Attention through Feedback Connections. Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada."},{"key":"ref_26","unstructured":"Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. (2015, January 7\u201310). Attention-Based Models for Speech Recognition. Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Pan, Y., Yao, T., Li, Y., and Mei, T. (2020, January 14\u201315). X-Linear Attention Networks for Image Captioning. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA.","DOI":"10.1109\/CVPR42600.2020.01098"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"2011","DOI":"10.1109\/TPAMI.2019.2913372","article-title":"Squeeze-and-Excitation Networks","volume":"42","author":"Hu","year":"2020","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Ray, S., and Chandra, N. (2012). Domain Based Ontology and Automated Text Categorization Based on Improved Term Frequency\u2014Inverse Document Frequency. Int. J. Mod. Educ. Comput. Sci., 4.","DOI":"10.5815\/ijmecs.2012.04.04"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"1618","DOI":"10.1016\/j.ipm.2019.05.003","article-title":"The impact of deep learning on document classification using semantically rich representations","volume":"56","author":"Kastrati","year":"2019","journal-title":"Inf. Process. Manag."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., and Saenko, K. (2013, January 1\u20133). YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition. Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV), Sydney, Australia.","DOI":"10.1109\/ICCV.2013.337"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Lavie, A., and Agarwal, A. (2007, January 23). Meteor: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments. Proceedings of the Second Workshop on Statistical Machine Translation, Stroudsburg, PA, USA.","DOI":"10.3115\/1626355.1626389"},{"key":"ref_33","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. (2020, January 6\u201312). BLEU: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA."},{"key":"ref_34","unstructured":"Lin, C.Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out, Association for Computational Linguistics."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Vedantam, R., Zitnick, C.L., and Parikh, D. (2015, January 7\u201312). CIDEr: Consensus-based image description evaluation. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"ref_36","unstructured":"Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollar, P., and Zitnick, C.L. (2015). Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A.A. (2017, January 4\u20139). Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA.","DOI":"10.1609\/aaai.v31i1.11231"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015, January 7\u201313). Learning Spatiotemporal Features with 3D Convolutional Networks. Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile.","DOI":"10.1109\/ICCV.2015.510"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014, January 24\u201327). Large-Scale Video Classification with Convolutional Neural Networks. Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.223"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Pennington, J., Socher, R., and Manning, C. (2014, January 25\u201329). Glove: Global Vectors for Word Representation. Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, Qatar.","DOI":"10.3115\/v1\/D14-1162"},{"key":"ref_41","unstructured":"Joulin, A., Grave, E., Bojanowski, P., Douze, M., J\u00e9gou, H., and Mikolov, T. (2016). FastText.zip: Compressing text classification models. arXiv."},{"key":"ref_42","first-page":"3111","article-title":"Distributed Representations of Words and Phrases and Their Compositionality","volume":"Volume 2","author":"Mikolov","year":"2019","journal-title":"Proceedings of the 26th International Conference on Neural Information Processing Systems"},{"key":"ref_43","unstructured":"Kingma, D., and Ba, J. (2015, January 7\u20139). Adam: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference for Learning Representations, San Diego, CA, USA."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Yu, H., Wang, J., Huang, Z., Yang, Y., and Xu, W. (July, January 26). Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.496"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Pan, P., Xu, Z., Yang, Y., Wu, F., and Zhuang, Y. (July, January 26). Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.117"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Ferrari, V., Hebert, M., Sminchisescu, C., and Weiss, Y. (2018, January 8\u201314). Less Is More: Picking Informative Frames for Video Captioning. Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01252-6"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Zhang, X., Gao, K., Zhang, Y., Zhang, D., Li, J., and Tian, Q. (2017, January 21\u201326). Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.662"},{"key":"ref_48","unstructured":"Chen, J., Pan, Y., Li, Y., Yao, T., Chao, H., and Mei, T. (February, January 27). Temporal deformable convolutional encoder-decoder networks for video captioning. Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Aafaq, N., Akhtar, N., Liu, W., Gilani, S.Z., and Mian, A. (November, January 27). Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seoul, Korea.","DOI":"10.1109\/CVPR.2019.01277"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Dong, J., Li, X., Lan, W., Huo, Y., and Snoek, C.G. (2016, January 15\u201319). Early Embedding and Late Reranking for Video Captioning. Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands.","DOI":"10.1145\/2964284.2984064"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Shetty, R., and Laaksonen, J. (2016, January 15\u201319). Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation. Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands.","DOI":"10.1145\/2964284.2984062"}],"container-title":["Future Internet"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-5903\/13\/2\/55\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T05:27:16Z","timestamp":1760160436000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-5903\/13\/2\/55"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,2,23]]},"references-count":51,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2021,2]]}},"alternative-id":["fi13020055"],"URL":"https:\/\/doi.org\/10.3390\/fi13020055","relation":{},"ISSN":["1999-5903"],"issn-type":[{"type":"electronic","value":"1999-5903"}],"subject":[],"published":{"date-parts":[[2021,2,23]]}}}