{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,2]],"date-time":"2026-05-02T06:52:22Z","timestamp":1777704742920,"version":"3.51.4"},"reference-count":24,"publisher":"SAGE Publications","issue":"6","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["IFS"],"published-print":{"date-parts":[[2021,6,21]]},"abstract":"<jats:p>Recently many methods use encoder-decoder framework for video captioning, aiming to translate short videos into natural language. These methods usually use equal interval frame sampling. However, lacking a good efficiency in sampling, it has a high temporal and spatial redundancy, resulting in unnecessary computation cost. In addition, the existing approaches simply splice different visual features on the fully connection layer. Therefore, features cannot be effectively utilized. In order to solve the defects, we proposed filtration network (FN) to select key frames, which is trained by deep reinforcement learning algorithm actor-double-critic. According to behavior psychology, the core idea of actor-double-critic is that the behavior of agent is determined by both the external environment and the internal personality. It avoids the phenomenon of unclear reward and sparse feedback in training because it gives steady feedback after each action. The key frames are sent to combine codec network (CCN) to generate sentences. The operation of feature combination in CCN make fusion of visual features by complex number representation to make good semantic modeling. Experiments and comparisons with other methods on two datasets (MSVD\/MSR-VTT) show that our approach achieves better performance in terms of four metrics, BLEU-4, METEOR, ROUGE-L and CIDEr.<\/jats:p>","DOI":"10.3233\/jifs-202249","type":"journal-article","created":{"date-parts":[[2021,4,20]],"date-time":"2021-04-20T21:26:17Z","timestamp":1618953977000},"page":"11085-11097","source":"Crossref","is-referenced-by-count":4,"title":["Filtration network: A frame sampling strategy via deep reinforcement learning for video captioning"],"prefix":"10.1177","volume":"40","author":[{"given":"Tiancheng","family":"Qian","sequence":"first","affiliation":[{"name":"College of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xue","family":"Mei","sequence":"additional","affiliation":[{"name":"College of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Pengxiang","family":"Xu","sequence":"additional","affiliation":[{"name":"College of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kangqi","family":"Ge","sequence":"additional","affiliation":[{"name":"College of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhelei","family":"Qiu","sequence":"additional","affiliation":[{"name":"College of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"179","reference":[{"key":"10.3233\/JIFS-202249_ref1","doi-asserted-by":"crossref","unstructured":"Zolfaghari M. , Singh K. and Brox T. , ECO: Efficient convolutional network for online video understanding, The European Conference on Computer Vision (ECCV) (2018), 695\u2013712.","DOI":"10.1007\/978-3-030-01216-8_43"},{"key":"10.3233\/JIFS-202249_ref2","doi-asserted-by":"crossref","first-page":"424","DOI":"10.1016\/j.neucom.2018.11.038","article-title":"Fast and robust dynamic hand gesture recognition via key frames extraction and feature fusion","volume":"331","author":"Tang","year":"2019","journal-title":"Neurocomputing"},{"issue":"357","key":"10.3233\/JIFS-202249_ref3","doi-asserted-by":"crossref","first-page":"24","DOI":"10.1016\/j.neucom.2019.05.027","article-title":"Semantic-filtered Soft-Split-Aware video captioning with audio-augmented feature","volume":"10","author":"Xu","year":"2019","journal-title":"Neurocomputing"},{"key":"10.3233\/JIFS-202249_ref4","doi-asserted-by":"crossref","unstructured":"Chen Y. , Wang S. , Zhang W. , et al., Less is more: Picking informative frames for video captioning, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.","DOI":"10.1007\/978-3-030-01261-8_22"},{"issue":"04","key":"10.3233\/JIFS-202249_ref7","doi-asserted-by":"crossref","first-page":"1011","DOI":"10.1016\/j.cja.2018.12.018","article-title":"Online scheduling of image satellites based on neural networks and deep reinforcement learning","volume":"32","author":"Wang","year":"2019","journal-title":"Chinese Journal of Aeronautics"},{"key":"10.3233\/JIFS-202249_ref8","doi-asserted-by":"crossref","unstructured":"Zhang W. , Wang B. , Ma L. , et al., Reconstruct and represent video contents for captioning via reinforcement learning, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.","DOI":"10.1109\/TPAMI.2019.2920899"},{"key":"10.3233\/JIFS-202249_ref9","unstructured":"Agarwal R. , Liang C. , Schuurmans D. , et al., Learning to generalize from sparse and underspecified rewards, International Conference on Machine Learning (ICML), 2019."},{"key":"10.3233\/JIFS-202249_ref10","doi-asserted-by":"crossref","unstructured":"Wang B. , Ma L. , Zhang W. , et al., Controllable video captioning with pos sequence guidance based on gated fusion network, IEEE International Conference on Computer Vision (ICCV), 2019.","DOI":"10.1109\/ICCV.2019.00273"},{"key":"10.3233\/JIFS-202249_ref11","doi-asserted-by":"crossref","unstructured":"Venugopalan S. , Xu H. , Donahue J. , et al., Translating videos to natural language using deep recurrent neural networks, The Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies, (2015), 1494\u20131504.","DOI":"10.3115\/v1\/N15-1173"},{"issue":"22","key":"10.3233\/JIFS-202249_ref12","doi-asserted-by":"crossref","first-page":"3179","DOI":"10.1007\/s11042-019-08011-3","article-title":"Deep multimodal embedding for video captioning","volume":"78","author":"Lee","year":"2019","journal-title":"Multimedia Tools and Applications"},{"issue":"6","key":"10.3233\/JIFS-202249_ref13","doi-asserted-by":"crossref","first-page":"663","DOI":"10.1007\/s00530-018-0598-5","article-title":"Multi-guiding long short-term memory for video captioning","volume":"25","author":"Xu","year":"2019","journal-title":"Multimedia Systems"},{"key":"10.3233\/JIFS-202249_ref14","doi-asserted-by":"crossref","unstructured":"Aafaq N. , Akhtar N. , Liu W. , et al., Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2019), 12487\u201312496.","DOI":"10.1109\/CVPR.2019.01277"},{"key":"10.3233\/JIFS-202249_ref15","doi-asserted-by":"crossref","unstructured":"Liu S. , Ren Z. and Yuan J. , Sibnet: Sibling convolutional encoder for video captioning, ACM Multimedia Conference on Multimedia Conference (ACMMM), (2018), 1425\u20131434.","DOI":"10.1145\/3240508.3240667"},{"key":"10.3233\/JIFS-202249_ref16","doi-asserted-by":"crossref","unstructured":"Zhang J. and Peng Y. , Object-aware aggregation with bidirectional temporal graph for video captioning, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2019), 8327\u20138336.","DOI":"10.1109\/CVPR.2019.00852"},{"key":"10.3233\/JIFS-202249_ref17","doi-asserted-by":"crossref","unstructured":"Zhao B. , Li X. and Lu X. , CAM-RNN: Co-attention model based RNN for video captioning, IEEE transactions on image processing: a publication of the IEEE Signal Processing Society 28(11) (2019).","DOI":"10.1109\/TIP.2019.2916757"},{"key":"10.3233\/JIFS-202249_ref18","doi-asserted-by":"crossref","unstructured":"Jin T. , Li Y. and Zhang Z. , Recurrent convolutional video captioning with global and local attention, Neurocomputing 370 (2019).","DOI":"10.1016\/j.neucom.2019.08.042"},{"key":"10.3233\/JIFS-202249_ref19","doi-asserted-by":"crossref","unstructured":"Wang J. , Wang W. , Huang Y. , et al., M3: Multimodal memory modelling for video captioning, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2018), 7512\u20137520.","DOI":"10.1109\/CVPR.2018.00784"},{"key":"10.3233\/JIFS-202249_ref20","doi-asserted-by":"crossref","unstructured":"Chen S. and Jiang Y. , Motion guided spatial attention for video captioning, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI), (2019), 8191\u20138198.","DOI":"10.1609\/aaai.v33i01.33018191"},{"key":"10.3233\/JIFS-202249_ref21","doi-asserted-by":"crossref","unstructured":"Pei W. , Zhang J. , Wang X. , et al., Memory-attended recurrent network for video captioning, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2019), 8347\u20138356.","DOI":"10.1109\/CVPR.2019.00854"},{"key":"10.3233\/JIFS-202249_ref22","doi-asserted-by":"crossref","unstructured":"Zhang Z. , Shi Y. , Yuan C. , et al., Object relational graph with teacher-recommended learning for video captioning, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.","DOI":"10.1109\/CVPR42600.2020.01329"},{"issue":"6","key":"10.3233\/JIFS-202249_ref23","first-page":"1291","article-title":"A survey of actor-critic reinforcement learning: Standard and natural policy gradients. Systems, Man, and Cybernetics, Part C: Applications and Reviews","volume":"42","author":"Grondman","year":"2012","journal-title":"IEEE Transactions on"},{"key":"10.3233\/JIFS-202249_ref28","unstructured":"Lin Z. , Feng M. , dos Santos C.N. , Yu M. , Xiang B. , Zhou B. and Bengio Y. , A Structured Self-attentive Sentence Embedding[C], International Conference on Learning Representations (ICLR), (2017)."},{"key":"10.3233\/JIFS-202249_ref29","unstructured":"Chen D.L. and Dolan W.B. , Collecting highly parallel data for paraphrase evaluation, The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, (2011), 190\u2013200."},{"key":"10.3233\/JIFS-202249_ref30","doi-asserted-by":"crossref","unstructured":"Xu J. , Mei T. , Yao T. , et al., MSR-VTT: A large video description dataset for bridging video and language, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), 5288\u20135296.","DOI":"10.1109\/CVPR.2016.571"}],"container-title":["Journal of Intelligent &amp; Fuzzy Systems"],"original-title":[],"link":[{"URL":"https:\/\/content.iospress.com\/download?id=10.3233\/JIFS-202249","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T09:41:55Z","timestamp":1777455715000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/full\/10.3233\/JIFS-202249"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,6,21]]},"references-count":24,"journal-issue":{"issue":"6"},"URL":"https:\/\/doi.org\/10.3233\/jifs-202249","relation":{},"ISSN":["1064-1246","1875-8967"],"issn-type":[{"value":"1064-1246","type":"print"},{"value":"1875-8967","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,6,21]]}}}