{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,18]],"date-time":"2026-02-18T00:08:50Z","timestamp":1771373330754,"version":"3.50.1"},"reference-count":59,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2022,2,10]],"date-time":"2022-02-10T00:00:00Z","timestamp":1644451200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,2,10]],"date-time":"2022-02-10T00:00:00Z","timestamp":1644451200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"published-print":{"date-parts":[[2022,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The massive addition of data to the internet in text, images, and videos made computer vision-based tasks challenging in the big data domain. Recent exploration of video data and progress in visual information captioning has been an arduous task in computer vision. Visual captioning is attributable to integrating visual information with natural language descriptions. This paper proposes an encoder-decoder framework with a 2D-Convolutional Neural Network (CNN) model and layered Long Short Term Memory (LSTM) as the encoder and an LSTM model integrated with an attention mechanism working as the decoder with a hybrid loss function. Visual feature vectors extracted from the video frames using a 2D-CNN model capture spatial features. Specifically, the visual feature vectors are fed into the layered LSTM to capture the temporal information. The attention mechanism enables the decoder to perceive and focus on relevant objects and correlate the visual context and language content for producing semantically correct captions. The visual features and GloVe word embeddings are input into the decoder to generate natural semantic descriptions for the videos. The performance of the proposed framework is evaluated on the video captioning benchmark dataset Microsoft Video Description (MSVD) using various well-known evaluation metrics. The experimental findings indicate that the suggested framework outperforms state-of-the-art techniques. Compared to the state-of-the-art research methods, the proposed model significantly increased all measures, B@1, B@2, B@3, B@4, METEOR, and CIDEr, with the score of 78.4, 64.8, 54.2, and 43.7, 32.3, and 70.7, respectively. The progression in all scores indicates a more excellent grasp of the context of the inputs, which results in more accurate caption prediction.<\/jats:p>","DOI":"10.1186\/s40537-022-00569-4","type":"journal-article","created":{"date-parts":[[2022,2,10]],"date-time":"2022-02-10T12:02:57Z","timestamp":1644494577000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":6,"title":["Semantic context driven language descriptions of videos using deep neural network"],"prefix":"10.1186","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8989-6282","authenticated-orcid":false,"given":"Dinesh","family":"Naik","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"C. D.","family":"Jaidhar","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2022,2,10]]},"reference":[{"issue":"1","key":"569_CR1","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s40537-021-00508-9","volume":"8","author":"E Suryawati","year":"2021","unstructured":"Suryawati E, Pardede HF, Zilvan V, Ramdan A, Krisnandi D, Heryana A, Yuwana RS, Kusumo R, Arisal A, Supianto AA. Unsupervised feature learning-based encoder and adversarial networks. J Big Data. 2021;8(1):1\u201317. https:\/\/doi.org\/10.1186\/s40537-021-00508-9.","journal-title":"J Big Data"},{"issue":"1","key":"569_CR2","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s40537-021-00444-8","volume":"8","author":"L Alzubaidi","year":"2021","unstructured":"Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, Al-Shamma O, Santamar\u00eda J, Fadhel MA, Al-Amidie M, Farhan L. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data. 2021;8(1):1\u201374. https:\/\/doi.org\/10.1186\/s40537-021-00444-8.","journal-title":"J Big Data"},{"issue":"1","key":"569_CR3","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s40537-021-00414-0","volume":"8","author":"V Sampath","year":"2021","unstructured":"Sampath V, Maurtua I, Mart\u00edn JJA, Gutierrez A. A survey on generative adversarial networks for imbalance problems in computer vision tasks. J Big Data. 2021;8(1):1\u201359. https:\/\/doi.org\/10.1186\/s40537-021-00414-0.","journal-title":"J Big Data"},{"issue":"7","key":"569_CR4","doi-asserted-by":"publisher","first-page":"2631","DOI":"10.1109\/TCYB.2018.2831447","volume":"49","author":"Y Bin","year":"2019","unstructured":"Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X. Describing video with attention-based bidirectional LSTM. IEEE Trans Cybern. 2019;49(7):2631\u201341. https:\/\/doi.org\/10.1109\/TCYB.2018.2831447.","journal-title":"IEEE Trans Cybern"},{"key":"569_CR5","doi-asserted-by":"publisher","unstructured":"Olivastri S, Singh G, Cuzzolin F. End-to-end video captioning. In: 2019 IEEE\/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 1474\u20131482, 2019. https:\/\/doi.org\/10.1109\/ICCVW.2019.00185.","DOI":"10.1109\/ICCVW.2019.00185"},{"issue":"11","key":"569_CR6","doi-asserted-by":"publisher","first-page":"5552","DOI":"10.1109\/TIP.2019.2916757","volume":"28","author":"B Zhao","year":"2019","unstructured":"Zhao B, Li X, Lu X. CAM-RNN: co-attention model based RNN for video captioning. IEEE Trans Image Process. 2019;28(11):5552\u201365. https:\/\/doi.org\/10.1109\/TIP.2019.2916757.","journal-title":"IEEE Trans Image Process"},{"issue":"9","key":"569_CR7","doi-asserted-by":"publisher","first-page":"2045","DOI":"10.1109\/TMM.2017.2729019","volume":"19","author":"L Gao","year":"2017","unstructured":"Gao L, Guo Z, Zhang H, Xu X, Shen HT. Video captioning with attention-based LSTM and semantic consistency. IEEE Trans Multimedia. 2017;19(9):2045\u201355. https:\/\/doi.org\/10.1109\/TMM.2017.2729019.","journal-title":"IEEE Trans Multimedia"},{"issue":"11","key":"569_CR8","doi-asserted-by":"publisher","first-page":"5600","DOI":"10.1109\/TIP.2018.2855422","volume":"27","author":"Y Yang","year":"2018","unstructured":"Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen HT, Ji Y. Video captioning by adversarial LSTM. IEEE Trans Image Process. 2018;27(11):5600\u201311. https:\/\/doi.org\/10.1109\/TIP.2018.2855422.","journal-title":"IEEE Trans Image Process"},{"key":"569_CR9","doi-asserted-by":"publisher","unstructured":"Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney RJ, Saenko K. Translating videos to natural language using deep recurrent neural networks. CoRR 2014. https:\/\/doi.org\/10.3115\/v1\/N15-1173. arXiv:1412.4729.","DOI":"10.3115\/v1\/N15-1173"},{"issue":"1","key":"569_CR10","doi-asserted-by":"publisher","first-page":"229","DOI":"10.1109\/TMM.2019.2924576","volume":"22","author":"C Yan","year":"2020","unstructured":"Yan C, Tu Y, Wang X, Zhang Y, Hao X, Zhang Y, Dai Q. STAT: spatial-temporal attention mechanism for video captioning. IEEE Trans Multimedia. 2020;22(1):229\u201341. https:\/\/doi.org\/10.1109\/TMM.2019.2924576.","journal-title":"IEEE Trans Multimedia"},{"key":"569_CR11","doi-asserted-by":"publisher","unstructured":"Yu H, Wang J, Huang Z, Yang Y, Xu W. Video paragraph captioning using hierarchical recurrent neural networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4584\u20134593, 2016. https:\/\/doi.org\/10.1109\/CVPR.2016.496.","DOI":"10.1109\/CVPR.2016.496"},{"key":"569_CR12","doi-asserted-by":"publisher","unstructured":"Pan P, Xu Z, Yang Y, Wu F, Zhuang Y. Hierarchical recurrent neural encoder for video representation with application to captioning. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1029\u20131038, 2016. https:\/\/doi.org\/10.1109\/CVPR.2016.117.","DOI":"10.1109\/CVPR.2016.117"},{"issue":"10","key":"569_CR13","doi-asserted-by":"publisher","first-page":"4933","DOI":"10.1109\/TIP.2018.2846664","volume":"27","author":"Y Xu","year":"2018","unstructured":"Xu Y, Han Y, Hong R, Tian Q. Sequential video VLAD: training the aggregation locally and temporally. IEEE Trans Image Process. 2018;27(10):4933\u201344. https:\/\/doi.org\/10.1109\/TIP.2018.2846664.","journal-title":"IEEE Trans Image Process"},{"key":"569_CR14","doi-asserted-by":"publisher","unstructured":"Amaresh M, Chitrakala S. Video captioning using deep learning: an overview of methods, datasets and metrics. In: 2019 International Conference on Communication and Signal Processing (ICCSP), pp. 0656\u20130661, 2019. https:\/\/doi.org\/10.1109\/ICCSP.2019.8698097.","DOI":"10.1109\/ICCSP.2019.8698097"},{"key":"569_CR15","unstructured":"Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. 2016. arXiv:1409.0473."},{"key":"569_CR16","doi-asserted-by":"publisher","unstructured":"Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532\u20131543. Association for Computational Linguistics, Doha, Qatar, 2014. https:\/\/doi.org\/10.3115\/v1\/D14-1162.","DOI":"10.3115\/v1\/D14-1162"},{"key":"569_CR17","doi-asserted-by":"publisher","first-page":"143584","DOI":"10.1109\/ACCESS.2020.3013321","volume":"8","author":"Y Jing","year":"2020","unstructured":"Jing Y, Zhiwei X, Guanglai G. Context-driven image caption with global semantic relations of the named entities. IEEE Access. 2020;8:143584\u201394. https:\/\/doi.org\/10.1109\/ACCESS.2020.3013321.","journal-title":"IEEE Access"},{"key":"569_CR18","doi-asserted-by":"publisher","unstructured":"Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: a neural image caption generator. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156\u20133164, 2015. https:\/\/doi.org\/10.1109\/CVPR.2015.7298935.","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"569_CR19","doi-asserted-by":"crossref","unstructured":"You Q, Jin H, Wang Z, Fang C, Luo J. Image captioning with semantic attention. CoRR 2016. arXiv:1603.03925.","DOI":"10.1109\/CVPR.2016.503"},{"key":"569_CR20","doi-asserted-by":"publisher","unstructured":"Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Doll\u00e1r P, Gao J, He X, Mitchell M, Platt JC, Zitnick CL, Zweig G. From captions to visual concepts and back. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1473\u20131482, 2015. https:\/\/doi.org\/10.1109\/CVPR.2015.7298754.","DOI":"10.1109\/CVPR.2015.7298754"},{"key":"569_CR21","doi-asserted-by":"publisher","unstructured":"Arowolo MO, Ogundokun RO, Misra S, Kadri AF, Aduragba TO. In: Garg, L., Chakraborty, C., Mahmoudi, S., Sohmen, V.S. (eds.) Machine learning approach using KPCA-SVMs for predicting COVID-19, pp. 193\u2013209. Springer, Cham, 2022. https:\/\/doi.org\/10.1007\/978-3-030-72752-9_10.","DOI":"10.1007\/978-3-030-72752-9_10"},{"issue":"17","key":"569_CR22","doi-asserted-by":"publisher","first-page":"9849","DOI":"10.48048\/wjst.2021.9849","volume":"18","author":"MO Arowolo","year":"2021","unstructured":"Arowolo MO, Adebiyi MO, Nnodim CT, Abdulsalam SO, Adebiyi AA. An adaptive genetic algorithm with recursive feature elimination approach for predicting malaria vector gene expression data classification using support vector machine kernels. Walailak J Sci Technol (WJST). 2021;18(17):9849.","journal-title":"Walailak J Sci Technol (WJST)"},{"issue":"2","key":"569_CR23","doi-asserted-by":"publisher","first-page":"1073","DOI":"10.11591\/ijeecs.v21.i2.pp1073-1081","volume":"21","author":"MO Arowolo","year":"2021","unstructured":"Arowolo MO, Adebiyi M, Adebiyi AA, OKesola J. Predicting RNA-Seq data using genetic algorithm and ensemble classification algorithms. Indonesian J Electr Eng Comput Sci. 2021;21(2):1073\u201381.","journal-title":"Indonesian J Electr Eng Comput Sci"},{"key":"569_CR24","doi-asserted-by":"crossref","unstructured":"Arowolo MO, Adebiyi MO, Adebiyi AA, Olugbara O. Optimized hybrid heuristic based dimensionality reduction methods for malaria vector using KNN classifier. 2020.","DOI":"10.21203\/rs.3.rs-107396\/v1"},{"issue":"9","key":"569_CR25","doi-asserted-by":"publisher","first-page":"2579","DOI":"10.17576\/jsm-2021-5009-07","volume":"50","author":"MO Arowolo","year":"2021","unstructured":"Arowolo MO, Adebiyi MO, Adebiyi AA. Enhanced dimensionality reduction methods for classifying malaria vector dataset using decision tree. Sains Malaysiana. 2021;50(9):2579\u201389.","journal-title":"Sains Malaysiana"},{"issue":"2","key":"569_CR26","first-page":"1071","volume":"10","author":"MO Adebiyi","year":"2021","unstructured":"Adebiyi MO, Arowolo MO, Olugbara O. A genetic algorithm for prediction of RNA-seq malaria vector gene expression data classification using SVM kernels. Bull Electr Eng Inf. 2021;10(2):1071\u20139.","journal-title":"Bull Electr Eng Inf"},{"issue":"5","key":"569_CR27","doi-asserted-by":"publisher","first-page":"1112","DOI":"10.1109\/TPAMI.2019.2894139","volume":"42","author":"L Gao","year":"2020","unstructured":"Gao L, Li X, Song J, Shen HT. Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell. 2020;42(5):1112\u201331. https:\/\/doi.org\/10.1109\/TPAMI.2019.2894139.","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"569_CR28","doi-asserted-by":"publisher","unstructured":"Hossain MZ, Sohel F, Shiratuddin MF, Laga H, Bennamoun M. Bi-SAN-CAP: Bi-directional self-attention for image captioning. In: 2019 Digital Image Computing: Techniques and Applications (DICTA), pp. 1\u20137, 2019. https:\/\/doi.org\/10.1109\/DICTA47822.2019.8946003.","DOI":"10.1109\/DICTA47822.2019.8946003"},{"key":"569_CR29","doi-asserted-by":"publisher","unstructured":"Xu J, Yao T, Zhang Y, Mei T. Learning multimodal attention LSTM networks for video captioning. In: Proceedings of the 25th ACM International Conference on Multimedia. MM \u201917, pp. 537\u2013545. Association for Computing Machinery, New York. 2017. https:\/\/doi.org\/10.1145\/3123266.3123448.","DOI":"10.1145\/3123266.3123448"},{"key":"569_CR30","doi-asserted-by":"publisher","unstructured":"Khademi M, Schulte O. Image caption generation with hierarchical contextual visual spatial attention. In: 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2024\u201320248, 2018. https:\/\/doi.org\/10.1109\/CVPRW.2018.00260.","DOI":"10.1109\/CVPRW.2018.00260"},{"issue":"6","key":"569_CR31","doi-asserted-by":"publisher","first-page":"1709","DOI":"10.1109\/TCSVT.2019.2904996","volume":"30","author":"Z Ji","year":"2020","unstructured":"Ji Z, Xiong K, Pang Y, Li X. Video summarization with attention-based encoder-decoder networks. IEEE Trans Circuits Syst Video Technol. 2020;30(6):1709\u201317. https:\/\/doi.org\/10.1109\/TCSVT.2019.2904996.","journal-title":"IEEE Trans Circuits Syst Video Technol"},{"key":"569_CR32","doi-asserted-by":"publisher","unstructured":"Li L, Gong B. End-to-end video captioning with multitask reinforcement learning. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 339\u2013348, 2019. https:\/\/doi.org\/10.1109\/WACV.2019.00042.","DOI":"10.1109\/WACV.2019.00042"},{"issue":"4","key":"569_CR33","doi-asserted-by":"publisher","first-page":"297","DOI":"10.1109\/TETCI.2019.2892755","volume":"3","author":"S Li","year":"2019","unstructured":"Li S, Tao Z, Li K, Fu Y. Visual to text: survey of image and video captioning. IEEE Trans Emerg Topics Comput Intel. 2019;3(4):297\u2013312. https:\/\/doi.org\/10.1109\/TETCI.2019.2892755.","journal-title":"IEEE Trans Emerg Topics Comput Intel"},{"key":"569_CR34","unstructured":"Song J, Li X, Gao L, Shen HT. Hierarchical LSTMs with adaptive attention for visual captioning. CoRR 2018. arXiv:1812.11004."},{"issue":"5","key":"569_CR35","doi-asserted-by":"publisher","first-page":"3419","DOI":"10.1109\/JIOT.2017.2779865","volume":"5","author":"N Xu","year":"2018","unstructured":"Xu N, Liu A, Nie W, Su Y. Attention-in-attention networks for surveillance video understanding in internet of things. IEEE Internet Things J. 2018;5(5):3419\u201329. https:\/\/doi.org\/10.1109\/JIOT.2017.2779865.","journal-title":"IEEE Internet Things J"},{"key":"569_CR36","doi-asserted-by":"crossref","unstructured":"Zoph B, Vasudevan V, Shlens J, Le QV. Learning transferable architectures for scalable image recognition. CoRR 2017. arXiv:1707.07012.","DOI":"10.1109\/CVPR.2018.00907"},{"key":"569_CR37","doi-asserted-by":"publisher","unstructured":"Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818\u20132826, 2016. https:\/\/doi.org\/10.1109\/CVPR.2016.308.","DOI":"10.1109\/CVPR.2016.308"},{"key":"569_CR38","unstructured":"Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: The 3rd International Conference on Learning Representations (ICLR2015) 2015. arXiv:1409.1556."},{"issue":"8","key":"569_CR39","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","volume":"9","author":"S Hochreiter","year":"1997","unstructured":"Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735\u201380. https:\/\/doi.org\/10.1162\/neco.1997.9.8.1735.","journal-title":"Neural Comput"},{"key":"569_CR40","unstructured":"Sha L, Chang B, Sui Z, Li S. Reading and thinking: Re-read LSTM unit for textual entailment recognition. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2870\u20132879. The COLING 2016 Organizing Committee, Osaka, 2016."},{"key":"569_CR41","unstructured":"Chen D, Dolan WB. Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 190\u2013200, 2011."},{"key":"569_CR42","doi-asserted-by":"publisher","first-page":"89","DOI":"10.1016\/j.procs.2018.08.153","volume":"135","author":"AG Salman","year":"2018","unstructured":"Salman AG, Heryadi Y, Abdurahman E, Suparta W. Single layer & multi-layer long short-term memory (LSTM) model with intermediate variables for weather forecasting. Proc Comput Sci. 2018;135:89\u201398.","journal-title":"Proc Comput Sci"},{"key":"569_CR43","doi-asserted-by":"publisher","unstructured":"Papineni K, Roukos S, Ward T, Zhu WJ. BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311\u2013318. Association for Computational Linguistics, Philadelphia. 2002. https:\/\/doi.org\/10.3115\/1073083.1073135.","DOI":"10.3115\/1073083.1073135"},{"key":"569_CR44","unstructured":"Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And\/or Summarization, pp. 65\u201372. Association for Computational Linguistics, Ann Arbor, Michigan. 2005. https:\/\/aclanthology.org\/W05-0909."},{"key":"569_CR45","doi-asserted-by":"crossref","unstructured":"Anderson P, Fernando B, Johnson M, Gould S. Spice: Semantic propositional image caption evaluation. In: European Conference on Computer Vision, pp. 382\u2013398, 2016. Springer.","DOI":"10.1007\/978-3-319-46454-1_24"},{"key":"569_CR46","doi-asserted-by":"crossref","unstructured":"Vedantam R, Zitnick CL, Parikh D. CIDEr: Consensus-based image description evaluation. CoRR 2014. arXiv:1411.5726.","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"569_CR47","unstructured":"Lin CY. ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74\u201381. Association for Computational Linguistics, Barcelona. 2004. https:\/\/www.aclweb.org\/anthology\/W04-1013."},{"key":"569_CR48","doi-asserted-by":"publisher","unstructured":"Li G, Ma S, Han Y. Summarization-based video caption via deep neural networks. In: Proceedings of the 23rd ACM International Conference on Multimedia. MM \u201915, pp. 1191\u20131194. Association for Computing Machinery, New York. 2015. https:\/\/doi.org\/10.1145\/2733373.2806314.","DOI":"10.1145\/2733373.2806314"},{"key":"569_CR49","doi-asserted-by":"publisher","unstructured":"Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A. Describing videos by exploiting temporal structure. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4507\u20134515, 2015. https:\/\/doi.org\/10.1109\/ICCV.2015.512.","DOI":"10.1109\/ICCV.2015.512"},{"key":"569_CR50","doi-asserted-by":"crossref","unstructured":"Xu H, Venugopalan S, Ramanishka V, Rohrbach M, Saenko K. A multi-scale multiple instance video description network. CoRR 2015. arXiv:1505.05914.","DOI":"10.1145\/2964284.2984066"},{"key":"569_CR51","doi-asserted-by":"publisher","unstructured":"Pan Y, Mei T, Yao T, Li H, Rui Y. Jointly modeling embedding and translation to bridge video and language. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4594\u20134602, 2016. https:\/\/doi.org\/10.1109\/CVPR.2016.497.","DOI":"10.1109\/CVPR.2016.497"},{"key":"569_CR52","doi-asserted-by":"publisher","unstructured":"Baraldi L, Grana C, Cucchiara R. Hierarchical boundary-aware neural encoder for video captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3185\u20133194, 2017. https:\/\/doi.org\/10.1109\/CVPR.2017.339.","DOI":"10.1109\/CVPR.2017.339"},{"key":"569_CR53","doi-asserted-by":"publisher","first-page":"126","DOI":"10.1016\/j.cviu.2017.06.012.","volume":"163","author":"F Nian","year":"2017","unstructured":"Nian F, Li T, Wang Y, Wu X, Ni B, Xu C. Learning explicit video attributes from mid-level representation for video captioning. Comput Vis Image Underst. 2017;163:126\u201338. https:\/\/doi.org\/10.1016\/j.cviu.2017.06.012. (Language in Vision).","journal-title":"Comput Vis Image Underst"},{"key":"569_CR54","doi-asserted-by":"publisher","unstructured":"Hao X, Zhou F, Li X. Scene-Edge GRU for Video Caption. In: 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), vol. 1, pp. 1290\u20131295, 2020. https:\/\/doi.org\/10.1109\/ITNEC48623.2020.9084781.","DOI":"10.1109\/ITNEC48623.2020.9084781"},{"key":"569_CR55","doi-asserted-by":"publisher","first-page":"102840","DOI":"10.1016\/j.cviu.2019.102840","volume":"190","author":"M Nabati","year":"2020","unstructured":"Nabati M, Behrad A. Video captioning using boosted and parallel Long Short-Term Memory networks. Comput Vis Image Underst. 2020;190:102840. https:\/\/doi.org\/10.1016\/j.cviu.2019.102840.","journal-title":"Comput Vis Image Underst"},{"issue":"1","key":"569_CR56","doi-asserted-by":"publisher","first-page":"147","DOI":"10.1007\/s10044-018-00770-3","volume":"23","author":"S Sah","year":"2020","unstructured":"Sah S, Nguyen T, Ptucha R. Understanding temporal structure for video captioning. Pattern Anal Appl. 2020;23(1):147\u201359.","journal-title":"Pattern Anal Appl"},{"key":"569_CR57","doi-asserted-by":"publisher","unstructured":"Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K. Sequence to Sequence\u2014Video to Text. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4534\u20134542, 2015. https:\/\/doi.org\/10.1109\/ICCV.2015.515.","DOI":"10.1109\/ICCV.2015.515"},{"key":"569_CR58","doi-asserted-by":"publisher","unstructured":"Tu Y, Zhang X, Liu B, Yan C. Video description with spatial-temporal attention. In: Proceedings of the 25th ACM International Conference on Multimedia. MM \u201917, pp. 1014\u20131022. Association for Computing Machinery, New York, 2017. https:\/\/doi.org\/10.1145\/3123266.3123354.","DOI":"10.1145\/3123266.3123354"},{"key":"569_CR59","doi-asserted-by":"publisher","unstructured":"Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171\u20134186, 2019. https:\/\/doi.org\/10.18653\/v1\/N19-1423.","DOI":"10.18653\/v1\/N19-1423"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-022-00569-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s40537-022-00569-4\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-022-00569-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,2,10]],"date-time":"2022-02-10T12:13:25Z","timestamp":1644495205000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-022-00569-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,2,10]]},"references-count":59,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2022,12]]}},"alternative-id":["569"],"URL":"https:\/\/doi.org\/10.1186\/s40537-022-00569-4","relation":{},"ISSN":["2196-1115"],"issn-type":[{"value":"2196-1115","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,2,10]]},"assertion":[{"value":"19 September 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 January 2022","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"10 February 2022","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"17"}}