{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,21]],"date-time":"2026-01-21T10:57:29Z","timestamp":1768993049228,"version":"3.49.0"},"reference-count":48,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2022,11,12]],"date-time":"2022-11-12T00:00:00Z","timestamp":1668211200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,11,12]],"date-time":"2022-11-12T00:00:00Z","timestamp":1668211200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The massive influx of text, images, and videos to the internet has recently increased the challenge of computer vision-based tasks in big data. Integrating visual data with natural language to generate video explanations has been a challenge for decades. However, recent experiments on image\/video captioning that employ Long-Short-Term-Memory (LSTM) have piqued the interest of researchers studying its possible application in video captioning. The proposed video captioning architecture combines the bidirectional multilayer LSTM (BiLSTM) encoder and unidirectional decoder. The innovative architecture also considers temporal relations when creating superior global video representations. In contrast to the majority of prior work, the most relevant features of a video are selected and utilized specifically for captioning purposes. Existing methods utilize a single-layer attention mechanism for linking visual input with phrase meaning. This approach employs LSTMs and a multilayer attention mechanism to extract characteristics from movies, construct links between multi-modal (words and visual material) representations, and generate sentences with rich semantic coherence. In addition, we evaluated the performance of the suggested system using a benchmark dataset for video captioning. The obtained results reveal superior performance relative to state-of-the-art works in METEOR and promising performance relative to the BLEU score. In terms of quantitative performance, the proposed approach outperforms most existing methodologies.<\/jats:p>","DOI":"10.1186\/s40537-022-00664-6","type":"journal-article","created":{"date-parts":[[2022,11,12]],"date-time":"2022-11-12T11:02:58Z","timestamp":1668250978000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":17,"title":["A novel Multi-Layer Attention Framework for visual description prediction using bidirectional LSTM"],"prefix":"10.1186","volume":"9","author":[{"given":"Dinesh","family":"Naik","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"C. D.","family":"Jaidhar","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2022,11,12]]},"reference":[{"issue":"1","key":"664_CR1","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s40537-021-00492-0","volume":"8","author":"C Shorten","year":"2021","unstructured":"Shorten C, Khoshgoftaar TM, Furht B. Text data augmentation for deep learning. J Big Data. 2021;8(1):1\u201334.","journal-title":"J Big Data"},{"key":"664_CR2","first-page":"09151","volume":"1711","author":"J Aneja","year":"2017","unstructured":"Aneja J, Deshpande A, Schwing A. Convolutional image captioning. Comput Vis Pattern Recognit. 2017;1711:09151.","journal-title":"Comput Vis Pattern Recognit"},{"key":"664_CR3","unstructured":"Kiros R, Salakhutdinov R, Zemel RS. Unifying visual-semantic embeddings with multimodal neural language models. https:\/\/arxiv.org\/abs\/1411.2539"},{"key":"664_CR4","doi-asserted-by":"crossref","unstructured":"Krishna R, Hata K, Ren F, Fei-Fei L, Carlos\u00a0Niebles J. Dense-captioning events in videos. In: Proceedings of the IEEE International conference on computer vision. 2017. p. 706\u2013715.","DOI":"10.1109\/ICCV.2017.83"},{"key":"664_CR5","doi-asserted-by":"publisher","first-page":"218386","DOI":"10.1109\/ACCESS.2020.3042484","volume":"8","author":"S Amirian","year":"2020","unstructured":"Amirian S, Rasheed K, Taha TR, Arabnia HR. Automatic image and video caption generation with deep learning: a concise review and algorithmic overlap. IEEE Access. 2020;8:218386\u2013400.","journal-title":"IEEE Access"},{"issue":"1","key":"664_CR6","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s40537-021-00444-8","volume":"8","author":"L Alzubaidi","year":"2021","unstructured":"Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, Al-Shamma O, Santamar\u00eda J, Fadhel MA, Al-Amidie M, Farhan L. Review of deep learning: concepts, cnn architectures, challenges, applications, future directions. J Big Data. 2021;8(1):1\u201374.","journal-title":"J Big Data"},{"issue":"8","key":"664_CR7","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","volume":"9","author":"S Hochreiter","year":"1997","unstructured":"Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735\u201380.","journal-title":"Neural Comput"},{"issue":"7","key":"664_CR8","doi-asserted-by":"publisher","first-page":"2631","DOI":"10.1109\/TCYB.2018.2831447","volume":"49","author":"Y Bin","year":"2018","unstructured":"Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X. Describing video with attention-based bidirectional lstm. IEEE Transact Cybern. 2018;49(7):2631\u201341.","journal-title":"IEEE Transact Cybern"},{"issue":"4","key":"664_CR9","doi-asserted-by":"publisher","first-page":"297","DOI":"10.1109\/TETCI.2019.2892755","volume":"3","author":"S Li","year":"2019","unstructured":"Li S, Tao Z, Li K, Fu Y. Visual to text: survey of image and video captioning. IEEE Transact Emerg Topics Comput Intell. 2019;3(4):297\u2013312. https:\/\/doi.org\/10.1109\/TETCI.2019.2892755.","journal-title":"IEEE Transact Emerg Topics Comput Intell"},{"issue":"11","key":"664_CR10","doi-asserted-by":"publisher","first-page":"5600","DOI":"10.1109\/TIP.2018.2855422","volume":"27","author":"Y Yang","year":"2018","unstructured":"Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen HT, Ji Y. Video captioning by adversarial lstm. IEEE Transact Image Process. 2018;27(11):5600\u201311. https:\/\/doi.org\/10.1109\/TIP.2018.2855422.","journal-title":"IEEE Transact Image Process"},{"key":"664_CR11","unstructured":"Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint. 2014. https:\/\/arxiv.org\/abs\/1409.1556"},{"key":"664_CR12","doi-asserted-by":"crossref","unstructured":"Krishnamoorthy N, Malkarnenkar G, Mooney R, Saenko K, Guadarrama S. Generating natural-language video descriptions using text-mined knowledge. In: Twenty-Seventh AAAI Conference on Artificial Intelligence. 2013.","DOI":"10.1609\/aaai.v27i1.8679"},{"key":"664_CR13","doi-asserted-by":"crossref","unstructured":"Pan P, Xu Z, Yang Y, Wu F, Zhuang Y. Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. p. 1029\u20131038.","DOI":"10.1109\/CVPR.2016.117"},{"key":"664_CR14","unstructured":"Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R. Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of COLING 2014, the 25th International conference on computational linguistics: technical papers. 2014. p. 1218\u20131227."},{"key":"664_CR15","doi-asserted-by":"crossref","unstructured":"Xu R, Xiong C, Chen W, Corso J. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: Proceedings of the AAAI conference on artificial intelligence. 2015; 29.","DOI":"10.1609\/aaai.v29i1.9512"},{"issue":"1","key":"664_CR16","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s40537-021-00509-8","volume":"8","author":"J Jaafari","year":"2021","unstructured":"Jaafari J, Douzi S, Douzi K, Hssina B. Towards more efficient cnn-based surgical tools classification using transfer learning. J Big Data. 2021;8(1):1\u201315.","journal-title":"J Big Data"},{"key":"664_CR17","doi-asserted-by":"crossref","unstructured":"Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K. Sequence to sequence-video to text. In: Proceedings of the IEEE International conference on computer vision. 2015. p. 4534\u20134542.","DOI":"10.1109\/ICCV.2015.515"},{"key":"664_CR18","doi-asserted-by":"crossref","unstructured":"Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A. Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International conference on computer vision. 2015. p. 4507\u20134515.","DOI":"10.1109\/ICCV.2015.512"},{"key":"664_CR19","doi-asserted-by":"crossref","unstructured":"Bin Y, Yang Y, Shen F, Xu X, Shen HT. Bidirectional long-short term memory for video description. In: Proceedings of the 24th ACM International conference on multimedia. 2016. p. 436\u2013440.","DOI":"10.1145\/2964284.2967258"},{"issue":"9","key":"664_CR20","doi-asserted-by":"publisher","first-page":"2045","DOI":"10.1109\/TMM.2017.2729019","volume":"19","author":"L Gao","year":"2017","unstructured":"Gao L, Guo Z, Zhang H, Xu X, Shen HT. Video captioning with attention-based lstm and semantic consistency. IEEE Transact Multimed. 2017;19(9):2045\u201355.","journal-title":"IEEE Transact Multimed"},{"issue":"1","key":"664_CR21","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1109\/TCSVT.2021.3058626","volume":"32","author":"Y Zheng","year":"2022","unstructured":"Zheng Y, Zhang Y, Feng R, Zhang T, Fan W. Stacked multimodal attention network for context-aware video captioning. IEEE Transact Circuit Syst Video Technol. 2022;32(1):31\u201342. https:\/\/doi.org\/10.1109\/TCSVT.2021.3058626.","journal-title":"IEEE Transact Circuit Syst Video Technol"},{"issue":"2","key":"664_CR22","doi-asserted-by":"publisher","first-page":"880","DOI":"10.1109\/TCSVT.2021.3063423","volume":"32","author":"J Deng","year":"2022","unstructured":"Deng J, Li L, Zhang B, Wang S, Zha Z, Huang Q. Syntax-guided hierarchical attention network for video captioning. IEEE Transact Circuit Syst Video Technol. 2022;32(2):880\u201392. https:\/\/doi.org\/10.1109\/TCSVT.2021.3063423.","journal-title":"IEEE Transact Circuit Syst Video Technol"},{"key":"664_CR23","doi-asserted-by":"publisher","first-page":"2004","DOI":"10.1109\/TIP.2022.3148868","volume":"31","author":"X Hua","year":"2022","unstructured":"Hua X, Wang X, Rui T, Shao F, Wang D. Adversarial reinforcement learning with object-scene relational graph for video captioning. IEEE Transact Image Process. 2022;31:2004\u201316.","journal-title":"IEEE Transact Image Process"},{"key":"664_CR24","doi-asserted-by":"crossref","unstructured":"Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K. Translating videos to natural language using deep recurrent neural networks. 2014. . http:\/\/arxiv.org\/abs\/1412.4729","DOI":"10.3115\/v1\/N15-1173"},{"issue":"11","key":"664_CR25","doi-asserted-by":"publisher","first-page":"5552","DOI":"10.1109\/TIP.2019.2916757","volume":"28","author":"B Zhao","year":"2019","unstructured":"Zhao B, Li X, Lu X. Cam-rnn: co-attention model based rnn for video captioning. IEEE Transact Image Process. 2019;28(11):5552\u201365. https:\/\/doi.org\/10.1109\/TIP.2019.2916757.","journal-title":"IEEE Transact Image Process"},{"issue":"5","key":"664_CR26","doi-asserted-by":"publisher","first-page":"1112","DOI":"10.1109\/TPAMI.2019.2894139","volume":"42","author":"L Gao","year":"2020","unstructured":"Gao L, Li X, Song J, Shen HT. Hierarchical lstms with adaptive attention for visual captioning. IEEE Transact Pattern Anal Mach Intell. 2020;42(5):1112\u201331. https:\/\/doi.org\/10.1109\/TPAMI.2019.2894139.","journal-title":"IEEE Transact Pattern Anal Mach Intell"},{"key":"664_CR27","doi-asserted-by":"publisher","unstructured":"Hossain MZ, Sohel F, Shiratuddin MF, Laga H, Bennamoun, M. Bi-san-cap: bi-directional self-attention for image captioning. In: 2019 Digital image computing: techniques and applications (DICTA). 2019. p. 1\u20137. https:\/\/doi.org\/10.1109\/DICTA47822.2019.8946003","DOI":"10.1109\/DICTA47822.2019.8946003"},{"key":"664_CR28","doi-asserted-by":"publisher","unstructured":"Xu J, Yao T, Zhang Y, Mei T. Learning multimodal attention lstm networks for video captioning. In: Proceedings of the 25th ACM International conference on multimedia. MM \u201917. New York: Association for computing machinery; 2017. p. 537\u2013545. https:\/\/doi.org\/10.1145\/3123266.3123448","DOI":"10.1145\/3123266.3123448"},{"key":"664_CR29","unstructured":"Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. In: Advances in neural information processing systems. 2014. p. 3104\u20133112."},{"issue":"11","key":"664_CR30","doi-asserted-by":"publisher","first-page":"2673","DOI":"10.1109\/78.650093","volume":"45","author":"M Schuster","year":"1997","unstructured":"Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Transact Signal Process. 1997;45(11):2673\u201381.","journal-title":"IEEE Transact Signal Process"},{"key":"664_CR31","doi-asserted-by":"crossref","unstructured":"Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. p. 3128\u20133137.","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"664_CR32","doi-asserted-by":"publisher","first-page":"1155","DOI":"10.1109\/ACCESS.2017.2778011","volume":"6","author":"A Ullah","year":"2017","unstructured":"Ullah A, Ahmad J, Muhammad K, Sajjad M, Baik SW. Action recognition in video sequences using deep bi-directional lstm with cnn features. IEEE Access. 2017;6:1155\u201366.","journal-title":"IEEE Access"},{"key":"664_CR33","unstructured":"Li J, Qiu H. Comparing attention-based neural architectures for video captioning. https:\/\/web.stanford.edu\/class\/archive\/cs\/cs224n\/cs224n.1194"},{"key":"664_CR34","unstructured":"Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y. Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning. New York: PMLR; 2015. p. 2048\u20132057."},{"key":"664_CR35","unstructured":"Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. 2014. arXiv. http:\/\/arxiv.org\/abs\/1409.0473"},{"key":"664_CR36","unstructured":"Chen D, Dolan W. Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. Portland: Association for Computational Linguistics. 2011. p. 190\u2013200."},{"key":"664_CR37","doi-asserted-by":"crossref","unstructured":"Xu J, Mei T, Yao T, Rui Y. Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. p. 5288\u20135296 (2016)","DOI":"10.1109\/CVPR.2016.571"},{"key":"664_CR38","unstructured":"Zeiler MD. ADADELTA: an adaptive learning rate method. CoRR. 2012. https:\/\/arxiv.org\/abs\/1212.5701"},{"key":"664_CR39","doi-asserted-by":"publisher","unstructured":"Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics. 2002; p. 311\u2013318 (2002). https:\/\/doi.org\/10.3115\/1073083.1073135","DOI":"10.3115\/1073083.1073135"},{"key":"664_CR40","unstructured":"Banerjee S, Lavie A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on intrinsic and extrinsic evaluation measures for machine translation and\/or summarization. Ann Arbor: Association for Computational Linguistics. 2005. p. 65\u201372."},{"key":"664_CR41","unstructured":"Feinerer I, Hornik K. Wordnet: WordNet Interface. R package version 0.1-15. 2020. https:\/\/CRAN.R-project.org\/package=wordnet"},{"key":"664_CR42","doi-asserted-by":"crossref","unstructured":"Pan Y, Mei T, Yao T, Li H, Rui Y. Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. p. 4594\u20134602.","DOI":"10.1109\/CVPR.2016.497"},{"key":"664_CR43","doi-asserted-by":"crossref","unstructured":"Xu H, Venugopalan S, Ramanishka V, Rohrbach M, Saenko K. A multi-scale multiple instance video description network. 2015. . http:\/\/arxiv.org\/abs\/1505.05914","DOI":"10.1145\/2964284.2984066"},{"key":"664_CR44","unstructured":"Venugopalan S, Hendricks LA, Mooney R, Saenko K. Improving lstm-based video description with linguistic knowledge mined from text. . http:\/\/arxiv.org\/abs\/1604.01729"},{"issue":"1","key":"664_CR45","doi-asserted-by":"publisher","first-page":"229","DOI":"10.1109\/TMM.2019.2924576","volume":"22","author":"C Yan","year":"2019","unstructured":"Yan C, Tu Y, Wang X, Zhang Y, Hao X, Zhang Y, Dai Q. Stat: Spatial-temporal attention mechanism for video captioning. IEEE Transact Multimed. 2019;22(1):229\u201341.","journal-title":"IEEE Transact Multimed"},{"key":"664_CR46","unstructured":"Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. 2016. p. 12487\u201312496."},{"issue":"1","key":"664_CR47","first-page":"1929","volume":"15","author":"N Srivastava","year":"2014","unstructured":"Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(1):1929\u201358.","journal-title":"J Mach Learn Res"},{"key":"664_CR48","doi-asserted-by":"crossref","unstructured":"Zoph B, Vasudevan V, Shlens J, Le QV. Learning transferable architectures for scalable image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition. 2018. p. 8697\u20138710.","DOI":"10.1109\/CVPR.2018.00907"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-022-00664-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s40537-022-00664-6\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-022-00664-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,11,12]],"date-time":"2022-11-12T11:07:25Z","timestamp":1668251245000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-022-00664-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,11,12]]},"references-count":48,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2022,12]]}},"alternative-id":["664"],"URL":"https:\/\/doi.org\/10.1186\/s40537-022-00664-6","relation":{},"ISSN":["2196-1115"],"issn-type":[{"value":"2196-1115","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,11,12]]},"assertion":[{"value":"5 February 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"20 October 2022","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 November 2022","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"104"}}