{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,10]],"date-time":"2026-02-10T05:53:21Z","timestamp":1770702801523,"version":"3.49.0"},"reference-count":23,"publisher":"SAGE Publications","issue":"1","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["IFS"],"published-print":{"date-parts":[[2022,6,1]]},"abstract":"<jats:p>Caption generation using an encoder-decoder approach has recently been extensively studied and implemented in various domains, including image captioning and code captioning. In this research article, we propose one particular approach for completing a capture generation task using an \u201cattention\u201d-based sequence-to-sequence framework that, when combined with a conventional encoder-decoder model, generates captions in an attention-based manner. ResNet-152 is a Convolutional Neural Network (CNN) based encoder that generates a comprehensive representation of an input image while embedding that into a fixed size length vector. To predict the next sentence, the decoder uses LSTM, a Recurrent Neural Network (RNN), and an attention-based mechanism to concentrate attention on certain sections of an image selectively. Define a set of epochs to 69, which should be enough for training the model to generate informative descriptions, and the validation loss has reached its minimum limit and no longer decreases. We present the datasets as well as the evaluation metrics, as well as quantitative and qualitative analysis. Experiments on the MSCOCO and Flickr8k benchmark datasets illustrate the model\u2019s efficacy in comparison to the baseline techniques. On MSCOCO, evaluation scores included BLEU-1 0.81, BLEU-2 0.61, BLEU-3 0.47, and 0.33 METEOR. For Flickr8k BLEU-1 0.68, BLEU-2 0.49, BLEU-3 0.41, METEOR 0.23, and 0.86 on SPICE. The proposed approach is comparable with several state-of-the-art methods in terms of standard evaluation metric, demonstrating that it can produce more accurate and richer captions.<\/jats:p>","DOI":"10.3233\/jifs-211907","type":"journal-article","created":{"date-parts":[[2022,4,8]],"date-time":"2022-04-08T12:44:42Z","timestamp":1649421882000},"page":"159-170","source":"Crossref","is-referenced-by-count":3,"title":["Attention based sequence-to-sequence framework for auto image caption generation"],"prefix":"10.1177","volume":"43","author":[{"given":"Rashid","family":"Khan","sequence":"first","affiliation":[{"name":"National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, Anhui, China"}]},{"given":"M. Shujah","family":"Islam","sequence":"additional","affiliation":[{"name":"Anhui Agriculture University, Hefei, Anhui, China"}]},{"given":"Khadija","family":"Kanwal","sequence":"additional","affiliation":[{"name":"National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, Anhui, China"}]},{"given":"Mansoor","family":"Iqbal","sequence":"additional","affiliation":[{"name":"National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, Anhui, China"}]},{"given":"Md. Imran","family":"Hossain","sequence":"additional","affiliation":[{"name":"National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, Anhui, China"}]},{"given":"Zhongfu","family":"Ye","sequence":"additional","affiliation":[{"name":"National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, Anhui, China"}]}],"member":"179","reference":[{"issue":"6","key":"10.3233\/JIFS-211907_ref5","doi-asserted-by":"crossref","first-page":"2743","DOI":"10.1109\/TIP.2018.2889922","article-title":"Topic-oriented image captioning based on order-embedding","volume":"28","author":"Yu","year":"2018","journal-title":"IEEE Transactions on Image Processing"},{"key":"10.3233\/JIFS-211907_ref8","doi-asserted-by":"crossref","first-page":"86","DOI":"10.1016\/j.neucom.2018.12.026","article-title":"Phrase-Based Image Caption Generator with Hierarchical Lstm Network,","volume":"333","author":"Tan","year":"2019","journal-title":"Neurocomputing"},{"key":"10.3233\/JIFS-211907_ref9","doi-asserted-by":"crossref","first-page":"40","DOI":"10.1016\/j.jvcir.2018.05.008","article-title":"Deepdiary: Lifelogging Image Captioning and Summarization","volume":"55","author":"Fan","year":"2018","journal-title":"Journal of Visual Communication and Image Representation"},{"key":"10.3233\/JIFS-211907_ref10","doi-asserted-by":"crossref","first-page":"17","DOI":"10.1016\/j.neucom.2018.10.059","article-title":"3g Structure for Image Caption Generation,","volume":"330","author":"Yuan","year":"2019","journal-title":"Neurocomputing"},{"key":"10.3233\/JIFS-211907_ref11","doi-asserted-by":"crossref","first-page":"132","DOI":"10.1016\/j.patrec.2018.12.018","article-title":"Leveraging Unpaired out-of-Domain Data for Image Captioning,","volume":"132","author":"Chen","year":"2020","journal-title":"Pattern Recognition Letters"},{"key":"10.3233\/JIFS-211907_ref12","first-page":"141","article-title":"Repeated Review Based Image Captioning for Image Evidence Review,","volume":"63","author":"Guan","year":"2018","journal-title":"Signal Processing:Image Communication"},{"key":"10.3233\/JIFS-211907_ref13","doi-asserted-by":"crossref","first-page":"229","DOI":"10.1016\/j.patrec.2017.10.018","article-title":"Image Caption Generation with Part of Speech Guidance,","volume":"119","author":"He","year":"2019","journal-title":"Pattern Recognition Letters"},{"key":"10.3233\/JIFS-211907_ref14","doi-asserted-by":"crossref","first-page":"107075","DOI":"10.1016\/j.patcog.2019.107075","article-title":"Learning visual relationship and context-aware attention for image captioning,","volume":"98","author":"Wang","year":"2020","journal-title":"Pattern Recognition"},{"key":"10.3233\/JIFS-211907_ref15","unstructured":"Liu , Xihui , Hongsheng Li , Jing Shao , Dapeng Chen and Xiaogang Wang , Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data, In Proceedings of the European Conference on Computer Vision (ECCV), pp. 338\u2013354, 2018, Springer, 2015, 133\u2013140."},{"key":"10.3233\/JIFS-211907_ref16","doi-asserted-by":"crossref","unstructured":"Lin T.Y. , Maire M. , Belongie S. , Hays J. , Perona P. , Ramanan D. , Doll\u00e1 P. and Zitnick C.L. , Microsoft coco: common objects in context, In: European conference on computer vision, Springer, pp. 740\u2013755, (2014).","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"10.3233\/JIFS-211907_ref17","first-page":"253","article-title":"Intelligent skin cancer detection mobile application using convolution neural network, (7(SI))","volume":"11","author":"Goyal","year":"2019","journal-title":"Journal of Advanced Research in Dynamical and Control Systems (JARCDS)"},{"issue":"12","key":"10.3233\/JIFS-211907_ref19","doi-asserted-by":"crossref","first-page":"2891","DOI":"10.1109\/TPAMI.2012.162","article-title":"Babytalk: Understanding and generating simple image descriptions","volume":"35","author":"Kulkarni","year":"2013","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"issue":"33","key":"10.3233\/JIFS-211907_ref24","doi-asserted-by":"crossref","first-page":"24429","DOI":"10.1007\/s11042-020-09128-6","article-title":"Image captions: global-local and joint signals attention model (GL-JSAM)","volume":"79","author":"Naqvi","year":"2020","journal-title":"Multimedia Tools and Applications"},{"key":"10.3233\/JIFS-211907_ref28","doi-asserted-by":"crossref","unstructured":"Xu, Huijuan and Kate Saenko , Ask, attend and answer: Exploring question-guided spatial attention for visual question answering, In European Conference on Computer Vision, pp. 451\u2013466. Springer, Cham, 2016.","DOI":"10.1007\/978-3-319-46478-7_28"},{"issue":"2","key":"10.3233\/JIFS-211907_ref29","doi-asserted-by":"crossref","first-page":"1377","DOI":"10.1007\/s00500-019-03973-w","article-title":"Novel model to integrate word embeddings and syntactic trees for automatic caption generation from images","volume":"24","author":"Zhang","year":"2020","journal-title":"Soft Computing"},{"issue":"12","key":"10.3233\/JIFS-211907_ref34","doi-asserted-by":"crossref","first-page":"2891","DOI":"10.1109\/TPAMI.2012.162","article-title":"Babytalk: Understanding and generating simple image descriptions","volume":"35","author":"Kulkarni","year":"2013","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"10.3233\/JIFS-211907_ref38","unstructured":"Lin C.Y. , Rouge: A package for automatic evaluation of smmaries, in, Text summarization branches ot: Proceedings of the ACL-04 workshop, 2004, vol. 8."},{"issue":"4","key":"10.3233\/JIFS-211907_ref41","doi-asserted-by":"crossref","first-page":"235","DOI":"10.1093\/ijl\/3.4.235","article-title":"Introduction to WordNet: An on-line lexical database","volume":"3","author":"Miller","year":"1990","journal-title":"International Journal of Lexicography"},{"issue":"2","key":"10.3233\/JIFS-211907_ref46","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3432246","article-title":"A Hindi Image Caption Generation Framework Using Deep Learning","volume":"20","author":"Mishra","year":"2021","journal-title":"Transactions on Asian and Low-Resource Language Information Processing"},{"issue":"8","key":"10.3233\/JIFS-211907_ref49","doi-asserted-by":"crossref","first-page":"1053","DOI":"10.3390\/e23081053","article-title":"Application of Euler Neural Networks with Soft Computing Paradigm to Solve Nonlinear Problems Arising in Heat Transfer","volume":"23","author":"Khan","year":"2021","journal-title":"Entropy"},{"issue":"12","key":"10.3233\/JIFS-211907_ref50","doi-asserted-by":"crossref","first-page":"2891","DOI":"10.1109\/TPAMI.2012.162","article-title":"Babytalk: Understanding and generating simple image descriptions","volume":"35","author":"Kulkarni","year":"2013","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"issue":"9","key":"10.3233\/JIFS-211907_ref51","doi-asserted-by":"crossref","first-page":"e6157","DOI":"10.1002\/cpe.6157","article-title":"Principal component analysis, hidden Markov model, and artificial neural network inspired techniques to recognize faces","volume":"33","author":"Aggarwal","year":"2021","journal-title":"Concurrency and Computation: Practice and Experience"},{"issue":"1","key":"10.3233\/JIFS-211907_ref52","doi-asserted-by":"crossref","first-page":"1289","DOI":"10.1007\/s11042-020-09520-2","article-title":"Image surface texture analysis and classification using deep learning","volume":"80","author":"Aggarwal","year":"2021","journal-title":"Multimedia Tools and Applications"}],"container-title":["Journal of Intelligent &amp; Fuzzy Systems"],"original-title":[],"link":[{"URL":"https:\/\/content.iospress.com\/download?id=10.3233\/JIFS-211907","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,30]],"date-time":"2026-01-30T12:49:35Z","timestamp":1769777375000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/full\/10.3233\/JIFS-211907"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,6,1]]},"references-count":23,"journal-issue":{"issue":"1"},"URL":"https:\/\/doi.org\/10.3233\/jifs-211907","relation":{},"ISSN":["1064-1246","1875-8967"],"issn-type":[{"value":"1064-1246","type":"print"},{"value":"1875-8967","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,6,1]]}}}