{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,24]],"date-time":"2026-03-24T19:41:24Z","timestamp":1774381284597,"version":"3.50.1"},"reference-count":44,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2019,11,15]],"date-time":"2019-11-15T00:00:00Z","timestamp":1573776000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Due to the rapid growth of deep learning technologies, automatic image description generation is an interesting problem in computer vision and natural language generation. It helps to improve access to photo collections on social media and gives guidance for visually impaired people. Currently, deep neural networks play a vital role in computer vision and natural language processing tasks. The main objective of the work is to generate the grammatically correct description of the image using the semantics of the trained captions. An encoder-decoder framework using the deep neural system is used to implement an image description generation task. The encoder is an image parsing module, and the decoder is a surface realization module. The framework uses Densely connected convolutional neural networks (Densenet) for image encoding and Bidirectional Long Short Term Memory (BLSTM) for language modeling, and the outputs are given to bidirectional LSTM in the caption generator, which is trained to optimize the log-likelihood of the target description of the image. Most of the existing image captioning works use RNN and LSTM for language modeling. RNNs are computationally expensive with limited memory. LSTM checks the inputs in one direction. BLSTM is used in practice, which avoids the problem of RNN and LSTM. In this work, the selection of the best combination of words in caption generation is made using beam search and game theoretic search. The results show the game theoretic search outperforms beam search. The model was evaluated with the standard benchmark dataset Flickr8k. The Bilingual Evaluation Understudy (BLEU) score is taken as the evaluation measure of the system. A new evaluation measure called GCorrectwas used to check the grammatical correctness of the description. The performance of the proposed model achieves greater improvements over previous methods on the Flickr8k dataset. The proposed model produces grammatically correct sentences for images with a GCorrect of 0.040625 and a BLEU score of 69.96%<\/jats:p>","DOI":"10.3390\/info10110354","type":"journal-article","created":{"date-parts":[[2019,11,15]],"date-time":"2019-11-15T11:25:56Z","timestamp":1573817156000},"page":"354","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":11,"title":["Dense Model for Automatic Image Description Generation with Game Theoretic Optimization"],"prefix":"10.3390","volume":"10","author":[{"given":"Sreela","family":"S R","sequence":"first","affiliation":[{"name":"Department of Computer Science, Cochin University of Science and Technology, Kochi, Kerala 682022, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sumam Mary","family":"Idicula","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Cochin University of Science and Technology, Kochi, Kerala 682022, India"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2019,11,15]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Mikolov, T., Karafi\u00e1t, M., Burget, L., \u010cernock\u00fd, J., and Khudanpur, S. (2010, January 26\u201330). Recurrent neural network based language model. Proceedings of the Conference of the International Speech Communication Association, Makuhari, Chiba, Japan. DBLP.","DOI":"10.21437\/Interspeech.2010-343"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015, January 7\u201312). Show and tell: A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"ref_4","unstructured":"Karpathy, A., Joulin, A., and Fei-Fei, L. (2014). Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. Adv. Neural Inf. Process. Syst."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"409","DOI":"10.1613\/jair.4900","article-title":"Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures","volume":"55","author":"Bernardi","year":"2016","journal-title":"J. Artif. Intell. Res. (JAIR)"},{"key":"ref_6","unstructured":"Mitchell, M., Dodge, J., Goyal, A., Yamaguchi, K., Stratos, K., Mensch, A., Berg, A., Han, X., Berg, T., and Health, O. (2012, January 23\u201327). Midge: Generating Image Descriptions From Computer Vision Detections. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"2891","DOI":"10.1109\/TPAMI.2012.162","article-title":"Baby talk: Understanding and generating simple image descriptions","volume":"35","author":"Kulkarni","year":"2013","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_8","unstructured":"Ordonez, V., Kulkarni, G., and Berg, T.L. (2011). Im2text: Describing images using 1 million captioned photographs. Adv. Neural Inf., 1143\u20131151."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"853","DOI":"10.1613\/jair.3994","article-title":"Framing image description as a ranking task: Data, models and evaluation metrics","volume":"47","author":"Hodosh","year":"2013","journal-title":"J. Artif. Intell. Res."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"207","DOI":"10.1162\/tacl_a_00177","article-title":"Grounded Compositional Semantics for Finding and Describing Images with Sentences","volume":"2","author":"Socher","year":"2014","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Lu, J., Xiong, C., Parikh, D., and Socher, R. (2017, January 21\u201326). Knowing when to look: Adaptive attention via a visual sentinel for image captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.345"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., and Forsyth, D. (2010). Every picture tells a story: Generating sentences from images. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Springer. 6314 LNCS (PART 4).","DOI":"10.1007\/978-3-642-15561-1_2"},{"key":"ref_13","unstructured":"Kiros, R., Salakhutdinov, R., and Zemel, R. (2014, January 21\u201326). Multimodal neural language models. Proceedings of the 31st International Conference on Machine Learning (ICML-14), Beijing, China."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Gong, Y., Wang, L., Hodosh, M., Hockenmaier, J., and Lazebnik, S. (2014). Improving image-sentence embeddings using large weakly annotated photo collections. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-10593-2_35"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Karpathy, A., and Li, F.-F. (2015, January 7\u201312). Deep visual-semantic alignments for generating image descriptions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"ref_16","unstructured":"Donnelly, C. (2016). Image Caption Generation with Recursive Neural Networks, Department of Electrical Engineering, Stanford University."},{"key":"ref_17","unstructured":"Soh, M. (2016). Learning CNN-LSTM Architectures for Image Caption Generation, Dept. Comput. Sci., Stanford Univ."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Wang, C., Yang, H., Bartz, C., and Meinel, C. (2016, January 6\u20139). Image captioning with deep bidirectional LSTMs. Proceedings of the 2016 ACM on Multimedia Conference, New York, NY, USA.","DOI":"10.1145\/2964284.2964299"},{"key":"ref_19","unstructured":"You, Q., Jin, H., Wang, Z., Fang, C., and Luo, J. (July, January 26). Image captioning with semantic attention. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. (2018, January 18\u201322). Bottom-up and top-down attention for image captioning and visual question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00636"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Poghosyan, A., and Sarukhanyan, H. (2017, January 25\u201329). Short-term memory with read-only unit in neural image caption generator. Proceedings of the 2017 Computer Science and Information Technologies (CSIT), Yerevan, Armenia.","DOI":"10.1109\/CSITechnol.2017.8312163"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Aneja, J., Deshpande, A., and Schwing, A.G. (2018, January 18\u201322). Convolutional image captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00583"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Chen, F., Ji, R., Sun, X., Wu, Y., and Su, J. (2018, January 18\u201322). Groupcap: Group-based image captioning with structured relevance and diversity constraints. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00146"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Tan, Y.H., and Chan, C.S. (2017). phi-LSTM: A Phrase-Based Hierarchical LSTM Model for Image Captioning, Springer International Publishing.","DOI":"10.1007\/978-3-319-54193-8_7"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"6143","DOI":"10.1007\/s10586-018-1885-9","article-title":"Fast image captioning using LSTM","volume":"22","author":"Han","year":"2019","journal-title":"Cluster Comput."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"177","DOI":"10.1007\/s11063-018-9807-7","article-title":"Image captioning with text-based visual attention","volume":"49","author":"He","year":"2019","journal-title":"Neural Process. Lett."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Zeiler, M.D., and Rob, F. (2014). Visualizing and understanding convolutional networks. European Conference on Computer Vision, Springer International Publishing.","DOI":"10.1007\/978-3-319-10590-1_53"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"2278","DOI":"10.1109\/5.726791","article-title":"Gradient-based learning applied to document recognition","volume":"86","author":"LeCun","year":"1998","journal-title":"Proc. IEEE"},{"key":"ref_29","unstructured":"Krizhevsky, A., Sutskever, I., and Hinton, G. (2012, January 3\u20138). ImageNet Classification with Deep Convolutional Neural Networks. Proceedings of the NIPS, Lake Tahoe, NV, USA."},{"key":"ref_30","unstructured":"Simonyan, K., and Zisserman, A. (2015, January 7\u20139). Very Deep Convolutional Networks for Large-Scale Image Recognition. Proceedings of the ICLR, San Diego, CA, USA."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015, January 7\u201313). Going deeper with convolutions. Proceedings of the CVPR, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"ref_32","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2014, January 23\u201328). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA."},{"key":"ref_33","unstructured":"Srivastava, R.K., Greff, K., and Schmidhuber, J. (2015). Highway networks. arXiv."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K.Q. (2016). Densely connected convolutional networks. arXiv.","DOI":"10.1109\/CVPR.2017.243"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.C. (2018, January 18\u201322). Mobilenetv2: Inverted residuals and linear bottlenecks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00474"},{"key":"ref_36","unstructured":"Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., and Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv."},{"key":"ref_37","unstructured":"Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Pennington, J., Socher, R., and Manning, C. (2014, January 25\u201329). Glove: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Doha, Qatar.","DOI":"10.3115\/v1\/D14-1162"},{"key":"ref_39","unstructured":"Joulin, A., Grave, E., Bojanowski, P., Douze, M., J\u00e9gou, H., and Mikolov, T. (2016). Fasttext. zip: Compressing text classification models. arXiv."},{"key":"ref_40","unstructured":"Von Neumann, O. (1953). Morgenstern, Theory of Games and Economic Behavior, Princeton University Press. copyright 1944."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"86","DOI":"10.1016\/j.neucom.2012.05.001","article-title":"Using cooperative game theory to optimize the feature selection problem","volume":"97","author":"Sun","year":"2012","journal-title":"Neurocomputing"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. (2002, January 7\u201312). BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, Philadelphia, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_43","unstructured":"Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. Int. Conf. Mach. Learn."},{"key":"ref_44","unstructured":"Tan, Y.H., and Chan, C.S. (2017). Phrase-based Image Captioning with Hierarchical LSTM Model. arXiv."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/10\/11\/354\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T13:34:55Z","timestamp":1760189695000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/10\/11\/354"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,11,15]]},"references-count":44,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2019,11]]}},"alternative-id":["info10110354"],"URL":"https:\/\/doi.org\/10.3390\/info10110354","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,11,15]]}}}