{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T01:25:19Z","timestamp":1760232319488,"version":"build-2065373602"},"reference-count":57,"publisher":"MDPI AG","issue":"21","license":[{"start":{"date-parts":[[2022,11,1]],"date-time":"2022-11-01T00:00:00Z","timestamp":1667260800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>In image captioning models, the main challenge in describing an image is identifying all the objects by precisely considering the relationships between the objects and producing various captions. Over the past few years, many methods have been proposed, from an attribute-to-attribute comparison approach to handling issues related to semantics and their relationships. Despite the improvements, the existing techniques suffer from inadequate positional and geometrical attributes concepts. The reason is that most of the abovementioned approaches depend on Convolutional Neural Networks (CNNs) for object detection. CNN is notorious for failing to detect equivariance and rotational invariance in objects. Moreover, the pooling layers in CNNs cause valuable information to be lost. Inspired by the recent successful approaches, this paper introduces a novel framework for extracting meaningful descriptions based on a parallelized capsule network that describes the content of images through a high level of understanding of the semantic contents of an image. The main contribution of this paper is proposing a new method that not only overrides the limitations of CNNs but also generates descriptions with a wide variety of words by using Wikipedia. In our framework, capsules focus on the generation of meaningful descriptions with more detailed spatial and geometrical attributes for a given set of images by considering the position of the entities as well as their relationships. Qualitative experiments on the benchmark dataset MS-COCO show that our framework outperforms state-of-the-art image captioning models when describing the semantic content of the images.<\/jats:p>","DOI":"10.3390\/s22218376","type":"journal-article","created":{"date-parts":[[2022,11,2]],"date-time":"2022-11-02T08:15:12Z","timestamp":1667376912000},"page":"8376","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":6,"title":["Caps Captioning: A Modern Image Captioning Approach Based on Improved Capsule Network"],"prefix":"10.3390","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3027-5895","authenticated-orcid":false,"given":"Shima","family":"Javanmardi","sequence":"first","affiliation":[{"name":"Computer Engineering Department, Yazd University, Yazd P.O. Box 8915818411, Iran"},{"name":"Section Imaging and Bioinformatics, Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Niels Bohrweg 1, 2333 CA Leiden, The Netherlands"}]},{"given":"Ali","family":"Latif","sequence":"additional","affiliation":[{"name":"Computer Engineering Department, Yazd University, Yazd P.O. Box 8915818411, Iran"}]},{"given":"Mohammad","family":"Sadeghi","sequence":"additional","affiliation":[{"name":"Electrical Engineering Department, Yazd University, Yazd P.O. Box 89195741, Iran"}]},{"given":"Mehrdad","family":"Jahanbanifard","sequence":"additional","affiliation":[{"name":"Section Imaging and Bioinformatics, Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Niels Bohrweg 1, 2333 CA Leiden, The Netherlands"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3746-3618","authenticated-orcid":false,"given":"Marcello","family":"Bonsangue","sequence":"additional","affiliation":[{"name":"Section Imaging and Bioinformatics, Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Niels Bohrweg 1, 2333 CA Leiden, The Netherlands"}]},{"given":"Fons","family":"Verbeek","sequence":"additional","affiliation":[{"name":"Section Imaging and Bioinformatics, Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Niels Bohrweg 1, 2333 CA Leiden, The Netherlands"}]}],"member":"1968","published-online":{"date-parts":[[2022,11,1]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1016\/j.neucom.2019.12.073","article-title":"Multi-Attention Generative Adversarial Network for image captioning","volume":"387","author":"Wei","year":"2020","journal-title":"Neurocomputing"},{"key":"ref_2","first-page":"4","article-title":"Caption recommendation system","volume":"2","author":"Asawa","year":"2021","journal-title":"United Int. J. Res. Technol."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"103334","DOI":"10.1016\/j.autcon.2020.103334","article-title":"Manifesting construction activity scenes via image captioning","volume":"119","author":"Liu","year":"2020","journal-title":"Autom. Constr."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"107075","DOI":"10.1016\/j.patcog.2019.107075","article-title":"Learning visual relationship and context-aware attention for image captioning","volume":"98","author":"Wang","year":"2019","journal-title":"Pattern Recognit."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3295748","article-title":"A Comprehensive Survey of Deep Learning for Image Captioning","volume":"51","author":"Hossain","year":"2019","journal-title":"ACM Comput. Surv."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"664","DOI":"10.1109\/TPAMI.2016.2598339","article-title":"Deep Visual-Semantic Alignments for Generating Image Descriptions","volume":"39","author":"Karpathy","year":"2016","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"291","DOI":"10.1016\/j.neucom.2018.05.080","article-title":"A survey on automatic image caption generation","volume":"311","author":"Bai","year":"2018","journal-title":"Neurocomputing"},{"key":"ref_8","first-page":"123","article-title":"A survey of evolution of image captioning techniques","volume":"14","author":"Kumar","year":"2018","journal-title":"Int. J. Hybrid Intell. Syst."},{"key":"ref_9","unstructured":"Sabour, S., Frosst, N., and Hinton, G.E. (2017, January 4\u20139). Dynamic routing between capsules. Proceedings of the 31st Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"1865","DOI":"10.1007\/s40747-021-00347-4","article-title":"ResCaps: An improved capsule network and its application in ultrasonic image classification of thyroid papillary carcinoma","volume":"8","author":"Ai","year":"2022","journal-title":"Complex Intell. Syst."},{"key":"ref_11","unstructured":"Hinton, G.E., Sabour, S., and Frosst, N. (May, January 30). Matrix capsules with EM routing. Proceedings of the ICLR 2018: 6th International Conference on Learning Representations, Vancouver, BC, Canada."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002, January 7\u201312). Bleu: A method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_13","unstructured":"Lin, C.-Y. (2004). Text Summarization Branches Out, Association for Computational Linguistics."},{"key":"ref_14","unstructured":"Banerjee, S., and Lavie, A. (2005, January 29). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization, Ann Arbor, MI, USA."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., and Zitnick, C.L. (2014, January 6\u201312). Microsoft coco: Common objects in context. Proceedings of the European Conference on Computer Vision, Zurich, Switzerland.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"853","DOI":"10.1613\/jair.3994","article-title":"Framing image description as a ranking task: Data, models and evaluation metrics","volume":"47","author":"Hodosh","year":"2013","journal-title":"J. Artif. Intell. Res."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"67","DOI":"10.1162\/tacl_a_00166","article-title":"From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions","volume":"2","author":"Young","year":"2014","journal-title":"Trans. Assoc. Comput. Linguistics"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., and Forsyth, D. (2010, January 5\u201311). Every picture tells a story: Generating sentences from images. Proceedings of the European Conference on Computer Vision, Heraklion, Crete, Greece.","DOI":"10.1007\/978-3-642-15561-1_2"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"2891","DOI":"10.1109\/TPAMI.2012.162","article-title":"BabyTalk: Understanding and Generating Simple Image Descriptions","volume":"35","author":"Kulkarni","year":"2013","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_20","unstructured":"Li, S., Kulkarni, G., Berg, T., Berg, A., and Choi, Y. (2011, January 23\u201324). Composing simple image descriptions using web-scale n-grams. Proceedings of the Fifteenth Conference on Computational Natural Language Learning, Portland, OR, USA."},{"key":"ref_21","unstructured":"Jin, J., Fu, K., Cui, R., Sha, F., and Zhang, C. (2015). Aligning where to see and what to tell: Image caption with region-based attention and scene factorization. arXiv."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"351","DOI":"10.1162\/tacl_a_00188","article-title":"TreeTalk: Composition and Compression of Trees for Image Descriptions","volume":"2","author":"Kuznetsova","year":"2014","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_23","unstructured":"Kuznetsova, P., Ordonez, V., Berg, A., Berg, T., and Choi, Y. (2013, January 4\u20139). Generalizing image captions for image-text parallel corpus. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Sofia, Bulgaria."},{"key":"ref_24","first-page":"1143","article-title":"Im2text: Describing images using 1 million captioned photographs","volume":"24","author":"Ordonez","year":"2011","journal-title":"Adv. Neural Inf. Process Syst."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"1367","DOI":"10.1109\/TPAMI.2017.2708709","article-title":"Image Captioning and Visual Question Answering Based on Attributes and External Knowledge","volume":"40","author":"Wu","year":"2018","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., and Goel, V. (2017, January 7\u201312). Self-critical sequence training for image captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2017.131"},{"key":"ref_27","unstructured":"Kiros, R., Salakhutdinov, R., and Zemel, R. (2014, January 21\u201326). Multimodal neural language models. Proceedings of the International Conference on Machine Learning, Beijing, China."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Mason, R., and Charniak, E. (2014, January 22\u201327). Nonparametric method for data-driven image captioning. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Baltimore, MD, USA.","DOI":"10.3115\/v1\/P14-2097"},{"key":"ref_29","unstructured":"Devlin, J., Gupta, S., Girshick, R., Mitchell, M., and Zitnick, C.L. (2015). Exploring nearest neighbor approaches for image captioning. arXiv."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015, January 7\u201312). Show and tell: A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"ref_31","first-page":"2085","article-title":"Phrase-based image captioning","volume":"37","author":"Lebret","year":"2015","journal-title":"Int. Conf. Mach. Learn."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"You, Q., Jin, H., Wang, Z., Fang, C., and Luo, J. (2016, January 7\u201312). Image captioning with semantic attention. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2016.503"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Johnson, J., Karpathy, A., and Fei-Fei, L. (2016, January 7\u201312). Densecap: Fully convolutional localization networks for dense captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2016.494"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"50565","DOI":"10.1109\/ACCESS.2020.2980578","article-title":"ATT-BM-SOM: A Framework of Effectively Choosing Image Information and Optimizing Syntax for Image Captioning","volume":"8","author":"Yang","year":"2020","journal-title":"IEEE Access"},{"key":"ref_35","unstructured":"Martens, D., and Provost, F. (2011). Pseudo-Social Network Targeting from Consumer Transaction Data, University of Antwerp."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"64918","DOI":"10.1109\/ACCESS.2021.3075579","article-title":"Text to Image Synthesis for Improved Image Captioning","volume":"9","author":"Hossain","year":"2021","journal-title":"IEEE Access"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"5241","DOI":"10.1109\/TIP.2019.2917229","article-title":"Self-Guiding Multimodal LSTM\u2014When We Do Not Have a Perfect Training Dataset for Image Captioning","volume":"28","author":"Xian","year":"2019","journal-title":"IEEE Trans. Image Process"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"107329","DOI":"10.1016\/j.sigpro.2019.107329","article-title":"Image captioning via hierarchical attention mechanism and policy gradient optimization","volume":"167","author":"Yan","year":"2019","journal-title":"Signal Process"},{"key":"ref_39","first-page":"1295","article-title":"Capsule networks\u2013a survey","volume":"34","author":"Patrick","year":"2022","journal-title":"J. King Saud Univ. Inf. Sci."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015, January 7\u201312). Going deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"100380","DOI":"10.1109\/ACCESS.2021.3096550","article-title":"Detection of Mulberry Ripeness Stages Using Deep Learning Models","volume":"9","author":"Ashtiani","year":"2021","journal-title":"IEEE Access"},{"key":"ref_42","unstructured":"Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Mandal, B., Ghosh, S., Sarkhel, R., Das, N., and Nasipuri, M. (2019, January 25\u201328). Using dynamic routing to extract intermediate features for developing scalable capsule networks. Proceedings of the 2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP), Sikkim, India.","DOI":"10.1109\/ICACCP.2019.8883020"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Albawi, S., Mohammed, T.A., and Al-Zawi, S. (2017, January 21\u201323). Understanding of a convolutional neural network. Proceedings of the 2017 International Conference on Engineering and Technology (ICET), Antalya, Turkey.","DOI":"10.1109\/ICEngTechnol.2017.8308186"},{"key":"ref_45","unstructured":"Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Karpathy, A., and Fei-Fei, L. (2015, January 7\u201312). Deep visual-semantic alignments for generating image descriptions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Aneja, J., Deshpande, A., and Schwing, A.G. (2018, January 7\u201312). Convolutional image captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2018.00583"},{"key":"ref_48","unstructured":"Tan, J.H., Chan, C.S., and Chuah, J.H. (2019). Image Captioning with Sparse Recurrent Neural Network. arXiv."},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"16267","DOI":"10.1007\/s11042-020-08832-7","article-title":"Tell and guess: Cooperative learning for natural image caption generation with hierarchical refined attention","volume":"80","author":"Zhang","year":"2020","journal-title":"Multimedia Tools Appl."},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"4467","DOI":"10.1109\/TCSVT.2019.2947482","article-title":"Multimodal Transformer with Multi-View Visual Representation for Image Captioning","volume":"30","author":"Yu","year":"2019","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Lu, J., Xiong, C., Parikh, D., and Socher, R. (2017, January 7\u201312). Knowing when to look: Adaptive attention via a visual sentinel for image captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2017.345"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. (2018, January 7\u201312). Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2018.00636"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Jiang, W., Ma, L., Jiang, Y.-G., Liu, W., and Zhang, T. (2018, January 8\u201314). Recurrent fusion network for image captioning. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01216-8_31"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Choi, W.-H., and Choi, Y.-S. (2022). Effective Pre-Training Method and Its Compositional Intelligence for Image Captioning. Sensors, 22.","DOI":"10.3390\/s22093433"},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Brazdil, P., van Rijn, J.N., Soares, C., and Vanschoren, J. (2022). Metalearning, Springer.","DOI":"10.1007\/978-3-030-67024-5"},{"key":"ref_56","doi-asserted-by":"crossref","first-page":"105246","DOI":"10.1016\/j.jobe.2022.105246","article-title":"Vision-based concrete crack detection using a hybrid framework considering noise effect","volume":"61","author":"Yu","year":"2022","journal-title":"J. Build. Eng."},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Wang, Q., and Chan, A.B. (2019, January 7\u201312). Describing like humans: On diversity in image captioning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2019.00432"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/21\/8376\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T01:07:15Z","timestamp":1760144835000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/21\/8376"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,11,1]]},"references-count":57,"journal-issue":{"issue":"21","published-online":{"date-parts":[[2022,11]]}},"alternative-id":["s22218376"],"URL":"https:\/\/doi.org\/10.3390\/s22218376","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2022,11,1]]}}}