{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T23:50:06Z","timestamp":1774655406035,"version":"3.50.1"},"reference-count":42,"publisher":"Springer Science and Business Media LLC","issue":"26","license":[{"start":{"date-parts":[[2023,6,17]],"date-time":"2023-06-17T00:00:00Z","timestamp":1686960000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,6,17]],"date-time":"2023-06-17T00:00:00Z","timestamp":1686960000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Kafr El Shiekh University"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Neural Comput &amp; Applic"],"published-print":{"date-parts":[[2023,9]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Automatic captioning of images contributes to identifying features of multimedia content and helps in the detection of interesting patterns, trends, and occurrences. English image captioning has recently made incredible progress; however, Arabic image captioning is still lagging. In the field of machine learning, Arabic image-caption generation is generally a very difficult problem. This paper presents a more accurate model for Arabic image captioning by using transformer models in both the encoder and decoder phases as feature extractors from images in the encoder phase and a pre-trained word embedding model in the decoder phase. The models are demonstrated, and all of them are implemented, trained, and tested on Arabic Flickr8k datasets. For the image feature extraction subsystem, we compared using three different individual vision models (SWIN, XCIT, and ConvNexT) with concatenation to get among them the most expressive extracted feature vector of the image, and for the caption generation lingual subsystem, which is tested by four different pre-trained language embedding models: (ARABERT, ARAELECTRA, MARBERTv2, and CamelBERT), to select from them the most accurate pre-trained language embedding model. Our experiments showed that building an Arabic image captioning system that uses a concatenation of the three transformer-based models ConvNexT combined with SWIN and XCIT as an image feature extractor, combined with the CamelBERT language embedding model produces the best results among the other combinations, having scores of 0.5980 with BLEU-1 and with ConvNexT combined with SWIN the araelectra language embedding model having a score of 0.1664 with BLEU-4 which are higher than the previously reported values of 0.443 and 0.157.<\/jats:p>","DOI":"10.1007\/s00521-023-08744-1","type":"journal-article","created":{"date-parts":[[2023,6,17]],"date-time":"2023-06-17T17:01:36Z","timestamp":1687021296000},"page":"19051-19067","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":11,"title":["Improved Arabic image captioning model using feature concatenation with pre-trained word embedding"],"prefix":"10.1007","volume":"35","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2187-0174","authenticated-orcid":false,"given":"Samar","family":"Elbedwehy","sequence":"first","affiliation":[]},{"given":"T.","family":"Medhat","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,6,17]]},"reference":[{"key":"8744_CR1","doi-asserted-by":"publisher","DOI":"10.1109\/TITS.2022.3212921","author":"A Amirkhani","year":"2022","unstructured":"Amirkhani A, Barshooi AH (2022) DeepCar 5.0: vehicle make and model recognition under challenging conditions. IEEE Trans Intell Transp Syst. https:\/\/doi.org\/10.1109\/TITS.2022.3212921","journal-title":"IEEE Trans Intell Transp Syst"},{"key":"8744_CR2","doi-asserted-by":"publisher","first-page":"103326","DOI":"10.1016\/j.bspc.2021.103326","volume":"72","author":"AH Barshooi","year":"2022","unstructured":"Barshooi AH, Amirkhani A (2022) A novel data augmentation based on Gabor filter and convolutional deep learning for improving the classification of COVID-19 chest X-Ray images. Biomed Signal Process Control 72:103326","journal-title":"Biomed Signal Process Control"},{"key":"8744_CR3","doi-asserted-by":"crossref","unstructured":"lJundi O, Dhaybi M, Mokadam K, Hajj HM and Asmar DC (2020) Resources and end-to-end neural network models for arabic image captioning In: VISIGRAPP (5: VISAPP), pp. 233\u2013241","DOI":"10.5220\/0008881202330241"},{"key":"8744_CR4","doi-asserted-by":"crossref","unstructured":"Attai A and Elnagar A (2020) A survey on arabic image captioning systems using deep learning models In: 14th international conference on innovations in information technology (IIT), pp. 114\u2013119.","DOI":"10.1109\/IIT50501.2020.9299027"},{"key":"8744_CR5","volume-title":"Arabic image captioning using deep learning with attention","author":"S Monaf","year":"2021","unstructured":"Monaf S (2021) Arabic image captioning using deep learning with attention. University of Georgia, Georgia."},{"key":"8744_CR6","unstructured":"Tan M and Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks In: International conference on machine learning. PMLR, pp. 6105\u20136114."},{"key":"8744_CR7","doi-asserted-by":"crossref","unstructured":"Sandler M, Howard A, Zhu M, Zhmoginov A and Chen L-C (2018) Mobilenetv2: Inverted residuals and linear bottlenecks In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510\u20134520.","DOI":"10.1109\/CVPR.2018.00474"},{"key":"8744_CR8","unstructured":"Bahdanau D, Cho K and Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473."},{"key":"8744_CR9","doi-asserted-by":"crossref","unstructured":"Luong M-T, Pham H and Manning CD (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.","DOI":"10.18653\/v1\/D15-1166"},{"key":"8744_CR10","unstructured":"Davydova O (2018) Text preprocessing in Python: Steps, tools, and examples. Data Monsters."},{"key":"8744_CR11","first-page":"1","volume":"36","author":"W Saad","year":"2021","unstructured":"Saad W, Shalaby WA, Shokair M, El-Samie FA, Dessouky M, Abdellatef E (2021) COVID-19 classification using deep feature concatenation technique. J Ambient Intell Humaniz Comput 36:1\u201319","journal-title":"J Ambient Intell Humaniz Comput"},{"key":"8744_CR12","first-page":"20014","volume":"34","author":"A Alaaeldin","year":"2021","unstructured":"Alaaeldin A, Touvron H, Caron M, Bojanowski P, Douze M, Joulin A, Laptev I et al (2021) Xcit: cross-covariance image transformers. Adv Neural Inf Process Syst 34:20014\u201320027","journal-title":"Adv Neural Inf Process Syst"},{"key":"8744_CR13","doi-asserted-by":"crossref","unstructured":"Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S and Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE\/CVF international conference on computer vision, pp. 10012\u201310022.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"8744_CR14","unstructured":"Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929."},{"key":"8744_CR15","doi-asserted-by":"crossref","unstructured":"Liu Z, Mao H, Wu C-Y, Feichtenhofer C, Darrell T and Xie S (2022) A convnet for the 2020s In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp. 11976\u201311986.","DOI":"10.1109\/CVPR52688.2022.01167"},{"key":"8744_CR16","doi-asserted-by":"crossref","unstructured":"Tarj\u00e1n B, Szasz\u00e1k G, Fegy\u00f3 T and Mihajlik P (2019) Investigation on N-gram approximated RNNLMs for recognition of morphologically rich speech In: International conference on statistical language and speech processing. Springer, Cham, pp. 223\u2013234.","DOI":"10.1007\/978-3-030-31372-2_19"},{"key":"8744_CR17","doi-asserted-by":"crossref","unstructured":"Vinyals O, Toshev A, Bengio S and Erhan D (2015) Show and tell: A neural image caption generator In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156\u20133164.","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"8744_CR18","unstructured":"Antoun W, Baly F and Hajj H (2020) Arabert: Transformer-based model for arabic language understanding. arXiv preprint arXiv:2003.00104."},{"key":"8744_CR19","unstructured":"Antoun W, Baly F and Hajj H (2020) AraELECTRA: pre-training text discriminators for Arabic language understanding. arXiv preprint arXiv:2012.15516."},{"key":"8744_CR20","doi-asserted-by":"crossref","unstructured":"Abdul-Mageed M, Elmadany A, and Nagoudi EMB (2020) ARBERT & MARBERT: deep bidirectional transformers for Arabic. arXiv preprint arXiv:2101.01785 (2020).","DOI":"10.18653\/v1\/2021.acl-long.551"},{"key":"8744_CR21","unstructured":"Inoue G, Alhafni B, Baimukan N, Bouamor H and Habash N (2021) The interplay of variant, size, and task type in Arabic pre-trained language models. arXiv preprint arXiv:2103.06678."},{"issue":"5","key":"8744_CR22","first-page":"2313","volume":"44","author":"Xu Yang","year":"2020","unstructured":"Yang Xu, Zhang H, Cai J (2020) Auto-encoding and distilling scene graphs for image captioning. IEEE Trans Pattern Anal Mach Intell 44(5):2313\u20132327","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"8744_CR23","doi-asserted-by":"crossref","unstructured":"Li Z, Tran Q, Mai L, Lin Z and Yuille AL (2020) Context-aware group captioning via self-attention and contrastive features In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp. 3440\u20133450.","DOI":"10.1109\/CVPR42600.2020.00350"},{"key":"8744_CR24","doi-asserted-by":"crossref","unstructured":"Cornia M, Stefanini M, Baraldi L and Cucchiara R (2020) Meshed-memory transformer for image captioning In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp. 10.578\u201310.587.","DOI":"10.1109\/CVPR42600.2020.01059"},{"key":"8744_CR25","unstructured":"Common objects in context, Retrieved from. https:\/\/cocodataset.org\/."},{"key":"8744_CR26","unstructured":"Hu X, Yin X, Lin K, Wang L, Zhang L, Gao J and Liu Z (2020) Vivo: Surpassing human performance in novel object captioning with visual vocabulary pre-training. arXivpreprint arXiv:2009.13682."},{"key":"8744_CR27","doi-asserted-by":"publisher","first-page":"1775","DOI":"10.1109\/TMM.2021.3072479","volume":"24","author":"L Yu","year":"2021","unstructured":"Yu L, Zhang J, Qiang Wu (2021) Dual attention on pyramid feature maps for image captioning. IEEE Trans Multim 24:1775\u20131786","journal-title":"IEEE Trans Multim"},{"key":"8744_CR28","unstructured":"Chen Q, Deng C and Wu Q (2022) Learning distinct and representative modes for image captioning. arXiv preprint arXiv:2209.08231."},{"key":"8744_CR29","doi-asserted-by":"crossref","unstructured":"Y\u0131lmaz BD, Demir AE, S\u00f6nmez EB and Y\u0131ld\u0131z T (2019) Image captioning in turkish language In: 2019 innovations in intelligent systems and applications conference (ASYU), pp. 1\u20135. IEEE.","DOI":"10.1109\/ASYU48272.2019.8946358"},{"key":"8744_CR30","doi-asserted-by":"crossref","unstructured":"Zhang B, Zhou L, Song S, Chen L, Jiang Z and Zhang J (2020) Image captioning in chinese and its application for children with autism spectrum disorder In: Proceedings of the 2020 12th international conference on machine learning and computing, pp. 426\u2013432.","DOI":"10.1145\/3383972.3384072"},{"issue":"2","key":"8744_CR31","doi-asserted-by":"publisher","first-page":"2375","DOI":"10.1145\/3432246","volume":"20","author":"SK Mishra","year":"2021","unstructured":"Mishra SK, Dhir R, Saha S, Bhattacharyya P (2021) A hindi image caption generation framework using deep learning. ACM Trans Asian Low Resour Lang Inf Process 20(2):2375\u20134699. https:\/\/doi.org\/10.1145\/3432246","journal-title":"ACM Trans Asian Low Resour Lang Inf Process"},{"key":"8744_CR32","unstructured":"Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R and Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention In: International conference on machine learning, pp. 2048\u20132057"},{"issue":"1","key":"8744_CR33","first-page":"1","volume":"17","author":"H Lu","year":"2021","unstructured":"Lu H, Yang R, Deng Z, Zhang Y, Gao G, Lan R (2021) Chinese image captioning via fuzzy attention-based DenseNet-BiLSTM. ACM Trans Multim Comput Commun Appl TOMM 17(1):1\u201318","journal-title":"ACM Trans Multim Comput Commun Appl TOMM"},{"key":"8744_CR34","doi-asserted-by":"publisher","unstructured":"Wu J, Zheng H, Zhao B, Li Y, Yan B, Liang R, Wang W, Zhou S, Lin G, Fu Y, Wang Y and Wang Y (2017) Ai challenger: A large-scale dataset for going deeper in image understanding. https:\/\/doi.org\/10.1109\/ICME.2019.00256","DOI":"10.1109\/ICME.2019.00256"},{"key":"8744_CR35","doi-asserted-by":"crossref","unstructured":"Jindal V (2017) A deep learning approach for arabic caption generation using roots-words In: Proceedings of the AAAI Conference on Artificial Intelligence 31: 2374\u20133468.","DOI":"10.1609\/aaai.v31i1.11090"},{"key":"8744_CR36","doi-asserted-by":"crossref","unstructured":"Jindal V (2018) Generating image captions in arabic using root-word based recurrent neural networks and deep neural networks In: Proceedings of the AAAI conference on artificial intelligence 32: 2374\u20133468.","DOI":"10.1609\/aaai.v32i1.12179"},{"issue":"6","key":"8744_CR37","first-page":"7","volume":"9","author":"HA Al-Muzaini","year":"2018","unstructured":"Al-Muzaini HA, Al-Yahya TN, Benhidour H (2018) Automatic arabic image captioning using rnn-lst m-based language model and cnn. Int J Adv Comput Sci Appl 9(6):7","journal-title":"Int J Adv Comput Sci Appl"},{"key":"8744_CR38","unstructured":"Emami J, Nugues P, Elnagar A and Afyouni I (2022) Arabic image captioning using pre-training of deep bidirectional transformers In: Proceedings of the 15th international conference on natural language generation, pp. 40\u201351."},{"issue":"7","key":"8744_CR39","first-page":"11","volume":"13","author":"MT Lasheen","year":"2022","unstructured":"Lasheen MT, Barakat NH (2022) Arabic image captioning: the effect of text pre-processing on the attention weights and the BLEU-N scores. Int J Adv Comput Sci Appl 13(7):11","journal-title":"Int J Adv Comput Sci Appl"},{"key":"8744_CR40","unstructured":"Hodosh M, Young P and Hockenmaier J (2021) Flickr8k dataset."},{"key":"8744_CR41","doi-asserted-by":"crossref","unstructured":"Kilickaya M, Erdem A, Ikizler-Cinbis N and Erdem E (2017) Re-evaluating automatic metrics for image captioning In: EACL.","DOI":"10.18653\/v1\/E17-1019"},{"key":"8744_CR42","doi-asserted-by":"crossref","unstructured":"Anderson P, Fernando B, Johnson M and Gould S (2016) Spice: semantic propositional image caption evaluation In: European conference on computer vision. Springer, Cham, pp. 382\u2013398.","DOI":"10.1007\/978-3-319-46454-1_24"}],"container-title":["Neural Computing and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-023-08744-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00521-023-08744-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-023-08744-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,8,14]],"date-time":"2023-08-14T15:21:59Z","timestamp":1692026519000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00521-023-08744-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,17]]},"references-count":42,"journal-issue":{"issue":"26","published-print":{"date-parts":[[2023,9]]}},"alternative-id":["8744"],"URL":"https:\/\/doi.org\/10.1007\/s00521-023-08744-1","relation":{},"ISSN":["0941-0643","1433-3058"],"issn-type":[{"value":"0941-0643","type":"print"},{"value":"1433-3058","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,6,17]]},"assertion":[{"value":"24 September 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"31 May 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"17 June 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no conflicts of interest to report regarding the present study.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}