{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T17:14:07Z","timestamp":1740158047823,"version":"3.37.3"},"reference-count":55,"publisher":"Springer Science and Business Media LLC","issue":"9","license":[{"start":{"date-parts":[[2024,7,2]],"date-time":"2024-07-02T00:00:00Z","timestamp":1719878400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,7,2]],"date-time":"2024-07-02T00:00:00Z","timestamp":1719878400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Ambient Intell Human Comput"],"published-print":{"date-parts":[[2024,9]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Image caption generation has emerged as a remarkable development that bridges the gap between Natural Language Processing (NLP) and Computer Vision (CV). It lies at the intersection of these fields and presents unique challenges, particularly when dealing with low-resource languages such as Urdu. Limited research on basic Urdu language understanding necessitates further exploration in this domain. In this study, we propose three Seq2Seq-based architectures specifically tailored for Urdu image caption generation. Our approach involves leveraging transformer models to generate captions in Urdu, a significantly more challenging task than English. To facilitate the training and evaluation of our models, we created an Urdu-translated subset of the flickr8k dataset, which contains images featuring dogs in action accompanied by corresponding Urdu captions. Our designed models encompassed a deep learning-based approach, utilizing three different architectures: Convolutional Neural Network (CNN) + Long Short-term Memory (LSTM) with Soft attention employing word2Vec embeddings, CNN+Transformer, and Vit+Roberta models. Experimental results demonstrate that our proposed model outperforms existing state-of-the-art approaches, achieving 86 BLEU-1 and 90 BERT-F1 scores. The generated Urdu image captions exhibit syntactic, contextual, and semantic correctness. Our study highlights the inherent challenges associated with retraining models on low-resource languages. Our findings highlight the potential of pre-trained models for facilitating the development of NLP and CV applications in low-resource language settings.<\/jats:p>","DOI":"10.1007\/s12652-024-04824-9","type":"journal-article","created":{"date-parts":[[2024,7,2]],"date-time":"2024-07-02T13:02:19Z","timestamp":1719925339000},"page":"3441-3457","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["A transformer-based Urdu image caption generation"],"prefix":"10.1007","volume":"15","author":[{"given":"Muhammad","family":"Hadi","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Iqra","family":"Safder","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hajra","family":"Waheed","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Farooq","family":"Zaman","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Naif Radi","family":"Aljohani","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Raheel","family":"Nawaz","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Saeed Ul","family":"Hassan","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0640-807X","authenticated-orcid":false,"given":"Raheem","family":"Sarwar","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2024,7,2]]},"reference":[{"issue":"6","key":"4824_CR1","doi-asserted-by":"publisher","first-page":"7719","DOI":"10.1007\/s12652-023-04584-y","volume":"14","author":"MK Afzal","year":"2023","unstructured":"Afzal MK, Shardlow M, Tuarob S et al (2023) Generative image captioning in Urdu using deep learning. J Ambient Intell Humaniz Comput 14(6):7719\u201331","journal-title":"J Ambient Intell Humaniz Comput"},{"key":"4824_CR2","doi-asserted-by":"crossref","unstructured":"Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5561\u20135570","DOI":"10.1109\/CVPR.2018.00583"},{"key":"4824_CR3","doi-asserted-by":"crossref","unstructured":"Antol S, Agrawal A, Lu J et\u00a0al (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425\u20132433","DOI":"10.1109\/ICCV.2015.279"},{"key":"4824_CR4","first-page":"39","volume":"1","author":"A Bakar","year":"2023","unstructured":"Bakar A, Sarwar R, Hassan SU et al (2023) Extracting algorithmic complexity in scientific literature for advance searching. J Comput Appl Linguist 1:39\u201365","journal-title":"J Comput Appl Linguist"},{"key":"4824_CR5","doi-asserted-by":"crossref","unstructured":"Bouchard C, Omhover Jf, Mougenot C, et\u00a0al (2008) Trends: a content-based information retrieval system for designers. In: Design Computing and Cognition\u201908: Proceedings of the Third International Conference on Design Computing and Cognition. Springer, pp 593\u2013611","DOI":"10.1007\/978-1-4020-8728-8_31"},{"key":"4824_CR6","doi-asserted-by":"crossref","unstructured":"Chen X, Lawrence\u00a0Zitnick C (2015) Mind\u2019s eye: A recurrent visual representation for image caption generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2422\u20132431","DOI":"10.1109\/CVPR.2015.7298856"},{"key":"4824_CR7","unstructured":"Chen W, Lucchi A, Hofmann T (2016) A semi-supervised framework for image captioning. arXiv preprint arXiv:1611.05321"},{"key":"4824_CR8","doi-asserted-by":"crossref","unstructured":"Cornia M, Stefanini M, Baraldi L, et\u00a0al (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 10578\u201310587","DOI":"10.1109\/CVPR42600.2020.01059"},{"key":"4824_CR9","doi-asserted-by":"crossref","unstructured":"Dai B, Fidler S, Urtasun R, et\u00a0al (2017) Towards diverse and natural image descriptions via a conditional gan. In: Proceedings of the IEEE international conference on computer vision, pp 2970\u20132979","DOI":"10.1109\/ICCV.2017.323"},{"key":"4824_CR10","doi-asserted-by":"crossref","unstructured":"Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR\u201905), IEEE, pp 886\u2013893","DOI":"10.1109\/CVPR.2005.177"},{"key":"4824_CR11","first-page":"16736","volume":"33","author":"R Del Chiaro","year":"2020","unstructured":"Del Chiaro R, Twardowski B, Bagdanov A et al (2020) Ratt: Recurrent attention to transient tasks for continual image captioning. Adv Neural Inf Process Syst 33:16736\u201316748","journal-title":"Adv Neural Inf Process Syst"},{"key":"4824_CR12","unstructured":"Dosovitskiy A, Beyer L, Kolesnikov A et\u00a0al (2021) An image is worth 16 x 16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations"},{"key":"4824_CR13","doi-asserted-by":"publisher","first-page":"812","DOI":"10.1016\/j.ins.2022.12.018","volume":"623","author":"S Dubey","year":"2023","unstructured":"Dubey S, Olimov F, Rafique MA et al (2023) Label-attention transformer with geometrically coherent objects for image captioning. Inf Sci 623:812\u2013831","journal-title":"Inf Sci"},{"key":"4824_CR14","first-page":"63","volume":"3","author":"AA Goodrum","year":"2000","unstructured":"Goodrum AA (2000) Image information retrieval: an overview of current research. Inf Sci 3:63","journal-title":"Inf Sci"},{"issue":"4","key":"4824_CR15","doi-asserted-by":"publisher","first-page":"e15407","DOI":"10.1016\/j.heliyon.2023.e15407","volume":"9","author":"MU Hassan","year":"2023","unstructured":"Hassan MU, Alaliyat S, Sarwar R et al (2023) Leveraging deep learning and big data to enhance computing curriculum for industry-relevant skills: a Norwegian case study. Heliyon 9(4):e15407","journal-title":"Heliyon"},{"issue":"5","key":"4824_CR16","doi-asserted-by":"publisher","first-page":"1229","DOI":"10.1177\/01655515211043713","volume":"49","author":"SU Hassan","year":"2023","unstructured":"Hassan SU, Aljohani NR, Tarar UI et al (2023) Exploiting tweet sentiments in altmetrics large-scale data. J Inf Sci 49(5):1229\u20131245","journal-title":"J Inf Sci"},{"key":"4824_CR17","doi-asserted-by":"crossref","unstructured":"He S, Liao W, Tavakoli HR et\u00a0al (2020) Image captioning through image transformer. In: Proceedings of the Asian conference on computer vision","DOI":"10.1007\/978-3-030-69538-5_10"},{"key":"4824_CR18","unstructured":"Herdade S, Kappeler A, Boakye K et\u00a0al (2019) Image captioning: transforming objects into words. Advances in neural information processing systems 32"},{"key":"4824_CR19","doi-asserted-by":"publisher","first-page":"853","DOI":"10.1613\/jair.3994","volume":"47","author":"M Hodosh","year":"2013","unstructured":"Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853\u2013899","journal-title":"J Artif Intell Res"},{"key":"4824_CR20","unstructured":"Ilahi I, Zia HMA, Ahsan MA et\u00a0al (2020) Efficient urdu caption generation using attention based lstm. arXiv preprint arXiv:2008.01663"},{"key":"4824_CR21","unstructured":"Jawaid B, Kamran A, Bojar O (2014) A tagged corpus and a tagger for Urdu. In: LREC. pp 2938\u20132943"},{"key":"4824_CR22","unstructured":"Karpathy A, Joulin A, Fei-Fei LF (2014) Deep fragment embeddings for bidirectional image sentence mapping. Advances in neural information processing systems 27"},{"issue":"1","key":"4824_CR23","doi-asserted-by":"publisher","first-page":"40","DOI":"10.1109\/MIC.2020.3037034","volume":"25","author":"MU Khan","year":"2020","unstructured":"Khan MU, Abbas A, Rehman A et al (2020) Hateclassify: a service framework for hate speech identification on social media. IEEE Internet Comput 25(1):40\u201349","journal-title":"IEEE Internet Comput"},{"key":"4824_CR24","unstructured":"Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539"},{"key":"4824_CR25","unstructured":"Li LH, Yatskar M, Yin D et\u00a0al (2019) Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557"},{"key":"4824_CR26","doi-asserted-by":"crossref","unstructured":"Li X, Yin X, Li C, et\u00a0al (2020) Oscar: Object-semantics aligned pre-training for vision-language tasks. In: Computer Vision\u2013ECCV 2020: 16th European Conference, Glasgow, UK, August 23\u201328, 2020, Springer, pp 121\u2013137","DOI":"10.1007\/978-3-030-58577-8_8"},{"key":"4824_CR27","unstructured":"Li J, Li D, Xiong C et\u00a0al (2022) Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, PMLR, pp 12888\u201312900"},{"key":"4824_CR28","doi-asserted-by":"crossref","unstructured":"Limkonchotiwat P, Phatthiyaphaibun W, Sarwar R et\u00a0al (2020) Domain adaptation of Thai word segmentation models using stacked ensemble. Association for Computational Linguistics","DOI":"10.18653\/v1\/2020.emnlp-main.315"},{"key":"4824_CR29","doi-asserted-by":"crossref","unstructured":"Lin TY, Maire M, Belongie S et\u00a0al (2014) Microsoft coco: Common objects in context. In: Computer Vision\u2013ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6\u201312, 2014, Proceedings, Part V 13, Springer, pp 740\u2013755","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"4824_CR30","doi-asserted-by":"crossref","unstructured":"Lowe DG (1999) Object recognition from local scale-invariant features. In: Proceedings of the seventh IEEE international conference on computer vision, IEEE, pp 1150\u20131157","DOI":"10.1109\/ICCV.1999.790410"},{"key":"4824_CR31","doi-asserted-by":"crossref","unstructured":"Luo J, Li Y, Pan Y et\u00a0al (2023) Semantic-conditional diffusion networks for image captioning. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 23359\u201323368","DOI":"10.1109\/CVPR52729.2023.02237"},{"key":"4824_CR32","doi-asserted-by":"crossref","unstructured":"Mao J, Wei X, Yang Y et\u00a0al (2015) Learning like a child: Fast novel visual concept learning from sentence descriptions of images. In: Proceedings of the IEEE international conference on computer vision, pp 2533\u20132541","DOI":"10.1109\/ICCV.2015.291"},{"issue":"3","key":"4824_CR33","doi-asserted-by":"publisher","first-page":"830","DOI":"10.1093\/llc\/fqab092","volume":"37","author":"E Mohamed","year":"2022","unstructured":"Mohamed E, Sarwar R (2022) Linguistic features evaluation for hadith authenticity through automatic machine learning. Digit Scholarsh Humanit 37(3):830\u2013843","journal-title":"Digit Scholarsh Humanit"},{"issue":"2","key":"4824_CR34","doi-asserted-by":"publisher","first-page":"658","DOI":"10.1093\/llc\/fqac054","volume":"38","author":"E Mohamed","year":"2023","unstructured":"Mohamed E, Sarwar R, Mostafa S (2023) Translator attribution for Arabic using machine learning. Digit Scholarsh Humanit 38(2):658\u2013666","journal-title":"Digit Scholarsh Humanit"},{"key":"4824_CR35","doi-asserted-by":"crossref","unstructured":"Mohammad S, Khan MU, Ali M, et\u00a0al (2019) Bot detection using a single post on social media. In: 2019 third world conference on smart trends in systems security and sustainability (WorldS4), IEEE, pp 215\u2013220","DOI":"10.1109\/WorldS4.2019.8903989"},{"key":"4824_CR36","doi-asserted-by":"crossref","unstructured":"Ojala T, Pietik\u00e4inen M, M\u00e4enp\u00e4\u00e4 T (2000) Gray scale and rotation invariant texture classification with local binary patterns. In: Computer Vision-ECCV 2000: 6th European Conference on Computer Vision Dublin, Ireland, June 26\u2013July 1, 2000 Proceedings, Part I 6, Springer, pp 404\u2013420","DOI":"10.1007\/3-540-45054-8_27"},{"key":"4824_CR37","unstructured":"Ordonez V, Kulkarni G, Berg T (2011) Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems 24"},{"key":"4824_CR38","doi-asserted-by":"crossref","unstructured":"Ramos R, Martins B, Elliott D et\u00a0al (2023) Smallcap: lightweight image captioning prompted with retrieval augmentation. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 2840\u20132849","DOI":"10.1109\/CVPR52729.2023.00278"},{"key":"4824_CR39","doi-asserted-by":"publisher","DOI":"10.46298\/jdmdh.8990","author":"H Saadany","year":"2023","unstructured":"Saadany H, Mohamed E, Sarwar R (2023) Towards a better understanding of tarajem: Creating topological networks for Arabic biographical dictionaries. J Data Min Digit Humanit. https:\/\/doi.org\/10.46298\/jdmdh.8990","journal-title":"J Data Min Digit Humanit"},{"key":"4824_CR40","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2023.122874","volume":"243","author":"F Sabah","year":"2023","unstructured":"Sabah F, Chen Y, Yang Z et al (2023) Model optimization techniques in personalized federated learning: a survey. Expert Syst Appl 243:122874","journal-title":"Expert Syst Appl"},{"issue":"8","key":"4824_CR41","doi-asserted-by":"publisher","DOI":"10.1111\/exsy.12751","volume":"38","author":"I Safder","year":"2021","unstructured":"Safder I, Mahmood Z, Sarwar R et al (2021) Sentiment analysis for Urdu online reviews using deep learning models. Expert Syst 38(8):e12751","journal-title":"Expert Syst"},{"key":"4824_CR42","doi-asserted-by":"crossref","unstructured":"Sap M, Shwartz V, Bosselut A, et\u00a0al (2020) Commonsense reasoning for natural language processing. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pp 27\u201333","DOI":"10.18653\/v1\/2020.acl-tutorials.7"},{"key":"4824_CR43","doi-asserted-by":"crossref","unstructured":"Sarwar R (2022) Author gender identification for urdu articles. In: International Conference on Computational and Corpus-Based Phraseology, Springer, pp 221\u2013235","DOI":"10.1007\/978-3-031-15925-1_16"},{"issue":"2","key":"4824_CR44","first-page":"1","volume":"21","author":"R Sarwar","year":"2021","unstructured":"Sarwar R, Hassan SU (2021) Urduai: Writeprints for Urdu authorship identification. Trans Asian Low-Resour Lang Inf Process 21(2):1\u201318","journal-title":"Trans Asian Low-Resour Lang Inf Process"},{"key":"4824_CR45","doi-asserted-by":"crossref","unstructured":"Sharma P, Ding N, Goodman S, et\u00a0al (2018) Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 2556\u20132565","DOI":"10.18653\/v1\/P18-1238"},{"key":"4824_CR46","doi-asserted-by":"crossref","unstructured":"Shetty R, Rohrbach M, Anne\u00a0Hendricks L et\u00a0al (2017) Speaking the same language: Matching machine to human captions by adversarial training. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4135\u20134144","DOI":"10.1109\/ICCV.2017.445"},{"key":"4824_CR47","doi-asserted-by":"crossref","unstructured":"Silva K, Can B, Blain F et\u00a0al (2023) Authorship attribution of late 19th century novels using gan-bert. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pp 310\u2013320","DOI":"10.18653\/v1\/2023.acl-srw.44"},{"key":"4824_CR48","unstructured":"Vaswani A, Shazeer N, Parmar N et\u00a0al (2017) Attention is all you need. Advances in neural information processing systems 30"},{"key":"4824_CR49","unstructured":"Wang P, Yang A, Men R et\u00a0al (2022) Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International Conference on Machine Learning, PMLR, pp 23318\u201323340"},{"key":"4824_CR50","unstructured":"Xu K, Ba J, Kiros R et\u00a0al (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, PMLR, pp 2048\u20132057"},{"key":"4824_CR51","doi-asserted-by":"publisher","DOI":"10.1016\/j.sigpro.2019.107329","volume":"167","author":"S Yan","year":"2020","unstructured":"Yan S, Xie Y, Wu F et al (2020) Image captioning via hierarchical attention mechanism and policy gradient optimization. Signal Process 167:107329","journal-title":"Signal Process"},{"key":"4824_CR52","doi-asserted-by":"crossref","unstructured":"Yang X, Zhang H, Jin D et\u00a0al (2020) Fashion captioning: Towards generating accurate descriptions with semantic rewards. In: Computer Vision\u2013ECCV 2020: 16th European Conference, Glasgow, UK, August 23\u201328, 2020, Proceedings, Part XIII 16, Springer, pp 1\u201317","DOI":"10.1007\/978-3-030-58601-0_1"},{"issue":"8","key":"4824_CR53","doi-asserted-by":"publisher","first-page":"1485","DOI":"10.1109\/JPROC.2010.2050411","volume":"98","author":"BZ Yao","year":"2010","unstructured":"Yao BZ, Yang X, Lin L et al (2010) I2t: Image parsing to text description. Proc IEEE 98(8):1485\u20131508","journal-title":"Proc IEEE"},{"issue":"6","key":"4824_CR54","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2020.102351","volume":"57","author":"F Zaman","year":"2020","unstructured":"Zaman F, Shardlow M, Hassan SU et al (2020) Htss: A novel hybrid text summarisation and simplification architecture. Inf Process Manag 57(6):102351","journal-title":"Inf Process Manag"},{"key":"4824_CR55","doi-asserted-by":"crossref","unstructured":"Zhong Y, Wang L, Chen J et\u00a0al (2020) Comprehensive image captioning via scene graph decomposition. In: Computer Vision\u2013ECCV 2020: 16th European Conference, Glasgow, UK, August 23\u201328, 2020, Proceedings, Part XIV 16, Springer, pp 211\u2013229","DOI":"10.1007\/978-3-030-58568-6_13"}],"container-title":["Journal of Ambient Intelligence and Humanized Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s12652-024-04824-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s12652-024-04824-9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s12652-024-04824-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,19]],"date-time":"2024-08-19T16:15:46Z","timestamp":1724084146000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s12652-024-04824-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,2]]},"references-count":55,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2024,9]]}},"alternative-id":["4824"],"URL":"https:\/\/doi.org\/10.1007\/s12652-024-04824-9","relation":{},"ISSN":["1868-5137","1868-5145"],"issn-type":[{"type":"print","value":"1868-5137"},{"type":"electronic","value":"1868-5145"}],"subject":[],"published":{"date-parts":[[2024,7,2]]},"assertion":[{"value":"14 June 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"4 June 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 July 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}