{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T05:04:30Z","timestamp":1750309470185,"version":"3.41.0"},"reference-count":45,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2025,3,23]],"date-time":"2025-03-23T00:00:00Z","timestamp":1742688000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2025,4,30]]},"abstract":"<jats:p>Research on image caption generation has predominantly focused on resource-rich languages like English, leaving resource-poor languages (like Assamese and several others) largely understudied. In this context, this paper leverages both visual and semantic attribute based features for generating captions in Assamese language. Semantic attributes refer to the significant words that represent higher-level knowledge about the image content. This work contributes through the effective use of features derived from semantic words in low resource Assamese language. The second contribution is the proposal of a Visual-Semantic Self-Attention (VSSA) module for the combination of features derived from images and semantic attributes. The VSSA module enables the image captioning model to dynamically attend to relevant regions of the image as well as the important semantic attributes, thereby leading to more contextually relevant and linguistically accurate Assamese captions. Moreover, the VSSA module is incorporated into a Transformer model to leverage the stacked attention for performance improvement. The model is trained by using both cross-entropy loss optimization and reinforcement learning approach. The effectiveness of the proposed model is evaluated through both qualitative and quantitative analyses (using BLEU-n and CIDEr metrics). The proposed model shows significant performance improvement in Assamese caption synthesis compared to previous methods, achieving 93.7% CIDEr score on the COCO-Assamese Caption (COCO-AC) dataset.<\/jats:p>","DOI":"10.1145\/3717612","type":"journal-article","created":{"date-parts":[[2025,2,14]],"date-time":"2025-02-14T11:08:17Z","timestamp":1739531297000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Exploring Semantic Attributes for Image Caption Synthesis in Low-Resource Assamese Language"],"prefix":"10.1145","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-1159-3118","authenticated-orcid":false,"given":"Pankaj","family":"Choudhury","sequence":"first","affiliation":[{"name":"Center For lingustics Science and Technology, Indian Institute of Technology Guwahati, Guwahati, India"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2885-0026","authenticated-orcid":false,"given":"Prithwijit","family":"Guha","sequence":"additional","affiliation":[{"name":"Electronics &amp; Electrical Engineering, Indian Institute of Technology Guwahati, Guwahati, India"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5869-1057","authenticated-orcid":false,"given":"Sukumar","family":"Nandi","sequence":"additional","affiliation":[{"name":"Computer Science and Engineering, Indian Institute of Technology Guwahati, Guwahati, India"}]}],"member":"320","published-online":{"date-parts":[[2025,3,23]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2018.05.080"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00051"},{"key":"e_1_3_2_5_2","article-title":"Census of India","author":"Chandramouli C.","year":"2011","unstructured":"C. Chandramouli and Registrar General. 2011. Census of India. Rural Urban Distribution of Population, Provisional Population Total. New Delhi: Office of the Registrar General and Census Commissioner, India (2011).","journal-title":"Rural Urban Distribution of Population, Provisional Population Total. New Delhi: Office of the Registrar General and Census Commissioner, India"},{"key":"e_1_3_2_6_2","first-page":"743","volume-title":"Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation","author":"Choudhury Pankaj","year":"2023","unstructured":"Pankaj Choudhury, Prithwijit Guha, and Sukumar Nandi. 2023. Image caption synthesis for low resource Assamese language using Bi-LSTM with bilinear attention. In Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation. 743\u2013752."},{"key":"e_1_3_2_7_2","article-title":"Impact of language-specific training on image caption synthesis: A case study on low-resource Assamese language","author":"Choudhury Pankaj","year":"2024","unstructured":"Pankaj Choudhury, Prithwijit Guha, and Sukumar Nandi. 2024. Impact of language-specific training on image caption synthesis: A case study on low-resource Assamese language. International Journal of Asian Language Processing (2024).","journal-title":"International Journal of Asian Language Processing"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01059"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-022-12042-8"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_2_11_2","article-title":"Ball*-tree: Efficient spatial indexing for constrained nearest-neighbor search in metric spaces","volume":"1511","author":"Dolatshah Mohamad","year":"2015","unstructured":"Mohamad Dolatshah, Ali Hadian, and Behrouz Minaei-Bidgoli. 2015. Ball*-tree: Efficient spatial indexing for constrained nearest-neighbor search in metric spaces. ArXiv abs\/1511.00628 (2015). https:\/\/api.semanticscholar.org\/CorpusID:14162909","journal-title":"ArXiv"},{"key":"e_1_3_2_12_2","doi-asserted-by":"crossref","first-page":"15","DOI":"10.1007\/978-3-642-15561-1_2","volume-title":"Computer Vision\u2013ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5\u201311, 2010, Proceedings, Part IV 11","author":"Farhadi Ali","year":"2010","unstructured":"Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Computer Vision\u2013ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5\u201311, 2010, Proceedings, Part IV 11. Springer, 15\u201329."},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.127"},{"key":"e_1_3_2_14_2","first-page":"429","volume-title":"The Indo-Aryan Languages","author":"Goswami Golok Chandra","year":"2007","unstructured":"Golok Chandra Goswami and Jyotiprakash Tamuli. 2007. Asamiya. In The Indo-Aryan Languages. Routledge, 429\u2013484."},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/3292058"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_17_2","article-title":"Image captioning: Transforming objects into words","volume":"32","author":"Herdade Simao","year":"2019","unstructured":"Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. Advances in Neural Information Processing Systems 32 (2019).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3295748"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00473"},{"key":"e_1_3_2_20_2","doi-asserted-by":"crossref","first-page":"822","DOI":"10.1109\/DASA51403.2020.9317108","volume-title":"2020 International Conference on Decision Aid Sciences and Application (DASA\u201920)","author":"Kamal Abrar Hasin","year":"2020","unstructured":"Abrar Hasin Kamal, Md. Asifuzzaman Jishan, and Nafees Mansoor. 2020. TextMage: The automated Bangla caption generator based on deep learning. In 2020 International Conference on Decision Aid Sciences and Application (DASA\u201920). IEEE, 822\u2013826."},{"key":"e_1_3_2_21_2","article-title":"Deep fragment embeddings for bidirectional image sentence mapping","volume":"27","author":"Karpathy Andrej","year":"2014","unstructured":"Andrej Karpathy, Armand Joulin, and Li F. Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. Advances in Neural Information Processing Systems 27 (2014).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2012.162"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.5555\/2018936.2018962"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.345"},{"issue":"2","key":"e_1_3_2_26_2","doi-asserted-by":"crossref","first-page":"217","DOI":"10.1017\/S0025100312000096","article-title":"Assamese","volume":"42","author":"Mahanta Shakuntala","year":"2012","unstructured":"Shakuntala Mahanta. 2012. Assamese. Journal of the International Phonetic Association 42, 2 (2012), 217\u2013224.","journal-title":"Journal of the International Phonetic Association"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/3432246"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.compeleceng.2021.107114"},{"issue":"3","key":"e_1_3_2_29_2","first-page":"1","article-title":"Efficient channel attention based encoder\u2013decoder approach for image captioning in Hindi","volume":"21","author":"Mishra Santosh Kumar","year":"2021","unstructured":"Santosh Kumar Mishra, Gaurav Rai, Sriparna Saha, and Pushpak Bhattacharyya. 2021. Efficient channel attention based encoder\u2013decoder approach for image captioning in Hindi. Transactions on Asian and Low-Resource Language Information Processing 21, 3 (2021), 1\u201317.","journal-title":"Transactions on Asian and Low-Resource Language Information Processing"},{"key":"e_1_3_2_30_2","first-page":"792","volume-title":"Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation","author":"Mishra Santosh Kumar","year":"2022","unstructured":"Santosh Kumar Mishra, Sushant Sinha, Sriparna Saha, and Pushpak Bhattacharyya. 2022. A deep learning based framework for image paragraph generation in Hindi. In Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation. 792\u2013800."},{"key":"e_1_3_2_31_2","first-page":"747","volume-title":"Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics","author":"Mitchell Margaret","year":"2012","unstructured":"Margaret Mitchell, Jesse Dodge, Amit Goyal, Kota Yamaguchi, Karl Stratos, Xufeng Han, Alyssa Mensch, Alexander Berg, Tamara Berg, and Hal Daum\u00e9 III. 2012. Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. 747\u2013756."},{"key":"e_1_3_2_32_2","first-page":"1","article-title":"Bornon: Bengali image captioning with transformer-based deep learning approach","volume":"3","author":"Shah Faisal Muhammad","year":"2022","unstructured":"Faisal Muhammad Shah, Mayeesha Humaira, Md. Abidur Rahman Khan Jim, Amit Saha Ami, and Shimul Paul. 2022. Bornon: Bengali image captioning with transformer-based deep learning approach. SN Computer Science 3 (2022), 1\u201316.","journal-title":"SN Computer Science"},{"key":"e_1_3_2_33_2","first-page":"263","volume-title":"Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING\u201922)","author":"Nath Prachurya","year":"2022","unstructured":"Prachurya Nath, Prottay Kumar Adhikary, Pankaj Dadure, Partha Pakray, Riyanka Manna, and Sivaji Bandyopadhyay. 2022. Image caption generation for low-resource Assamese language. In Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING\u201922). 263\u2013272."},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01098"},{"key":"e_1_3_2_35_2","first-page":"1","volume-title":"2022 IEEE\/ACS 19th International Conference on Computer Systems and Applications (AICCSA\u201922)","author":"Pathak Dhrubajyoti","year":"2022","unstructured":"Dhrubajyoti Pathak, Sukumar Nandi, and Priyankoo Sarmah. 2022. AsPOS: Assamese part of speech tagger using deep learning approach. In 2022 IEEE\/ACS 19th International Conference on Computer Systems and Applications (AICCSA\u201922). IEEE, 1\u20138."},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2019.06.100"},{"key":"e_1_3_2_37_2","article-title":"Faster R-CNN: Towards real-time object detection with region proposal networks","volume":"28","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems 28 (2015).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.131"},{"key":"e_1_3_2_39_2","article-title":"Sequence to sequence learning with neural networks","volume":"27","author":"Sutskever Ilya","year":"2014","unstructured":"Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 27 (2014).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_40_2","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.29"},{"key":"e_1_3_2_43_2","first-page":"2048","volume-title":"International Conference on Machine Learning","author":"Xu Kelvin","year":"2015","unstructured":"Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. 2048\u20132057."},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.524"},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.503"},{"key":"e_1_3_2_46_2","article-title":"Places: A 10 million image database for scene recognition","author":"Zhou Bolei","year":"2017","unstructured":"Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017).","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3717612","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3717612","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:17:15Z","timestamp":1750295835000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3717612"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,23]]},"references-count":45,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2025,4,30]]}},"alternative-id":["10.1145\/3717612"],"URL":"https:\/\/doi.org\/10.1145\/3717612","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"type":"print","value":"2375-4699"},{"type":"electronic","value":"2375-4702"}],"subject":[],"published":{"date-parts":[[2025,3,23]]},"assertion":[{"value":"2024-09-02","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-02-08","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-03-23","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}