{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,29]],"date-time":"2026-01-29T14:42:56Z","timestamp":1769697776879,"version":"3.49.0"},"reference-count":64,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2025,8,8]],"date-time":"2025-08-08T00:00:00Z","timestamp":1754611200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Ajman University"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computation"],"abstract":"<jats:p>Neural machine translation (NMT) models combining textual and visual inputs generate more accurate translations compared with unimodal models. Moreover, translation models with an under-resourced target language benefit from multisource inputs (source sentences are provided in different languages). Building MultiModal MutliSource NMT (M3S-NMT) systems require significant efforts to curate datasets suitable for such a multifaceted task. This work uses image caption translation as an example of multimodal translation and presents a novel public dataset for translating captions from multiple European languages (viz., English, German, French, and Czech) into the distant and under-resourced Arabic language. Moreover, it presents multitask learning models trained and tested on this dataset to serve as solid baselines to help further research in this area. These models involve two parts: one for learning the visual representations of the input images, and the other for translating the textual input based on these representations. The translations are produced from a framework of attention-based encoder\u2013decoder architectures. The visual features are learned from a pretrained convolutional neural network (CNN). These features are then integrated with textual features learned through the very basic yet well-known recurrent neural networks (RNNs) with GloVe or BERT word embeddings. Despite the challenges associated with the task at hand, the results of these systems are very promising, reaching 34.57 and 42.52 METEOR scores.<\/jats:p>","DOI":"10.3390\/computation13080194","type":"journal-article","created":{"date-parts":[[2025,8,8]],"date-time":"2025-08-08T15:30:52Z","timestamp":1754667052000},"page":"194","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Multimodal Multisource Neural Machine Translation: Building Resources for Image Caption Translation from European Languages into Arabic"],"prefix":"10.3390","volume":"13","author":[{"given":"Roweida","family":"Mohammed","sequence":"first","affiliation":[{"name":"Computer Engineering Department, Jordan University of Science and Technology, Irbid 22110, Jordan"}]},{"given":"Inad","family":"Aljarrah","sequence":"additional","affiliation":[{"name":"Computer Engineering Department, Jordan University of Science and Technology, Irbid 22110, Jordan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9372-9076","authenticated-orcid":false,"given":"Mahmoud","family":"Al-Ayyoub","sequence":"additional","affiliation":[{"name":"AI Research Center, College of Engingeering and IT, Ajman University, Ajman 346, United Arab Emirates"},{"name":"Computer Science Department, Jordan University of Science and Technology, Irbid 22110, Jordan"}]},{"given":"Ali","family":"Fadel","sequence":"additional","affiliation":[{"name":"Computer Science Department, Jordan University of Science and Technology, Irbid 22110, Jordan"}]}],"member":"1968","published-online":{"date-parts":[[2025,8,8]]},"reference":[{"key":"ref_1","first-page":"79","article-title":"A statistical approach to machine translation","volume":"16","author":"Brown","year":"1990","journal-title":"Comput. Linguist."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Kalchbrenner, N., and Blunsom, P. (2013, January 18\u201321). Recurrent continuous translation models. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA.","DOI":"10.18653\/v1\/D13-1176"},{"key":"ref_3","unstructured":"Choi, G.H., Shin, J.H., and Kim, Y.K. (2017). Improving a multi-source neural machine translation model with corpus extension for low-resource languages. arXiv."},{"key":"ref_4","unstructured":"Barrault, L., Bougares, F., Specia, L., Lala, C., Elliott, D., and Frank, S. (November, January 31). Findings of the third shared task on multimodal machine translation. Proceedings of the Third Conference on Machine Translation (WMT18), Brussels, Belgium."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Elliott, D., Frank, S., Barrault, L., Bougares, F., and Specia, L. (2017). Findings of the second shared task on multimodal machine translation and multilingual image description. arXiv.","DOI":"10.18653\/v1\/W17-4718"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Specia, L., Frank, S., Sima\u2019An, K., and Elliott, D. (2016, January 11\u201312). A shared task on multimodal machine translation and crosslingual image description. Proceedings of the First Conference on Machine Translation, Berlin, Germany.","DOI":"10.18653\/v1\/W16-2346"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Kocmi, T., Avramidis, E., Bawden, R., Bojar, O., Dvorkovich, A., Federmann, C., Fishel, M., Freitag, M., Gowda, T., and Grundkiewicz, R. (2024, January 15\u201316). Findings of the WMT24 general machine translation shared task: The LLM era is here but mt is not solved yet. Proceedings of the Ninth Conference on Machine Translation, Miami, FL, USA.","DOI":"10.18653\/v1\/2024.wmt-1.1"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. (2002, January 6\u201312). Bleu: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_9","unstructured":"Imankulova, A., Kaneko, M., Hirasawa, T., and Komachi, M. (2020). Towards multimodal simultaneous neural machine translation. arXiv."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"1550","DOI":"10.1109\/5.58337","article-title":"Backpropagation through time: What it does and how to do it","volume":"78","author":"Werbos","year":"2002","journal-title":"Proc. IEEE"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"533","DOI":"10.1038\/323533a0","article-title":"Learning representations by back-propagating errors","volume":"323","author":"Rumelhart","year":"1986","journal-title":"Nature"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Mikolov, T., Kombrink, S., Burget, L., \u010cernock\u1ef3, J., and Khudanpur, S. (2011, January 22\u201327). Extensions of recurrent neural network language model. Proceedings of the 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), Prague, Czech Republic.","DOI":"10.1109\/ICASSP.2011.5947611"},{"key":"ref_13","unstructured":"Sutskever, I., Martens, J., and Hinton, G.E. (July, January 28). Generating text with recurrent neural networks. Proceedings of the 28th International Conference on Machine Learning (ICML-11), Bellevue, WA, USA."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Liu, S., Yang, N., Li, M., and Zhou, M. (2014, January 22\u201327). A recursive recurrent neural network for statistical machine translation. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, MD, USA.","DOI":"10.3115\/v1\/P14-1140"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Auli, M., Galley, M., Quirk, C., and Zweig, G. (2013, January 18\u201321). Joint language and translation modeling with recurrent neural networks. Proceedings of the EMNLP, Seattle, WA, USA.","DOI":"10.18653\/v1\/D13-1106"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"2673","DOI":"10.1109\/78.650093","article-title":"Bidirectional recurrent neural networks","volume":"45","author":"Schuster","year":"1997","journal-title":"IEEE Trans. Signal Process."},{"key":"ref_17","unstructured":"Peris, A., and Casacuberta, F. (2015). A bidirectional recurrent neural language model for machine translation. Procesamiento del Lenguaje Natural 55, 109\u2013116."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Cho, K., Van Merri\u00ebnboer, B., Bahdanau, D., and Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv.","DOI":"10.3115\/v1\/W14-4012"},{"key":"ref_19","unstructured":"Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv."},{"key":"ref_20","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017). Attention is all you need. Adv. Neural Inf. Process. Syst., 30."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Pennington, J., Socher, R., and Manning, C.D. (2014, January 25\u201329). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.","DOI":"10.3115\/v1\/D14-1162"},{"key":"ref_22","unstructured":"Makarenkov, V., Shapira, B., and Rokach, L. (2016). Language models with pre-trained (GloVe) word embeddings. arXiv."},{"key":"ref_23","unstructured":"Hirasawa, T., and Komachi, M. (2019). Debiasing word embeddings improves multimodal machine translation. arXiv."},{"key":"ref_24","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2019, January 3\u20135). Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA."},{"key":"ref_25","unstructured":"Zhu, J., Xia, Y., Wu, L., He, D., Qin, T., Zhou, W., Li, H., and Liu, T.Y. (2020). Incorporating bert into neural machine translation. arXiv."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Clinchant, S., Jung, K.W., and Nikoulina, V. (2019). On the use of BERT for neural machine translation. arXiv.","DOI":"10.18653\/v1\/D19-5611"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Caglayan, O., Bardet, A., Bougares, F., Barrault, L., Wang, K., Masana, M., Herranz, L., and van de Weijer, J. (2018). LIUM-CVC submissions for WMT18 multimodal translation task. arXiv.","DOI":"10.18653\/v1\/W18-6438"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Gr\u00f6nroos, S.A., Huet, B., Kurimo, M., Laaksonen, J., Merialdo, B., Pham, P., Sj\u00f6berg, M., Sulubacak, U., Tiedemann, J., and Troncy, R. (2018). The MeMAD submission to the WMT18 multimodal translation task. arXiv.","DOI":"10.18653\/v1\/W18-6439"},{"key":"ref_29","unstructured":"Gwinnup, J., Sandvick, J., Hutt, M., Erdmann, G., Duselis, J., and Davis, J. (November, January 31). The AFRL-Ohio State WMT18 multimodal system: Combining visual with traditional. Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Brussels, Belgium."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Helcl, J., Libovick\u1ef3, J., and Vari\u0161, D. (2018). CUNI system for the WMT18 multimodal translation task. arXiv.","DOI":"10.18653\/v1\/W18-6441"},{"key":"ref_31","unstructured":"Lala, C., Madhyastha, P.S., Scarton, C., and Specia, L. (November, January 31). Sheffield submissions for WMT18 multimodal translation shared task. Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Brussels, Belgium."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Zheng, R., Yang, Y., Ma, M., and Huang, L. (2018). Ensemble sequence level training for multimodal mt: OSU-Baidu WMT18 multimodal machine translation system report. arXiv.","DOI":"10.18653\/v1\/W18-6443"},{"key":"ref_33","unstructured":"Antoun, W., Baly, F., and Hajj, H. (2020). AraBERT: Transformer-based model for Arabic language understanding. arXiv."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Gashaw, I., and Shashirekha, H. (2019). Amharic-Arabic Neural Machine Translation. arXiv.","DOI":"10.5121\/csit.2019.91606"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Aqlan, F., Fan, X., Alqwbani, A., and Al-Mansoub, A. (2019). Improved Arabic\u2013Chinese machine translation with linguistic input features. Future Internet, 11.","DOI":"10.3390\/fi11010022"},{"key":"ref_36","first-page":"6629","article-title":"A systematic review of text classification research based on deep learning models in arabic language","volume":"10","author":"Wahdan","year":"2020","journal-title":"Int. J. Electr. Comput. Eng."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Dashtipour, K., Gogate, M., Cambria, E., and Hussain, A. (2021). A novel context-aware multimodal framework for persian sentiment analysis. arXiv.","DOI":"10.1016\/j.neucom.2021.02.020"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Bensalah, N., Ayad, H., Adib, A., and El Farouk, A.I. (2021). Arabic Machine Translation Based on the Combination of Word Embedding Techniques. Intelligent Systems in Big Data, Semantic Web and Machine Learning, Springer.","DOI":"10.1007\/978-3-030-72588-4_17"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Alsudais, A. (2020, January 9). Extending ImageNet to Arabic using Arabic WordNet. Proceedings of the First Workshop on Advances in Language and Vision Research, Online.","DOI":"10.18653\/v1\/2020.alvr-1.1"},{"key":"ref_40","first-page":"1","article-title":"AraTraditions10k bridging cultures with a comprehensive dataset for enhanced cross lingual image annotation retrieval and tagging","volume":"15","author":"Wang","year":"2025","journal-title":"Sci. Rep."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Elliott, D., Frank, S., Sima\u2019an, K., and Specia, L. (2016). Multi30k: Multilingual english-german image descriptions. arXiv.","DOI":"10.18653\/v1\/W16-3210"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"67","DOI":"10.1162\/tacl_a_00166","article-title":"From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions","volume":"2","author":"Young","year":"2014","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"393","DOI":"10.1017\/S1351324918000074","article-title":"Assessing multilingual multimodal image description: Studies of native speaker preferences and translator choices","volume":"24","author":"Frank","year":"2018","journal-title":"Nat. Lang. Eng."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Klein, G., Kim, Y., Deng, Y., Senellart, J., and Rush, A.M. (2017). OpenNMT: Open-source toolkit for neural machine translation. arXiv.","DOI":"10.18653\/v1\/P17-4012"},{"key":"ref_45","unstructured":"Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017, January 4\u20139). Automatic differentiation in pytorch. Proceedings of the NIPS, Long Beach, CA, USA."},{"key":"ref_46","unstructured":"Meng, Y., Ren, X., Sun, Z., Li, X., Yuan, A., Wu, F., and Li, J. (2019). Large-scale pretraining for neural machine translation with tens of billions of sentence pairs. arXiv."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Senellart, J., Zhang, D., Wang, B., Klein, G., Ramatchandirin, J.P., Crego, J.M., and Rush, A.M. (2018, January 20). OpenNMT system description for WNMT 2018: 800 words\/sec on a single-core CPU. Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, Melbourne, Australia.","DOI":"10.18653\/v1\/W18-2715"},{"key":"ref_48","unstructured":"Gwinnup, J., Anderson, T., Erdmann, G., and Young, K. (November, January 31). The AFRL WMT18 systems: Ensembling, continuation and combination. Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Brussels, Belgium."},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"125","DOI":"10.2478\/pralin-2018-0011","article-title":"Open source toolkit for speech to text translation","volume":"111","author":"Zenkel","year":"2018","journal-title":"Prague Bull. Math. Linguist."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Parida, S., and Motlicek, P. (2019, January 3\u20137). Abstract text summarization: A low resource challenge. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China.","DOI":"10.18653\/v1\/D19-1616"},{"key":"ref_51","unstructured":"Dixit, S. (2025, June 21). Summarization on SParC. Available online: https:\/\/yale-lily.github.io\/public\/shreya_s2019.pdf."},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Sennrich, R., Haddow, B., and Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv.","DOI":"10.18653\/v1\/P16-1162"},{"key":"ref_53","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (July, January 26). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA."},{"key":"ref_54","unstructured":"Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv."},{"key":"ref_55","unstructured":"Banerjee, S., and Lavie, A. (2005, January 29). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization, Ann Arbor, MI, USA."},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2015, January 7\u201313). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.123"},{"key":"ref_57","doi-asserted-by":"crossref","first-page":"135","DOI":"10.1162\/tacl_a_00051","article-title":"Enriching word vectors with subword information","volume":"5","author":"Bojanowski","year":"2017","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_58","unstructured":"Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv."},{"key":"ref_59","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021, January 11\u201317). Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_61","unstructured":"Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., and Wang, X. (2024). Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv."},{"key":"ref_62","first-page":"107547","article-title":"xlstm: Extended long short-term memory","volume":"37","author":"Beck","year":"2024","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_63","unstructured":"Alkin, B., Beck, M., P\u00f6ppel, K., Hochreiter, S., and Brandstetter, J. (2024). Vision-LSTM: xLSTM as generic vision backbone. arXiv."},{"key":"ref_64","doi-asserted-by":"crossref","unstructured":"Xiao, B., Wu, H., Xu, W., Dai, X., Hu, H., Lu, Y., Zeng, M., Liu, C., and Yuan, L. (2024, January 17\u201318). Florence-2: Advancing a unified representation for a variety of vision tasks. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.00461"}],"container-title":["Computation"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2079-3197\/13\/8\/194\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:26:49Z","timestamp":1760034409000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2079-3197\/13\/8\/194"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,8]]},"references-count":64,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2025,8]]}},"alternative-id":["computation13080194"],"URL":"https:\/\/doi.org\/10.3390\/computation13080194","relation":{},"ISSN":["2079-3197"],"issn-type":[{"value":"2079-3197","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,8,8]]}}}