{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T08:20:13Z","timestamp":1774945213553,"version":"3.50.1"},"reference-count":41,"publisher":"MDPI AG","issue":"12","license":[{"start":{"date-parts":[[2025,12,12]],"date-time":"2025-12-12T00:00:00Z","timestamp":1765497600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>The development of Natural Language Processing applications tailored for diverse Arabic-speaking users requires specialized Arabic corpora, which are currently lacking in existing Arabic linguistic resources. Therefore, in this study, a multidialectal parallel Arabic corpus is built, focusing on the travel and tourism domain. By leveraging the text generation and dialectal transformation capabilities of Large Language Models, an initial set of approximately 100,000 parallel sentences was generated. Following a rigorous multi-stage deduplication process, 50,010 unique parallel sentences were obtained from Modern Standard Arabic (MSA) and five major Arabic dialects\u2014Saudi, Egyptian, Iraqi, Levantine, and Moroccan. This study presents the detailed methodology of corpus generation and refinement, describes the characteristics of the generated corpus, and provides a comprehensive statistical analysis highlighting the corpus size, lexical diversity, and linguistic overlap between MSA and the five dialects. This corpus represents a valuable resource for researchers and developers in Arabic dialect processing and AI applications that require nuanced contextual understanding.<\/jats:p>","DOI":"10.3390\/data10120208","type":"journal-article","created":{"date-parts":[[2025,12,12]],"date-time":"2025-12-12T11:13:33Z","timestamp":1765538013000},"page":"208","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Automated Building of a Multidialectal Parallel Arabic Corpus Using Large Language Models"],"prefix":"10.3390","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8856-0958","authenticated-orcid":false,"given":"Khalid","family":"Almeman","sequence":"first","affiliation":[{"name":"Unit of Scientific Research, Applied College, Qassim University, Buraydah 52571, Saudi Arabia"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,12,12]]},"reference":[{"key":"ref_1","unstructured":"(2025, October 04). United Nations Arabic Language Day. Available online: https:\/\/www.un.org\/en\/observances\/arabiclanguageday?utm_source=chatgpt.com."},{"key":"ref_2","unstructured":"Hirst, G. (2010). Introduction to Arabic Natural Language Processing, Morgan & Claypool."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"102770","DOI":"10.1016\/j.ipm.2021.102770","article-title":"Similarities between Arabic Dialects: Investigating Geographical Proximity","volume":"59","author":"Alsudais","year":"2022","journal-title":"Inf. Process Manag."},{"key":"ref_4","first-page":"23","article-title":"Diglossia in Arabic A Comparative Study of the Modern Standard Arabic and Egyptian Colloquial Arabic","volume":"12","author":"Jabbari","year":"2012","journal-title":"Glob. J. Hum.-Soc. Sci."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Bouamor, H., Alshikhabobakr, H., Mohit, B., and Oflazer, K. (2014, January 25\u201329). A Human Judgement Corpus and a Metric for Arabic MT Evaluation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qata.","DOI":"10.3115\/v1\/D14-1026"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Abdelali, A., Darwish, K., Durrani, N., and Mubarak, H. (2016, January 12\u201317). Farasa: A Fast and Furious Segmenter for Arabic. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, San Diego, CA, USA.","DOI":"10.18653\/v1\/N16-3003"},{"key":"ref_7","unstructured":"Parker, R., Graff, D., Chen, K., Kong, J., and Maeda, K. (2011). Arabic Gigaword, Abacus Data Network. [5th ed.]."},{"key":"ref_8","unstructured":"Canavan, A., Zipperlen, G., and Graff, D. (1997). CALLHOME Egyptian Arabic Speech, Linguistic Data Consortium."},{"key":"ref_9","unstructured":"Bouamor, H., Habash, N., Salameh, M., Zaghouani, W., Rambow, O., Abdulrahim, D., Obeid, O., Khalifa, S., Eryani, F., and Erdmann, A. (2018, January 7\u201312). MADAR: A Large-Scale Multi-Arabic Dialect Applications and Resources Project. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan."},{"key":"ref_10","unstructured":"(2025, October 05). LDC Linguistic Data Consortium. Available online: https:\/\/www.ldc.upenn.edu\/."},{"key":"ref_11","unstructured":"(2025, October 05). ELRA ELRA Catalogue of Language Resources. Available online: https:\/\/catalogue.elra.info\/en-us\/."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"103","DOI":"10.1016\/j.specom.2024.103110","article-title":"Arabic Automatic Speech Recognition: Challenges and Progress","volume":"163","author":"Besdouri","year":"2024","journal-title":"Speech Commun."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Morin, C., and Marttinen Larsson, M. (2025). Large Corpora and Large Language Models: A Replicable Method for Automating Grammatical Annotation. Linguist. Vanguard.","DOI":"10.1515\/lingvan-2024-0228"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"100089","DOI":"10.1016\/j.acorp.2024.100089","article-title":"Using Early LLMs for Corpus Linguistics: Examining ChatGPT\u2019s Potential and Limitations","volume":"4","author":"Uchida","year":"2024","journal-title":"Appl. Corpus Linguist."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"101988","DOI":"10.1016\/j.giq.2024.101988","article-title":"Exploiting GPT for Synthetic Data Generation: An Empirical Study","volume":"42","author":"Busker","year":"2025","journal-title":"Gov. Inf. Q."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Perea-Trigo, M., Botella-L\u00f3pez, C., Mart\u00ednez-del-Amor, M.\u00c1., \u00c1lvarez-Garc\u00eda, J.A., Soria-Morillo, L.M., and Vegas-Olmos, J.J. (2024). Synthetic Corpus Generation for Deep Learning-Based Translation of Spanish Sign Language. Sensors, 24.","DOI":"10.3390\/s24051472"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1145\/3664930","article-title":"A Bibliometric Review of Large Language Models Research from 2017 to 2023","volume":"15","author":"Fan","year":"2024","journal-title":"ACM Trans. Intell. Syst. Technol."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"106","DOI":"10.1145\/3744746","article-title":"A Comprehensive Overview of Large Language Models","volume":"16","author":"Naveed","year":"2025","journal-title":"ACM Trans. Intell. Syst. Technol."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"24","DOI":"10.1145\/3341726","article-title":"Filtered Pseudo-Parallel Corpus Improves Low-Resource Neural Machine Translation","volume":"19","author":"Imankulova","year":"2020","journal-title":"ACM Trans. Asian Low-Resour. Lang. Inf. Process."},{"key":"ref_20","first-page":"92","article-title":"Morphologically Motivated Input Variations and Data Augmentation in Turkish-English Neural Machine Translation","volume":"22","year":"2023","journal-title":"ACM Trans. Asian Low-Resour. Lang. Inf. Process."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"12404","DOI":"10.18653\/v1\/2024.acl-long.671","article-title":"Investigating Cultural Alignment of Large Language Models","volume":"Volume 1","author":"AlKhamissi","year":"2024","journal-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Al-Shenaifi, N., Azmi, A.M., and Hosny, M. (2024). Advancing AI-Driven Linguistic Analysis: Developing and Annotating Comprehensive Arabic Dialect Corpora for Gulf Countries and Saudi Arabia. Mathematics, 12.","DOI":"10.3390\/math12193120"},{"key":"ref_23","unstructured":"El Haff, K., Jarrar, M., Hammouda, T., and Zaraket, F. (2022, January 20\u201325). Curras + Baladi: Towards a Levantine Corpus. Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC 2022), Marseille, France."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"755","DOI":"10.1038\/s41586-024-07566-y","article-title":"AI Models Collapse When Trained on Recursively Generated Data","volume":"631","author":"Shumailov","year":"2024","journal-title":"Nature"},{"key":"ref_25","unstructured":"(2025, October 02). Google Gemini. Available online: https:\/\/gemini.google.com\/."},{"key":"ref_26","unstructured":"Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., and Millican, K. (2023). Gemini: A Family of Highly Capable Multimodal Models. arXiv."},{"key":"ref_27","unstructured":"Georgiev, P., Lei, V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., and Lenc, C. (2024). Gemini 1.5: Unlocking Multimodal Understanding across Millions of Tokens of Context. arXiv."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"67","DOI":"10.1145\/3743047","article-title":"AraEventCoref: An Arabic Event Coreference Dataset and LLM Benchmarks","volume":"24","author":"Aldawsari","year":"2025","journal-title":"ACM Trans. Asian Low-Resour. Lang. Inf. Process."},{"key":"ref_29","unstructured":"Daoud, M.A., Abouzahir, C., Kharouf, L., Al-Eisawi, W., Shamout, F.E., and Habash, N. (2025, January 15). MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks. Proceedings of the Machine Learning for Healthcare (ML4HC), Rochester, MN, USA."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Sallam, M., Al-Mahzoum, K., Almutawaa, R.A., Alhashash, J.A., Dashti, R.A., AlSafy, D.R., Almutairi, R.A., and Barakat, M. (2024). The performance of OpenAI ChatGPT-4 and Google Gemini in virology multiple-choice questions: A comparative analysis of English and Arabic responses. BMC Res. Notes, 17.","DOI":"10.1186\/s13104-024-06920-7"},{"key":"ref_31","unstructured":"Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., and Dong, Z. (2023). A Survey of Large Language Models. arXiv."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Almeman, K., and Lee, M. (2013, January 12\u201314). Automatic Building of Arabic Multi Dialect Text Corpora by Bootstrapping Dialect Words. Proceedings of the Communications, Signal Processing, and their Applications (ICCSPA), 2013 1st International Conference on Communications, Signal Processing and Their Applications, Sharjah, United Arab Emirates.","DOI":"10.1109\/ICCSPA.2013.6487247"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Biemann, C., Shin, S.-I., and Choi, K.-S. (2004, January 23\u201327). Semiautomatic Extension of CoreNet Using a Bootstrapping Mechanism on Corpus-Based Co-Occurrences. Proceedings of the 20th International Conference on Computational Linguistics\u2014COLING\u201904, Geneva, Switzerland.","DOI":"10.3115\/1220355.1220533"},{"key":"ref_34","unstructured":"Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., and Brahma, S. (2022). Scaling Instruction-Finetuned Language Models. arXiv."},{"key":"ref_35","unstructured":"Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., and Kumar, A. (2023). Holistic Evaluation of Language Models. Transactions on Machine Learning Research (TMLR). arXiv."},{"key":"ref_36","first-page":"1877","article-title":"Language Models Are Few-Shot Learners","volume":"33","author":"Brown","year":"2020","journal-title":"Proc. Adv. Neural Inf. Process. Syst. (NeurIPS)"},{"key":"ref_37","unstructured":"van Deemter, K., Lin, C., and Takamura, H. (November, January 29). Best Practices for the Human Evaluation of Automatically Generated Text. Proceedings of the 12th International Conference on Natural Language Generation, Tokyo, Japan."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1016\/j.procs.2017.10.094","article-title":"AraSenTi-Tweet: A Corpus for Arabic Sentiment Analysis of Saudi Tweets","volume":"117","year":"2017","journal-title":"Procedia Comput. Sci."},{"key":"ref_39","unstructured":"Bouamor, H., Habash, N., and Oflazer, K. (2014, January 26). A Multidialectal Parallel Corpus of Arabic. Proceedings of the Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC\u201914), Reykjavik, Iceland."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Bowker, L., and Pearson, J. (2002). Working with Specialized Corpora; Studies in Corpus Linguistics, John Benjamins Publishing Company.","DOI":"10.4324\/9780203469255"},{"key":"ref_41","unstructured":"Sinclair, J. (1991). Corpus, Concordance, Collocation; Describing English Language, Oxford University Press."}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/10\/12\/208\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,12]],"date-time":"2025-12-12T11:36:40Z","timestamp":1765539400000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/10\/12\/208"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,12]]},"references-count":41,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["data10120208"],"URL":"https:\/\/doi.org\/10.3390\/data10120208","relation":{},"ISSN":["2306-5729"],"issn-type":[{"value":"2306-5729","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12,12]]}}}