{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,23]],"date-time":"2025-10-23T08:04:33Z","timestamp":1761206673584,"version":"build-2065373602"},"reference-count":75,"publisher":"PeerJ","license":[{"start":{"date-parts":[[2025,10,23]],"date-time":"2025-10-23T00:00:00Z","timestamp":1761177600000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100023674","name":"Deanship of Scientific Research at King Khalid University","doi-asserted-by":"crossref","award":["RGP2\/569\/46"],"award-info":[{"award-number":["RGP2\/569\/46"]}],"id":[{"id":"10.13039\/501100023674","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61901388"],"award-info":[{"award-number":["61901388"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"abstract":"<jats:p>\n                    Arabic dialect identification (ADI) aims to automatically determine the specific regional dialect of a given Arabic text. State-of-the-art ADI solutions often rely on fine-tuning Arabic-specific pre-trained language models (PLMs). Although effective, these PLMs are predominantly trained on modern standard Arabic (MSA), which limits their performance on dialectal data. Furthermore, the high degree of similarity among Arabic dialects makes it difficult to learn accurate dialect-specific representations without large volumes of labeled data. However, labeling such data is both costly and labor-intensive, particularly for low-resource languages like Arabic. To address these challenges, we propose a self-training neural approach that independently learns\n                    <jats:italic>Dialectal Indicators<\/jats:italic>\n                    . Our method leverages unlabeled data to construct a matrix that captures dialectal tokens frequently co-occurring in similar contexts. This matrix provides dialect-specific representations, which are integrated with PLM outputs to enhance ADI performance. We evaluate our approach on multiple ADI and related datasets. Results show that our method significantly improves PLM performance over direct fine-tuning, achieving gains of up to 36.2% in accuracy and 11.52% in macro-F1-score.\n                  <\/jats:p>","DOI":"10.7717\/peerj-cs.3127","type":"journal-article","created":{"date-parts":[[2025,10,23]],"date-time":"2025-10-23T08:00:23Z","timestamp":1761206423000},"page":"e3127","source":"Crossref","is-referenced-by-count":0,"title":["Towards boosting unlabeled text corpora for Arabic dialect identification"],"prefix":"10.7717","volume":"11","author":[{"given":"Mohammed","family":"Abdelmajeed","sequence":"first","affiliation":[{"name":"School of Computer Science, Northwestern Polytechnical University, Xi\u2019an, Shaanxi, China"}]},{"given":"Zheng","family":"Jiangbin","sequence":"additional","affiliation":[{"name":"School of Computer Science, Northwestern Polytechnical University, Xi\u2019an, Shaanxi, China"}]},{"given":"Murtadha","family":"Ahmed","sequence":"additional","affiliation":[{"name":"School of Computer Science, Northwestern Polytechnical University, Xi\u2019an, Shaanxi, China"}]},{"given":"Mohammed","family":"Abaker","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Applied College, King Khalid University, Muhayil, Asir, Saudi Arabia"}]}],"member":"4443","published-online":{"date-parts":[[2025,10,23]]},"reference":[{"key":"10.7717\/peerj-cs.3127\/ref-1","volume-title":"QADI: Arabic dialect identification in the wild","author":"Abdelali","year":"2021"},{"key":"10.7717\/peerj-cs.3127\/ref-2","doi-asserted-by":"publisher","first-page":"3471","DOI":"10.32604\/cmc.2025.059870","article-title":"Leveraging unlabeled corpus for Arabic dialect identification","volume":"83","author":"Abdelmajeed","year":"2025","journal-title":"Computers, Materials and Continua"},{"key":"10.7717\/peerj-cs.3127\/ref-3","first-page":"7088","volume-title":"ARBERT & MARBERT: deep bidirectional transformers for Arabic","author":"Abdul-Mageed","year":"2021"},{"key":"10.7717\/peerj-cs.3127\/ref-4","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2407.04910","article-title":"Nadi 2024: the fifth nuanced Arabic dialect identification shared task","author":"Abdul-Mageed","year":"2024"},{"key":"10.7717\/peerj-cs.3127\/ref-5","volume-title":"NADI 2020: the first nuanced Arabic dialect identification shared task","author":"Abdul-Mageed","year":"2020"},{"key":"10.7717\/peerj-cs.3127\/ref-6","first-page":"244","volume-title":"NADI 2021: the second nuanced Arabic dialect identification shared task","author":"Abdul-Mageed","year":"2021"},{"key":"10.7717\/peerj-cs.3127\/ref-7","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2303.08774","article-title":"Gpt-4 technical report","author":"Achiam","year":"2023"},{"key":"10.7717\/peerj-cs.3127\/ref-8","first-page":"153","volume-title":"AlclaM: Arabic dialect language model","author":"Ahmed","year":"2024"},{"key":"10.7717\/peerj-cs.3127\/ref-9","first-page":"488","volume-title":"DNN-driven gradual machine learning for aspect-term sentiment analysis","author":"Ahmed","year":"2021"},{"key":"10.7717\/peerj-cs.3127\/ref-10","first-page":"1","article-title":"A comprehensive framework and empirical analysis for evaluating large language models in Arabic dialect identification","author":"Al-Azani","year":"2024"},{"key":"10.7717\/peerj-cs.3127\/ref-11","volume-title":"LILI: a simple language independent approach for language identification, organizing committee","author":"Al-Badrashiny","year":"2016"},{"key":"10.7717\/peerj-cs.3127\/ref-12","first-page":"38","article-title":"SADSLyC: a corpus for Saudi Arabian multi-dialect identification through song lyrics","author":"Alahmari","year":"2025"},{"key":"10.7717\/peerj-cs.3127\/ref-13","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2011.00578","article-title":"ASAD: a twitter-based benchmark Arabic sentiment analysis dataset","author":"Alharbi","year":"2020"},{"key":"10.7717\/peerj-cs.3127\/ref-14","first-page":"260","volume-title":"Adapting MARBERT for improved Arabic dialect identification: submission to the NADI, 2021 shared task","author":"AlKhamissi","year":"2021"},{"key":"10.7717\/peerj-cs.3127\/ref-15","first-page":"185","volume-title":"Assessing the linguistic knowledge in Arabic pre-trained language models using minimal pairs","author":"Alrajhi","year":"2022"},{"key":"10.7717\/peerj-cs.3127\/ref-16","first-page":"137","volume-title":"Arabic synonym BERT-based adversarial examples for text classification","author":"Alshahrani","year":"2024"},{"key":"10.7717\/peerj-cs.3127\/ref-17","first-page":"464","volume-title":"Arabic dialect identification using machine learning and transformer-based models: submission to the NADI, 2022 shared task","author":"AlShenaifi","year":"2022"},{"issue":"17","key":"10.7717\/peerj-cs.3127\/ref-18","doi-asserted-by":"publisher","first-page":"e36280","DOI":"10.1016\/j.heliyon.2024.e36280","article-title":"Arabic dialect identification in social media: a hybrid model with transformer models and BiLSTM","volume":"10","author":"Alsuwaylimi","year":"2024","journal-title":"Heliyon"},{"key":"10.7717\/peerj-cs.3127\/ref-19","first-page":"494","volume-title":"LABR: a large scale Arabic book reviews dataset","author":"Aly","year":"2013"},{"key":"10.7717\/peerj-cs.3127\/ref-20","first-page":"9","volume-title":"AraBERT: transformer-based model for Arabic language understanding","author":"Antoun","year":"2020"},{"key":"10.7717\/peerj-cs.3127\/ref-21","first-page":"1814","article-title":"ANERsys 2.0: conquering the NER task for the Arabic language by combining the maximum entropy with POS-tag information","author":"Benajiba","year":"2007"},{"key":"10.7717\/peerj-cs.3127\/ref-22","first-page":"53","volume-title":"Spoken Arabic dialect identification using phonotactic modeling","author":"Biadsy","year":"2009"},{"key":"10.7717\/peerj-cs.3127\/ref-23","volume-title":"The MADAR Arabic dialect corpus and lexicon","author":"Bouamor","year":"2018"},{"key":"10.7717\/peerj-cs.3127\/ref-24","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2005.14165","article-title":"Language models are few-shot learners","author":"Brown","year":"2020"},{"issue":"1","key":"10.7717\/peerj-cs.3127\/ref-25","doi-asserted-by":"publisher","first-page":"1","DOI":"10.5555\/3648699.3648939","article-title":"PaLM: scaling language modeling with pathways","volume":"24","author":"Chowdhery","year":"2023","journal-title":"The Journal of Machine Learning Researh"},{"key":"10.7717\/peerj-cs.3127\/ref-26","first-page":"8440","volume-title":"Unsupervised cross-lingual representation learning at scale","author":"Conneau","year":"2020"},{"key":"10.7717\/peerj-cs.3127\/ref-27","doi-asserted-by":"publisher","first-page":"301","DOI":"10.1007\/978-3-030-61534-5_27","article-title":"Pre-training polish transformer-based language models at scale","volume-title":"Artificial Intelligence and Soft Computing","author":"Dadas","year":"2020"},{"key":"10.7717\/peerj-cs.3127\/ref-28","first-page":"318","volume-title":"CODACT: towards identifying orthographic variants in dialectal Arabic","author":"Dasigi","year":"2011"},{"key":"10.7717\/peerj-cs.3127\/ref-29","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1912.09582","article-title":"BERTje: a dutch BERT mode","author":"de Vries","year":"2019"},{"key":"10.7717\/peerj-cs.3127\/ref-30","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1423","volume-title":"BERT: pre-training of deep bidirectional transformers for language understanding","author":"Devlin","year":"2019"},{"key":"10.7717\/peerj-cs.3127\/ref-31","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1611.04033","article-title":"1.5 billion words Arabic corpus","author":"El-Khair","year":"2016"},{"key":"10.7717\/peerj-cs.3127\/ref-32","first-page":"263","volume-title":"Deep models for Arabic dialect identification on benchmarked data","author":"Elaraby","year":"2018"},{"key":"10.7717\/peerj-cs.3127\/ref-33","first-page":"456","volume-title":"Sentence level dialect identification in Arabic","author":"Elfardy","year":"2013"},{"key":"10.7717\/peerj-cs.3127\/ref-34","first-page":"642","volume-title":"NLPeople at NADI, 2023 shared task: Arabic dialect identification with augmented context and multi-stage tuning","author":"Elkaref","year":"2023"},{"key":"10.7717\/peerj-cs.3127\/ref-35","article-title":"An Arabic speech-act and sentiment Corpus of Tweets","author":"Elmadany","year":"2018"},{"key":"10.7717\/peerj-cs.3127\/ref-36","doi-asserted-by":"crossref","first-page":"35","DOI":"10.1007\/978-3-319-67056-0_3","article-title":"Hotel Arabic-reviews dataset construction for sentiment analysis applications","volume-title":"Intelligent Natural Language Processing: Trends and Applications","volume":"740","author":"Elnagar","year":"2018"},{"key":"10.7717\/peerj-cs.3127\/ref-37","first-page":"878","volume-title":"Language-agnostic BERT sentence embedding","author":"Feng","year":"2022"},{"key":"10.7717\/peerj-cs.3127\/ref-38","article-title":"Unified guidelines and resources for Arabic dialect orthography","author":"Habash","year":"2018"},{"key":"10.7717\/peerj-cs.3127\/ref-39","first-page":"49","article-title":"Guidelines for annotation of Arabic dialectness","author":"Habash","year":"2008"},{"key":"10.7717\/peerj-cs.3127\/ref-40","article-title":"Creating parallel Arabic dialect corpus: pitfalls to avoid","author":"Harrat","year":"2017"},{"key":"10.7717\/peerj-cs.3127\/ref-41","first-page":"388","volume-title":"An unsupervised neural attention model for aspect extraction","author":"He","year":"2017"},{"key":"10.7717\/peerj-cs.3127\/ref-42","first-page":"92","volume-title":"The interplay of variant, size, and task type in Arabic pre-trained language models","author":"Inoue","year":"2021"},{"key":"10.7717\/peerj-cs.3127\/ref-43","first-page":"1708","volume-title":"Morphosyntactic tagging with pre-trained language models for Arabic and its dialects","author":"Inoue","year":"2022"},{"key":"10.7717\/peerj-cs.3127\/ref-44","first-page":"1534","volume-title":"Feuding families and former friends: unsupervised learning for dynamic fictional relationships","author":"Iyyer","year":"2016"},{"key":"10.7717\/peerj-cs.3127\/ref-45","first-page":"5436","volume-title":"Standardisation of dialect comments in social networks in view of sentiment analysis: case of tunisian dialect","author":"Kchaou","year":"2022"},{"key":"10.7717\/peerj-cs.3127\/ref-46","first-page":"2","article-title":"BERT: pre-training of deep bidirectional transformers for language understanding","author":"Kenton","year":"2019"},{"key":"10.7717\/peerj-cs.3127\/ref-47","first-page":"42","volume-title":"SemEval-2016 task 7: determining sentiment intensity of English and Arabic phrases","author":"Kiritchenko","year":"2016"},{"key":"10.7717\/peerj-cs.3127\/ref-48","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1909.11942","article-title":"ALBERT: a lite BERT for self-supervised learning of language representations","author":"Lan","year":"2019"},{"key":"10.7717\/peerj-cs.3127\/ref-49","first-page":"2479","volume-title":"FlauBERT: unsupervised language model pre-training for French","author":"Le","year":"2020"},{"key":"10.7717\/peerj-cs.3127\/ref-50","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1907.11692","article-title":"RoBERTa: a robustly optimized bert pretraining approach","author":"Liu","year":"2019"},{"key":"10.7717\/peerj-cs.3127\/ref-51","doi-asserted-by":"crossref","first-page":"35","DOI":"10.1007\/978-981-10-0515-2_3","article-title":"Arabic dialect identification using a parallel multidialectal corpus","volume-title":"Computational Linguistics","author":"Malmasi","year":"2016"},{"key":"10.7717\/peerj-cs.3127\/ref-52","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2007.01658","article-title":"Playing with words at the national library of Sweden\u2014making a Swedish BERT","author":"Malmsten","year":"2020"},{"key":"10.7717\/peerj-cs.3127\/ref-53","first-page":"7203","volume-title":"CamemBERT: a tasty French language model","author":"Martin","year":"2020"},{"key":"10.7717\/peerj-cs.3127\/ref-54","first-page":"108","volume-title":"Neural Arabic question answering","author":"Mozannar","year":"2019"},{"key":"10.7717\/peerj-cs.3127\/ref-55","first-page":"48","volume-title":"Overview of OSACT4 Arabic offensive language detection shared task","author":"Mubarak","year":"2020"},{"key":"10.7717\/peerj-cs.3127\/ref-56","first-page":"136","volume-title":"Adult content detection on Arabic Twitter: analysis and experiments","author":"Mubarak","year":"2021"},{"key":"10.7717\/peerj-cs.3127\/ref-57","first-page":"2515","volume-title":"ASTD: Arabic sentiment tweets dataset","author":"Nabil","year":"2015"},{"key":"10.7717\/peerj-cs.3127\/ref-58","first-page":"1037","volume-title":"PhoBERT: pre-trained language models for vietnamese","author":"Nguyen","year":"2020"},{"key":"10.7717\/peerj-cs.3127\/ref-59","first-page":"1","volume-title":"WikiBERT models: deep transfer learning for many languages","author":"Pyysalo","year":"2021"},{"key":"10.7717\/peerj-cs.3127\/ref-60","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer. 21: Article 140","author":"Raffel","year":"2020"},{"key":"10.7717\/peerj-cs.3127\/ref-61","first-page":"3982","volume-title":"Sentence-BERT: sentence embeddings using siamese BERT-networks","author":"Reimers","year":"2019"},{"key":"10.7717\/peerj-cs.3127\/ref-62","first-page":"2054","volume-title":"KUISAIL at SemEval-2020 task 12: BERT-CNN for offensive speech identification in social media","author":"Safaya","year":"2020"},{"key":"10.7717\/peerj-cs.3127\/ref-63","first-page":"1332","volume-title":"Fine-grained Arabic dialect identification","author":"Salameh","year":"2018"},{"key":"10.7717\/peerj-cs.3127\/ref-64","doi-asserted-by":"publisher","first-page":"324","DOI":"10.1007\/978-3-319-77116-8_24","article-title":"Character-level dialect identification in Arabic using long short-term memory","volume-title":"Computational Linguistics and Intelligent Text Processing","author":"Sayadi","year":"2018"},{"key":"10.7717\/peerj-cs.3127\/ref-65","first-page":"496","volume-title":"AraProp at WANLP 2022 shared task: leveraging pre-trained language models for Arabic propaganda detection","author":"Singh","year":"2022"},{"key":"10.7717\/peerj-cs.3127\/ref-66","doi-asserted-by":"publisher","first-page":"127063","DOI":"10.1016\/j.neucom.2023.127063","article-title":"RoFormer: enhanced transformer with rotary position embedding","volume":"568","author":"Su","year":"2024","journal-title":"Neurocomputing"},{"key":"10.7717\/peerj-cs.3127\/ref-67","doi-asserted-by":"publisher","first-page":"201","DOI":"10.1007\/978-3-319-73500-9_15","article-title":"Automatic identification of moroccan colloquial Arabic","volume-title":"Arabic Language Processing: From Theory to Practice","author":"Tachicart","year":"2018"},{"key":"10.7717\/peerj-cs.3127\/ref-68","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2007.05612","article-title":"Multi-dialect Arabic BERT for country-level dialect identification","author":"Talafha","year":"2020"},{"key":"10.7717\/peerj-cs.3127\/ref-69","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2307.09288","article-title":"Llama 2: open foundation and fine-tuned chat models","author":"Touvron","year":"2023"},{"key":"10.7717\/peerj-cs.3127\/ref-70","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1912.07076","article-title":"Multilingual is not enough: BERT for Finnish","author":"Virtanen","year":"2019"},{"key":"10.7717\/peerj-cs.3127\/ref-71","first-page":"483","volume-title":"mT5: a massively multilingual pre-trained text-to-text transformer","author":"Xue","year":"2021"},{"issue":"6","key":"10.7717\/peerj-cs.3127\/ref-72","doi-asserted-by":"publisher","first-page":"316","DOI":"10.3390\/info15060316","article-title":"Enhancing Arabic dialect detection on social media: a hybrid model with an attention mechanism","volume":"15","author":"Yafooz","year":"2024","journal-title":"Information"},{"key":"10.7717\/peerj-cs.3127\/ref-73","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2022.aacl-main.16","volume-title":"Arabic dialect identification with a few labeled examples using generative adversarial networks","author":"Yusuf","year":"2022"},{"issue":"1","key":"10.7717\/peerj-cs.3127\/ref-74","doi-asserted-by":"publisher","first-page":"171","DOI":"10.1162\/COLI_a_00169","article-title":"Arabic dialect identification","volume":"40","author":"Zaidan","year":"2014","journal-title":"Computational Linguistics"},{"key":"10.7717\/peerj-cs.3127\/ref-75","first-page":"175","volume-title":"OSIAN: open source international Arabic News corpus\u2014preparation and integration into the CLARIN-infrastructure","author":"Zeroual","year":"2019"}],"container-title":["PeerJ Computer Science"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/peerj.com\/articles\/cs-3127.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/peerj.com\/articles\/cs-3127.xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/peerj.com\/articles\/cs-3127.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/peerj.com\/articles\/cs-3127.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,23]],"date-time":"2025-10-23T08:00:34Z","timestamp":1761206434000},"score":1,"resource":{"primary":{"URL":"https:\/\/peerj.com\/articles\/cs-3127"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,23]]},"references-count":75,"alternative-id":["10.7717\/peerj-cs.3127"],"URL":"https:\/\/doi.org\/10.7717\/peerj-cs.3127","archive":["CLOCKSS","LOCKSS","Portico"],"relation":{},"ISSN":["2376-5992"],"issn-type":[{"value":"2376-5992","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,23]]},"article-number":"e3127"}}