{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,7]],"date-time":"2026-05-07T14:40:27Z","timestamp":1778164827862,"version":"3.51.4"},"reference-count":30,"publisher":"Association for Computing Machinery (ACM)","issue":"8","license":[{"start":{"date-parts":[[2023,8,23]],"date-time":"2023-08-23T00:00:00Z","timestamp":1692748800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2023,8,31]]},"abstract":"<jats:p>In this article, a complete methodology of a corpus realization of authentic Short Message Service (SMS) from Algerian dialect and which are transcribed in Latin characters or symbols is presented. A linguistic material constituted by 6,000 SMS coming from the different geographical regions of Algeria (Middle, East, and West) corresponding to 42 administrative and geographical departments, have been collected. The coexistence of several dialects through these three regions simultaneously has obliged us to consider and operate a classification of the data for each dialect. This data classification has yielded three extracted regional dialectic corpora, each of them covering a specific number of administrative departments. These treatments are based on the so-called Data-n-gram tokenization targeting the suppression of the stop words, the stemming and the imbalance of the classes linked to the nature of the SMS. Consequently, three text classifiers based on three linear classifiers, namely, Stochastic Gradient Descent (SGD), The Ridge Regression (RDG), and Linear Support Vector Machines, to find out the number of significant corpora to extract from the collected data. A deep analysis of the results has shown that the 5-grams data representation is more representative whereas the stop-words removal and stemming process has generated an information loss that has subsequently inferred an alteration of the recognition rate of about 2%. The emerging problem of classes imbalance has been treated by using three techniques: Random Oversampling, Synthetic Minorities Oversampling Technique (SMOTE), and Adaptive Synthetic (ADASYN). This treatment produced interesting results and enhancements; particularly, the classification by region with the oversampling process SMOTE by using the RDG technique has reached a better percentage of 55.93% whereas the classification by department with the oversampling process ADASYN associated with the SGD has only yielded a maximum score of about 17.11%. The results, which undoubtedly are in favor of the classification by region, have compelled us to create three Subdialectal regional corpora, each, covering a certain number of Algerian departments.<\/jats:p>","DOI":"10.1145\/3610522","type":"journal-article","created":{"date-parts":[[2023,7,27]],"date-time":"2023-07-27T15:41:04Z","timestamp":1690472464000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["DZ-SMS: An Authentic Corpus of Algerian SMS"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6150-8903","authenticated-orcid":false,"given":"Brahim","family":"Dahou","sequence":"first","affiliation":[{"name":"University of Science and Technology Houari Boumediene, Algeria"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8719-8755","authenticated-orcid":false,"given":"Leila","family":"Falek","sequence":"additional","affiliation":[{"name":"University of Science and Technology Houari Boumediene, Algeria"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8291-0862","authenticated-orcid":false,"given":"Mourad","family":"Abbas","sequence":"additional","affiliation":[{"name":"Computational Linguistics Department, CRSTDLA, Algeria"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9828-3779","authenticated-orcid":false,"given":"Slimane","family":"Mekaoui","sequence":"additional","affiliation":[{"name":"University of Science and Technology Houari Boumediene, Algeria"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0584-1389","authenticated-orcid":false,"given":"Mohamed","family":"Lichouri","sequence":"additional","affiliation":[{"name":"Computational Linguistics Department, CRSTDLA, Algeria"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4311-9912","authenticated-orcid":false,"given":"Aicha","family":"Zitouni","sequence":"additional","affiliation":[{"name":"University of Science and Technology Houari Boumediene, Algeria"}]}],"member":"320","published-online":{"date-parts":[[2023,8,23]]},"reference":[{"key":"e_1_3_1_2_1","first-page":"6405","volume-title":"Proceedings of the 12th Language Resources and Evaluation Conference","author":"Abdallah Najla Ben","year":"2020","unstructured":"Najla Ben Abdallah, Sam\u00e9h Kchaou, and Fethi Bougares. 2020. Text and speech-based Tunisian Arabic sub-dialects identification. In Proceedings of the 12th Language Resources and Evaluation Conference. 6405\u20136411."},{"key":"e_1_3_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/COMAPP.2017.8079752"},{"key":"e_1_3_1_4_1","volume-title":"Plan et cartes des villes Alg\u00e9rienne","year":"2020","unstructured":"AlgerianMap. 2020. Plan et cartes des villes Alg\u00e9rienne. Retrieved from http:\/\/www.carte-algerie.com. Accessed June 2023."},{"key":"e_1_3_1_5_1","volume-title":"Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC\u201918)","author":"Alsarsour Israa","year":"2018","unstructured":"Israa Alsarsour, Esraa Mohamed, Reem Suwaileh, and Tamer Elsayed. 2018. DART: A large dataset of dialectal Arabic tweets. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC\u201918)."},{"key":"e_1_3_1_6_1","first-page":"229","volume-title":"Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics","author":"Baldwin Timothy","year":"2010","unstructured":"Timothy Baldwin and Marco Lui. 2010. Language identification: The long and the short of the matter. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 229\u2013237."},{"key":"e_1_3_1_7_1","volume-title":"Language, Mobile Phones and Internet: A Study of SMS Texting, Email, IM and SNS Chats in Computer Mediated Communication (CMC) in Kenya","author":"Barasa Sandra Nekesa","year":"2010","unstructured":"Sandra Nekesa Barasa. 2010. Language, Mobile Phones and Internet: A Study of SMS Texting, Email, IM and SNS Chats in Computer Mediated Communication (CMC) in Kenya. Netherlands Graduate School of Linguistics."},{"key":"e_1_3_1_8_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-3612"},{"key":"e_1_3_1_9_1","article-title":"SMOTE for Imbalanced Classification with Python","author":"Brownlee Jason","year":"2020","unstructured":"Jason Brownlee. 2020. SMOTE for Imbalanced Classification with Python. Machine Learning Mastery. Retrieved from https:\/\/machinelearningmastery.com\/smote-oversampling-for-imbalancedclassification. Accessed March 2022.","journal-title":"Machine Learning Mastery"},{"key":"e_1_3_1_10_1","first-page":"299","article-title":"Creating a live, public short message service corpus: The NUS SMS corpus","volume":"47","author":"Chen Tao","year":"2013","unstructured":"Tao Chen and Min-Yen Kan. 2013. Creating a live, public short message service corpus: The NUS SMS corpus. Language Resources and Evaluation 47 (2013), 299\u2013335.","journal-title":"Language Resources and Evaluation"},{"key":"e_1_3_1_11_1","unstructured":"Paul Clough and Mark Sanderson. 2013. Evaluating the Performance of Information Retrieval Systems Using Test Collections . (2013). Retrieved from https:\/\/informationr.net\/ir\/18-2\/paper582.html"},{"key":"e_1_3_1_12_1","doi-asserted-by":"publisher","DOI":"10.3917\/lang.157.0036"},{"key":"e_1_3_1_13_1","doi-asserted-by":"publisher","DOI":"10.24840\/2183-6493_006.001_0005"},{"key":"e_1_3_1_14_1","doi-asserted-by":"publisher","DOI":"10.1075\/eww.29.2.02deu"},{"key":"e_1_3_1_15_1","doi-asserted-by":"crossref","unstructured":"Martin Dillon. 1983. Introduction to modern information retrieval. Information Processing & Management 19 6 (1983) 402\u2013403.","DOI":"10.1016\/0306-4573(83)90062-6"},{"key":"e_1_3_1_16_1","volume-title":"Ethnologue: Langues Du Monde","author":"(Eds.). D. M Eberhard, Gary F. Simons, and Charles D. Fennig","year":"2021","unstructured":"D. M Eberhard, Gary F. Simons, and Charles D. Fennig (Eds.).2021. Ethnologue: Langues Du Monde (26th Ed.). SIL International."},{"key":"e_1_3_1_17_1","doi-asserted-by":"crossref","unstructured":"Emmanuel Ferragne and Fran\u00e7ois Pellegrino. 2007. Automatic dialect identification: A study of British English. In Speaker Classification II: Selected Projects . Springer 243\u2013257.","DOI":"10.1007\/978-3-540-74122-0_19"},{"key":"e_1_3_1_18_1","first-page":"81","article-title":"Emprunts lexicaux dans des dialectes Arabes Alg\u00e9riens","volume":"8","author":"Guella Noureddine","year":"2011","unstructured":"Noureddine Guella. 2011. Emprunts lexicaux dans des dialectes Arabes Alg\u00e9riens. Synergies Monde Arabe 8 (2011), 81\u201388.","journal-title":"Synergies Monde Arabe"},{"key":"e_1_3_1_19_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10579-016-9370-7"},{"key":"e_1_3_1_20_1","volume-title":"Communiquer par SMS: Analyse Automatique du Langage et Extraction de l\u2019information V\u00e9hicul\u00e9e","author":"Kogkitsidou Eleni","year":"2018","unstructured":"Eleni Kogkitsidou. 2018. Communiquer par SMS: Analyse Automatique du Langage et Extraction de l\u2019information V\u00e9hicul\u00e9e. Ph.D. Dissertation. Universit\u00e9 Grenoble Alpes (ComUE)."},{"key":"e_1_3_1_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-50417-5_39"},{"key":"e_1_3_1_22_1","volume-title":"Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC\u201918)","author":"Kwaik Kathrein Abu","year":"2018","unstructured":"Kathrein Abu Kwaik, Motaz Saad, Stergios Chatzikyriakidis, and Simon Dobnik. 2018. Shami: A corpus of Levantine Arabic dialects. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC\u201918)."},{"key":"e_1_3_1_23_1","doi-asserted-by":"publisher","DOI":"10.5555\/3122009.3122026"},{"key":"e_1_3_1_24_1","doi-asserted-by":"publisher","DOI":"10.5555\/1040196.1040208"},{"key":"e_1_3_1_25_1","doi-asserted-by":"publisher","DOI":"10.4000\/corpus.7"},{"key":"e_1_3_1_26_1","first-page":"1202","volume-title":"Proceedings of the 12th Language Resources and Evaluation Conference (LREC\u201920)","author":"Moudjari Leila","year":"2020","unstructured":"Leila Moudjari, Karima Akli-Astouati, and Farah Benamara. 2020. An Algerian corpus and an annotation platform for opinion and emotion analysis. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC\u201920). European Language Resources Association, 1202\u20131210."},{"key":"e_1_3_1_27_1","doi-asserted-by":"publisher","DOI":"10.5555\/1953048.2078195"},{"key":"e_1_3_1_28_1","unstructured":"Kahwadji Qassim. 2017. Carte linguistique des parlers Arabes Alg\u00e9riens. (2017). Retrieved from https:\/\/www.academia.edu. Accessed January 2022."},{"key":"e_1_3_1_29_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-52856-0_31"},{"key":"e_1_3_1_30_1","volume-title":"Adaptive \u201cAdaline\u201d Neuron Using Chemical \u201cMemistors.\u201d","author":"Widrow Bernard","year":"1960","unstructured":"Bernard Widrow. 1960. Adaptive \u201cAdaline\u201d Neuron Using Chemical \u201cMemistors.\u201d Stanford University."},{"key":"e_1_3_1_31_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00726-010-0588-1"}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3610522","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3610522","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:03Z","timestamp":1750182543000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3610522"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,8,23]]},"references-count":30,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2023,8,31]]}},"alternative-id":["10.1145\/3610522"],"URL":"https:\/\/doi.org\/10.1145\/3610522","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"value":"2375-4699","type":"print"},{"value":"2375-4702","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,8,23]]},"assertion":[{"value":"2021-07-23","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-06-24","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-08-23","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}