{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,4,19]],"date-time":"2024-04-19T06:24:14Z","timestamp":1713507854799},"reference-count":39,"publisher":"Cambridge University Press (CUP)","issue":"6","license":[{"start":{"date-parts":[[2020,5,5]],"date-time":"2020-05-05T00:00:00Z","timestamp":1588636800000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":["cambridge.org"],"crossmark-restriction":true},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2020,11]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The occurrence of code-switching in online communication, when a writer switches among multiple languages, presents a challenge for natural language processing tools, since they are designed for texts written in a single language. To answer the challenge, this paper presents detailed research on ways to detect code-switching in Arabic text automatically. We compare the prediction by partial matching (PPM) compression-based classifier, implemented in Tawa, and a traditional machine learning classifier sequential minimal optimization (SMO), implemented in Waikato Environment for Knowledge Analysis, working specifically on Arabic text taken from Facebook. Three experiments were conducted in order to: (1) detect code-switching among the Egyptian dialect and English; (2) detect code-switching among the Egyptian dialect, the Saudi dialect, and English; and (3) detect code-switching among the Egyptian dialect, the Saudi dialect, Modern Standard Arabic (MSA), and English. Our experiments showed that PPM achieved a higher accuracy rate than SMO with 99.8% versus 97.5% in the first experiment and 97.8% versus 80.7% in the second. In the third experiment, PPM achieved a lower accuracy rate than SMO with 53.2% versus 60.2%. Code-switching between Egyptian Arabic and English text is easiest to detect because Arabic and English are generally written in different character sets. It is more difficult to distinguish between Arabic dialects and MSA as these use the same character set, and most users of Arabic, especially Saudis and Egyptians, frequently mix MSA with their dialects. We also note that the MSA corpus used for training the MSA model may not represent MSA Facebook text well, being built from news websites. This paper also describes in detail the new Arabic corpora created for this research and our experiments.<\/jats:p>","DOI":"10.1017\/s135132492000011x","type":"journal-article","created":{"date-parts":[[2020,5,5]],"date-time":"2020-05-05T08:46:38Z","timestamp":1588668398000},"page":"663-676","update-policy":"http:\/\/dx.doi.org\/10.1017\/policypage","source":"Crossref","is-referenced-by-count":4,"title":["Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study"],"prefix":"10.1017","volume":"26","author":[{"given":"Taghreed","family":"Tarmom","sequence":"first","affiliation":[]},{"given":"William","family":"Teahan","sequence":"additional","affiliation":[]},{"given":"Eric","family":"Atwell","sequence":"additional","affiliation":[]},{"given":"Mohammad Ammar","family":"Alsalka","sequence":"additional","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2020,5,5]]},"reference":[{"key":"S135132492000011X_ref31","unstructured":"Oco, N. , Wong, J. , Ilao, J. and Roxas, R. (2013). Detecting code-switches using word bigram frequency count. In 9th National Natural Language Processing Research Symposium, Quezon City, Philippines, March, Vol. 7."},{"key":"S135132492000011X_ref35","volume-title":"Changing English","author":"Swann","year":"2007"},{"key":"S135132492000011X_ref25","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1222"},{"key":"S135132492000011X_ref1","unstructured":"Al-Moghrabi, A.A. (2015). An Examination of Reading Strategies in Arabic (L1) and English (L2) Used by Saudi Female Public High School Adolescents, Doctoral Dissertation, The British University in Dubai (BUiD). Available at https:\/\/bspace.buid.ac.ae\/handle\/1234\/776."},{"key":"S135132492000011X_ref15","volume-title":"Life with Two Languages: An Introduction to Bilingualism","author":"Grosjean","year":"1982"},{"key":"S135132492000011X_ref2","unstructured":"Ali, M. (2018). Character level convolutional neural network for German dialect identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 172\u2013177."},{"key":"S135132492000011X_ref30","unstructured":"Oco, N. and Roxas, R.E. (2012). Pattern matching refinements to dictionary-based code-switching point detection. In Pacific Asia Conference on Language, Information and Computation (PACLIC), 7 November 2012, Bali, Indonesia, pp. 229\u2013236."},{"key":"S135132492000011X_ref29","unstructured":"Nguyen, D. , Dogru\u00f6z, A.S. , Ros\u00e9, C.P. and de Jong, F. (2016). Computational sociolinguistics: a survey. Computational Linguistics. Available at https:\/\/arXiv:1508.07544v2."},{"key":"S135132492000011X_ref11","volume-title":"Ethnologue: Languages of the World","author":"Eberhard","year":"2017"},{"key":"S135132492000011X_ref14","unstructured":"Global Media Insight Website (2019). Saudi Arabia Social Media Statistics 2018 \u2013 Official GMI Blog. Global Media Insight. Available at https:\/\/www.globalmediainsight.com\/blog\/saudi-arabia-social-media-statistics\/ (accessed 21 June 2019)."},{"key":"S135132492000011X_ref26","doi-asserted-by":"publisher","DOI":"10.1145\/1458082.1458186"},{"key":"S135132492000011X_ref20","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-3906"},{"key":"S135132492000011X_ref13","unstructured":"Elfardy, H. and Diab, M. (2012). Token level identification of linguistic code switching. In Proceedings of the International Conference on Computational Linguistics (COLING): Posters, pp. 287\u2013296."},{"key":"S135132492000011X_ref39","first-page":"37","article-title":"The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content.","author":"Zaidan","year":"2011","journal-title":"Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers"},{"key":"S135132492000011X_ref33","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4899-7687-1"},{"key":"S135132492000011X_ref27","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511620867"},{"key":"S135132492000011X_ref3","unstructured":"Alkahtani, S. (2015). Building and Verifying Parallel Corpora Between Arabic and English. Doctoral Dissertation, Prifysgol Bangor University. Available at http:\/\/e.bangor.ac.uk\/6546\/1\/saad_alkahtani_dissertation.pdf."},{"key":"S135132492000011X_ref22","unstructured":"Lignos, C. and Marcus, M. (2013). Toward web-scale analysis of codeswitching. In 87th Annual Meeting of the Linguistic Society of America, 3 January 2013, Boston."},{"key":"S135132492000011X_ref19","first-page":"99","author":"Johnson","year":"2013"},{"key":"S135132492000011X_ref9","doi-asserted-by":"publisher","DOI":"10.1109\/TCOM.1984.1096090"},{"key":"S135132492000011X_ref36","unstructured":"Tarmom, T. (2018). Designing and Evaluating a Compression-Based Approach to the Automatic Detection of Code-switching in Arabic Text, MSc Dissertation, Bangor University."},{"key":"S135132492000011X_ref16","doi-asserted-by":"publisher","DOI":"10.1145\/2600428.2609622"},{"key":"S135132492000011X_ref34","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-3907"},{"key":"S135132492000011X_ref5","unstructured":"Alshutayri, A. , Atwell, E. , Alosaimy, A. , Dickins, J. , Ingleby, M. and Watson, J. (2016). Arabic language WEKA-based dialect classifier for Arabic automatic speech recognition transcripts. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 204\u2013211."},{"key":"S135132492000011X_ref12","doi-asserted-by":"publisher","DOI":"10.1145\/2659893"},{"key":"S135132492000011X_ref24","unstructured":"Malmasi, S. and Zampieri, M. (2016). Arabic dialect identification in speech transcripts. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 106\u2013113."},{"key":"S135132492000011X_ref10","volume-title":"A Dictionary of Marketing","author":"Doyle","year":"2011"},{"key":"S135132492000011X_ref32","unstructured":"Platt, J. (1998). Sequential minimal optimization: a fast algorithm for training support vector machines. In Technical Report MSR-TR-98-14. Microsoft Research."},{"key":"S135132492000011X_ref37","doi-asserted-by":"publisher","DOI":"10.3390\/info9120294"},{"key":"S135132492000011X_ref38","doi-asserted-by":"publisher","DOI":"10.1155\/2013\/898714"},{"key":"S135132492000011X_ref21","doi-asserted-by":"crossref","first-page":"241","DOI":"10.1017\/S1351324908004968","article-title":"Adapting SVM for data sparseness and imbalance: a case study in information extraction","volume":"15","author":"Li","year":"2009","journal-title":"Natural Language Engineering"},{"key":"S135132492000011X_ref4","first-page":"421","article-title":"Classifying and segmenting classical and modern standard Arabic using minimum cross-entropy","volume":"8","author":"Alkhazi","year":"2017","journal-title":"International Journal of Advanced Computer Science and Applications"},{"key":"S135132492000011X_ref17","unstructured":"Hale, S.A. (2014). Global connectivity and multilinguals in the Twitter network In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 26 April 2014, Toronto, Canada. ACM, pp. 833\u2013842."},{"key":"S135132492000011X_ref6","first-page":"659","author":"Androutsopoulos","year":"2013"},{"key":"S135132492000011X_ref23","first-page":"35","volume-title":"Conference of the Pacific Association for Computational Linguistics","author":"Malmasi","year":"2015"},{"key":"S135132492000011X_ref28","volume-title":"Multiple Voices: An Introduction to Bilingualism.","author":"Myers-Scotton","year":"2006"},{"key":"S135132492000011X_ref18","doi-asserted-by":"publisher","DOI":"10.1145\/1656274.1656278"},{"key":"S135132492000011X_ref8","doi-asserted-by":"publisher","DOI":"10.1093\/comjnl\/40.2_and_3.67"},{"key":"S135132492000011X_ref7","doi-asserted-by":"publisher","DOI":"10.7763\/IJCCE.2014.V3.316"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S135132492000011X","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2020,11,20]],"date-time":"2020-11-20T13:32:14Z","timestamp":1605879134000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S135132492000011X\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,5,5]]},"references-count":39,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2020,11]]}},"alternative-id":["S135132492000011X"],"URL":"https:\/\/doi.org\/10.1017\/s135132492000011x","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,5,5]]},"assertion":[{"value":"\u00a9 Cambridge University Press 2020","name":"copyright","label":"Copyright","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}}]}}