{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,17]],"date-time":"2025-12-17T08:47:32Z","timestamp":1765961252082,"version":"3.40.5"},"reference-count":38,"publisher":"Cambridge University Press (CUP)","issue":"6","license":[{"start":{"date-parts":[[2020,4,14]],"date-time":"2020-04-14T00:00:00Z","timestamp":1586822400000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":["cambridge.org"],"crossmark-restriction":true},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2020,11]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>This work introduces robust multi-dialectal part of speech tagging trained on an annotated data set of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses conditional random fields (CRFs), while the second combines word- and character-based representations in a deep neural network with stacked layers of convolutional and recurrent networks with a CRF output layer. We successfully exploit a variety of features that help generalize our models, such as Brown clusters and stem templates. Also, we develop robust joint models that tag multi-dialectal tweets and outperform uni-dialectal taggers. We achieve a combined accuracy of 92.4% across all dialects, with per dialect results ranging between 90.2% and 95.4%. We obtained the results using a train\/dev\/test split of 70\/10\/20 for a data set of 350 tweets per dialect.<\/jats:p>","DOI":"10.1017\/s1351324920000078","type":"journal-article","created":{"date-parts":[[2020,4,14]],"date-time":"2020-04-14T10:11:42Z","timestamp":1586859102000},"page":"677-690","update-policy":"https:\/\/doi.org\/10.1017\/policypage","source":"Crossref","is-referenced-by-count":4,"title":["Effective multi-dialectal arabic POS tagging"],"prefix":"10.1017","volume":"26","author":[{"given":"Kareem","family":"Darwish","sequence":"first","affiliation":[]},{"given":"Mohammed","family":"Attia","sequence":"additional","affiliation":[]},{"given":"Hamdy","family":"Mubarak","sequence":"additional","affiliation":[]},{"given":"Younes","family":"Samih","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4160-8181","authenticated-orcid":false,"given":"Ahmed","family":"Abdelali","sequence":"additional","affiliation":[]},{"given":"Llu\u00eds","family":"M\u00e0rquez","sequence":"additional","affiliation":[]},{"given":"Mohamed","family":"Eldesouki","sequence":"additional","affiliation":[]},{"given":"Laura","family":"Kallmeyer","sequence":"additional","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2020,4,14]]},"reference":[{"key":"S1351324920000078_ref28","doi-asserted-by":"crossref","first-page":"1064","DOI":"10.18653\/v1\/P16-1101","article-title":"End-to-end sequence labeling via bi-directional lstm-cnns-crf","author":"Ma","year":"2016","journal-title":"Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)"},{"key":"S1351324920000078_ref14","doi-asserted-by":"publisher","DOI":"10.3115\/1621787.1621798"},{"key":"S1351324920000078_ref38","unstructured":"Zaidan, O.F. and Callison-Burch, C. (2011). The arabic online commentary dataset: An annotated dataset of informal arabic with high dialectal content. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers-Volume 2. Association for Computational Linguistics, pp. 37\u201341."},{"key":"S1351324920000078_ref7","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00104"},{"key":"S1351324920000078_ref25","first-page":"282","article-title":"Conditional random fields: Probabilistic models for segmenting and labeling sequence data","author":"Lafferty","year":"2001","journal-title":"Proceedings of the 18th International Conference on Machine Learning 2001 (ICML 2001)"},{"key":"S1351324920000078_ref36","doi-asserted-by":"publisher","DOI":"10.1109\/78.650093"},{"key":"S1351324920000078_ref5","first-page":"467","article-title":"Class-based n-gram models of natural language","volume":"18","author":"Brown","year":"1992","journal-title":"Computational Linguistics"},{"key":"S1351324920000078_ref32","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1035"},{"key":"S1351324920000078_ref20","unstructured":"Hinton, G.E. , Srivastava, N. , Krizhevsky, A. , Sutskever, I. and Salakhutdinov, R.R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580."},{"key":"S1351324920000078_ref31","unstructured":"Pasha, A. , Al-Badrashiny, M. , Diab, M.T. , El Kholy, A. , Eskander, R. , Habash, N. , Pooleery, M. , Rambow, O. and Roth, R. (2014). MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 1094\u20131101."},{"key":"S1351324920000078_ref27","unstructured":"Liang, P. (2005). Semi-Supervised Learning for Natural Language . PhD Thesis, Massachusetts Institute of Technology."},{"key":"S1351324920000078_ref17","unstructured":"Graja, M. , Jaoua, M. and Hadrich Belguith, L. (2010). Lexical study of a spoken dialogue corpus in tunisian dialect. In The International Arab Conference on Information Technology (ACIT), Benghazi\u2013Libya."},{"key":"S1351324920000078_ref29","unstructured":"Malmasi, S. , Refaee, E. and Dras, M. (2015). Arabic dialect identification using a parallel multidialectal corpus. In International Conference of the Pacific Association for Computational Linguistics. Springer, pp. 35\u201353."},{"key":"S1351324920000078_ref12","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1154"},{"key":"S1351324920000078_ref34","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/K17-1043"},{"key":"S1351324920000078_ref19","unstructured":"Habash, N. , Roth, R. , Rambow, O. , Eskander, R. and Tomeh, N. (2013). Morphological analysis and disambiguation for dialectal arabic. In Hlt-Naacl, pp. 426\u2013432."},{"key":"S1351324920000078_ref35","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W15-3904"},{"key":"S1351324920000078_ref26","unstructured":"Lample, G. , Ballesteros, M. , Subramanian, S. , Kawakami, K. and Dyer, C. (2016). Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360."},{"key":"S1351324920000078_ref3","unstructured":"Bojanowski, P. , Grave, E. , Joulin, A. and Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606."},{"key":"S1351324920000078_ref33","unstructured":"Ryan, R. , Rambow, O. , Habash, N. , Diab, M. and Rudin, C. (2008). Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of the Conference of American Association for Computational Linguistics (ACL08)."},{"key":"S1351324920000078_ref16","unstructured":"Gimpel, K. , Schneider, N. , O\u2019Connor, B. , Das, D. , Mills, D. , Eisenstein, J. , Heilman, M. , Yogatama, D. , Flanigan, J. and Smith, N.A. (2011). Part-of-speech tagging for twitter: Annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers-Volume 2. Association for Computational Linguistics, pp. 42\u201347."},{"key":"S1351324920000078_ref24","first-page":"35","article-title":"A morphological analyzer for gulf arabic verbs","author":"Khalifa","year":"2017","journal-title":"In WANLP 2017 (co-located with EACL 2017)"},{"key":"S1351324920000078_ref13","unstructured":"Derczynski, L. , Ritter, A. , Clark, S. and Bontcheva, K. (2013). Twitter part-of-speech tagging for all: Overcoming sparse and noisy data. In RANLP, pp. 198\u2013206."},{"key":"S1351324920000078_ref11","unstructured":"Darwish, K. , Mubarak, H. , Abdelali, A. , Eldesouki, M. , Samih, Y. , Alharbi, R. , Attia, M. , Magdy, W. and Kallmeyer, L. (2018). Multi-dialect arabic POS tagging: A CRF approach. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7\u201312, 2018."},{"first-page":"402","year":"2000","author":"Caruana","key":"S1351324920000078_ref6"},{"key":"S1351324920000078_ref30","unstructured":"Owoputi, O. , O\u2019Connor, B. , Dyer, C. , Gimpel, K. , Schneider, N. and Smith, N.A. (2013). Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of NAACL-HLT 2013. Association for Computational Linguistics, pp. 380\u2013390."},{"key":"S1351324920000078_ref10","first-page":"130","article-title":"Arabic pos tagging: Don\u2019t abandon feature engineering just yet","author":"Darwish","year":"2017","journal-title":"WANLP 2017 (co-located with EACL 2017)"},{"key":"S1351324920000078_ref9","unstructured":"Cotterell, R. and Callison-Burch, C. (2014). A multi-dialect, multi-genre corpus of informal written arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 241\u2013245."},{"key":"S1351324920000078_ref4","unstructured":"Bouamor, H. , Habash, N. and Oflazer, K. (2014). A multidialectal parallel corpus of arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 1240\u20131245."},{"key":"S1351324920000078_ref18","unstructured":"Habash, N. , Diab, M.T. and Rambow, O. (2012). Conventional orthography for dialectal arabic. In LREC, pp. 711\u2013718."},{"key":"S1351324920000078_ref23","unstructured":"Jurafsky, D. and Martin, J.H. (2009). Speech and Language Processing, 2nd Edn. New Jersey: Pearson Prentice Hall. ISBN 978-0-13-187321-6."},{"key":"S1351324920000078_ref22","unstructured":"Huang, Z. , Xu, W. and Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs\/1508.01991."},{"key":"S1351324920000078_ref15","unstructured":"Elfardy, H. and Diab, M.T. (2013). Sentence level dialect identification in arabic. In ACL (2), pp. 456\u2013461."},{"key":"S1351324920000078_ref1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N16-3003"},{"key":"S1351324920000078_ref2","unstructured":"Al-Sabbagh, R. and Girju, R. (2010). Mining the web for the induction of a dialectical arabic lexicon. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), pp. 288\u2013293."},{"key":"S1351324920000078_ref21","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"S1351324920000078_ref8","first-page":"2493","article-title":"Natural language processing (almost) from scratch","volume":"12","author":"Collobert","year":"2011","journal-title":"Journal of Machine Learning Research"},{"key":"S1351324920000078_ref37","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W15-1511"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324920000078","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2020,11,20]],"date-time":"2020-11-20T13:29:58Z","timestamp":1605878998000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324920000078\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,4,14]]},"references-count":38,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2020,11]]}},"alternative-id":["S1351324920000078"],"URL":"https:\/\/doi.org\/10.1017\/s1351324920000078","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"type":"print","value":"1351-3249"},{"type":"electronic","value":"1469-8110"}],"subject":[],"published":{"date-parts":[[2020,4,14]]},"assertion":[{"value":"\u00a9 Cambridge University Press 2020","name":"copyright","label":"Copyright","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}}]}}