{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,19]],"date-time":"2025-11-19T07:04:09Z","timestamp":1763535849602,"version":"3.41.0"},"reference-count":46,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2021,3,31]],"date-time":"2021-03-31T00:00:00Z","timestamp":1617148800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2021,3,31]]},"abstract":"<jats:p>Diacritics (short vowels) are typically omitted when writing Arabic text, and readers have to reintroduce them to correctly pronounce words. There are two types of Arabic diacritics: The first are core-word diacritics (CW), which specify the lexical selection, and the second are case endings (CE), which typically appear at the end of word stems and generally specify their syntactic roles. Recovering CEs is relatively harder than recovering core-word diacritics due to inter-word dependencies, which are often distant. In this article, we use feature-rich recurrent neural network model that use a variety of linguistic and surface-level features to recover both core word diacritics and case endings. Our model surpasses all previous state-of-the-art systems with a CW error rate (CWER) of 2.9% and a CE error rate (CEER) of 3.7% for Modern Standard Arabic (MSA) and CWER of 2.2% and CEER of 2.5% for Classical Arabic (CA). When combining diacritized word cores with case endings, the resultant word error rates are 6.0% and 4.3% for MSA and CA, respectively. This highlights the effectiveness of feature engineering for such deep neural models.<\/jats:p>","DOI":"10.1145\/3434235","type":"journal-article","created":{"date-parts":[[2021,4,15]],"date-time":"2021-04-15T18:11:26Z","timestamp":1618510286000},"page":"1-18","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["Arabic Diacritic Recovery Using a Feature-rich biLSTM Model"],"prefix":"10.1145","volume":"20","author":[{"given":"Kareem","family":"Darwish","sequence":"first","affiliation":[{"name":"Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4160-8181","authenticated-orcid":false,"given":"Ahmed","family":"Abdelali","sequence":"additional","affiliation":[{"name":"Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar"}]},{"given":"Hamdy","family":"Mubarak","sequence":"additional","affiliation":[{"name":"Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar"}]},{"given":"Mohamed","family":"Eldesouki","sequence":"additional","affiliation":[{"name":"Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar"}]}],"member":"320","published-online":{"date-parts":[[2021,4,15]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Mart\u00edn Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S. Corrado Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Ian Goodfellow Andrew Harp Geoffrey Irving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz Kaiser Manjunath Kudlur Josh Levenberg Dandelion Man\u00e9 Rajat Monga Sherry Moore Derek Murray Chris Olah Mike Schuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker Vincent Vanhoucke Vijay Vasudevan Fernanda Vi\u00e9gas Oriol Vinyals Pete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from https:\/\/www.tensorflow.org\/.  Mart\u00edn Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S. Corrado Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Ian Goodfellow Andrew Harp Geoffrey Irving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz Kaiser Manjunath Kudlur Josh Levenberg Dandelion Man\u00e9 Rajat Monga Sherry Moore Derek Murray Chris Olah Mike Schuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker Vincent Vanhoucke Vijay Vasudevan Fernanda Vi\u00e9gas Oriol Vinyals Pete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from https:\/\/www.tensorflow.org\/."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.5555\/2780081.2780156"},{"key":"e_1_2_1_3_1","volume-title":"Proceedings of the European Conference on Information Retrieval. Springer, 341\u2013355","author":"Abbad Hamza","year":"2020","unstructured":"Hamza Abbad and Shengwu Xiong . 2020 . Multi-components system for automatic Arabic diacritization . In Proceedings of the European Conference on Information Retrieval. Springer, 341\u2013355 . Hamza Abbad and Shengwu Xiong. 2020. Multi-components system for automatic Arabic diacritization. In Proceedings of the European Conference on Information Retrieval. Springer, 341\u2013355."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICMLA.2019.00142"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-19578-0_15"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1017\/S1351324913000284"},{"key":"e_1_2_1_8_1","unstructured":"Mohamed Bebah Chennoufi Amine Mazroui Azzeddine and Lakhouaja Abdelhak. 2014. Hybrid approaches for automatic vowelization of Arabic texts. arXiv:1410.2646. Retrieved from https:\/\/arxiv.org\/abs\/1410.2646.  Mohamed Bebah Chennoufi Amine Mazroui Azzeddine and Lakhouaja Abdelhak. 2014. Hybrid approaches for automatic vowelization of Arabic texts. arXiv:1410.2646. Retrieved from https:\/\/arxiv.org\/abs\/1410.2646."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1274"},{"key":"e_1_2_1_10_1","unstructured":"Tim Buckwalter. 2002. Buckwalter {Arabic} morphological analyzer version 1.0.  Tim Buckwalter. 2002. Buckwalter {Arabic} morphological analyzer version 1.0."},{"key":"e_1_2_1_11_1","volume-title":"LDC Catalog Number LDC2004L02","author":"Buckwalter Tim","year":"2004","unstructured":"Tim Buckwalter . 2004 . Buckwalter Arabic morphological analyzer version 2.0 . LDC Catalog Number LDC2004L02 . Tim Buckwalter. 2004. Buckwalter Arabic morphological analyzer version 2.0. LDC Catalog Number LDC2004L02."},{"key":"e_1_2_1_12_1","unstructured":"Fran\u00e7ois Chollet et\u00a0al. 2015. Keras. Retrieved from https:\/\/keras.io.  Fran\u00e7ois Chollet et\u00a0al. 2015. Keras. Retrieved from https:\/\/keras.io."},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","volume":"1","author":"Darwish Kareem","year":"2013","unstructured":"Kareem Darwish . 2013 . Named entity recognition using cross-lingual resources: Arabic as an example . In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , Vol. 1 . 1558\u20131567. Kareem Darwish. 2013. Named entity recognition using cross-lingual resources: Arabic as an example. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 1558\u20131567."},{"key":"e_1_2_1_14_1","volume-title":"Proceedings of the 3rd Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT 3). 62","author":"Darwish Kareem","year":"2018","unstructured":"Kareem Darwish , Ahmed Abdelali , Hamdy Mubarak , Younes Samih , and Mohammed Attia . 2018 . Diacritization of Moroccan and Tunisian Arabic dialects: A CRF approach . In Proceedings of the 3rd Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT 3). 62 . Kareem Darwish, Ahmed Abdelali, Hamdy Mubarak, Younes Samih, and Mohammed Attia. 2018. Diacritization of Moroccan and Tunisian Arabic dialects: A CRF approach. In Proceedings of the 3rd Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT 3). 62."},{"key":"e_1_2_1_15_1","volume-title":"Proceedings of the International Conference on Language Resources and Evaluation (LREC\u201914)","author":"Darwish Kareem","year":"2014","unstructured":"Kareem Darwish and Wei Gao . 2014 . Simple effective microblog named entity recognition: Arabic as an example . In Proceedings of the International Conference on Language Resources and Evaluation (LREC\u201914) . 2513\u20132517. Kareem Darwish and Wei Gao. 2014. Simple effective microblog named entity recognition: Arabic as an example. In Proceedings of the International Conference on Language Resources and Evaluation (LREC\u201914). 2513\u20132517."},{"key":"e_1_2_1_16_1","volume-title":"Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC\u201916)","author":"Darwish Kareem","year":"2016","unstructured":"Kareem Darwish and Hamdy Mubarak . 2016 . Farasa: A new fast and accurate Arabic word segmenter . In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC\u201916) . European Language Resources Association (ELRA). Kareem Darwish and Hamdy Mubarak. 2016. Farasa: A new fast and accurate Arabic word segmenter. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC\u201916). European Language Resources Association (ELRA)."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1302"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.5555\/1776334.1776360"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1147\/sj.284.0600"},{"key":"e_1_2_1_20_1","volume-title":"Proceedings of the Saudi 18th National Computer Conference","volume":"18","author":"Elshafei Moustafa","year":"2006","unstructured":"Moustafa Elshafei , Husni Al-Muhtaseb , and Mansour Alghamdi . 2006 . Statistical methods for automatic diacritization of Arabic text . In Proceedings of the Saudi 18th National Computer Conference , Vol. 18 . 301\u2013306. Moustafa Elshafei, Husni Al-Muhtaseb, and Mansour Alghamdi. 2006. Statistical methods for automatic diacritization of Arabic text. In Proceedings of the Saudi 18th National Computer Conference, Vol. 18. 301\u2013306."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.3115\/1118637.1118641"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.5555\/1614108.1614122"},{"key":"e_1_2_1_23_1","volume-title":"Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH\u201913)","author":"Harrat Salima","year":"2013","unstructured":"Salima Harrat , Mourad Abbas , Karima Meftouh , and Kamel Smaili . 2013 . Diacritics restoration for Arabic dialects . In Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH\u201913) . ISCA. Salima Harrat, Mourad Abbas, Karima Meftouh, and Kamel Smaili. 2013. Diacritics restoration for Arabic dialects. In Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH\u201913). ISCA."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2018.2865098"},{"key":"e_1_2_1_25_1","volume-title":"Salakhutdinov","author":"Hinton Geoffrey E.","year":"2012","unstructured":"Geoffrey E. Hinton , Nitish Srivastava , Alex Krizhevsky , Ilya Sutskever , and Ruslan R . Salakhutdinov . 2012 . Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580. Retrieved from https:\/\/arxiv.org\/abs\/1207.0580. Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580. Retrieved from https:\/\/arxiv.org\/abs\/1207.0580."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/DISA.2018.8490624"},{"key":"e_1_2_1_27_1","volume-title":"Kingma and Jimmy Ba","author":"Diederik","year":"2014","unstructured":"Diederik P. Kingma and Jimmy Ba . 2014 . Adam : A Method for Stochastic Optimization. arxiv:cs.LG\/1412.6980. Retrieved from https:\/\/arxiv.org\/abs\/1412.6980. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arxiv:cs.LG\/1412.6980. Retrieved from https:\/\/arxiv.org\/abs\/1412.6980."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/IALP.2012.18"},{"key":"e_1_2_1_29_1","volume-title":"Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools. 102\u2013109","author":"Maamouri Mohammed","year":"2004","unstructured":"Mohammed Maamouri , Ann Bies , Tim Buckwalter , and Wigdan Mekki . 2004 . The Penn Arabic treebank: Building a large-scale annotated Arabic corpus . In Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools. 102\u2013109 . Mohammed Maamouri, Ann Bies, Tim Buckwalter, and Wigdan Mekki. 2004. The Penn Arabic treebank: Building a large-scale annotated Arabic corpus. In Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools. 102\u2013109."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.5555\/1868771.1868773"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.5555\/647344.724003"},{"key":"e_1_2_1_32_1","volume-title":"Alkhalil MorphoSys. In Proceedings of the 7th International Computing Conference in Arabic. 66\u201373","author":"Ould Abdallahi Ould Bebah Mohamed","year":"2011","unstructured":"Bebah Mohamed Ould Abdallahi Ould , Abdelouafi Meziane , Azzeddine Mazroui , and Abdelhak Lakhouaja . 2011 . Alkhalil MorphoSys. In Proceedings of the 7th International Computing Conference in Arabic. 66\u201373 . Bebah Mohamed Ould Abdallahi Ould, Abdelouafi Meziane, Azzeddine Mazroui, and Abdelhak Lakhouaja. 2011. Alkhalil MorphoSys. In Proceedings of the 7th International Computing Conference in Arabic. 66\u201373."},{"key":"e_1_2_1_33_1","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","volume":"1","author":"Mubarak Hamdy","year":"2019","unstructured":"Hamdy Mubarak , Ahmed Abdelali , Hassan Sajjad , Younes Samih , and Kareem Darwish . 2019 . Highly effective Arabic diacritization using sequence to sequence modeling . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers). Association for Computational Linguistics, 2390\u20132395. Hamdy Mubarak, Ahmed Abdelali, Hassan Sajjad, Younes Samih, and Kareem Darwish. 2019. Highly effective Arabic diacritization using sequence to sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 2390\u20132395."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-3617"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.5555\/1621787.1621802"},{"key":"e_1_2_1_36_1","doi-asserted-by":"crossref","unstructured":"Iroro Orife. 2018. Attentive sequence-to-sequence learning for diacritic restoration of Yor\u00f9b\u00e1 language text. arXiv:1804.00832. Retrieved from https:\/\/arxiv.org\/abs\/1804.00832.  Iroro Orife. 2018. Attentive sequence-to-sequence learning for diacritic restoration of Yor\u00f9b\u00e1 language text. arXiv:1804.00832. Retrieved from https:\/\/arxiv.org\/abs\/1804.00832.","DOI":"10.21437\/Interspeech.2018-42"},{"key":"e_1_2_1_37_1","doi-asserted-by":"crossref","first-page":"27","DOI":"10.21248\/jlcl.32.2017.213","article-title":"A survey and comparative study of arabic diacritization tools","volume":"32","author":"Hamed Osama","year":"2017","unstructured":"Osama Hamed and Torsten Zesch . 2017 . A survey and comparative study of arabic diacritization tools . J. Lang. Technol. Comput. Ling. 32 , 1 (2017), 27 \u2013 47 . Osama Hamed and Torsten Zesch. 2017. A survey and comparative study of arabic diacritization tools. J. Lang. Technol. Comput. Ling. 32, 1 (2017), 27\u201347.","journal-title":"J. Lang. Technol. Comput. Ling."},{"volume-title":"Proceedings of the International Conference on Language Resources and Evaluation (LREC\u201914)","author":"Pasha Arfath","key":"e_1_2_1_38_1","unstructured":"Arfath Pasha , Mohamed Al-Badrashiny , Mona Diab , Ahmed El Kholy , Ramy Eskander , Nizar Habash , Manoj Pooleery , Owen Rambow , and Ryan M. Roth . 2014. Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic . In Proceedings of the International Conference on Language Resources and Evaluation (LREC\u201914) . Arfath Pasha, Mohamed Al-Badrashiny, Mona Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan M. Roth. 2014. Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of the International Conference on Language Resources and Evaluation (LREC\u201914)."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.5555\/2817174.2817183"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.5555\/1557690.1557721"},{"volume-title":"A hybrid approach for Arabic diacritization","author":"Said Ahmed","key":"e_1_2_1_41_1","unstructured":"Ahmed Said , Mohamed El-Sharqwi , Achraf Chalabi , and Eslam Kamal . 2013. A hybrid approach for Arabic diacritization . In Natural Language Processing and Information Systems, Elisabeth M\u00e9tais, Farid Meziane, Mohamad Saraee, Vijayan Sugumaran, and Sunil Vadera (Eds.). Springer , Berlin , 53\u201364. Ahmed Said, Mohamed El-Sharqwi, Achraf Chalabi, and Eslam Kamal. 2013. A hybrid approach for Arabic diacritization. In Natural Language Processing and Information Systems, Elisabeth M\u00e9tais, Farid Meziane, Mohamad Saraee, Vijayan Sugumaran, and Sunil Vadera (Eds.). Springer, Berlin, 53\u201364."},{"key":"e_1_2_1_42_1","volume-title":"INFuture2009: Digital Resources and Knowledge Sharing","author":"\u0160anti\u0107 Nikola","year":"2009","unstructured":"Nikola \u0160anti\u0107 , Jan \u0160najder , and Bojana Dalbelo Ba\u0161i\u0107 . 2009. Automatic diacritics restoration in Croatian texts . In INFuture2009: Digital Resources and Knowledge Sharing ( 2009 ), 309\u2013318. Nikola \u0160anti\u0107, Jan \u0160najder, and Bojana Dalbelo Ba\u0161i\u0107. 2009. Automatic diacritics restoration in Croatian texts. In INFuture2009: Digital Resources and Knowledge Sharing (2009), 309\u2013318."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/78.650093"},{"key":"e_1_2_1_44_1","volume-title":"Proceedings of the International Conference on Language Resources and Evaluation (LREC\u201908)","author":"Tufi\u015f Dan","year":"2008","unstructured":"Dan Tufi\u015f and Alexandru Ceau\u015fu . 2008 . DIAC+: A professional diacritics recovering system . Proceedings of the International Conference on Language Resources and Evaluation (LREC\u201908) . Dan Tufi\u015f and Alexandru Ceau\u015fu. 2008. DIAC+: A professional diacritics recovering system. Proceedings of the International Conference on Language Resources and Evaluation (LREC\u201908)."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.5555\/1621804.1621822"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.3115\/1220175.1220248"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1016\/S1386-5056(02)00056-4"}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3434235","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3434235","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:24:35Z","timestamp":1750195475000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3434235"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,3,31]]},"references-count":46,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2021,3,31]]}},"alternative-id":["10.1145\/3434235"],"URL":"https:\/\/doi.org\/10.1145\/3434235","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"type":"print","value":"2375-4699"},{"type":"electronic","value":"2375-4702"}],"subject":[],"published":{"date-parts":[[2021,3,31]]},"assertion":[{"value":"2020-02-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-04-15","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}