{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2022,11,20]],"date-time":"2022-11-20T05:27:41Z","timestamp":1668922061215},"reference-count":68,"publisher":"Cambridge University Press (CUP)","issue":"2","license":[{"start":{"date-parts":[[2020,9,23]],"date-time":"2020-09-23T00:00:00Z","timestamp":1600819200000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":["cambridge.org"],"crossmark-restriction":true},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2022,3]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Resource-limited and morphologically rich languages pose many challenges to natural language processing tasks. Their highly inflected surface forms inflate the vocabulary size and increase sparsity in an already scarce data situation. In this article, we present an unsupervised learning approach to vocabulary reduction through morphological segmentation. We demonstrate its value in the context of machine translation for dialectal Arabic (DA), the primarily spoken, orthographically unstandardized, morphologically rich and yet resource poor variants of Standard Arabic. Our approach exploits the existence of monolingual and parallel data. We show comparable performance to state-of-the-art supervised methods for DA segmentation.<\/jats:p>","DOI":"10.1017\/s1351324920000455","type":"journal-article","created":{"date-parts":[[2020,9,23]],"date-time":"2020-09-23T08:29:26Z","timestamp":1600849766000},"page":"223-248","update-policy":"http:\/\/dx.doi.org\/10.1017\/policypage","source":"Crossref","is-referenced-by-count":1,"title":["Unsupervised Arabic dialect segmentation for machine translation"],"prefix":"10.1017","volume":"28","author":[{"given":"Wael","family":"Salloum","sequence":"first","affiliation":[]},{"given":"Nizar","family":"Habash","sequence":"additional","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2020,9,23]]},"reference":[{"key":"S1351324920000455_ref42","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00130"},{"key":"S1351324920000455_ref40","unstructured":"Mohamed, E. , Mohit, B. and Oflazer, K. (2012). Annotating and learning morphological segmentation of Egyptian colloquial Arabic. In Proceedings of the Language Resources and Evaluation Conference (LREC)."},{"key":"S1351324920000455_ref17","unstructured":"Eskander, R. , Habash, N. and Rambow, O. (2013). Automatic extraction of morphological lexicons from morphologically annotated corpora. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, Washington, USA: Association for Computational Linguistics, pp. 1032\u20131043."},{"key":"S1351324920000455_ref24","doi-asserted-by":"publisher","DOI":"10.3115\/1219840.1219911"},{"key":"S1351324920000455_ref22","article-title":"On Arabic and its dialects","volume":"17","author":"Habash","year":"2006","journal-title":"Multilingual Magazine"},{"key":"S1351324920000455_ref1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N16-3003"},{"key":"S1351324920000455_ref11","article-title":"Unsupervised models for morpheme segmentation and morphology learning","volume":"4","author":"Creutz","year":"2007","journal-title":"ACM Transactions on Speech and Language Processing (TSLP)"},{"key":"S1351324920000455_ref29","unstructured":"Habash, N. , Diab, M. and Rabmow, O. (2012c). Conventional orthography for dialectal Arabic. In Proceedings of the Language Resources and Evaluation Conference (LREC)."},{"key":"S1351324920000455_ref36","unstructured":"Kilany, H. , Gadalla, H. , Arram, H. , Yacoub, A. , El-Habashi, A. and McLemore, C. (2002). Egyptian Colloquial Arabic Lexicon. LDC catalog number LDC99L22."},{"key":"S1351324920000455_ref43","doi-asserted-by":"publisher","DOI":"10.1162\/089120103321337421"},{"key":"S1351324920000455_ref3","unstructured":"Al-Badrashiny, M. , Pasha, A. , Diab, M.T. , Habash, N. , Rambow, O. , Salloum, W. and Eskander, R. (2016). SPLIT: Smart Preprocessing (Quasi) Language Independent Tool. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC-2016)."},{"key":"S1351324920000455_ref12","volume-title":"Arabic Computational Morphology: Knowledge-based and Empirical Methods","author":"Diab","year":"2007"},{"key":"S1351324920000455_ref9","unstructured":"Chiang, D. , Diab, M. , Habash, N. , Rambow, O. and Shareef, S. (2006). Parsing arabic dialects. In Proceedings of the European Chapter of ACL (EACL)."},{"key":"S1351324920000455_ref59","unstructured":"Sawaf, H. (2010). Arabic dialect handling in hybrid machine translation. In Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA)."},{"key":"S1351324920000455_ref46","unstructured":"Oudah, M. , Almahairi, A. and Habash, N. (2019). The impact of preprocessing on Arabic-English statistical and neural machine translation. CoRR, abs\/1906.11751."},{"key":"S1351324920000455_ref28","unstructured":"Habash, N. , Eskander, R. and Hawwari, A. (2012b). A morphological analyzer for Egyptian Arabic. In Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology, pp. 1\u20139."},{"key":"S1351324920000455_ref25","doi-asserted-by":"publisher","DOI":"10.3115\/1220175.1220261"},{"key":"S1351324920000455_ref26","doi-asserted-by":"publisher","DOI":"10.3115\/1614049.1614062"},{"key":"S1351324920000455_ref35","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1305"},{"key":"S1351324920000455_ref21","unstructured":"Habash, N. , Eskander, R. and Hawwari, A. (2012a). A morphological analyzer for Egyptian Arabic. In NAACL-HLT 2012 Workshop on Computational Morphology and Phonology (SIGMORPHON2012), pp. 1\u20139."},{"key":"S1351324920000455_ref54","unstructured":"Salloum, W. and Habash, N. (2011). Dialectal to standard Arabic paraphrasing to improve Arabic-English statistical machine translation. In Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties, pp. 10\u201321."},{"key":"S1351324920000455_ref63","unstructured":"Utiyama, M. and Isahara, H. (2007). A comparison of pivot methods for phrase-based statistical machine translation. In HLT-NAACL, pp. 484\u2013491."},{"key":"S1351324920000455_ref57","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/K17-1043"},{"key":"S1351324920000455_ref68","doi-asserted-by":"crossref","unstructured":"Zhang, X. (1998). Dialect MT: A case study between Cantonese and Mandarin. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, ACL 1998, pp. 1460\u20131464.","DOI":"10.3115\/980691.980807"},{"key":"S1351324920000455_ref39","unstructured":"Mikolov, T. , Chen, K. Corrado G. and Dean J. (2013). Efficient estimation of word representations in vector space. CoRR."},{"key":"S1351324920000455_ref52","unstructured":"Sajjad, H. , Darwish, K. and Belinkov, Y. (2013). Translating dialectal Arabic to English. In The 51st Annual Meeting of the Association for Computational Linguistics - Short Papers (ACL Short Papers 2013), Sofia, Bulgaria."},{"key":"S1351324920000455_ref66","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1173"},{"key":"S1351324920000455_ref51","doi-asserted-by":"publisher","DOI":"10.3115\/1220175.1220176"},{"key":"S1351324920000455_ref55","unstructured":"Salloum, W. and Habash, N. (2012). Elissa: A dialectal to standard Arabic machine translation system. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012): Demonstration Papers, pp. 385\u2013392."},{"key":"S1351324920000455_ref15","unstructured":"El Kholy, A. and Habash, N. (2010). Techniques for Arabic morphological detokenization and orthographic denormalization. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC)."},{"key":"S1351324920000455_ref53","unstructured":"Salloum, W. (2018). Machine Translation of Arabic Dialects. Ph.D. thesis, Columbia University in the City of New York."},{"key":"S1351324920000455_ref33","unstructured":"Hamdi, A. , Boujelbane, R. , Habash, N. , Nasr, A. , et al. (2013). The effects of factorizing root and pattern mapping in bidirectional Tunisian-Standard Arabic machine translation. MT Summit 2013."},{"key":"S1351324920000455_ref32","doi-asserted-by":"publisher","DOI":"10.3115\/974147.974149"},{"key":"S1351324920000455_ref23","doi-asserted-by":"publisher","DOI":"10.2200\/S00277ED1V01Y201008HLT010"},{"key":"S1351324920000455_ref16","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-4214"},{"key":"S1351324920000455_ref13","unstructured":"Du, J. , Jiang, J. and Way, A. (2010). Facilitating translation using source language paraphrase lattices. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. EMNLP 2010, pp. 420\u2013429."},{"key":"S1351324920000455_ref2","unstructured":"Abo Bakr, H. , Shaalan, K. and Ziedan, I. (2008). A hybrid approach for converting written Egyptian colloquial dialect into Diacritized Arabic. In The 6th International Conference on Informatics and Systems, INFOS2008. Cairo University."},{"key":"S1351324920000455_ref56","unstructured":"Salloum, W. and Habash, N. (2013). Dialectal Arabic to English machine translation: Pivoting through modern standard Arabic. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)."},{"key":"S1351324920000455_ref49","unstructured":"Pasha, A. , Al-Badrashiny, M. , Diab, M.T. , El Kholy, A. , Eskander, R. , Habash, N. , Pooleery, M. , Rambow, O. and Roth, R. (2014). MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014)."},{"key":"S1351324920000455_ref50","unstructured":"Riesa, J. and Yarowsky, D. (2006). Minimally supervised morphological segmentation with applications to machine translation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas (AMTA 2006), pp. 185\u2013192."},{"key":"S1351324920000455_ref67","unstructured":"Zbib, R. , Malchiodi, E. , Devlin, J. , Stallard, D. , Matsoukas, S. , Schwartz, R. , Makhoul, J. , Zaidan, O.F. and Callison-Burch, C. (2012). Machine translation of Arabic dialects. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Montr\u00e9al, Canada: Association for Computational Linguistics, pp. 49\u201359."},{"key":"S1351324920000455_ref14","doi-asserted-by":"publisher","DOI":"10.3115\/1621787.1621798"},{"key":"S1351324920000455_ref19","volume-title":"English Gigaword, LDC Catalog No.: LDC2003T05","author":"Graff","year":"2003"},{"key":"S1351324920000455_ref60","doi-asserted-by":"publisher","DOI":"10.3115\/1117601.1117615"},{"key":"S1351324920000455_ref62","doi-asserted-by":"crossref","unstructured":"Stolcke, A. (2002). SRILM an Extensible Language Modeling Toolkit. In Proceedings of the International Conference on Spoken Language Processing.","DOI":"10.21437\/ICSLP.2002-303"},{"key":"S1351324920000455_ref20","unstructured":"Graff, D. , Maamouri, M. , Bouziri, B. , Krouna, S. , Kulick, S. and Buckwalter, T. (2009). Standard Arabic Morphological Analyzer (SAMA) Version 3.1. Linguistic Data Consortium LDC2009E73."},{"key":"S1351324920000455_ref38","unstructured":"Kumar, S. , Och, F.J. and Macherey, W. (2007). Improving word alignment with bridge languages. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 42\u201350."},{"key":"S1351324920000455_ref7","unstructured":"Buckwalter, T. (2004). Buckwalter Arabic Morphological Analyzer Version 2.0. LDC catalog number LDC2004L02, ISBN 1-58563-324-0."},{"key":"S1351324920000455_ref64","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1073"},{"key":"S1351324920000455_ref41","unstructured":"Nakov, P. and Ng, H.T. (2011). Translating from morphologically complex languages: A paraphrase-based approach. In Proceedings of the Meeting of the Association for Computational Linguistics (ACL 2011)."},{"key":"S1351324920000455_ref61","unstructured":"Stallard, D. , Devlin, J. , Kayser, M. , Lee, Y.K. and Barzilay, R. (2012). Unsupervised morphology rivals supervised morphology for Arabic MT. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, pp. 322\u2013327."},{"key":"S1351324920000455_ref18","unstructured":"Eskander, R. , Habash, N. , Rambow, O. and Pasha, A. (2016). Creating resources for dialectal Arabic from a single annotation: A case study on Egyptian and Levantine. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 3455\u20133465."},{"key":"S1351324920000455_ref34","unstructured":"Khalifa, S. , Zalmout, N. and Habash, N. (2016). YAMAMA: Yet another multi-dialect Arabic morphological analyzer. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations. Osaka, Japan: The COLING 2016 Organizing Committee, pp. 223\u2013227."},{"key":"S1351324920000455_ref47","doi-asserted-by":"crossref","unstructured":"Papineni, K. , Roukos, S. , Ward, T. and Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311\u2013318.","DOI":"10.3115\/1073083.1073135"},{"key":"S1351324920000455_ref10","doi-asserted-by":"crossref","unstructured":"Creutz, M. and Lagus, K. (2002). Unsupervised discovery of morphemes. In: ACL 2002 Workshop on Morphological and Phonological Learning. ACL.","DOI":"10.3115\/1118647.1118650"},{"key":"S1351324920000455_ref37","doi-asserted-by":"publisher","DOI":"10.3115\/1557769.1557821"},{"key":"S1351324920000455_ref8","doi-asserted-by":"publisher","DOI":"10.3115\/1220835.1220838"},{"key":"S1351324920000455_ref4","unstructured":"Al-Sabbagh, R. and Girju, R. (2010). Mining the web for the induction of a Dialectical Arabic Lexicon. In Calzolari N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M. and Tapias, D. (eds), LREC. European Language Resources Association."},{"key":"S1351324920000455_ref65","doi-asserted-by":"publisher","DOI":"10.1515\/pralin-2017-0025"},{"key":"S1351324920000455_ref45","doi-asserted-by":"publisher","DOI":"10.1162\/089120103321337421"},{"key":"S1351324920000455_ref44","doi-asserted-by":"publisher","DOI":"10.3115\/1075096.1075117"},{"key":"S1351324920000455_ref30","unstructured":"Habash, N. , Roth, R. , Rambow, O. , Eskander, R. and Tomeh, N. (2013). Morphological analysis and disambiguation for dialectal Arabic. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)."},{"key":"S1351324920000455_ref58","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1306"},{"key":"S1351324920000455_ref5","unstructured":"Banerjee, S. and Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization, pp. 65\u201372."},{"key":"S1351324920000455_ref31","unstructured":"Habash, N. , Eryani, F. , Khalifa, S. , Rambow, O. , Abdulrahim, D. , Erdmann, A. , Faraj, R. , Zaghouani, W. , Bouamor, H. , Zalmout, N. , Hassan, S. , Shargi, F.A. , Alkhereyf, S. , Abdulkareem, B. , Eskander, R. , Salameh, M. and Saddiki, H. (2018). Unified guidelines and resources for Arabic Dialect orthography. In: Proceedings of the Language Resources and Evaluation Conference (LREC)."},{"key":"S1351324920000455_ref48","unstructured":"Parker, R. , Graff, D. , Chen, K. , Kong, J. and Maeda, K. (2009). Arabic Gigaword Fourth Edition. LDC catalog number No. LDC2009T30, ISBN 1-58563-532-4."},{"key":"S1351324920000455_ref6","first-page":"263","article-title":"The mathematics of statistical machine translation: Parameter estimation","volume":"19","author":"Brown","year":"1993","journal-title":"Computational Linguistics"},{"key":"S1351324920000455_ref27","volume-title":"Arabic Computational Morphology: Knowledge-based and Empirical Methods","author":"Habash","year":"2007"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324920000455","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,11,20]],"date-time":"2022-11-20T00:55:38Z","timestamp":1668905738000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324920000455\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,9,23]]},"references-count":68,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2022,3]]}},"alternative-id":["S1351324920000455"],"URL":"https:\/\/doi.org\/10.1017\/s1351324920000455","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,9,23]]},"assertion":[{"value":"\u00a9 The Author(s), 2020. Published by Cambridge University Press","name":"copyright","label":"Copyright","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}}]}}