{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,13]],"date-time":"2026-05-13T20:46:55Z","timestamp":1778705215965,"version":"3.51.4"},"reference-count":148,"publisher":"Cambridge University Press (CUP)","issue":"6","license":[{"start":{"date-parts":[[2020,11,20]],"date-time":"2020-11-20T00:00:00Z","timestamp":1605830400000},"content-version":"unspecified","delay-in-days":19,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":["cambridge.org"],"crossmark-restriction":true},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2020,11]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>There has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and\/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.<\/jats:p>","DOI":"10.1017\/s1351324920000492","type":"journal-article","created":{"date-parts":[[2020,11,20]],"date-time":"2020-11-20T13:29:51Z","timestamp":1605878991000},"page":"595-612","update-policy":"https:\/\/doi.org\/10.1017\/policypage","source":"Crossref","is-referenced-by-count":29,"title":["Natural language processing for similar languages, varieties, and dialects: A survey"],"prefix":"10.1017","volume":"26","author":[{"given":"Marcos","family":"Zampieri","sequence":"first","affiliation":[]},{"given":"Preslav","family":"Nakov","sequence":"additional","affiliation":[]},{"given":"Yves","family":"Scherrer","sequence":"additional","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2020,11,20]]},"reference":[{"key":"S1351324920000492_ref53","doi-asserted-by":"publisher","DOI":"10.1017\/S135132491900038X"},{"key":"S1351324920000492_ref146","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1163"},{"key":"S1351324920000492_ref109","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-3907"},{"key":"S1351324920000492_ref30","unstructured":"Corb\u00ed-Bellot, A.M. , Forcada, M.L. , Ortiz-Rojas, S. , P\u00e9rez-Ortiz, J.A. , Ram\u00edrez-S\u00e1nchez, G. , S\u00e1nchez-Mart\u00ednez, F. , Alegria, I. , Mayor, A. and Sarasola, K. (2005). An open-source shallow-transfer machine translation engine for the romance languages of Spain. In Proceedings of the Tenth Conference of the European Association for Machine Translation, EAMT\u201905, Budapest, Hungary, pp. 79\u201386."},{"key":"S1351324920000492_ref85","unstructured":"Nivre, J. , de Marneffe, M.-C. , Ginter, F. , Goldberg, Y. , Haji\u010d, J. , Manning, C.D. , McDonald, R. , Petrov, S. , Pyysalo, S. , Silveira, N. , Silveira, R. , Zeman, D. (2016). Universal dependencies v1: A multilingual treebank collection. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), Portoroz, Slovenia, pp. 1659\u20131666."},{"key":"S1351324920000492_ref37","unstructured":"Elfardy, H. and Diab, M. (2013). Sentence level dialect identification in Arabic. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Sofia, Bulgaria, pp. 456\u2013461."},{"key":"S1351324920000492_ref128","unstructured":"Wray, S. (2018). Classification of closely related sub-dialects of Arabic using support-vector machines. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan, pp. 3671\u20133674."},{"key":"S1351324920000492_ref131","doi-asserted-by":"publisher","DOI":"10.1162\/COLI_a_00169"},{"key":"S1351324920000492_ref89","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1226"},{"key":"S1351324920000492_ref71","doi-asserted-by":"publisher","DOI":"10.1017\/S1351324919000299"},{"key":"S1351324920000492_ref112","unstructured":"T\u00e4ckstr\u00f6m, O. , McDonald, R. and Uszkoreit, J. (2012). Cross-lingual word clusters for direct transfer of linguistic structure. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Montr\u00e9al, Canada, pp. 477\u2013487."},{"key":"S1351324920000492_ref78","unstructured":"Mokhov, S.A. (2010). A MARF approach to DEFT 2010. In Proceedings of the 6th DEFT Workshop (DEFT\u201910), pp. 35\u201349."},{"key":"S1351324920000492_ref141","unstructured":"Zeman, D. (2008). Reusable tagset conversion using tagset drivers. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC\u201908), Marrakech, Morocco, pp. 213\u2013218."},{"key":"S1351324920000492_ref44","unstructured":"Goutte, C. , L\u00e9ger, S. , Malmasi, S. and Zampieri, M. (2016). Discriminating similar languages: Evaluations and explorations. In Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, pp. 1800\u20131807."},{"key":"S1351324920000492_ref50","unstructured":"Hollenstein, N. and Aepli, N. (2015). A resource for natural language processing of Swiss German dialects. In Proceedings of GSCL, pp. 108\u2013109."},{"key":"S1351324920000492_ref68","unstructured":"Lusetti, M. , Ruzsics, T. , G\u00f6hring, A. , Samard\u017ei\u0107, T. and Stark, E. 2018. Encoder-decoder methods for text normalization. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), Santa Fe, New Mexico, USA, August. Association for Computational Linguistics, pp. 18\u201328."},{"key":"S1351324920000492_ref24","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1218"},{"key":"S1351324920000492_ref33","unstructured":"Cotterell, R. and Callison-Burch, C. (2014). A multi-dialect, multi-genre corpus of informal written Arabic. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), Reykjavik, Iceland, pp. 241\u2013245."},{"key":"S1351324920000492_ref66","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00163"},{"key":"S1351324920000492_ref87","unstructured":"Popovi\u0107, M. , Poncelas, A. , Brkic, M. and Way, A. (2020). Neural machine translation for translating into Croatian and Serbian. In Proceedings of the Seventh Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)."},{"key":"S1351324920000492_ref20","unstructured":"Bouamor, H. , Habash, N. and Oflazer, K. (2014). A multidialectal parallel corpus of Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC\u201914), Reykjavik, Iceland, pp. 1240\u20131245."},{"key":"S1351324920000492_ref101","doi-asserted-by":"publisher","DOI":"10.1017\/S1351324919000287"},{"key":"S1351324920000492_ref56","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00065"},{"key":"S1351324920000492_ref80","doi-asserted-by":"publisher","DOI":"10.3115\/1699648.1699682"},{"key":"S1351324920000492_ref91","doi-asserted-by":"publisher","DOI":"10.1145\/2632188.2632207"},{"key":"S1351324920000492_ref105","unstructured":"Shapiro, P. and Duh, K. (2019). Comparing pipelined and integrated approaches to dialectal Arabic neural machine translation. In Proceedings of the Workshop on NLP for Similar Languages Varieties and Dialects (VarDial), Minneapolis, USA, pp. 214\u2014222."},{"key":"S1351324920000492_ref140","unstructured":"Zbib, R. , Malchiodi, E. , Devlin, J. , Stallard, D. , Matsoukas, S. , Schwartz, R. , Makhoul, J. , Zaidan, O.F. and Callison-Burch, C. (2012). Machine translation of Arabic dialects. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL-HLT), Montreal, Canada, pp. 49\u201359."},{"key":"S1351324920000492_ref49","unstructured":"Han, B. , Cook, P. and Baldwin, T. (2012). Geolocation prediction in social media data by finding location indicative words. In Proceedings of the International Conference in Computational Linguistics (COLING), pp. 1045\u20131062."},{"key":"S1351324920000492_ref142","unstructured":"Zeman, D. and Resnik, P. (2008). Cross-language parser adaptation between related languages. In Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages, Hyderabad, India, pp. 35\u201342."},{"key":"S1351324920000492_ref110","unstructured":"Sutskever, I. , Vinyals, O. and Le, Q.V. (2014). Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing System, NIPS\u201914, Montreal, Canada, pp. 3104\u20133112."},{"key":"S1351324920000492_ref103","doi-asserted-by":"publisher","DOI":"10.1007\/s10579-019-09457-5"},{"key":"S1351324920000492_ref121","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1224"},{"key":"S1351324920000492_ref23","unstructured":"Cao, S. , Kitaev, N. and Klein, D. (2020). Multilingual alignment of contextual word representations. In Proceedings of the 8th International Conference on Learning Representations, ICLR\u201920, Addis Ababa, Ethiopia."},{"key":"S1351324920000492_ref69","doi-asserted-by":"publisher","DOI":"10.1007\/s10579-019-09463-7"},{"key":"S1351324920000492_ref6","first-page":"37","article-title":"Exploring twitter as a source of an Arabic dialect corpus","volume":"8","author":"Alshutayri","year":"2017","journal-title":"International Journal of Computational Linguistics (IJCL)"},{"key":"S1351324920000492_ref72","unstructured":"Marujo, L. , Grazina, N. , Lu\u00eds, T. , Ling, W. , Coheur, L. and Trancoso, I. (2011). BP2EP - adaptation of Brazilian Portuguese texts to European Portuguese. In Proceedings of the 15th Conference of the European Association for Machine Translation, EAMT\u201911, Leuven, Belgium, pp. 129\u2013136."},{"key":"S1351324920000492_ref36","volume-title":"Aggregating Dialectology, Typology, and Register Analysis. Linguistic Variation in Text and Speech","author":"Diwersy","year":"2014"},{"key":"S1351324920000492_ref111","doi-asserted-by":"crossref","first-page":"269","DOI":"10.1145\/772755.772759","article-title":"A language and character set determination method based on N-gram statistics","volume":"1","author":"Suzuki","year":"2002","journal-title":"ACM Transactions on Asian Language Information Processing (TALIP)"},{"key":"S1351324920000492_ref32","unstructured":"Costa-juss\u00e0 M.R., Zampieri M. and Pal S. (2018). A neural approach to language variety translation. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects, VarDial\u201918, Santa Fe, New Mexico, USA, pp. 275\u2013282."},{"key":"S1351324920000492_ref15","unstructured":"Bernier-Colborne, G. , Goutte, C. and L\u00e9ger, S. (2019). Improving cuneiform language identification with BERT. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Minneapolis, USA, pp. 17\u201325."},{"key":"S1351324920000492_ref38","doi-asserted-by":"publisher","DOI":"10.26615\/978-954-452-042-7_007"},{"key":"S1351324920000492_ref70","unstructured":"Malmasi, S. , Zampieri, M. , Ljube\u0161i\u0107, N. , Nakov, P. , Ali, A. and Tiedemann, J. (2016). Discriminating between similar languages and Arabic dialect identification: A report on the third DSL shared task. In Proceedings of the Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Osaka, Japan, pp. 1\u201314."},{"key":"S1351324920000492_ref84","unstructured":"Nguyen, T.Q. and Chiang, D. (2017). Transfer learning across low-resource, related languages for neural machine translation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, IJCNLP\u201917, Taipei, Taiwan, pp. 296\u2013301."},{"key":"S1351324920000492_ref41","unstructured":"Francis, W.N. and Kucera, H. (1979). Brown Corpus Manual."},{"key":"S1351324920000492_ref58","unstructured":"Josan, G.S. and Lehal, G.S. (2008). A Punjabi to Hindi machine translation system. In Proceedings of the 22nd International Conference on on Computational Linguistics, COLING\u201908, Manchester, UK, pp. 157\u2013160."},{"key":"S1351324920000492_ref97","unstructured":"Sawaf, H. (2010). Arabic dialect handling in hybrid machine translation. In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas, AMTA\u201910, Denver, Colorado, USA."},{"key":"S1351324920000492_ref62","unstructured":"Lample, G. , Conneau, A. , Ranzato, M. , Denoyer, L. and J\u00e9gou, H. (2018). Word translation without parallel data. In Proceedings of the 6th International Conference on Learning Representations, ICLR\u201918, Vancouver, BC, Canada."},{"key":"S1351324920000492_ref18","unstructured":"Biemann, C. , Heyer, G. , Quasthoff, U. and Richter, M. (2007). The Leipzig corpora collection-monolingual corpora of standard size. In Proceedings of Corpus Linguistics."},{"key":"S1351324920000492_ref2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1388"},{"key":"S1351324920000492_ref27","unstructured":"Ciobanu, A.M. and Dinu, L.P. (2016). A computational perspective on the Romanian dialects. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portoro\u017e, Slovenia, May, pp. 3281\u20133285."},{"key":"S1351324920000492_ref12","unstructured":"Bakr, H.A. , Shaalan, K. and Ziedan, I. (2008). A hybrid approach for converting written Egyptian colloquial dialect into diacritized Arabic. In Proceedings of the 6th International Conference on Informatics and Systems, INFOS\u201908, Egypt, pp. 27\u201333."},{"key":"S1351324920000492_ref113","unstructured":"Tan, L. , Zampieri, M. , Ljube\u0161i\u0107, N. and Tiedemann, J. (2014). Merging comparable data sources for the discrimination of similar languages: The DSL corpus collection. In Proceedings of the Workshop on Building and Using Comparable Corpora (BUCC), Reykjavik, Iceland, pp. 6\u201310."},{"key":"S1351324920000492_ref117","unstructured":"Tiedemann, J. and Nakov, P. (2013). Analyzing the use of character-level translation with sparse and noisy datasets. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP\u201913, Hissar, Bulgaria, pp. 676\u2013684."},{"key":"S1351324920000492_ref48","unstructured":"Han, B. and Baldwin, T. (2011). Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-HLT\u201911, Portland, Oregon, USA, pp. 368\u2013378."},{"key":"S1351324920000492_ref73","unstructured":"McDonald, R. , Nivre, J. , Quirmbach-Brundage, Y. , Goldberg, Y. , Das, D. , Ganchev, K. , Hall, K. , Petrov, S. , Zhang, H. , T\u00e4ckstr\u00f6m, O. , Bedini, C. , Castell\u00f3, N.B. and Lee, J. (2013). Universal dependency annotation for multilingual parsing. In Proceedings of ACL."},{"key":"S1351324920000492_ref136","unstructured":"Zampieri, M. , Malmasi, S. , Scherrer, Y. , Samard\u017ei\u0107, T. , Tyers, F. , Silfverberg, M. , Klyueva, N. , Pan, T.-L. , Huang, C.-R. , Ionescu, R.T. , Butnaru, A. and Jauhiainen, T. (2019). A report on the third VarDial evaluation campaign. In Proceedings of the Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial). Association for Computational Linguistics, pp. 1\u201316."},{"key":"S1351324920000492_ref46","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1632"},{"key":"S1351324920000492_ref118","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-5313"},{"key":"S1351324920000492_ref55","doi-asserted-by":"publisher","DOI":"10.1613\/jair.1.11675"},{"key":"S1351324920000492_ref1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P15-2044"},{"key":"S1351324920000492_ref98","unstructured":"Scannell, K.P. (2006). Machine translation for closely related language pairs. In Proceedings of the LREC 2006 Workshop on Strategies for Developing Machine Translation for Minority Languages, Genoa, Italy, pp. 103\u2013109."},{"key":"S1351324920000492_ref123","unstructured":"Vilar, D. , Peter, J.-T. and Ney, H. (2007). Can we translate letters? In Proceedings of the Second Workshop on Statistical Machine Translation, StatMT\u201907, Prague, Czech Republic, pp. 33\u201339."},{"key":"S1351324920000492_ref11","unstructured":"Bahdanau, D. , Cho, K. and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations, ICLR\u201915, San Diego, California, USA."},{"key":"S1351324920000492_ref65","unstructured":"Lui, M. and Cook, P. (2013). Classifying English documents by national dialect. In Proceedings of Australasian Language Technology Association Workshop 2013 (ALTA 2013), Brisbane, Australia, December, pp. 5\u201315."},{"key":"S1351324920000492_ref100","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1210"},{"key":"S1351324920000492_ref143","unstructured":"Zhang, X. (1998). Dialect MT: A case study between Cantonese and Mandarin. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2, ACL-COLING\u201998, Quebec, Canada, pp. 1460\u20131464."},{"key":"S1351324920000492_ref114","unstructured":"Tiedemann, J. (2012). Character-based pivot translation for under-resourced languages and domains. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL\u201912, Avignon, France, pp. 141\u2013151."},{"key":"S1351324920000492_ref148","doi-asserted-by":"publisher","DOI":"10.1017\/S1351324919000366"},{"key":"S1351324920000492_ref54","doi-asserted-by":"crossref","unstructured":"Jauhiainen, T. , Jauhiainen, H. , Alstola, T. and Lind\u00e9n, K. (2019a). Language and dialect identification of Cuneiform texts. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Ann Arbor, Michigan. Association for Computational Linguistics, pp. 89\u201398.10.18653\/v1\/W19-1409","DOI":"10.18653\/v1\/W19-1409"},{"key":"S1351324920000492_ref129","doi-asserted-by":"publisher","DOI":"10.3115\/1073336.1073362"},{"key":"S1351324920000492_ref16","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1214"},{"key":"S1351324920000492_ref29","unstructured":"Conneau, A. and Lample, G. (2019). Cross-lingual language model pretraining. In Wallach H., Larochelle H., Beygelzimer A., dAlch\u00e9-Buc F., Fox E. and Garnett R. (eds), Advances in Neural Information Processing Systems 32, Vancouver, Canada, pp. 7059\u20137069."},{"key":"S1351324920000492_ref40","unstructured":"Forcada, M.L. (2006). Open-source machine translation: An opportunity for minor languages. In Proceedings of the LREC\u201906 Workshop on Strategies for Developing Machine Translation for Minority Languages, Genoa, Italy."},{"key":"S1351324920000492_ref51","unstructured":"Huang, C.-R. and Lee, L.-H. (2008). Contrastive approach towards text source classification based on top-bag-of-word similarity. In Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation, Cebu City, Philippines, November, pp. 404\u2013410."},{"key":"S1351324920000492_ref42","unstructured":"G\u0103man, M. , Hovy, D. , Ionescu, R.T. , Jauhiainen, H. , Jauhiainen, T. , Lind\u00e9n, K. , Ljube\u0161i\u0107, N. , Partanen, N. , Purschke, C. , Scherrer, Y. and Zampieri, M. (2020). A report on the VarDial evaluation campaign 2020. In Proceedings of the Seventh Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)."},{"key":"S1351324920000492_ref149","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-5309"},{"key":"S1351324920000492_ref77","unstructured":"Mikolov, T. , Le, Q.V. and Sutskever, I. (2013). Exploiting similarities among languages for machine translation. CoRR, abs\/1309.4168."},{"key":"S1351324920000492_ref137","unstructured":"Zampieri, M. , Malmasi, S. , Sulea, O.-M. and Dinu, L.P. (2016). A computational approach to the study of Portuguese newspapers published in Macau. In Proceedings of the Workshop on Natural Language Processing meets Journalism (NLPMJ 2016), New York City, NY, USA, pp. 47\u201351."},{"key":"S1351324920000492_ref39","unstructured":"Feldman, A. , Hana, J. and Brew, C. (2006). A cross-language approach to rapid creation of new morphosyntactically annotated resources. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006). European Language Resources Association (ELRA), pp. 549\u2013554."},{"key":"S1351324920000492_ref60","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-2012"},{"key":"S1351324920000492_ref82","unstructured":"Nakov, P. and Tiedemann, J. (2012). Combining word-level and character-level models for machine translation between closely-related languages. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL\u201912, Jeju Island, Korea, pp. 301\u2013305."},{"key":"S1351324920000492_ref86","unstructured":"Petrov, S. , Das, D. and McDonald, R. (2012). A universal part-of-speech tagset. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC\u201912), Istanbul, Turkey. European Language Resources Association (ELRA)."},{"key":"S1351324920000492_ref88","doi-asserted-by":"publisher","DOI":"10.37936\/ecti-cit.200622.53288"},{"key":"S1351324920000492_ref139","unstructured":"Zampieri, M. , Tan, L. , Ljube\u0161i\u0107, N. , Tiedemann, J. and Nakov, P. (2015). Overview of the DSL shared task 2015. In Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LTVarDial), Hissar, Bulgaria, pp. 1\u20139."},{"key":"S1351324920000492_ref115","doi-asserted-by":"crossref","first-page":"209","DOI":"10.1613\/jair.4785","article-title":"Synthetic treebanking for cross-lingual dependency parsing","volume":"55","author":"Tiedemann","year":"2016","journal-title":"Journal of Artificial Intelligence Research"},{"key":"S1351324920000492_ref116","unstructured":"Tiedemann, J. and Ljube\u0161i\u0107, N. (2012). Efficient discrimination between closely related languages. In Proceedings of the International Conference in Computational Linguistics (COLING), Mumbai, India, pp. 2619\u20132634."},{"key":"S1351324920000492_ref135","unstructured":"Zampieri, M. , Malmasi, S. , Nakov, P. , Ali, A. , Shon, S. , Glass, J. , Scherrer, Y. , Samard\u017ei\u0107, T. , Ljube\u0161i\u0107, N. , Tiedemann, J. , van der Lee, C. , Grondelaers, S. , Oostdijk, N. , Speelman, D. , van den Bosch, A. , Kumar, R. , Lahiri, B. and Jain, M. (2018). Language identification and morphosyntactic tagging: The second VarDial evaluation campaign. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018). Association for Computational Linguistics, pp. 1\u201317."},{"key":"S1351324920000492_ref34","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1078"},{"key":"S1351324920000492_ref61","unstructured":"Lakew, S.M. , Cettolo, M. and Federico, M. (2018). A comparison of transformer and recurrent neural networks on multilingual neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, COLING\u201918, Santa Fe, New Mexico, USA, pp. 641\u2013652."},{"key":"S1351324920000492_ref63","unstructured":"Ljube\u0161i\u0107, N. , Mikeli\u0107, N. and Boras, D. (2007). Language identification: How to distinguish similar languages? In Proceedings of the 29th International Conference on Information Technology Interfaces (ITI 2007), Cavtat\/Dubrovnik, Croatia, pp. 541\u2013546."},{"key":"S1351324920000492_ref104","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-1162"},{"key":"S1351324920000492_ref81","doi-asserted-by":"publisher","DOI":"10.1613\/jair.3540"},{"key":"S1351324920000492_ref119","first-page":"53","article-title":"The clin27 shared task: Translating historical text to contemporary language for improving automatic linguistic annotation","volume":"7","author":"Tjong Kim Sang","year":"2017","journal-title":"Computational Linguistics in the Netherlands Journal"},{"key":"S1351324920000492_ref96","unstructured":"Samard\u017ei\u0107, T. , Scherrer, Y. and Glaser, E. (2016). ArchiMob \u2013 a corpus of spoken Swiss German. In Proceedings of LREC."},{"key":"S1351324920000492_ref7","unstructured":"Altintas, K. and Cicekli, I. (2002). A machine translation system between a pair of closely related languages. In Proceedings of the 17th International Symposium on Computer and Information Sciences, ISCIS\u201902, Orlando, Florida, USA, pp. 192\u2013196."},{"key":"S1351324920000492_ref67","doi-asserted-by":"crossref","unstructured":"Lui, M. , Letcher, N. , Adams, O. , Duong, L. , Cook, P. and Baldwin, T. (2014). Exploring methods and resources for discriminating similar languages. In Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (VarDial), Dublin, Ireland, August, pp. 129\u2013138.","DOI":"10.3115\/v1\/W14-5315"},{"key":"S1351324920000492_ref147","unstructured":"Zubiaga, A. , Vicente, I.S. , Gamallo, P. , Pichel, J.R. , Alegria, I. , Aranberri, N. , Ezeiza, A. and Fresno, V. (2014). Overview of TweetLID: Tweet language identification at SEPLN 2014. In Proceedings of the Tweet Language Identification Workshop 2014 co-located with 30th Conference of the Spanish Society for Natural Language Processing (SEPLN 2014), Girona, Spain, pp. 1\u201311."},{"key":"S1351324920000492_ref99","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-5304"},{"key":"S1351324920000492_ref90","unstructured":"Ruder, S. (2017). An overview of multi-task learning in deep neural networks. arXiv e-prints, page arXiv:1706.05098."},{"key":"S1351324920000492_ref133","unstructured":"Zampieri, M. , Gebre, B.G. and Diwersy, S. (2013). N-gram language models and POS distribution for the identification of Spanish varieties. In Proceedings of la 20\u00e8me conf\u00e9rence du Traitement Automatique du Langage Naturel (TALN), Sables d\u2019Olonne, France, pp. 580\u2013587."},{"key":"S1351324920000492_ref79","unstructured":"Myint Oo, T. , Kyaw Thu, Y. and Mar Soe, K. (2019) Neural machine translation between Myanmar (Burmese) and rakhine (arakanese). In Proceedings of the Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Minneapolis, USA, pp. 80\u201488."},{"key":"S1351324920000492_ref3","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-1410"},{"key":"S1351324920000492_ref93","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P14-2125"},{"key":"S1351324920000492_ref74","unstructured":"McDonald, R. , Petrov, S. and Hall, K. (2011). Multi-source transfer of delexicalized dependency parsers. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, pp. 62\u201372."},{"key":"S1351324920000492_ref75","first-page":"94","article-title":"Language identification: A solved problem suitable for undergraduate instruction","volume":"20","author":"McNamee","year":"2005","journal-title":"Journal of Computing Sciences in Colleges"},{"key":"S1351324920000492_ref92","unstructured":"Sajjad, H. , Darwish, K. and Belinkov, Y. (2013). Translating dialectal Arabic to English. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL\u201913, Sofia, Bulgaria, pp. 1\u20136."},{"key":"S1351324920000492_ref132","unstructured":"Zampieri, M. and Gebre, B.G. (2012). Automatic identification of language varieties: The case of Portuguese. In Proceedings of The 11th Conference on Natural Language Processing (KONVENS 2012), Vienna, Austria, pp. 233\u2013237."},{"key":"S1351324920000492_ref130","unstructured":"Zaidan, O.F. and Callison-Burch, C. (2011). The Arabic online commentary dataset: An annotated dataset of informal Arabic with high dialectal content. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers-Volume 2, Portland, Oregon, USA, June, pp. 37\u201341."},{"key":"S1351324920000492_ref25","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1179"},{"key":"S1351324920000492_ref4","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2016-1297"},{"key":"S1351324920000492_ref43","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2013.6738541"},{"key":"S1351324920000492_ref59","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/K17-1024"},{"key":"S1351324920000492_ref94","unstructured":"Salloum, W. and Habash, N. (2011). Dialectal to Standard Arabic paraphrasing to improve Arabic-English statistical machine translation. In Proceedings of the Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties, Stroudsburg, Pennsylvania, USA, pp. 10\u201321."},{"key":"S1351324920000492_ref45","doi-asserted-by":"publisher","DOI":"10.1017\/S0266078400005836"},{"key":"S1351324920000492_ref106","doi-asserted-by":"publisher","DOI":"10.26615\/978-954-452-049-6_086"},{"key":"S1351324920000492_ref138","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-5307"},{"key":"S1351324920000492_ref10","doi-asserted-by":"publisher","DOI":"10.3115\/1273073.1273078"},{"key":"S1351324920000492_ref5","doi-asserted-by":"publisher","DOI":"10.1109\/ASRU.2017.8268952"},{"key":"S1351324920000492_ref57","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N16-1130"},{"key":"S1351324920000492_ref52","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-1425"},{"key":"S1351324920000492_ref108","volume-title":"Synthesis Lectures on Human Language Technologies","author":"S\u00f8gaard","year":"2019"},{"key":"S1351324920000492_ref19","unstructured":"Bojja, N. , Nedunchezhian, A. and Wang, P. (2015). Machine translation in mobile games: Augmenting social media text normalization with incentivized feedback. In Proceedings of the 15th Machine Translation Summit (MT Users\u2019 Track), vol. 2, Miami, Florida, USA, pp. 11\u201316."},{"key":"S1351324920000492_ref28","unstructured":"Clyne, M. (1992). Pluricentric Languages: Different Norms in Different Nations, Amsterdam: De Gruyter Mouton.10.1515\/9783110888140"},{"key":"S1351324920000492_ref31","doi-asserted-by":"publisher","DOI":"10.1007\/BF00994018"},{"key":"S1351324920000492_ref122","unstructured":"Vaswani, A. , Shazeer, N. , Parmar, N. , Uszkoreit, J. , Jones, L. , Gomez, A.N. , Kaiser, L. and Polosukhin, I. (2017). Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems, NIPS\u201917, Long Beach, California, USA, pp. 5998\u20136008."},{"key":"S1351324920000492_ref124","unstructured":"Vogel, J. and Tresner-Kirsch, D. (2012). Robust language identification in short, noisy texts: Improvements to LIGA. In Third International Workshop on Mining Ubiquitous and Social Environments (MUSE 2012)."},{"key":"S1351324920000492_ref64","unstructured":"Lui, M. (2014). Generalized Language Identification. PhD Thesis, University of Melbourne."},{"key":"S1351324920000492_ref14","unstructured":"Bergsma, S. , McNamee, P. , Bagdouri, M. , Fink, C. and Wilson, T. (2012). Language identification for creating language-specific Twitter collections. In Proceedings of the Second Workshop on Language in Social Media, pp. 65\u201374."},{"key":"S1351324920000492_ref145","doi-asserted-by":"publisher","DOI":"10.1016\/S0167-6393(00)00099-6"},{"key":"S1351324920000492_ref125","unstructured":"Wang, P. , Nakov, P. and Ng, H.T. (2012). Source language adaptation for resource-poor machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL\u201912, Jeju Island, Korea, pp. 286\u2013296."},{"key":"S1351324920000492_ref95","unstructured":"Salloum, W. and Habash, N. (2012). Elissa: A dialectal to standard Arabic machine translation system. In Proceedings of COLING 2012: Demonstration Papers, COLING\u201912, Mumbai, India, pp. 385\u2013392."},{"key":"S1351324920000492_ref17","unstructured":"Bick, E. and Nygaard, L. (2007). Using Danish as a CG interlingua: A wide-coverage Norwegian-English machine translation system. In Proceedings of the 16th Nordic Conference of Computational Linguistics, NODALIDA\u201907, Tartu, Estonia, pp. 21\u201328."},{"key":"S1351324920000492_ref134","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1201"},{"key":"S1351324920000492_ref127","unstructured":"Wang, P. and Ng, H.T. (2013). A beam-search decoder for normalization of social media text with application to machine translation. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT\u201913, Atlanta, Georgia, USA, pp. 471\u2013481."},{"key":"S1351324920000492_ref120","unstructured":"Tyers, F. and Alperen, M.S. (2010). South-East European Times: A parallel corpus of Balkan languages. In Proceedings of the LREC workshop on Exploitation of multilingual resources and tools for Central and (South) Eastern European Languages."},{"key":"S1351324920000492_ref47","doi-asserted-by":"publisher","DOI":"10.3115\/974147.974149"},{"key":"S1351324920000492_ref102","first-page":"9","article-title":"New developments in tagging pre-modern orthodox slavic texts","volume":"18","author":"Scherrer","year":"2018","journal-title":"Scripta and e-Scripta"},{"key":"S1351324920000492_ref35","unstructured":"Devlin, J. , Chang, M.-W. , Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 4171\u20134186."},{"key":"S1351324920000492_ref21","unstructured":"Bouamor, H. , Habash, N. , Salameh, M. , Zaghouani, W. , Rambow, O. , Abdulrahim, D. , Obeid, O. , Khalifa, S. , Eryani, F. , Erdmann, A. , Oflazer, K. (2018). The MADAR arabic dialect corpus and lexicon. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), pp. 3387\u20133396."},{"key":"S1351324920000492_ref144","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-39965-8_6"},{"key":"S1351324920000492_ref13","doi-asserted-by":"publisher","DOI":"10.3115\/991635.991645"},{"key":"S1351324920000492_ref9","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00288"},{"key":"S1351324920000492_ref8","doi-asserted-by":"publisher","DOI":"10.1007\/11751984_6"},{"key":"S1351324920000492_ref22","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-4622"},{"key":"S1351324920000492_ref83","unstructured":"Nguyen, D. and Dogruoz, A.S. (2014). Word level language identification in online multilingual communication. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP), pp. 18\u201321."},{"key":"S1351324920000492_ref126","doi-asserted-by":"publisher","DOI":"10.1162\/COLI_a_00248"},{"key":"S1351324920000492_ref76","doi-asserted-by":"crossref","unstructured":"Medvedeva, M. , Kroon, M. and Plank, B. (2017) When sparse traditional models outperform dense neural networks: The curious case of discriminating between similar languages. In Proceedings of the Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pp. 156\u2013163.10.18653\/v1\/W17-1219","DOI":"10.18653\/v1\/W17-1219"},{"key":"S1351324920000492_ref26","unstructured":"Christensen, H. (2014). Hc corpora. http:\/\/www.corpora.heliohost.org\/."}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324920000492","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2020,11,20]],"date-time":"2020-11-20T13:31:56Z","timestamp":1605879116000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324920000492\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,11]]},"references-count":148,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2020,11]]}},"alternative-id":["S1351324920000492"],"URL":"https:\/\/doi.org\/10.1017\/s1351324920000492","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,11]]},"assertion":[{"value":"\u00a9 Cambridge University Press 2020","name":"copyright","label":"Copyright","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}}]}}