{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,11]],"date-time":"2026-06-11T09:14:15Z","timestamp":1781169255928,"version":"3.54.1"},"reference-count":46,"publisher":"MIT Press","issue":"1","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computational Linguistics"],"published-print":{"date-parts":[[2014,3]]},"abstract":"<jats:p>The written form of the Arabic language, Modern Standard Arabic (MSA), differs in a non-trivial manner from the various spoken regional dialects of Arabic\u2014the true \u201cnative languages\u201d of Arabic speakers. Those dialects, in turn, differ quite a bit from each other. However, due to MSA's prevalence in written form, almost all Arabic data sets have predominantly MSA content. In this article, we describe the creation of a novel Arabic resource with dialect annotations. We have created a large monolingual data set rich in dialectal Arabic content called the Arabic On-line Commentary Data set (Zaidan and Callison-Burch 2011). We describe our annotation effort to identify the dialect level (and dialect itself) in each of more than 100,000 sentences from the data set by crowdsourcing the annotation task, and delve into interesting annotator behaviors (like over-identification of one's own dialect). Using this new annotated data set, we consider the task of Arabic dialect identification: Given the word sequence forming an Arabic sentence, determine the variety of Arabic in which it is written. We use the data to train and evaluate automatic classifiers for dialect identification, and establish that classifiers using dialectal data significantly and dramatically outperform baselines that use MSA-only data, achieving near-human classification accuracy. Finally, we apply our classifiers to discover dialectical data from a large Web crawl consisting of 3.5 million pages mined from on-line Arabic newspapers.<\/jats:p>","DOI":"10.1162\/coli_a_00169","type":"journal-article","created":{"date-parts":[[2013,6,26]],"date-time":"2013-06-26T14:39:53Z","timestamp":1372257593000},"page":"171-202","source":"Crossref","is-referenced-by-count":131,"title":["Arabic Dialect Identification"],"prefix":"10.1162","volume":"40","author":[{"given":"Omar F.","family":"Zaidan","sequence":"first","affiliation":[{"name":"Microsoft Research"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Chris","family":"Callison-Burch","sequence":"additional","affiliation":[{"name":"University of Pennsylvania"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"281","reference":[{"key":"R1","volume-title":"A Reference Grammar of Egyptian Arabic.","author":"Abdel-Massih Ernest T.","year":"1979"},{"key":"R2","doi-asserted-by":"publisher","DOI":"10.1515\/9783110878769"},{"issue":"2","key":"R3","first-page":"195","volume":"25","author":"Aoun Joseph","year":"1994","journal-title":"Linguistic Inquiry"},{"key":"R4","volume-title":"A Dictionary of Egyptian Arabic.","author":"Badawi El-Said","year":"1986"},{"key":"R5","doi-asserted-by":"publisher","DOI":"10.3366\/edinburgh\/9780748623730.001.0001"},{"key":"R6","doi-asserted-by":"publisher","DOI":"10.3115\/1621774.1621784"},{"key":"R8","unstructured":"Buckwalter, Tim. 2004. Buckwalter Arabic morphological analyzer version 2.0. Linguistic Data Consortium, Philadelphia, PA."},{"key":"R9","first-page":"1","volume-title":"Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk","author":"Callison-Burch Chris","year":"2010"},{"key":"R10","first-page":"161","volume-title":"Proceedings of SDAIR-94","author":"Cavnar William B.","year":"1994"},{"key":"R12","first-page":"369","volume-title":"Proceedings of EACL","author":"Chiang David","year":"2006"},{"key":"R13","volume-title":"A Reference Grammar of Syrian Arabic.","author":"Cowell Mark W.","year":"1964"},{"key":"R14","first-page":"66","volume-title":"Proceedings of the LREC Workshop on Semitic Language Processing","author":"Diab Mona","year":"2010"},{"key":"R16","volume-title":"A Short Reference Grammar of Iraqi Arabic.","author":"Erwin Wallace","year":"1963"},{"key":"R17","doi-asserted-by":"publisher","DOI":"10.1037\/h0031619"},{"key":"R18","volume-title":"The Phonetics of Arabic.","author":"Gairdner William Henry Temple","year":"1925"},{"key":"R19","doi-asserted-by":"publisher","DOI":"10.3115\/1690219.1690245"},{"key":"R20","doi-asserted-by":"publisher","DOI":"10.3115\/1557690.1557706"},{"key":"R21","first-page":"711","volume-title":"Proceedings of the Language Resources and Evaluation Conference (LREC)","author":"Habash Nizar","year":"2012"},{"key":"R22","first-page":"1","volume-title":"Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology","author":"Habash Nizar","year":"2012"},{"key":"R23","doi-asserted-by":"publisher","DOI":"10.3115\/1220175.1220261"},{"key":"R24","first-page":"49","volume-title":"Proceedings of the LREC Workshop on HLT & NLP within the Arabic World","author":"Habash Nizar","year":"2008"},{"key":"R25","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4020-6046-5_2"},{"key":"R26","doi-asserted-by":"publisher","DOI":"10.2200\/S00277ED1V01Y201008HLT010"},{"key":"R27","doi-asserted-by":"publisher","DOI":"10.1057\/9780230107373"},{"key":"R28","volume-title":"Modern Arabic: Structures, Functions, and Varieties.","author":"Holes Clive","year":"2004"},{"key":"R29","doi-asserted-by":"publisher","DOI":"10.1075\/loall.1"},{"key":"R30","volume-title":"Phonetik und Phonologie des modernen Hocharabisch.","author":"K\u00e4stner Hartmut","year":"1981"},{"key":"R31","doi-asserted-by":"publisher","DOI":"10.2307\/2529310"},{"key":"R32","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2010.2045184"},{"key":"R33","doi-asserted-by":"crossref","DOI":"10.1093\/oso\/9780198151517.001.0001","volume-title":"Pronouncing Arabic.","author":"Mitchell Terence Frederick","year":"1990"},{"key":"R34","first-page":"99","volume":"4","author":"Mohand Tilmatine","year":"1999","journal-title":"Estudios de Dialectologi\u00e1 Norteaafricana y andalus\u00ed"},{"key":"R35","first-page":"63","volume":"100","author":"Newman Daniel L.","year":"2002","journal-title":"Antwerp Papers in Linguistics"},{"key":"R36","doi-asserted-by":"crossref","first-page":"541","DOI":"10.21437\/Interspeech.2011-226","volume-title":"Interspeech","author":"Novotney Scott","year":"2011"},{"key":"R37","doi-asserted-by":"crossref","unstructured":"\u0158eh\u016f\u0159ek, Radim and Milan Kolkus. 2009. Language Identification on the Web: Extending the Dictionary Method, volume 5449 of Lecture Notes in Computer Science, pages 357\u2013368. SpringerLink.","DOI":"10.1007\/978-3-642-00382-0_29"},{"key":"R38","first-page":"10","volume-title":"Proceedings of the EMNLP Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties","author":"Salloum Wael","year":"2011"},{"key":"R39","doi-asserted-by":"crossref","DOI":"10.1093\/oso\/9780195108668.001.0001","volume-title":"Clause Structure and Word Order in Hebrew and Arabic: An Essay in Comparative Semitic Syntax.","author":"Shlonsky Ur","year":"1997"},{"key":"R40","first-page":"183","volume":"13","author":"Souter Clive","year":"1994","journal-title":"Hermes Journal of Linguistics"},{"key":"R41","volume-title":"Arabic Sociolinguistics.","author":"Suleiman Yasir","year":"1994"},{"key":"R42","doi-asserted-by":"publisher","DOI":"10.1017\/S0025100300004266"},{"key":"R43","doi-asserted-by":"crossref","unstructured":"Verma, Brijesh, Hong Lee, and John Zakos. 2009. An Automatic Intelligent Language Classifier, volume 5507 of Lecture Notes in Computer Science, pages 639\u2013646. SpringerLink.","DOI":"10.1007\/978-3-642-03040-6_78"},{"key":"R44","volume-title":"The Arabic Language.","author":"Versteegh Kees","year":"2001"},{"key":"R45","first-page":"260","volume-title":"Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference","author":"Zaidan Omar","year":"2007"},{"key":"R46","unstructured":"Zaidan, Omar F. 2012. Crowdsourcing Annotation for Machine Learning in Natural Language Processing Tasks. Ph.D. thesis, Johns Hopkins University, Baltimore, MD."},{"key":"R47","first-page":"37","author":"Zaidan Omar F.","year":"2011","journal-title":"Proceedings of ACL"},{"key":"R48","first-page":"49","volume-title":"2012 Conference of the North American Chapter of the Association for Computational Linguistics","author":"Zbib Rabih","year":"2012"},{"key":"R49","doi-asserted-by":"publisher","DOI":"10.1109\/TSA.1996.481450"}],"container-title":["Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mitpressjournals.org\/doi\/pdf\/10.1162\/COLI_a_00169","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,13]],"date-time":"2024-05-13T01:50:11Z","timestamp":1715565011000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/coli\/article\/40\/1\/171-202\/1458"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2014,3]]},"references-count":46,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2014,3]]}},"alternative-id":["10.1162\/COLI_a_00169"],"URL":"https:\/\/doi.org\/10.1162\/coli_a_00169","relation":{},"ISSN":["0891-2017","1530-9312"],"issn-type":[{"value":"0891-2017","type":"print"},{"value":"1530-9312","type":"electronic"}],"subject":[],"published":{"date-parts":[[2014,3]]}}}