{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,5,13]],"date-time":"2025-05-13T22:00:09Z","timestamp":1747173609723,"version":"3.40.5"},"reference-count":41,"publisher":"Cambridge University Press (CUP)","issue":"6","license":[{"start":{"date-parts":[[2020,3,10]],"date-time":"2020-03-10T00:00:00Z","timestamp":1583798400000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":["cambridge.org"],"crossmark-restriction":true},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2020,11]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The rise of social media empowers people to interact and communicate with anyone anywhere in the world. The possibility of being anonymous avoids censorship and enables freedom of expression. Nevertheless, this anonymity might lead to cybersecurity issues, such as opinion spam, sexual harassment, incitement to hatred or even terrorism propaganda. In such cases, there is a need to know more about the anonymous users and this could be useful in several domains beyond security and forensics such as marketing, for example. In this paper, we focus on a fine-grained analysis of language varieties while considering also the authors\u2019 demographics. We present a Low-Dimensionality Statistical Embedding method to represent text documents. We compared the performance of this method with the best performing teams in the Author Profiling task at PAN 2017. We obtained an average accuracy of 92.08% versus 91.84% for the best performing team at PAN 2017. We also analyse the relationship of the language variety identification with the authors\u2019 gender. Furthermore, we applied our proposed method to a more fine-grained annotated corpus of Arabic varieties covering 22 Arab countries and obtained an overall accuracy of 88.89%. We have also investigated the effect of the authors\u2019 age and gender on the identification of the different Arabic varieties, as well as the effect of the corpus size on the performance of our method.<\/jats:p>","DOI":"10.1017\/s1351324920000108","type":"journal-article","created":{"date-parts":[[2020,3,10]],"date-time":"2020-03-10T08:55:45Z","timestamp":1583830545000},"page":"641-661","update-policy":"https:\/\/doi.org\/10.1017\/policypage","source":"Crossref","is-referenced-by-count":10,"title":["Fine-grained analysis of language varieties and demographics"],"prefix":"10.1017","volume":"26","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6583-3682","authenticated-orcid":false,"given":"Francisco","family":"Rangel","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Paolo","family":"Rosso","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wajdi","family":"Zaghouani","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Anis","family":"Charfi","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"56","published-online":{"date-parts":[[2020,3,10]]},"reference":[{"unstructured":"Kestemont, M. , Tschuggnall, M. , Stamatatos, E. , Daelemans, W. , Specht, G. , Stein, B. and Potthast, M. (2018). Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection. CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org.","key":"S1351324920000108_ref17"},{"doi-asserted-by":"publisher","key":"S1351324920000108_ref19","DOI":"10.1007\/BF02295996"},{"unstructured":"Lui, M. and Cook, P. (2013). Classifying english documents by national dialect. In Proceedings of the Australasian Language Technology Association Workshop, Citeseer pp. 5\u201315.","key":"S1351324920000108_ref18"},{"unstructured":"Basile, A. , Dwyer, G. , Medvedeva, M. , Rawee, J. , Haagsma, H. and Nissim, M. (2017). Is there life beyond n-grams? A simple SVM-based author profiling system. In Cappellato L., Ferro N., Goeuriot L. and Mandl T. (eds), CLEF 2017 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, http:\/\/ceur-ws.org\/Vol-\/. CLEF and CEUR-WS.org.","key":"S1351324920000108_ref2"},{"unstructured":"Elfardy, H. and Diab, M.T. (2013). Sentence level dialect identification in arabic. In Association for Computational Linguistics (ACL), pp. 456\u2013461.","key":"S1351324920000108_ref6"},{"doi-asserted-by":"publisher","key":"S1351324920000108_ref31","DOI":"10.1016\/0306-4573(88)90021-0"},{"unstructured":"Zaghouani, W. and Charfi, A. (2018a). ArapTweet: A large MultiDialect Twitter corpus for gender, age and language variety identification. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan.","key":"S1351324920000108_ref35"},{"unstructured":"Zampieri, M. , Tan, L. , Ljube\u0161i\u0107, N. , Tiedemann, J. and Nakov, P. (2015). Overview of the DSL shared task 2015. In Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pp. 1\u20139.","key":"S1351324920000108_ref40"},{"unstructured":"Huang, C.-R. and Lee, L.-H. (2008). Contrastive approach towards text source classification based on top-bag-of-word similarity. In PACLIC, pp. 404\u2013410.","key":"S1351324920000108_ref14"},{"key":"S1351324920000108_ref23","doi-asserted-by":"crossref","first-page":"73","DOI":"10.1016\/j.ipm.2015.06.003","article-title":"On the impact of emotions on author profiling","volume":"52","author":"Rangel","year":"2016","journal-title":"Information Processing and Management"},{"doi-asserted-by":"publisher","key":"S1351324920000108_ref37","DOI":"10.1162\/COLI_a_00169"},{"unstructured":"Grouin, C. , Forest, D. , Paroubek, P. and Zweigenbaum, P. (2011). Pr\u00e9sentation et r\u00e9sultats du d\u00e9fi fouille de texte DEFT2011 Quand un article de presse a t-il \u00e9t\u00e9 \u00e9crit? \u00c0 quel article scientifique correspond ce r\u00e9sum\u00e9? Actes du septi\u00e8me D\u00e9fi Fouille de Textes, p. 3.","key":"S1351324920000108_ref9"},{"unstructured":"Martinc, M. , Skrjanec, I. , Zupan, K. and Pollak, S. Pan (2017). Author profiling \u2013 gender and language variety prediction. In Cappellato L., Ferro N., Goeuriot L. and Mandl T. (eds), CLEF 2017 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, http:\/\/ceur-ws.org\/Vol-\/. CLEF and CEUR-WS.org.","key":"S1351324920000108_ref22"},{"unstructured":"Rangel, F. , Rosso, P. and Franco-Salvador, M. (2016b). A low dimensionality representation for language variety identification. In 17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, LNCS. Springer-Verlag, arxiv:1705.10754.","key":"S1351324920000108_ref24"},{"key":"S1351324920000108_ref29","first-page":"17","article-title":"Profile of a Terrorist","volume":"1","author":"Russell","year":"1977","journal-title":"Studies in Conflict and Terrorism"},{"key":"S1351324920000108_ref8","first-page":"534","article-title":"Variability and mutability, contribution to the study of statistical distributions and relations. Studi cconomico-giuridici della r. Universita de Cagliari. Reviewed in: Light R.J. and Margolin B.H. An analysis of variance for categorical data","volume":"66","author":"Gini","year":"1912\/1971","journal-title":"Journal of American Statistical Association"},{"unstructured":"Hagen, M. , Potthast, M. and Stein, B. (2018). Overview of the Author Obfuscation Task at PAN 2018. CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org.","key":"S1351324920000108_ref10"},{"unstructured":"Zampieri, M. and Gebre, B.G. (2012). Automatic identification of language varieties: The case of portuguese. In The 11th Conference on Natural Language Processing (KONVENS), pp. 233\u2013237 (2012)","key":"S1351324920000108_ref38"},{"unstructured":"Rangel, F. , Rosso, P. , Montes-y-G\u00f3mez, M. , Potthast, M. and Stein, B. (2018). Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter. In CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org.","key":"S1351324920000108_ref26"},{"key":"S1351324920000108_ref11","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-031-02139-8","volume-title":"Introduction to Arabic Natural Language Processing","volume":"3","author":"Habash","year":"2010"},{"doi-asserted-by":"publisher","key":"S1351324920000108_ref12","DOI":"10.1007\/BF00302543"},{"unstructured":"Inches, G. and Crestani, F. (2012). Overview of the International Sexual Predator Identification Competition at PAN-2012. CLEF Online working notes\/labs\/workshop, vol. 30.","key":"S1351324920000108_ref15"},{"doi-asserted-by":"publisher","key":"S1351324920000108_ref39","DOI":"10.3115\/v1\/W14-5307"},{"key":"S1351324920000108_ref34","article-title":"Sentence-level dialects identification in the Greater China region","volume":"5","author":"Xu","year":"2016","journal-title":"International Journal on Natural Language Computing (IJNLC)"},{"unstructured":"Rosso, P. , Rangel Pardo, F.M. , Ghanem, B. and Charfi, A. (2018b). ARAP: Arabic Author Profiling Project for Cyber-Security. Sociedad Espa\u00f1ola para el Procesamiento del Lenguaje Natural (SEPLN).","key":"S1351324920000108_ref28"},{"volume-title":"Digital Crime and Digital Terrorism","year":"2014","author":"Taylor","key":"S1351324920000108_ref32"},{"unstructured":"Agi\u0107, \u017d. , Tiedemann, J. , Dobrovoljc, K. , Krek, S. , Merkler, D. , Mo\u017ee, S. , Nakov, P. , Osenova, P. and Vertan, C. (2014). Proceedings of the EMNLP 2014 Workshop on Language Technology for Closely Related Languages and Language Variants. Association for Computational Linguistics.","key":"S1351324920000108_ref1"},{"doi-asserted-by":"publisher","key":"S1351324920000108_ref30","DOI":"10.3115\/v1\/W14-5904"},{"doi-asserted-by":"publisher","key":"S1351324920000108_ref7","DOI":"10.1007\/978-3-319-24027-5_3"},{"doi-asserted-by":"publisher","key":"S1351324920000108_ref27","DOI":"10.1111\/lnc3.12275"},{"unstructured":"Malmasi, S. , Zampieri, M. , Ljube\u0161i\u0107, N. , Nakov, P. , Ali, A. and Tiedemann, J. (2016). Discriminating between similar languages and arabic dialect identification: A report on the third DSL shared task. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 1\u201314.","key":"S1351324920000108_ref21"},{"unstructured":"Rangel, F. , Rosso, P. , Potthast, M. and Stein, B. (2017). Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. In Cappellato L., Ferro N., Goeuriot, L. and Mandl T. (eds), Working Notes Papers of the CLEF 2017 Evaluation Labs, p. 1613\u20130073, CLEF and CEUR-WS.org.","key":"S1351324920000108_ref25"},{"doi-asserted-by":"crossref","unstructured":"Zampieri, M. , Malmasi, S. , Ljube\u0161i\u0107, N. , Nakov, P. , Ali, A. , Tiedemann, J. , Scherrer, Y. , Aepli, N. (2017). Findings of the vardial evaluation campaign 2017. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 1\u201315.","key":"S1351324920000108_ref41","DOI":"10.18653\/v1\/W17-1201"},{"doi-asserted-by":"publisher","key":"S1351324920000108_ref3","DOI":"10.1016\/j.csl.2013.04.007"},{"doi-asserted-by":"crossref","unstructured":"Maier, W. and G\u00f3mez-Rodr\u00edguez, C. (2014). Language Variety Identification in Spanish Tweets. LT4CloseLang.","key":"S1351324920000108_ref20","DOI":"10.3115\/v1\/W14-4204"},{"key":"S1351324920000108_ref4","first-page":"467","article-title":"Method of moments","volume":"5","author":"Bowman","year":"1985","journal-title":"Encyclopedia of Statistical Sciences"},{"doi-asserted-by":"crossref","unstructured":"Castro, D. , Souza, E. , de Oliveira, A.L.I. (2016). Discriminating between Brazilian and European Portuguese national varieties on Twitter texts. In 5th Brazilian Conference on Intelligent Systems (BRACIS), pp. 265\u2013270.","key":"S1351324920000108_ref5","DOI":"10.1109\/BRACIS.2016.056"},{"unstructured":"Zaghouani, W. and Charfi, A. (2018b). Guidelines and annotation framework for Arabic author profiling. In Proceedings of the 3rd Workshop on Open-Source Arabic Corpora and Processing Tools, 11th International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan.","key":"S1351324920000108_ref36"},{"doi-asserted-by":"publisher","key":"S1351324920000108_ref13","DOI":"10.1016\/j.ipm.2014.11.001"},{"unstructured":"Tellez, E.S. , Miranda-Jim\u00e9nez, S. , Graff, M. and Moctezuma, D. (2017). Gender and language variety identification with microtc. In Cappellato L., Ferro N., Goeuriot L. and Mandl T. (eds). CLEF 2017 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, http:\/\/ceur-ws.org\/Vol-\/. CLEF and CEUR-WS.org.","key":"S1351324920000108_ref33"},{"doi-asserted-by":"publisher","key":"S1351324920000108_ref16","DOI":"10.1145\/2517840.2517865"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324920000108","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,10,18]],"date-time":"2022-10-18T04:48:32Z","timestamp":1666068512000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324920000108\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,3,10]]},"references-count":41,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2020,11]]}},"alternative-id":["S1351324920000108"],"URL":"https:\/\/doi.org\/10.1017\/s1351324920000108","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"type":"print","value":"1351-3249"},{"type":"electronic","value":"1469-8110"}],"subject":[],"published":{"date-parts":[[2020,3,10]]},"assertion":[{"value":"\u00a9 Cambridge University Press 2020","name":"copyright","label":"Copyright","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}}]}}