{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,7]],"date-time":"2026-05-07T15:41:51Z","timestamp":1778168511328,"version":"3.51.4"},"reference-count":47,"publisher":"Cambridge University Press (CUP)","issue":"6","license":[{"start":{"date-parts":[[2023,8,10]],"date-time":"2023-08-10T00:00:00Z","timestamp":1691625600000},"content-version":"unspecified","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"content-domain":{"domain":["cambridge.org"],"crossmark-restriction":true},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2023,11]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>We introduce a generic, language-independent method to collect a large percentage of offensive and hate tweets regardless of their topics or genres. We harness the extralinguistic information embedded in the emojis to collect a large number of offensive tweets. We apply the proposed method on Arabic tweets and compare it with English tweets\u2014analyzing key cultural differences. We observed a constant usage of these emojis to represent offensiveness throughout different timespans on Twitter. We manually annotate and publicly release the largest Arabic dataset for <jats:italic>offensive<\/jats:italic>, <jats:italic>fine-grained hate speech<\/jats:italic>, <jats:italic>vulgar,<\/jats:italic> and <jats:italic>violence<\/jats:italic> content. Furthermore, we benchmark the dataset for detecting offensiveness and hate speech using different transformer architectures and perform in-depth linguistic analysis. We evaluate our models on external datasets\u2014a Twitter dataset collected using a completely different method, and a multi-platform dataset containing comments from Twitter, YouTube, and Facebook, for assessing generalization capability. Competitive results on these datasets suggest that the data collected using our method capture universal characteristics of offensive language. Our findings also highlight the common words used in offensive communications, common targets for hate speech, specific patterns in violence tweets, and pinpoint common classification errors that can be attributed to limitations of NLP models. We observe that even state-of-the-art transformer models may fail to take into account culture, background, and context or understand nuances present in real-world data such as sarcasm.<\/jats:p>","DOI":"10.1017\/s1351324923000402","type":"journal-article","created":{"date-parts":[[2023,8,10]],"date-time":"2023-08-10T04:45:03Z","timestamp":1691642703000},"page":"1436-1457","update-policy":"https:\/\/doi.org\/10.1017\/policypage","source":"Crossref","is-referenced-by-count":37,"title":["Emojis as anchors to detect Arabic offensive language and hate speech"],"prefix":"10.1017","volume":"29","author":[{"given":"Hamdy","family":"Mubarak","sequence":"first","affiliation":[]},{"given":"Sabit","family":"Hassan","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1331-2543","authenticated-orcid":false,"given":"Shammur Absar","family":"Chowdhury","sequence":"additional","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2023,8,10]]},"reference":[{"key":"S1351324923000402_ref3","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.semeval-1.275"},{"key":"S1351324923000402_ref32","first-page":"237","volume-title":"Social Informatics","author":"Mubarak","year":"2020"},{"key":"S1351324923000402_ref4","doi-asserted-by":"publisher","DOI":"10.1109\/ASONAM.2018.8508247"},{"key":"S1351324923000402_ref44","doi-asserted-by":"publisher","DOI":"10.1186\/s13673-019-0205-6"},{"key":"S1351324923000402_ref5","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.3033666"},{"key":"S1351324923000402_ref18","first-page":"4171","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Devlin","year":"2019"},{"key":"S1351324923000402_ref20","volume-title":"Proceedings of the 15th International Workshop on Semantic Evaluation, SemEval","volume":"21","author":"Dimitrov","year":"2021"},{"key":"S1351324923000402_ref43","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939778"},{"key":"S1351324923000402_ref19","volume-title":"Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP\u201921)","author":"Dimitrov","year":"2021"},{"key":"S1351324923000402_ref21","first-page":"118","volume-title":"Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis","author":"Donato","year":"2017"},{"key":"S1351324923000402_ref25","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.semeval-1.249"},{"key":"S1351324923000402_ref45","doi-asserted-by":"publisher","DOI":"10.4159\/harvard.9780674065086"},{"key":"S1351324923000402_ref12","first-page":"6203","volume-title":"Proceedings of The 12th Language Resources and Evaluation Conference","author":"Chowdhury","year":"2020"},{"key":"S1351324923000402_ref41","unstructured":"Platt, J. (1998). Sequential minimal optimization: A fast algorithm for training support vector machines."},{"key":"S1351324923000402_ref47","volume-title":"Proceedings of SemEval","author":"Zampieri","year":"2020"},{"key":"S1351324923000402_ref14","doi-asserted-by":"crossref","unstructured":"Conneau, A. , Khandelwal, K. , Goyal, N. , Chaudhary, V. , Wenzek, G. , Guzm\u00e1n, F. , Grave, E. , Ott, M. , Zettlemoyer, L. and Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv: 1911.","DOI":"10.18653\/v1\/2020.acl-main.747"},{"key":"S1351324923000402_ref39","unstructured":"Nakov, P. , Nayak, V. , Dent, K. , Bhatawdekar, A. , Sarwar, S. M. , Hardalov, M. , Dinkov, Y. , Zlatkova, D. , Bouchard, G. and Augenstein, I. (2021). Detecting abusive language on online platforms: A critical analysis. arXiv preprint arXiv: 2103.00153."},{"key":"S1351324923000402_ref46","first-page":"369","volume-title":"Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume","author":"Wiegand","year":"2021"},{"key":"S1351324923000402_ref15","first-page":"89","article-title":"Political polarization on twitter","volume":"133","author":"Conover","year":"2011","journal-title":"ICWSM"},{"key":"S1351324923000402_ref10","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2014.2357012"},{"key":"S1351324923000402_ref17","volume-title":"Proceedings of the International AAAI Conference on Web and Social Media","volume":"11","author":"Davidson","year":"2017"},{"key":"S1351324923000402_ref13","doi-asserted-by":"crossref","unstructured":"Chung, Y.-L. , Kuzmenko, E. , Tekiroglu, S. S. and Guerini, M. (2019). Conan\u2013counter narratives through nichesourcing: A multilingual dataset of responses to fight online hate speech. arXiv preprint arXiv: 1910.","DOI":"10.18653\/v1\/P19-1271"},{"key":"S1351324923000402_ref11","first-page":"226","volume-title":"Proceedings of the Fifth Arabic Natural Language Processing Workshop","author":"Chowdhury","year":"2020"},{"key":"S1351324923000402_ref38","unstructured":"Mubarak, H. , Rashed, A. , Darwish, K. , Samih, Y. and Abdelali, A. (2020c). Arabic offensive language on twitter: Analysis and experiments. arXiv preprint arXiv: 2004.02192."},{"key":"S1351324923000402_ref27","volume-title":"OSACT","volume":"4","author":"Husain","year":"2020"},{"key":"S1351324923000402_ref36","volume-title":"Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection","author":"Mubarak","year":"2020"},{"key":"S1351324923000402_ref1","unstructured":"Abdelali, A. , Hassan, S. , Mubarak, H. , Darwish, K. and Samih, Y. (2021a). Pre-training BERT on arabic tweets: Practical considerations. CoRR, abs\/2102.10684."},{"key":"S1351324923000402_ref33","first-page":"1","volume-title":"Proceedings of the EMNLP. 2014 Workshop on Arabic Natural Language Processing (ANLP)","author":"Mubarak","year":"2014"},{"key":"S1351324923000402_ref6","first-page":"12","volume-title":"Proceedings of the Fifth Arabic Natural Language Processing Workshop","author":"Alshaalan","year":"2020"},{"key":"S1351324923000402_ref2","first-page":"1","volume-title":"Proceedings of the Sixth Arabic Natural Language Processing Workshop","author":"Abdelali","year":"2021"},{"key":"S1351324923000402_ref35","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-3008"},{"key":"S1351324923000402_ref29","unstructured":"Kiela, D. , Firooz, H. , Mohan, A. , Goswami, V. , Singh, A. , Ringshia, P. and Testuggine, D. (2020). The hateful memes challenge: Detecting hate speech in multimodal memes. arXiv preprint arXiv: 2005.04790."},{"key":"S1351324923000402_ref40","doi-asserted-by":"crossref","unstructured":"Ousidhoum, N. , Lin, Z. , Zhang, H. , Song, Y. and Yeung, D.-Y. (2019). Multilingual and multi-aspect hate speech analysis. arXiv preprint arXiv: 1908.11049.","DOI":"10.18653\/v1\/D19-1474"},{"key":"S1351324923000402_ref22","volume-title":"Kurzfassung eines (auf Deutsch) zur Publikation eingereichten Manuskripts","author":"D\u00fcrscheid","year":"2017"},{"key":"S1351324923000402_ref23","doi-asserted-by":"publisher","DOI":"10.1016\/j.sbspro.2010.03.602"},{"key":"S1351324923000402_ref24","first-page":"113","volume-title":"Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations","author":"Hassan","year":"2021"},{"key":"S1351324923000402_ref8","first-page":"9","volume-title":"Proceedings of The 4th Workshop on Open-Source Arabic Corpora and Processing Tools","author":"Antoun","year":"2020"},{"key":"S1351324923000402_ref28","first-page":"71","volume-title":"9th International Conference on Social Computing and Social Media, SCSM. 2017 held as part of the 19th International Conference on Human-Computer Interaction, HCI International 2017","author":"Intapong","year":"2017"},{"key":"S1351324923000402_ref16","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-67217-5_7"},{"key":"S1351324923000402_ref26","article-title":"Cross-lingual emotion detection","author":"Hassan","year":"2021","journal-title":"CoRR"},{"key":"S1351324923000402_ref9","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.2978950"},{"key":"S1351324923000402_ref31","doi-asserted-by":"publisher","DOI":"10.1145\/3308560.3316541"},{"key":"S1351324923000402_ref30","doi-asserted-by":"publisher","DOI":"10.2307\/2529310"},{"key":"S1351324923000402_ref34","volume-title":"Weaving Relations of Trust in Crowd Work: Transparency and Reputation across Platforms","author":"Mubarak","year":"2016"},{"key":"S1351324923000402_ref42","volume-title":"CLiC-it","author":"Polignano","year":"2019"},{"key":"S1351324923000402_ref37","first-page":"136","volume-title":"Proceedings of the Sixth Arabic Natural Language Processing Workshop","author":"Mubarak","year":"2021"},{"key":"S1351324923000402_ref7","first-page":"15","volume-title":"TA-COS 2018: 2nd Workshop on Text Analytics for Cybersecurity and Online Safety","author":"Alshehri","year":"2018"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324923000402","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,12,6]],"date-time":"2023-12-06T10:48:37Z","timestamp":1701859717000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324923000402\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,8,10]]},"references-count":47,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2023,11]]}},"alternative-id":["S1351324923000402"],"URL":"https:\/\/doi.org\/10.1017\/s1351324923000402","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,8,10]]},"assertion":[{"value":"\u00a9 The Author(s), 2023. Published by Cambridge University Press","name":"copyright","label":"Copyright","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}},{"value":"This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial licence (http:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use.","name":"license","label":"License","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}},{"value":"This content has been made available to all.","name":"free","label":"Free to read"}]}}