{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,6]],"date-time":"2026-05-06T11:40:19Z","timestamp":1778067619451,"version":"3.51.4"},"reference-count":80,"publisher":"Cambridge University Press (CUP)","issue":"5","license":[{"start":{"date-parts":[[2022,6,20]],"date-time":"2022-06-20T00:00:00Z","timestamp":1655683200000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["cambridge.org"],"crossmark-restriction":true},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2023,9]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Hate speech is any kind of communication that attacks a person or a group based on their characteristics, such as gender, religion and race. Due to the availability of online platforms where people can express their (hateful) opinions, the amount of hate speech is steadily increasing that often leads to offline hate crimes. This paper focuses on understanding and detecting hate speech in underground hacking and extremist forums where cybercriminals and extremists, respectively, communicate with each other, and some of them are associated with criminal activity. Moreover, due to the lengthy posts, it would be beneficial to identify the specific span of text containing hateful content in order to assist site moderators with the removal of hate speech. This paper describes a hate speech dataset composed of posts extracted from HackForums, an online hacking forum, and Stormfront and Incels.co, two extremist forums. We combined our dataset with a Twitter hate speech dataset to train a multi-platform classifier. Our evaluation shows that a classifier trained on multiple sources of data does not always improve the performance compared to a mono-platform classifier. Finally, this is the first work on extracting hate speech spans from longer texts. The paper fine-tunes BERT (Bidirectional Encoder Representations from Transformers) and adopts two approaches \u2013 span prediction and sequence labelling. Both approaches successfully extract hateful spans and achieve an F1-score of at least 69%.<\/jats:p>","DOI":"10.1017\/s1351324922000262","type":"journal-article","created":{"date-parts":[[2022,6,20]],"date-time":"2022-06-20T08:44:26Z","timestamp":1655714666000},"page":"1247-1274","update-policy":"https:\/\/doi.org\/10.1017\/policypage","source":"Crossref","is-referenced-by-count":15,"title":["Automated hate speech detection and span extraction in underground hacking and extremist forums"],"prefix":"10.1017","volume":"29","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5503-9701","authenticated-orcid":false,"given":"Linda","family":"Zhou","sequence":"first","affiliation":[]},{"given":"Andrew","family":"Caines","sequence":"additional","affiliation":[]},{"given":"Ildiko","family":"Pete","sequence":"additional","affiliation":[]},{"given":"Alice","family":"Hutchings","sequence":"additional","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2022,6,20]]},"reference":[{"key":"S1351324922000262_ref6","first-page":"1137","article-title":"A neural probabilistic language model","volume":"3","author":"Bengio","year":"2003","journal-title":"Journal of Machine Learning Research"},{"key":"S1351324922000262_ref52","unstructured":"Mikolov, T. , Chen, K. , Corrado, G. and Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. ICLR Workshop."},{"key":"S1351324922000262_ref56","doi-asserted-by":"publisher","DOI":"10.1145\/3178876.3186178"},{"key":"S1351324922000262_ref63","doi-asserted-by":"publisher","DOI":"10.1186\/s13673-019-0205-6"},{"key":"S1351324922000262_ref57","doi-asserted-by":"crossref","unstructured":"Pastrana, S. , Hutchings, A. , Caines, A. and Buttery, P. (2018b). Characterizing eve: Analysing cybercrime actors in a large underground forum. In Proceedings of 21st International Symposium, pp. 207\u2013227.","DOI":"10.1007\/978-3-030-00470-5_10"},{"key":"S1351324922000262_ref48","unstructured":"Liu, Y. , Ott, M. , Goyal, N. , Du, J. , Joshi, M. , Chen, D. , Levy, O. , Lewis, M. , Zettlemoyer, L. and Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692."},{"key":"S1351324922000262_ref74","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-3012"},{"key":"S1351324922000262_ref22","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-3504"},{"key":"S1351324922000262_ref11","doi-asserted-by":"publisher","DOI":"10.1007\/s10579-020-09488-3"},{"key":"S1351324922000262_ref23","doi-asserted-by":"crossref","unstructured":"Davidson, T. , Warmsley, D. , Macy, M. and Weber, I. (2017). Automated Hate Speech Detection and the Problem of Offensive Language. In\u00a0Proceedings of the International AAAI Conference on Web and Social Media, vol. 11, pp. 512\u2013515.","DOI":"10.1609\/icwsm.v11i1.14955"},{"key":"S1351324922000262_ref64","first-page":"69","article-title":"Spinning the Web of hate: Web-based hate propagation by extremist organizations","volume":"9","author":"Schafer","year":"2002","journal-title":"Journal of Criminal Justice and Popular Culture"},{"key":"S1351324922000262_ref68","doi-asserted-by":"crossref","first-page":"2709","DOI":"10.1177\/1077801221996453","article-title":"\u201cI Don\u2019t Hate All Women, Just Those Stuck-Up Bitches\u201d: How incels and mainstream pornography speak the same extreme language of misogyny","volume":"27","author":"Tranchese","year":"2021","journal-title":"Violence Against Women"},{"key":"S1351324922000262_ref9","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00051"},{"key":"S1351324922000262_ref39","article-title":"Male supremacism and the Hanau terrorist attack: between online misogyny and far-right violence","volume":"20","author":"Jasser","year":"2020","journal-title":"The International Centre for Counter-Terrorism\u2013The Hague"},{"key":"S1351324922000262_ref8","unstructured":"Binny, M. , Saha, P. , Yimam, S.M. , Biemann, C. , Goyal, P. and Mukherjee, A. (2021). HateXplain: A benchmark dataset for explainable hate speech detection. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 14867\u201314875."},{"key":"S1351324922000262_ref67","unstructured":"Stricker, G. (2014). The 2014 #YearOnTwitter. https:\/\/blog.twitter.com\/official\/en_us\/a\/2014\/the-2014-yearontwitter.html (accessed May 2021)."},{"key":"S1351324922000262_ref16","doi-asserted-by":"crossref","unstructured":"Chhablani, G. , Bhartia, Y. , Sharma, A. , Pandey, H. and Suthaharan, S. (2021). NLRG at SemEval-2021 Task 5: Toxic Spans Detection Leveraging BERT-based Token Classification and Span Prediction Techniques. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021). Association for Computational Linguistics, pp. 233\u2013242.","DOI":"10.18653\/v1\/2021.semeval-1.27"},{"key":"S1351324922000262_ref41","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.semeval-1.187"},{"key":"S1351324922000262_ref46","unstructured":"Lafferty, J.D. , McCallum, A. and Pereira, F.C.N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., pp. 282\u2013289."},{"key":"S1351324922000262_ref36","unstructured":"Wiki, Incels . (2018). https:\/\/incels.wiki\/w\/Main_Page (accessed May 2021)."},{"key":"S1351324922000262_ref14","doi-asserted-by":"publisher","DOI":"10.1186\/s40163-018-0094-4"},{"key":"S1351324922000262_ref45","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v27i1.8539"},{"key":"S1351324922000262_ref62","unstructured":"Reja, M. (2021). Trump\u2019s \u2018Chinese Virus\u2019 tweet helped lead to rise in racist anti-Asian Twitter content: Study. https:\/\/abcnews.go.com\/Health\/trumps-chinese-virus-tweet-helped-lead-rise-racist\/story?id=76530148 (accessed May 2021)."},{"key":"S1351324922000262_ref13","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W18-5109"},{"key":"S1351324922000262_ref55","doi-asserted-by":"publisher","DOI":"10.1109\/ICCMC.2018.8488096"},{"key":"S1351324922000262_ref5","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/S19-2007"},{"key":"S1351324922000262_ref27","doi-asserted-by":"publisher","DOI":"10.1007\/978-0-387-73003-5_304"},{"key":"S1351324922000262_ref34","unstructured":"Hinton, G.E. , Srivastava, N. , Krizhevsky, A. , Sutskever, I. and Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580."},{"key":"S1351324922000262_ref54","doi-asserted-by":"crossref","unstructured":"Mozafari, M. , Farahbakhsh, R. and No\u00ebl, C. (2019). A BERT-Based Transfer Learning Approach for Hate Speech Detection in Online Social Media. In\u00a0International Conference on Complex Networks and Their Applications, pp. 928\u2013940.","DOI":"10.1007\/978-3-030-36687-2_77"},{"key":"S1351324922000262_ref60","doi-asserted-by":"publisher","DOI":"10.1109\/KSE.2019.8919368"},{"key":"S1351324922000262_ref53","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.semeval-1.228"},{"key":"S1351324922000262_ref76","doi-asserted-by":"publisher","DOI":"10.1093\/bjc\/azz064"},{"key":"S1351324922000262_ref79","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-93417-4_48"},{"key":"S1351324922000262_ref26","doi-asserted-by":"crossref","first-page":"378","DOI":"10.1037\/h0031619","article-title":"Measuring nominal scale agreement among many raters","volume":"76","author":"Fleiss","year":"1971","journal-title":"Psychological Bulletin"},{"key":"S1351324922000262_ref29","unstructured":"Gokaslan, A. and Cohen, V. (2019). OpenWebText Corpus. http:\/\/Skylion007.github.io\/OpenWebTextCorpus (accessed May 2021)."},{"key":"S1351324922000262_ref38","doi-asserted-by":"publisher","DOI":"10.1075\/jlac.00026.jak"},{"key":"S1351324922000262_ref35","unstructured":"Holpuch, A. (2014). Almost 100 hate-crime murders linked to single website. https:\/\/www.theguardian.com\/world\/2014\/apr\/18\/hate-crime-murders-website-stormfront-report (accessed May 2021)."},{"key":"S1351324922000262_ref43","first-page":"46","article-title":"Reclaiming critical analysis: The social harms of \u2018bitch.\u2019","volume":"3","author":"Kleinman","year":"2009","journal-title":"Sociological Analysis"},{"key":"S1351324922000262_ref50","doi-asserted-by":"publisher","DOI":"10.1162\/COLI_a_00227"},{"key":"S1351324922000262_ref44","unstructured":"Krebs, B. (2017). Who Is Marcus Hutchins?. https:\/\/krebsonsecurity.com\/2017\/09\/who-is-marcus-hutchins\/ (accessed May 2021)."},{"key":"S1351324922000262_ref59","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"key":"S1351324922000262_ref40","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00300"},{"key":"S1351324922000262_ref65","unstructured":"Smith, K.L. (2018). Twitter Is Deleting Accounts And These Are The Words That Might Get You Suspended. https:\/\/www.popbuzz.com\/internet\/social-media\/twitter-account-suspension-trigger-words\/ (accessed May 2021)."},{"key":"S1351324922000262_ref3","unstructured":"Assimakopoulos, S. , Vella Muskat, R. , van der Plas, L. and Gatt, A. (2020). Annotating for hate speech: The MaNeCo corpus and some input from critical discourse analysis. In Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, pp. 5088\u20135097."},{"key":"S1351324922000262_ref51","doi-asserted-by":"publisher","DOI":"10.1145\/3292522.3326034"},{"key":"S1351324922000262_ref66","doi-asserted-by":"publisher","DOI":"10.1108\/eb026526"},{"key":"S1351324922000262_ref72","unstructured":"Vu, A.V. , Wilson, L. , Chua, Y.T. , Shumailov, I. and Anderson, R. (2021). ExtremeBB: Enabling Large-Scale Research into Extremism, the Manosphere and Their Correlation by Online Forum Data. arXiv preprint arXiv:2111.04479."},{"key":"S1351324922000262_ref18","unstructured":"Cohn, D. (2010). Active learning. In Encyclopedia of Machine Learning, vol. 32. USA: Springer, pp. 10\u201314."},{"key":"S1351324922000262_ref25","unstructured":"Devlin, J. , Chang, M. , Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, pp. 4171\u20134186."},{"key":"S1351324922000262_ref32","unstructured":"Harrison, S. (2019). Twitter and Instagram Unveil New Ways to Combat Hate\u2014Again. https:\/\/www.wired.com\/ story\/twitter-instagram-unveil-new-ways-combat-hate-again\/ (accessed May 2021)."},{"key":"S1351324922000262_ref2","doi-asserted-by":"publisher","DOI":"10.3390\/app10238614"},{"key":"S1351324922000262_ref1","unstructured":"Abadi, M. , Agarwal, A. , Barham, P. , Chen, J. , Chen, Z. , Davis, A. , Dean, J. , Devin, M. , Ghemawat, S. , Irving, G. , Isard, M. , Kudlur, M. , Levenberg, J. , Monga, R. , Moore, S. , Murray, D.G. , Steiner, B. , Tucker, P. , Vasudevan, V. , Warden, P. , Wicke, M. , Yu, Y. and Zheng, X. (2016). TensorFlow: Large-scale machine learning on heterogeneous distributed systems. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI\u201916), pp. 265\u2013283."},{"key":"S1351324922000262_ref28","doi-asserted-by":"publisher","DOI":"10.1111\/j.1530-2415.2003.00013.x"},{"key":"S1351324922000262_ref77","doi-asserted-by":"crossref","unstructured":"Wolf, T. , Debut, L. , Sanh, V. , Chaumond, J. , Delangue, C. , Moi, A. , Cistac, P. , Rault, T. , Louf, R. , Funtowicz, M. , Davison, J. , Shleifer, S. , von Platen, P. , Ma, C. , Jernite, Y. , Plu, J. , Xu, C. , Le Scao, T. , Gugger, S. , Drame, M. , Lhoest, Q. and Rush, A. (2020). Transformers: State-of-the-Art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, pp. 38\u201345.","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"S1351324922000262_ref4","doi-asserted-by":"publisher","DOI":"10.1145\/3041021.3054223"},{"key":"S1351324922000262_ref61","doi-asserted-by":"crossref","unstructured":"Rajpurkar, P. , Zhang, J. , Lopyrev, K. and Liang, P. (2016). SQuAD: 100000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp. 2383\u20132392.","DOI":"10.18653\/v1\/D16-1264"},{"key":"S1351324922000262_ref69","unstructured":"UN. (2020). https:\/\/www.un.org\/en\/genocideprevention\/documents\/UN"},{"key":"S1351324922000262_ref58","first-page":"2825","article-title":"Scikit-learn: Machine learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"Journal of Machine Learning Research"},{"key":"S1351324922000262_ref15","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-2029"},{"key":"S1351324922000262_ref78","doi-asserted-by":"crossref","unstructured":"Wulczyn, E. , Thain, N. and Dixon, L. (2016). Ex Machina: Personal attacks seen at scale. In Proceedings of the 26th International Conference on World Wide Web (WWW\u201917). International World Wide Web Conferences Steering Committee, pp. 1391\u20131399.","DOI":"10.1145\/3038912.3052591"},{"key":"S1351324922000262_ref71","unstructured":"Vu, X. , Vu, T. , Tran, M. , Le-Cong, T. and Nguyen, H.T.M. (2020). HSD Shared Task in VLSP Campaign 2019:Hate Speech Detection for Social Good. arXiv preprint arXiv:2007.06493."},{"key":"S1351324922000262_ref12","doi-asserted-by":"publisher","DOI":"10.1080\/10576100903259951"},{"key":"S1351324922000262_ref24","doi-asserted-by":"crossref","unstructured":"de Gibert, O. , Perez, N. , Garc\u00eda-Pablos, A. and Cuadros, M. (2018). Hate speech dataset from a white supremacy forum. In Proceedings of the 2nd Workshop on Abusive Language Online. Association for Computational Linguistics, pp. 11\u201320.","DOI":"10.18653\/v1\/W18-5102"},{"key":"S1351324922000262_ref80","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.11"},{"key":"S1351324922000262_ref10","doi-asserted-by":"publisher","DOI":"10.1145\/130385.130401"},{"key":"S1351324922000262_ref17","unstructured":"Chollet, F. (2015). Keras. https:\/\/keras.io (accessed April 2021)."},{"key":"S1351324922000262_ref21","unstructured":"Daum\u00e9 III H. (2009). Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics, pp. 256\u2013263."},{"key":"S1351324922000262_ref37","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/S19-2009"},{"key":"S1351324922000262_ref47","doi-asserted-by":"publisher","DOI":"10.2307\/2529310"},{"key":"S1351324922000262_ref70","unstructured":"Vaswani, A. , Shazeer, N. , Parmar, N. , Uszkoreit, J. , Jones, L. , Gomez, A.N. , Kaiser \u0141L. and Polosukhin I. (2017). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS\u201917). Curran Associates Inc., pp. 6000\u20136010."},{"key":"S1351324922000262_ref49","doi-asserted-by":"publisher","DOI":"10.1145\/3368567.3368584"},{"key":"S1351324922000262_ref30","doi-asserted-by":"publisher","DOI":"10.1613\/jair.4992"},{"key":"S1351324922000262_ref33","unstructured":"Hatebase Inc. (2020). https:\/\/hatebase.org\/ (accessed January 2021)."},{"key":"S1351324922000262_ref20","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.semeval-1.186"},{"key":"S1351324922000262_ref19","unstructured":"Corazza, M. , Menini, S. , Cabrio, E. , Tonelli, S. and Villata, S. (2019). Cross-platform evaluation for Italian hate speech detection. In CLiC-it 2019 \u2013 6th Annual Conference of the Italian Association for Computational Linguistics, vol. 2481."},{"key":"S1351324922000262_ref75","doi-asserted-by":"crossref","unstructured":"Waseem, Z. and Hovy, D. (2016). Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In Proceedings of the NAACL Student Research Workshop. Association for Computational Linguistics, pp. 88\u201393.","DOI":"10.18653\/v1\/N16-2013"},{"key":"S1351324922000262_ref42","unstructured":"Kingma, D.P. and Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations. Conference Track Proceedings, pp. 11\u201320."},{"key":"S1351324922000262_ref7","doi-asserted-by":"crossref","unstructured":"Bhalerao, R. , Aliapoulios, M. , Shumailov, I. , Afroz, S. , Mccoy, D. , Levchenko, K. and Paxson, V. (2018). Mapping the Underground: Towards Automatic Discovery of Cybercrime Supply Chains. 16. arXiv preprint arXiv:1812.00381.","DOI":"10.1109\/eCrime47957.2019.9037582"},{"key":"S1351324922000262_ref73","unstructured":"Warner, W. and Hirschberg, J. (2012). Detecting hate speech on the World Wide Web. In Proceedings of the Second Workshop on Language in Social Media. Association for Computational Linguistics, pp. 19\u201326."},{"key":"S1351324922000262_ref31","first-page":"7","article-title":"Preprocessing techniques for text mining - An overview","volume":"5","author":"Gurusamy","year":"2015","journal-title":"International Journal of Computer Science and Communication Networks"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324922000262","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,9,11]],"date-time":"2023-09-11T02:07:11Z","timestamp":1694398031000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324922000262\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,6,20]]},"references-count":80,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2023,9]]}},"alternative-id":["S1351324922000262"],"URL":"https:\/\/doi.org\/10.1017\/s1351324922000262","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,6,20]]},"assertion":[{"value":"\u00a9 The Author(s), 2022. Published by Cambridge University Press","name":"copyright","label":"Copyright","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}},{"value":"This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https:\/\/creativecommons.org\/licenses\/by\/4.0\/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.","name":"license","label":"License","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}},{"value":"This content has been made available to all.","name":"free","label":"Free to read"}]}}