{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,12]],"date-time":"2026-01-12T13:57:25Z","timestamp":1768226245752,"version":"3.49.0"},"reference-count":45,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2026,1,12]],"date-time":"2026-01-12T00:00:00Z","timestamp":1768176000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"General Authority for Defense Development","award":["262"],"award-info":[{"award-number":["262"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>This paper introduces Saudi Dialects Cyber Violence Detection (SD-CVD) corpus, a large-scale, class-balanced Saudi-dialect corpus for fine-grained cyber violence detection on online platforms. The dataset contains 88,687 Saudi Arabic tweets annotated using a three-level hierarchical scheme that assigns each tweet to one of 11 mutually exclusive classes, covering benign sentiment (positive, neutral, negative), cyberbullying, and seven hate-speech subtypes (incitement to violence, gender, national, social class, tribal, religious, and regional discrimination). To mitigate the class imbalance common in Arabic cyber violence datasets, data augmentation was applied to achieve a near-uniform class distribution. Annotation quality was ensured through multi-stage review, yielding excellent inter-annotator agreement (Fleiss\u2019 \u03ba &gt; 0.89). We evaluate three modeling paradigms: traditional machine learning with TF\u2013IDF and n-gram features (SVM, logistic regression, random forest), deep learning models trained on fixed sentence embeddings (LSTM, RNN, MLP, CNN), and fine-tuned transformer models (AraBERTv02-Twitter, CAMeLBERT-MSA). Experimental results show that transformers perform best, with AraBERTv02-Twitter achieving the highest weighted F1-score (0.882) followed by CAMeLBERT-MSA (0.869). Among non-transformer baselines, SVM is most competitive (0.853), while CNN performs worst (0.561). Overall, SD-CVD provides a high-quality benchmark and strong baselines to support future research on robust and interpretable Arabic cyber-violence detection.<\/jats:p>","DOI":"10.3390\/info17010076","type":"journal-article","created":{"date-parts":[[2026,1,12]],"date-time":"2026-01-12T10:25:23Z","timestamp":1768213523000},"page":"76","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["SD-CVD Corpus: Towards Robust Detection of Fine-Grained Cyber-Violence Across Saudi Dialects in Online Platforms"],"prefix":"10.3390","volume":"17","author":[{"given":"Abrar","family":"Alsayed","sequence":"first","affiliation":[{"name":"Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1710-9286","authenticated-orcid":false,"given":"Salma","family":"Elhag","sequence":"additional","affiliation":[{"name":"Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3018-048X","authenticated-orcid":false,"given":"Sahar","family":"Badri","sequence":"additional","affiliation":[{"name":"Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2026,1,12]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Asiri, A., and Saleh, M. (2024). Sod: A corpus for saudi offensive language detection classification. Computers, 13.","DOI":"10.3390\/computers13080211"},{"key":"ref_2","first-page":"104","article-title":"Deep learning-based approaches for abusive content detection and classification for multi-class online user-generated data","volume":"5","author":"Kaur","year":"2024","journal-title":"Int. J. Cogn. Comput. Eng."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1436","DOI":"10.1017\/S1351324923000402","article-title":"Emojis as anchors to detect arabic offensive language and hate speech","volume":"29","author":"Mubarak","year":"2023","journal-title":"Nat. Lang. Eng."},{"key":"ref_4","first-page":"200376","article-title":"Detection of arabic offensive language in social media using machine learning models","volume":"22","author":"Mousa","year":"2024","journal-title":"Intell. Syst. Appl."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Ahmad, A., Azzeh, M., Alnagi, E., Al-Haija, Q.A., Halabi, D., Aref, A., and AbuHour, Y. (2024). Hate speech detection in the arabic language: Corpus design, construction, and evaluation. Front. Artif. Intell., 7.","DOI":"10.3389\/frai.2024.1345445"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"75018","DOI":"10.1109\/ACCESS.2022.3190960","article-title":"Offensive language detection in arabic social networks using evolutionary-based classifiers learned from fine-tuned embeddings","volume":"10","author":"Shannaq","year":"2022","journal-title":"IEEE Access"},{"key":"ref_7","unstructured":"Council of Europe (2025, December 18). Cyberviolence. Council of Europe Website, 2025. Available online: https:\/\/www.coe.int\/en\/web\/cyberviolence."},{"key":"ref_8","unstructured":"Ministry of Health (2025, March 25). Available online: https:\/\/www.moh.gov.sa\/HealthAwareness\/EducationalContent\/BabyHealth\/Pages\/Bullying.aspx."},{"key":"ref_9","unstructured":"KAICIID (2025, November 03). Stop Hate Speech. Available online: https:\/\/www.kaiciid.org\/resources\/publications\/quick-guide-hate-speech-prevention."},{"key":"ref_10","unstructured":"Statista (2025, October 17). Number of Active Twitter Users in Selected Countries. Available online: https:\/\/www.statista.com\/statistics\/242606\/number-of-active-twitter-users-in-selected-countries\/."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1016\/j.procs.2017.10.094","article-title":"Arasenti-tweet: A corpus for arabic sentiment analysis of saudi tweets","volume":"117","year":"2017","journal-title":"Procedia Comput. Sci."},{"key":"ref_12","unstructured":"(2025, December 19). Saudi Vision 2030, Available online: https:\/\/www.vision2030.gov.sa\/en."},{"key":"ref_13","first-page":"1033","article-title":"A machine learning approach to cyberbullying detection in arabic tweets","volume":"80","author":"Dhiaa","year":"2024","journal-title":"Comput. Mater. Contin."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"4001","DOI":"10.1007\/s13369-021-05383-3","article-title":"A deep learning framework for automatic detection of hate speech embedded in arabic tweets","volume":"46","author":"Duwairi","year":"2021","journal-title":"Arab. J. Sci. Eng."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Charfi, A., Besghaier, M., Akasheh, R., Atalla, A., and Zaghouani, W. (2024). Hate speech detection with adhar: A multi-dialectal hate speech corpus in arabic. Front. Artif. Intell., 7.","DOI":"10.3389\/frai.2024.1391472"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"101652","DOI":"10.1016\/j.jksuci.2023.101652","article-title":"Cyberbullying detection framework for short and imbalanced arabic datasets","volume":"35","author":"Alzaqebah","year":"2023","journal-title":"J. King Saud Univ.-Comput. Inf. Sci."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"e510","DOI":"10.7717\/peerj-cs.510","article-title":"Aracust: A saudi telecom tweets corpus for sentiment analysis","volume":"7","author":"Almuqren","year":"2021","journal-title":"PeerJ Comput. Sci."},{"key":"ref_18","unstructured":"Alshaalan, R., and Al-Khalifa, H. (2020, January 12). Hate Speech Detection in Saudi Twittersphere: A Deep Learning Approach. Proceedings of the Fifth Arabic Natural Language Processing Workshop, Barcelona, Spain."},{"key":"ref_19","unstructured":"Zaghouani, W., Mubarak, H., and Biswas, M.R. (2024, January 20\u201325). So hateful! building a multi-label hate speech annotated arabic dataset. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"17","DOI":"10.21833\/ijaas.2021.10.003","article-title":"Automatic detection of cyberbullying and threatening in saudi tweets using machine learning","volume":"8","author":"Alghamdi","year":"2021","journal-title":"Int. J. Adv. Appl. Sci."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Almaliki, M., Almars, A.M., Gad, I., and Atlam, E.-S. (2023). Abmm: Arabic bert-mini model for hate-speech detection on social media. Electronics, 12.","DOI":"10.3390\/electronics12041048"},{"key":"ref_22","unstructured":"Qarah, F. (2024). Saudibert: A large language model pretrained on saudi dialect corpora. arXiv."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"81793","DOI":"10.1109\/ACCESS.2024.3382836","article-title":"Emo-sl framework: Emoji sentiment lexicon using text-based features and machine learning for sentiment analysis","volume":"12","author":"Alfreihat","year":"2024","journal-title":"IEEE Access"},{"key":"ref_24","first-page":"972","article-title":"Bert-based approach to arabic hate speech and offensive language detection in twitter: Exploiting emojis and sentiment analysis","volume":"13","author":"Althobaiti","year":"2022","journal-title":"Int. J. Adv. Comput. Sci. Appl."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"110","DOI":"10.1016\/j.aej.2023.08.038","article-title":"A survey on hate speech detection and sentiment analysis using machine learning and deep learning models","volume":"80","author":"Subramanian","year":"2023","journal-title":"Alex. Eng. J."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"322","DOI":"10.1007\/s12559-021-09862-5","article-title":"Emotionally informed hate speech detection: A multi-target perspective","volume":"14","author":"Chiril","year":"2022","journal-title":"Cogn. Comput."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Al Anezi, F.Y. (2022). Arabic hate speech detection using deep recurrent neural networks. Appl. Sci., 12.","DOI":"10.3390\/app12126010"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Aldjanabi, W., Dahou, A., Al-Qaness, M.A.A., Elaziz, M.A., Helmi, A.M., and Dama\u0161evi\u010dius, R. (2021). Arabic offensive and hate speech detection using a cross-corpora multi-task learning model. Informatics, 8.","DOI":"10.3390\/informatics8040069"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"112478","DOI":"10.1109\/ACCESS.2021.3103697","article-title":"A multi-task learning approach to hate speech detection leveraging sentiment analysis","volume":"9","year":"2021","journal-title":"IEEE Access"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"121507","DOI":"10.1109\/ACCESS.2024.3452987","article-title":"Enhancing multilingual hate speech detection: From language-specific insights to cross-linguistic integration","volume":"12","author":"Hashmi","year":"2024","journal-title":"IEEE Access"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"e1966","DOI":"10.7717\/peerj-cs.1966","article-title":"A systematic literature review of hate speech identification on Arabic Twitter data: Research challenges and future directions","volume":"10","author":"Alhazmi","year":"2024","journal-title":"PeerJ Comput. Sci."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"287","DOI":"10.3758\/s13428-025-02746-8","article-title":"Measuring agreement among several raters classifying subjects into one or more (hierarchical) categories: A generalization of fleiss\u2019 kappa","volume":"57","author":"Moons","year":"2025","journal-title":"Behav. Res. Methods"},{"key":"ref_33","unstructured":"Mubarak, H., Rashed, A., Darwish, K., Samih, Y., and Abdelali, A. (2020). Arabic offensive language on twitter: Analysis and experiments. arXiv."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Alrashidi, B., Jamal, A., and Alkhathlan, A. (2023). Abusive content detection in arabic tweets using multi-task learning and transformer-based models. Appl. Sci., 13.","DOI":"10.3390\/app13105825"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"101887","DOI":"10.1016\/j.inffus.2023.101887","article-title":"Enhancing social network hate detection using back translation and gpt-3 augmentations during training and test-time","volume":"99","author":"Cohen","year":"2023","journal-title":"Inf. Fusion"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Jovel, J., and Greiner, R. (2021). An introduction to machine learning approaches for biomedical research. Front. Med., 8.","DOI":"10.3389\/fmed.2021.771607"},{"key":"ref_37","first-page":"5308","article-title":"Cyberbullying detection and recognition with type determination based on machine learning","volume":"75","author":"Nahar","year":"2023","journal-title":"Comput. Mater. Contin."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"El Koshiry, A.M., Eliwa, E.H.I., El-Hafeez, T.A., and Omar, A. (2023). Arabic toxic tweet classification: Leveraging the AraBERT model. Big Data Cogn. Comput., 7.","DOI":"10.3390\/bdcc7040170"},{"key":"ref_39","unstructured":"Qarah, F. (2024). EgyBERT: A large language model pretrained on Egyptian dialect corpora. arXiv."},{"key":"ref_40","unstructured":"Inoue, G., Alhafni, B., Baimukan, N., Bouamor, H., and Habash, N. (2021). The interplay of variant, size, and task type in Arabic pre-trained language models. arXiv."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"106956","DOI":"10.1016\/j.neunet.2024.106956","article-title":"Large language model enhanced logic tensor network for stance detection","volume":"183","author":"Dai","year":"2024","journal-title":"Neural Netw."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"103214","DOI":"10.1016\/j.inffus.2025.103214","article-title":"Logic-augmented multi-decision fusion framework for stance detection on social media","volume":"122","author":"Zhang","year":"2025","journal-title":"Inf. Fusion"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Naous, T., Ryan, M., Xu, W., Ritter, A., and Van Durme, B. (2024, January 11\u201316). Having Beer After Prayer? Measuring Cultural Bias in Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand.","DOI":"10.18653\/v1\/2024.acl-long.862"},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"320","DOI":"10.1007\/s10462-025-11328-1","article-title":"Content moderation by large language models: From accuracy to legitimacy","volume":"58","author":"Huang","year":"2025","journal-title":"Artif. Intell. Rev."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"1098","DOI":"10.1162\/coli_a_00524","article-title":"Bias and fairness in large language models: A survey","volume":"50","author":"Gallegos","year":"2024","journal-title":"Comput. Linguist."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/17\/1\/76\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,12]],"date-time":"2026-01-12T10:55:05Z","timestamp":1768215305000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/17\/1\/76"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,12]]},"references-count":45,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2026,1]]}},"alternative-id":["info17010076"],"URL":"https:\/\/doi.org\/10.3390\/info17010076","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,12]]}}}