{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,16]],"date-time":"2026-06-16T05:30:40Z","timestamp":1781587840968,"version":"3.54.5"},"reference-count":35,"publisher":"Emerald","issue":"2","license":[{"start":{"date-parts":[[2024,11,29]],"date-time":"2024-11-29T00:00:00Z","timestamp":1732838400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/www.emerald.com\/insight\/site-policies"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["DTA"],"published-print":{"date-parts":[[2025,4,11]]},"abstract":"<jats:sec><jats:title content-type=\"abstract-subheading\">Purpose<\/jats:title><jats:p>The occurrence and dissemination of hate videos in social media platform could pose serious harm to both society and individuals. However, the characteristics of the hate videos increase the difficulty of detection task. Hate content is usually presented in a relatively covert manner in videos, and textual content in videos plays an important role in hate video detection. In this work, we propose a textual context enhanced dynamic bimodal fusion (TCE-DBF) method for hate video detection.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-subheading\">Design\/methodology\/approach<\/jats:title><jats:p>The proposed method TCE-DBF introduces dynamic modality gate (DMG) and bimodal fusion transformer network to dynamically integrate multimodalities. Moreover, in order to enhance textual modality in videos, two types of textual context from the video are taken as the input of TCE-DBF. One is extracted from video frames in visual modality. The other is extracted from audio in acoustic modality. Specially, TCE-DBF splits the original audio and learns the sequence representation to capture acoustic temporal information.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-subheading\">Findings<\/jats:title><jats:p>Hate video detection has been one of the hotspots in recent works. However, it still faces two serious challenges. The first challenge is the hate content in videos presented in multimodalities. The second challenge is how to evaluate the importance of different modalities for multimodal fusion modeling. TCE-DBF aims to tackle these challenges. Experimental results on hate video dataset HateMM demonstrate that TCE-DBF outperforms the state-of-the-art methods, and the visualization results show that textual modality plays a more important role in hate video detection. Therefore, it is vital to consider the text in videos.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-subheading\">Originality\/value<\/jats:title><jats:p>TCE-DBF can be utilized to effectively detect hate videos on social media. Besides transcript, TCE-DBF considers text in video frames, which makes detection more accurate. Meanwhile, to better achieve multimodal fusion, TCE-DBF uses DMG and bimodal fusion transformer network to dynamically assign different weights to three modalities and integrate them. The proposed TCE-DBF is novel in terms of capturing multimodal features, enhancing the textual modality and achieving high detection performance for hate video detection.<\/jats:p><\/jats:sec>","DOI":"10.1108\/dta-02-2024-0211","type":"journal-article","created":{"date-parts":[[2024,11,28]],"date-time":"2024-11-28T08:43:23Z","timestamp":1732783403000},"page":"201-215","source":"Crossref","is-referenced-by-count":1,"title":["TCE-DBF: textual context enhanced dynamic bimodal fusion for hate video detection"],"prefix":"10.1108","volume":"59","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3505-4925","authenticated-orcid":false,"given":"Haitao","family":"Xiong","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-3319-1100","authenticated-orcid":false,"given":"Wei","family":"Jiao","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yuanyuan","family":"Cai","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"140","published-online":{"date-parts":[[2024,11,29]]},"reference":[{"key":"key2025041114531143800_ref001","doi-asserted-by":"publisher","first-page":"122136","DOI":"10.1109\/access.2022.3223444","article-title":"Mel frequency cepstral coefficient and its applications: a review","volume":"10","year":"2022","journal-title":"IEEE Access"},{"key":"key2025041114531143800_ref002","doi-asserted-by":"publisher","DOI":"10.1016\/j.bspc.2023.105592","article-title":"Dual mode information fusion with pre-trained CNN models and transformer for video-based non-invasive anaemia detection","volume":"88","year":"2024","journal-title":"Biomedical Signal Processing and Control"},{"key":"key2025041114531143800_ref003","article-title":"wav2vec 2.0: a framework for self-supervised learning of speech representations","year":"2020"},{"key":"key2025041114531143800_ref004","doi-asserted-by":"publisher","first-page":"1014","DOI":"10.1609\/icwsm.v17i1.22209","article-title":"HateMM: a multi-modal dataset for hate video classification","year":"2023"},{"key":"key2025041114531143800_ref005","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","year":"2019"},{"key":"key2025041114531143800_ref006","article-title":"An image is worth 16x16 words: transformers for image recognition at scale","year":"2020"},{"key":"key2025041114531143800_ref007","first-page":"457","article-title":"Multimodal compact bilinear pooling for visual question answering and visual grounding","year":"2016","journal-title":"Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing"},{"key":"key2025041114531143800_ref008","doi-asserted-by":"publisher","first-page":"4","DOI":"10.1145\/3576913","article-title":"HateCircle and unsupervised hate speech detection incorporating emotion and contextual semantics","volume":"22","year":"2023","journal-title":"ACM Transactions on Asian and Low-Resource Language Information Processing"},{"key":"key2025041114531143800_ref009","doi-asserted-by":"publisher","first-page":"1109","DOI":"10.1145\/3437963.3441668","article-title":"Multilingual and multimodal hate speech analysis in Twitter","author":"Gretel Liz De La Pe\u00f1a Sarrac\u00e9n","year":"2021"},{"key":"key2025041114531143800_ref010","doi-asserted-by":"publisher","first-page":"63373","DOI":"10.1109\/access.2019.2916887","article-title":"Deep multimodal representation learning: a survey","volume":"7","year":"2019","journal-title":"IEEE Access"},{"key":"key2025041114531143800_ref011","doi-asserted-by":"publisher","first-page":"5995","DOI":"10.24963\/ijcai.2023\/665","article-title":"Decoding the underlying meaning of multimodal hateful memes","year":"2023"},{"issue":"4","key":"key2025041114531143800_ref012","doi-asserted-by":"publisher","first-page":"664","DOI":"10.1109\/tpami.2016.2598339","article-title":"Deep visual-semantic alignments for. Generating image descriptions","volume":"39","year":"2017","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"key2025041114531143800_ref013","article-title":"Adam: a method for stochastic","year":"2014"},{"issue":"24","key":"key2025041114531143800_ref014","doi-asserted-by":"publisher","first-page":"8788","DOI":"10.1073\/pnas.1320040111","article-title":"Experimental evidence of massive-scale emotional contagion through social networks","volume":"111","year":"2014","journal-title":"Proceedings of the National Academy of Sciences of the United States of America"},{"key":"key2025041114531143800_ref036","doi-asserted-by":"publisher","first-page":"346","DOI":"10.1109\/bigcomp51126.2021.00075","volume-title":"2021 IEEE International Conference on Big Data and Smart Computing (BigComp)","year":"2020"},{"key":"key2025041114531143800_ref015","first-page":"101","article-title":"Comparative studies of detecting abusive language on Twitter","year":"2018","journal-title":"Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)"},{"key":"key2025041114531143800_ref016","article-title":"HateXplain: a benchmark dataset for explainable hate speech detection","year":"2020"},{"issue":"25","key":"key2025041114531143800_ref017","doi-asserted-by":"publisher","first-page":"38667","DOI":"10.1007\/s11042-023-15118-1","article-title":"Deep fusion framework for speech. Command recognition using acoustic and linguistic features","volume":"82","year":"2023","journal-title":"Multimedia Tools and Applications"},{"key":"key2025041114531143800_ref018","doi-asserted-by":"publisher","first-page":"488","DOI":"10.1007\/978-3-031-20044-1_28","article-title":"Temporal and cross-modal attention for audio-visual zero-shot learning","year":"2022"},{"key":"key2025041114531143800_ref019","doi-asserted-by":"publisher","first-page":"2156","DOI":"10.1109\/cvpr.2017.232","article-title":"Dual attention networks. For multimodal reasoning and matching","year":"2017"},{"key":"key2025041114531143800_ref020","article-title":"GloVe: global vectors for word representation","year":"2014"},{"key":"key2025041114531143800_ref021","article-title":"EmMixformer: mix transformer for eye movement recognition","year":"2024","journal-title":"Multimedia Tools and Applications"},{"key":"key2025041114531143800_ref022","article-title":"Robust speech recognition via large-scale weak supervision","year":"2022"},{"issue":"6","key":"key2025041114531143800_ref023","doi-asserted-by":"publisher","first-page":"96","DOI":"10.1109\/MSP.2017.2738401","article-title":"Deep multimodal learning: a survey on recent advances and trends","volume":"34","year":"2017","journal-title":"IEEE Signal Processing Magazine"},{"key":"key2025041114531143800_ref024","unstructured":"Rana, A. and Jha, S. (2022), \u201cEmotion based hate speech detection using multimodal learning\u201d, ArXiv, available at: https:\/\/api.semanticscholar.org\/CorpusID:246822635"},{"key":"key2025041114531143800_ref025","article-title":"Hate speech in social media: an exploration of the problem and its proposed solutions","year":"2013"},{"key":"key2025041114531143800_ref026","doi-asserted-by":"publisher","first-page":"585","DOI":"10.1109\/csci51800.2020.00104","article-title":"Detection of hate speech in videos using machine learning","year":"2020"},{"issue":"1","key":"key2025041114531143800_ref027","first-page":"2949","article-title":"Multimodal learning with deep Boltzmann machines","volume":"15","year":"2012","journal-title":"Journal of Machine Learning Research"},{"key":"key2025041114531143800_ref028","doi-asserted-by":"publisher","first-page":"2818","DOI":"10.1109\/cvpr.2016.308","article-title":"Rethinking the inception architecture for computer vision","year":"2015"},{"key":"key2025041114531143800_ref029","first-page":"5998","article-title":"Attention is all you need","year":"2017","journal-title":"31st Conference on Neural Information Processing Systems (NIPS 2017)"},{"key":"key2025041114531143800_ref030","doi-asserted-by":"publisher","first-page":"6255","DOI":"10.24963\/ijcai.2023\/694","article-title":"Evaluating GPT-3 generated explanations for hateful content moderation","year":"2023"},{"key":"key2025041114531143800_ref031","article-title":"Cross-attention is not enough: incongruity-aware multimodal sentiment analysis and emotion recognition","year":"2023"},{"issue":"6","key":"key2025041114531143800_ref032","doi-asserted-by":"publisher","first-page":"82","DOI":"10.1109\/mis.2016.94","article-title":"Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages","volume":"31","year":"2016","journal-title":"IEEE Intelligent Systems"},{"key":"key2025041114531143800_ref033","doi-asserted-by":"publisher","first-page":"103","DOI":"10.1145\/3512576.3512594","article-title":"Traffic matrix prediction with attention-based recurrent neural network","year":"2022"},{"key":"key2025041114531143800_ref034","article-title":"Cross-attention is all you need: real-time streaming transformers for personalised speech enhancement","year":"2022"}],"container-title":["Data Technologies and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/DTA-02-2024-0211\/full\/xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/DTA-02-2024-0211\/full\/html","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,24]],"date-time":"2025-07-24T23:14:59Z","timestamp":1753398899000},"score":1,"resource":{"primary":{"URL":"http:\/\/www.emerald.com\/dta\/article\/59\/2\/201-215\/1246549"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,29]]},"references-count":35,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2024,11,29]]},"published-print":{"date-parts":[[2025,4,11]]}},"alternative-id":["10.1108\/DTA-02-2024-0211"],"URL":"https:\/\/doi.org\/10.1108\/dta-02-2024-0211","relation":{},"ISSN":["2514-9288"],"issn-type":[{"value":"2514-9288","type":"print"}],"subject":[],"published":{"date-parts":[[2024,11,29]]}}}