{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,22]],"date-time":"2026-01-22T05:57:29Z","timestamp":1769061449261,"version":"3.49.0"},"reference-count":29,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2024,10,28]],"date-time":"2024-10-28T00:00:00Z","timestamp":1730073600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MAKE"],"abstract":"<jats:p>In this paper, we present a novel model that enhances performance by extending the dual-modality TEVAD model\u2014originally leveraging visual and textual information\u2014into a multi-modal framework that integrates visual, audio, and textual data. Additionally, we refine the multi-scale temporal network (MTN) to improve feature extraction across multiple temporal scales between video snippets. Using the XD-Violence dataset, which includes audio data for violence detection, we conduct experiments to evaluate various feature fusion methods. The proposed model achieves an average precision (AP) of 83.9%, surpassing the performance of single-modality approaches (visual: 73.9%, audio: 67.1%, textual: 29.9%) and dual-modality approaches (visual + audio: 78.8%, visual + textual: 78.5%). These findings demonstrate that the proposed model outperforms models based on the original MTN and reaffirm the efficacy of multi-modal approaches in enhancing violence detection compared to single- or dual-modality methods.<\/jats:p>","DOI":"10.3390\/make6040119","type":"journal-article","created":{"date-parts":[[2024,10,28]],"date-time":"2024-10-28T09:51:26Z","timestamp":1730109086000},"page":"2422-2434","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Leveraging Multi-Modality and Enhanced Temporal Networks for Robust Violence Detection"],"prefix":"10.3390","volume":"6","author":[{"given":"Gwangho","family":"Na","sequence":"first","affiliation":[{"name":"Department of Computer Science, Chungbuk National University, Cheongju 28644, Republic of Korea"}]},{"given":"Jaepil","family":"Ko","sequence":"additional","affiliation":[{"name":"Department of Computer Engineering, Kumoh National Institute of Technology, Gumi 39177, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2076-9119","authenticated-orcid":false,"given":"Kyungjoo","family":"Cheoi","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Chungbuk National University, Cheongju 28644, Republic of Korea"}]}],"member":"1968","published-online":{"date-parts":[[2024,10,28]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Sultani, W., Chen, C., and Shah, M. (2018, January 18\u201322). Real-World Anomaly Detection in Surveillance Videos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00678"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Zhong, J.X., Li, N., Kong, W., Liu, S., Li, T.H., and Li, G. (2019, January 15\u201320). Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00133"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Tian, Y., Pang, G., Chen, Y., Singh, R., Verjans, J.W., and Carneiro, G. (2021, January 10\u201317). Weakly-Supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00493"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Panariello, A., Porrello, A., Calderara, S., and Cucchiara, R. (2022, January 23\u201327). Consistency-Based Self-Supervised Learning for Temporal Anomaly Localization. Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel.","DOI":"10.1007\/978-3-031-25072-9_22"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"3513","DOI":"10.1109\/TIP.2021.3062192","article-title":"Learning Causal Temporal Relation and Feature Discrimination for Anomaly Detection","volume":"30","author":"Wu","year":"2021","journal-title":"IEEE Trans. Image Process."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"4067","DOI":"10.1109\/TMM.2021.3112814","article-title":"Contrastive attention for video anomaly detection","volume":"24","author":"Chang","year":"2021","journal-title":"IEEE Trans. Multimed."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Chen, W., Ma, K.T., Yew, Z.J., Hur, M., and Khoo, D.A.A. (2023, January 17\u201324). TEVAD: Improved video anomaly detection with captions. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPRW59228.2023.00587"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Wu, P., Liu, J., Shi, Y., Sun, Y., Shao, F., Wu, Z., and Yang, Z. (2020, January 23\u201328). Not only look, but also listen: Learning multimodal violence detection under weak supervision. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58577-8_20"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"1674","DOI":"10.1109\/TMM.2022.3147369","article-title":"Weakly supervised audio-visual violence detection","volume":"25","author":"Wu","year":"2022","journal-title":"IEEE Trans. Multimed."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Zhang, C., Li, G., Qi, Y., Wang, S., Qing, L., Huang, Q., and Yang, M.H. (2023, January 17\u201324). Exploiting completeness and uncertainty of pseudo labels for weakly supervised video anomaly detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01561"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Pang, W.F., He, Q.H., Hu, Y.J., and Li, Y.X. (2021, January 6\u201311). Violence detection in videos based on fusing visual and audio information. Proceedings of the 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, Toronto, ON, Canada.","DOI":"10.1109\/ICASSP39728.2021.9413686"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Zhou, H., Yu, J., and Yang, W. (2023, January 7\u201314). Dual memory units with uncertainty regulation for weakly supervised video anomaly detection. Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA.","DOI":"10.1609\/aaai.v37i3.25489"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Lin, K., Li, L., Lin, C.C., Ahmed, F., Gan, Z., Liu, Z., and Wang, L. (2022, January 18\u201324). Swinbert: End-to-end transformers with sparse attention for video captioning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01742"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Gao, T., Yao, X., and Chen, D. (2021, January 7\u201311). SimCSE: Simple Contrastive Learning of Sentence Embeddings. Proceedings of the EMNLP 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic.","DOI":"10.18653\/v1\/2021.emnlp-main.552"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Carreira, J., and Zisserman, A. (2017, January 21\u201326). Quo vadis, action recognition? A new model and the kinetics dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.502"},{"key":"ref_16","unstructured":"Wang, X., Wu, J., Chen, J., Li, L., Wang, Y.F., and Wang, W.Y. (November, January 27). Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_17","unstructured":"Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., and Zisserman, A. (2018). A short note about kinetics-600. arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., and Hu, H. (2022, January 18\u201324). Video swin transformer. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00320"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Hershey, S., Chaudhuri, S., Ellis, D.P., Gemmeke, J.F., Jansen, A., Moore, R.C., and Wilson, K. (2017, January 5\u20139). CNN architectures for large-scale audio classification. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7952132"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Gemmeke, J.F., Ellis, D.P., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., and Ritter, M. (2017, January 5\u20139). Audio set: An ontology and human-labeled dataset for audio events. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7952261"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Liu, C., Xu, X., and Zhang, Y. (2018, January 7\u201310). Temporal attention network for action proposal. Proceedings of the 2018 25th IEEE International Conference on Image Processing, Athens, Greece.","DOI":"10.1109\/ICIP.2018.8451429"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Wang, X., Girshick, R., Gupta, A., and He, K. (2018, January 18\u201323). Non-local neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00813"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Hu, J., Shen, L., and Sun, G. (2018, January 18\u201322). Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00745"},{"key":"ref_24","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Hassner, T., Itcher, Y., and Kliper-Gross, O. (2012, January 16\u201321). Violent flows: Real-time detection of violent crowd behavior. Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA.","DOI":"10.1109\/CVPRW.2012.6239348"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Perez, M., Kot, A.C., and Rocha, A. (2019, January 12\u201317). Detection of real-world fights in surveillance videos. Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8683676"},{"key":"ref_27","unstructured":"Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019, January 2\u20137). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the NAACL-HLT 2019, Minneapolis, MN, USA."},{"key":"ref_28","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Yuan, T., Zhang, X., Liu, K., Liu, B., Chen, C., Jin, J., and Jiao, Z. (2024, January 17\u201321). Towards Surveillance Video-and-Language Understanding: New Dataset Baselines and Challenges. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.02082"}],"container-title":["Machine Learning and Knowledge Extraction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-4990\/6\/4\/119\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T16:22:21Z","timestamp":1760113341000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-4990\/6\/4\/119"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10,28]]},"references-count":29,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["make6040119"],"URL":"https:\/\/doi.org\/10.3390\/make6040119","relation":{},"ISSN":["2504-4990"],"issn-type":[{"value":"2504-4990","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,10,28]]}}}