{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,10]],"date-time":"2026-03-10T15:19:18Z","timestamp":1773155958350,"version":"3.50.1"},"reference-count":50,"publisher":"MDPI AG","issue":"10","license":[{"start":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T00:00:00Z","timestamp":1760054400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61971007"],"award-info":[{"award-number":["61971007"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61571013"],"award-info":[{"award-number":["61571013"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"North China University of Technology Research Start-up Fund Project","award":["11005136025XN076-043"],"award-info":[{"award-number":["11005136025XN076-043"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["J. Imaging"],"abstract":"<jats:p>Weakly Supervised Video Anomaly Detection (WSVAD) is a critical task in computer vision. It aims to localize and recognize abnormal behaviors using only video-level labels. Without frame-level annotations, it becomes significantly challenging to model temporal dependencies. Given the diversity of abnormal events, it is also difficult to model semantic representations. Recently, the cross-modal pre-trained model Contrastive Language-Image Pretraining (CLIP) has shown a strong ability to align visual and textual information. This provides new opportunities for video anomaly detection. Inspired by CLIP, WSVAD-CLIP is proposed as a framework that uses its cross-modal knowledge to bridge the semantic gap between text and vision. First, the Axial-Graph (AG) Module is introduced. It combines an Axial Transformer and Lite Graph Attention Networks (LiteGAT) to capture global temporal structures and local abnormal correlations. Second, a Text Prompt mechanism is designed. It fuses a learnable prompt with a knowledge-enhanced prompt to improve the semantic expressiveness of category embeddings. Third, the Abnormal Visual-Guided Text Prompt (AVGTP) mechanism is proposed to aggregate anomalous visual context for adaptively refining textual representations. Extensive experiments on UCF-Crime and XD-Violence datasets show that WSVAD-CLIP notably outperforms existing methods in coarse-grained anomaly detection. It also achieves superior performance in fine-grained anomaly recognition tasks, validating its effectiveness and generalizability.<\/jats:p>","DOI":"10.3390\/jimaging11100354","type":"journal-article","created":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T09:45:01Z","timestamp":1760089501000},"page":"354","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["WSVAD-CLIP: Temporally Aware and Prompt Learning with CLIP for Weakly Supervised Video Anomaly Detection"],"prefix":"10.3390","volume":"11","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5304-1477","authenticated-orcid":false,"given":"Min","family":"Li","sequence":"first","affiliation":[{"name":"School of Artificial Intelligence and Computer Science, North China University of Technology, No. 5 Jinyuanzhuang Road, Beijing 100144, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-5145-8349","authenticated-orcid":false,"given":"Jing","family":"Sang","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence and Computer Science, North China University of Technology, No. 5 Jinyuanzhuang Road, Beijing 100144, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0019-3028","authenticated-orcid":false,"given":"Yuanyao","family":"Lu","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence and Computer Science, North China University of Technology, No. 5 Jinyuanzhuang Road, Beijing 100144, China"}]},{"given":"Lina","family":"Du","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence and Computer Science, North China University of Technology, No. 5 Jinyuanzhuang Road, Beijing 100144, China"}]}],"member":"1968","published-online":{"date-parts":[[2025,10,10]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Wu, P., Liu, J., He, X., Peng, Y., Wang, P., and Zhang, Y. (2023). Towards video anomaly retrieval from video anomaly detection: New benchmarks and model. arXiv.","DOI":"10.1109\/TIP.2024.3374070"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Sabokrou, M., Fathy, M., Hoseini, M., and Klette, R. (2015, January 7\u201312). Real-time anomaly detection and localization in crowded scenes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA.","DOI":"10.1109\/CVPRW.2015.7301284"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Wu, P., Liu, J., Shi, Y., Sun, Y., Shao, F., Wu, Z., and Yang, Z. (2020, January 23\u201328). Not only look, but also listen: Learning multimodal violence detection under weak supervision. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58577-8_20"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Cong, Y., Yuan, J., and Liu, J. (2011, January 20\u201325). Sparse reconstruction cost for abnormal event detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA.","DOI":"10.1109\/CVPR.2011.5995434"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Sultani, W., Chen, C., and Shah, M. (2018, January 18\u201323). Real-world anomaly detection in surveillance videos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00678"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Liu, K., and Ma, H. (2019, January 21\u201325). Exploring background-bias for anomaly detection in surveillance videos. Proceedings of the 27th ACM International Conference on Multimedia, Nice, France.","DOI":"10.1145\/3343031.3350998"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Liu, W., Luo, W., Lian, D., and Gao, S. (2018, January 18\u201323). Future frame prediction for anomaly detection\u2014A new baseline. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00684"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Lv, H., Chen, C., Cui, Z., Xu, C., Li, Y., and Yang, J. (2021, January 20\u201325). Learning normal dynamics in videos with meta prototype network. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01517"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Park, H., Noh, J., and Ham, B. (2020, January 13\u201319). Learning memory-guided normality for anomaly detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01438"},{"key":"ref_10","unstructured":"Bai, S., He, Z., Lei, Y., Wu, W., Zhu, C., Sun, M., and Yan, J. (2019, January 16\u201320). Traffic anomaly detection via perspective map based on spatial-temporal information matrix. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA."},{"key":"ref_11","unstructured":"Wang, G., Yuan, X., Zheng, A., Hsu, H.M., and Hwang, J.N. (2019, January 16\u201320). Anomaly candidate identification and starting time estimation of vehicles from traffic videos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Zaheer, M.Z., Mahmood, A., Khan, M.H., Segu, M., Yu, F., and Lee, S.I. (2022, January 18\u201324). Generative cooperative learning for unsupervised video anomaly detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01433"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Li, G., Cai, G., Zeng, X., and Zhao, R. (2022, January 23\u201327). Scale-aware spatio-temporal relation learning for video anomaly detection. Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel.","DOI":"10.1007\/978-3-031-19772-7_20"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Chen, Y., Liu, Z., Zhang, B., Fok, W., Qi, X., and Wu, Y.C. (2023, January 7\u201315). MGFN: Magnitude-contrastive glance-and-focus network for weakly-supervised video anomaly detection. Proceedings of the AAAI Conference on Artificial Intelligence, Virtual.","DOI":"10.1609\/aaai.v37i1.25112"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Zhang, J., Qing, L., and Miao, J. (2019, January 22\u201325). Temporal convolutional network with complementary inner bag loss for weakly supervised anomaly detection. Proceedings of the IEEE International Conference on Image Processing, Taipei, Taiwan.","DOI":"10.1109\/ICIP.2019.8803657"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Tian, Y., Pang, G., Chen, Y., Singh, R., Verjans, J.W., and Carneiro, G. (2021, January 10\u201317). Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00493"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015, January 7\u201313). Learning spatiotemporal features with 3D convolutional networks. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.510"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Carreira, J., and Zisserman, A. (2017, January 21\u201326). Quo vadis, action recognition? A new model and the kinetics dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.502"},{"key":"ref_19","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_20","unstructured":"Li, S., Liu, F., and Jiao, L. (March, January 22). Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection. Proceedings of the AAAI Conference on Artificial Intelligence, Virtual."},{"key":"ref_21","unstructured":"Kim, W., Son, B., and Kim, I. (2021, January 18\u201324). Vilt: Vision-and-language transformer without convolution or region supervision. Proceedings of the International Conference on Machine Learning, Virtual."},{"key":"ref_22","unstructured":"Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., and Duerig, T. (2021, January 18\u201324). Scaling up visual and vision-language representation learning with noisy text supervision. Proceedings of the International Conference on Machine Learning, Virtual."},{"key":"ref_23","unstructured":"Wang, M., Xing, J., and Liu, Y. (2021). Actionclip: A new paradigm for video action recognition. arXiv."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"38","DOI":"10.1007\/s11633-022-1369-5","article-title":"VLP: A survey on vision-language pre-training","volume":"20","author":"Chen","year":"2023","journal-title":"Mach. Intell. Res."},{"key":"ref_25","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 18\u201324). Learning transferable visual models from natural language supervision. Proceedings of the International Conference on Machine Learning, Virtual."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Joo, H.K., Vo, K., Yamazaki, K., and Le, N. (2023, January 8\u201311). CLIP-TSA: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection. Proceedings of the IEEE International Conference on Image Processing, Kuala Lumpur, Malaysia.","DOI":"10.1109\/ICIP49359.2023.10222289"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Lv, H., Yue, Z., Sun, Q., Luo, B., Cui, Z., and Zhang, H. (2023, January 17\u201324). Unbiased multiple instance learning for weakly supervised video anomaly detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.00775"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"104163","DOI":"10.1016\/j.cviu.2024.104163","article-title":"Delving into CLIP latent space for video anomaly recognition","volume":"249","author":"Zanella","year":"2024","journal-title":"Comput. Vis. Image Underst."},{"key":"ref_29","unstructured":"Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. (2017). Graph attention networks. arXiv."},{"key":"ref_30","unstructured":"Ho, J., Kalchbrenner, N., Weissenborn, D., and Salimans, T. (2019). Axial attention in multidimensional transformers. arXiv."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"2337","DOI":"10.1007\/s11263-022-01653-1","article-title":"Learning to prompt for vision-language models","volume":"130","author":"Zhou","year":"2022","journal-title":"Int. J. Comput. Vis."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"4923","DOI":"10.1109\/TIP.2024.3451935","article-title":"Learning prompt-enhanced context features for weakly-supervised video anomaly detection","volume":"33","author":"Pu","year":"2024","journal-title":"IEEE Trans. Image Process."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Wu, P., Zhou, X., Pang, G., Zhou, L., Yan, Q., Wang, P., and Zhang, Y. (2024, January 20\u201327). VadCLIP: Adapting vision-language models for weakly supervised video anomaly detection. Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada.","DOI":"10.1609\/aaai.v38i6.28423"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"394","DOI":"10.1109\/TMM.2019.2929931","article-title":"Video anomaly detection and localization based on an adaptive intra-frame classification network","volume":"22","author":"Xu","year":"2020","journal-title":"IEEE Trans. Multimed."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Cho, M., Kim, M., Hwang, S., Park, C., Lee, K., and Lee, S. (2023, January 17\u201324). Look around for anomalies: Weakly-supervised anomaly detection via context-motion relational learning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01168"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"12627","DOI":"10.1109\/TNNLS.2023.3263966","article-title":"Distilling privileged knowledge for anomalous event detection from weakly labeled videos","volume":"35","author":"Liu","year":"2024","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"4505","DOI":"10.1109\/TIP.2021.3072863","article-title":"Localizing anomalies from weakly-labeled videos","volume":"30","author":"Lv","year":"2021","journal-title":"IEEE Trans. Image Process."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Zhong, J.X., Li, N., Kong, W., Liu, S., Li, T.H., and Li, G. (2019, January 16\u201320). Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00133"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Zhou, H., Yu, J., and Yang, W. (2023, January 7\u201314). Dual memory units with uncertainty regulation for weakly supervised video anomaly detection. Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA.","DOI":"10.1609\/aaai.v37i3.25489"},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"1674","DOI":"10.1109\/TMM.2022.3147369","article-title":"Weakly supervised audio-visual violence detection","volume":"25","author":"Wu","year":"2023","journal-title":"IEEE Trans. Multimed."},{"key":"ref_41","unstructured":"Zanella, L., Menapace, W., Mancini, M., Wang, Y., and Ricci, E. (2022, January 18\u201324). Harnessing Large Language Models for Training-Free Video Anomaly Detection. Proceedings of the 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Dev, P.P., Hazari, R., and Das, P. (2024, January 1\u20135). MCANet: Multimodal caption aware training-free video anomaly detection via large language model. Proceedings of the International Conference on Pattern Recognition, Kolkata, India.","DOI":"10.1007\/978-3-031-78125-4_25"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Zhou, X., Girdhar, R., Joulin, A., Kr\u00e4henb\u00fchl, P., and Misra, I. (2022, January 23\u201327). Detecting twenty-thousand classes using image-level supervision. Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel.","DOI":"10.1007\/978-3-031-20077-9_21"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Lin, Y., Chen, M., Wang, W., Wu, B., Li, K., Lin, B., Liu, H., and He, X. (2023, January 17\u201324). CLIP is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01469"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Xu, H., Ghosh, G., Huang, P.Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., and Feichtenhofer, C. (2021). VideoCLIP: Contrastive pre-training for zero-shot video-text understanding. arXiv.","DOI":"10.18653\/v1\/2021.emnlp-main.544"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Yang, Z., Liu, J., and Wu, P. (2024, January 16\u201322). Text prompt with normality guidance for weakly supervised video anomaly detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.01788"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Wu, P., Zhou, X., Pang, G., Sun, Y., Liu, J., Wang, P., and Zhang, Y. (2024, January 16\u201322). Open-vocabulary video anomaly detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR52733.2024.01732"},{"key":"ref_48","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Hasan, M., Choi, J., Neumann, J., Roy-Chowdhury, A.K., and Davis, L.S. (2016, January 27\u201330). Learning Temporal Regularity in Video Sequences. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.86"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Wang, J., and Cherian, A. (November, January 27). GODS: Generalized One-Class Discriminative Subspaces for Anomaly Detection. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea.","DOI":"10.1109\/ICCV.2019.00829"}],"container-title":["Journal of Imaging"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2313-433X\/11\/10\/354\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T10:36:14Z","timestamp":1760092574000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2313-433X\/11\/10\/354"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,10]]},"references-count":50,"journal-issue":{"issue":"10","published-online":{"date-parts":[[2025,10]]}},"alternative-id":["jimaging11100354"],"URL":"https:\/\/doi.org\/10.3390\/jimaging11100354","relation":{},"ISSN":["2313-433X"],"issn-type":[{"value":"2313-433X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,10]]}}}