{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T01:02:55Z","timestamp":1760058175380,"version":"build-2065373602"},"reference-count":61,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2025,3,16]],"date-time":"2025-03-16T00:00:00Z","timestamp":1742083200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Voice activity detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining both modalities through fusion or joint learning. In our study, drawing inspiration from recent advancements in visual-language models, we introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models. The CLIP visual encoder analyzes video segments focusing on the upper body of an individual, while the text encoder processes textual descriptions generated by a Generative Large Multimodal Model, i.e., the Large Language and Vision Assistant (LLaVA). Subsequently, embeddings from these encoders are fused through a deep neural network to perform VAD. Our experimental analysis across three VAD benchmarks showcases the superior performance of our method compared to existing visual VAD approaches. Notably, our approach outperforms several audio-visual methods despite its simplicity and without requiring pretraining on extensive audio-visual datasets.<\/jats:p>","DOI":"10.3390\/info16030233","type":"journal-article","created":{"date-parts":[[2025,3,17]],"date-time":"2025-03-17T06:36:23Z","timestamp":1742193383000},"page":"233","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["VAD-CLVA: Integrating CLIP with LLaVA for Voice Activity Detection"],"prefix":"10.3390","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0009-0007-6359-9611","authenticated-orcid":false,"given":"Andrea","family":"Appiani","sequence":"first","affiliation":[{"name":"Department of Management, Information and Production Engineering, University of Bergamo, 24127 Dalmine, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9583-0087","authenticated-orcid":false,"given":"Cigdem","family":"Beyan","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Verona, 37134 Verona, Italy"}]}],"member":"1968","published-online":{"date-parts":[[2025,3,16]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"101178","DOI":"10.1016\/j.csl.2020.101178","article-title":"Turn-taking in conversational systems and human\u2013robot interaction: A review","volume":"67","author":"Skantze","year":"2021","journal-title":"Comput. Speech Lang."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Xu, E.Z., Song, Z., Tsutsui, S., Feng, C., Ye, M., and Shou, M.Z. (2022, January 10\u201314). Ava-avd: Audio-visual speaker diarization in the wild. Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal.","DOI":"10.1145\/3503161.3548027"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Wang, Q., Downey, C., Wan, L., Mansfield, P.A., and Moreno, I.L. (2018, January 15\u201320). Speaker diarization with LSTM. Proceedings of the 2018 IEEE ICASSP, Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8462628"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Chung, J.S., Huh, J., Nagrani, A., Afouras, T., and Zisserman, A. (2020). Spot the conversation: Speaker diarisation in the wild. arXiv.","DOI":"10.21437\/Interspeech.2020-2337"},{"key":"ref_5","unstructured":"Hung, H., and Ba, S.O. (2025, January 11). Speech\/Non-Speech Detection in Meetings from Automatically Extracted Low Resolution Visual Features. Available online: https:\/\/infoscience.epfl.ch\/entities\/publication\/0659b34f-3f4d-44e6-86a8-898c01b6b857."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"2107","DOI":"10.1109\/TMM.2019.2895505","article-title":"A sequential data analysis approach to detect emergent leaders in small groups","volume":"21","author":"Beyan","year":"2019","journal-title":"IEEE Trans. Multimed."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"664","DOI":"10.1016\/j.specom.2010.03.003","article-title":"Improved likelihood ratio test based voice activity detector applied to speech recognition","volume":"52","author":"Lang","year":"2010","journal-title":"Speech Commun."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"1368","DOI":"10.1109\/TASLP.2021.3066303","article-title":"An overview of deep-learning-based audio-visual speech enhancement and separation","volume":"29","author":"Michelsanti","year":"2021","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Moine, C.L., Obin, N., and Roebel, A. (2021). Speaker attentive speech emotion recognition. arXiv.","DOI":"10.21437\/Interspeech.2021-573"},{"key":"ref_10","unstructured":"Moattar, M.H., and Homayounpour, M.M. (2009, January 24\u201328). A simple but efficient real-time voice activity detection algorithm. Proceedings of the 2009 17th European Signal Processing Conference, Glasgow, Scotland, UK."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"1032","DOI":"10.1109\/TMM.2014.2305632","article-title":"Simultaneous-speaker voice activity detection and localization using mid-fusion of SVM and HMMs","volume":"16","author":"Minotto","year":"2014","journal-title":"IEEE Trans. Multimed."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"967","DOI":"10.1109\/TMM.2016.2535357","article-title":"Visual voice activity detection in the wild","volume":"18","author":"Patrona","year":"2016","journal-title":"IEEE Trans. Multimed."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Tao, R., Pan, Z., Das, R.K., Qian, X., Shou, M.Z., and Li, H. (2021, January 20\u201324). Is someone speaking? exploring long-term temporal features for\naudio-visual active speaker detection. Proceedings of the 29th ACM International Conference on Multimedia, Virtual.","DOI":"10.1145\/3474085.3475587"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"K\u00f6p\u00fckl\u00fc, O., Taseska, M., and Rigoll, G. (2021, January 11\u201317). How to design a three-stage architecture for audio-visual active speaker detection in the wild. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.00123"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"5800","DOI":"10.1109\/TMM.2022.3199109","article-title":"Look&listen: Multi-modal correlation learning for active speaker detection and speech enhancement","volume":"25","author":"Xiong","year":"2022","journal-title":"IEEE Trans. Multimed."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Shahid, M., Beyan, C., and Murino, V. (2021, January 5\u20139). S-vvad: Visual voice activity detection by motion segmentation. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Virtual.","DOI":"10.1109\/WACV48630.2021.00238"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"2071","DOI":"10.1109\/TMM.2020.3007350","article-title":"RealVAD: A real-world dataset and a method for voice activity detection by body motion analysis","volume":"23","author":"Beyan","year":"2020","journal-title":"IEEE Trans. Multimed."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Shahid, M., Beyan, C., and Murino, V. (2019, January 27\u201329). Voice activity detection by upper body motion analysis and unsupervised domain adaptation. Proceedings of the IEEE\/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea.","DOI":"10.1109\/ICCVW.2019.00159"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Liang, S., Yang, S., Liu, X., Wu, Z., Shan, S., and Chen, X. (2021, January 20\u201324). Unicon: Unified context network for robust active speaker detection. Proceedings of the 29th ACM International Conference on Multimedia, Virtual.","DOI":"10.1145\/3474085.3475275"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"76","DOI":"10.1016\/j.cviu.2018.02.001","article-title":"Learning to lip read words by watching videos","volume":"173","author":"Chung","year":"2018","journal-title":"Comput. Vis. Image Underst."},{"key":"ref_21","unstructured":"Liu, Q., Wang, W., and Jackson, P. (2011, January 27\u201329). A visual voice activity detection method with adaboosting. Proceedings of the Sensor Signal Processing for Defence (SSPD 2011), London, UK."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"1184","DOI":"10.1121\/1.3050257","article-title":"A study of lip movements during spontaneous dialog and its application to voice activity detection","volume":"125","author":"Sodoyer","year":"2009","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_23","unstructured":"Chung, J.S., and Zisserman, A. (2016, January 20\u201324). Out of time: Automated lip sync in the wild. Proceedings of the ACCV 2016 Workshops, Taipei, Taiwan."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Geeroms, W., Allebosch, G., Kindt, S., Kadri, L., Veelaert, P., and Madhu, N. (2022, January 12-14). Audio-Visual Active Speaker Identification: A comparison of dense image-based features and sparse facial landmark-based features. Proceedings of the 2022 Sensor Data Fusion: Trends, Solutions, Applications (SDF), Bonn, Germany.","DOI":"10.1109\/SDF55338.2022.9931697"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Huang, C., and Koishida, K. (2020, January 17\u201318). Improved active speaker detection based on optical flow. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA.","DOI":"10.1109\/CVPRW50498.2020.00483"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Cristani, M., Pesarin, A., Vinciarelli, A., Crocco, M., and Murino, V. (2011, January 16\u201318). Look at who\u2019s talking: Voice activity detection by automated gesture analysis. Proceedings of the Constructing Ambient Intelligence: AmI 2011 Workshops, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-642-31479-7_14"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Gebre, B.G., Wittenburg, P., and Heskes, T. (2013, January 26\u201331). The gesturer is the speaker. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada.","DOI":"10.1109\/ICASSP.2013.6638359"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Shahid, M., Beyan, C., and Murino, V. (2019, January 9\u201313). Comparisons of Visual Activity Primitives for Voice Activity Detection. Proceedings of the Image Analysis and Processing\u2014ICIAP 2019: 20th International Conference, Trento, Italy. Proceedings, Part I.","DOI":"10.1007\/978-3-030-30642-7_5"},{"key":"ref_29","unstructured":"Xenos, A., Foteinopoulou, N.M., Ntinou, I., and Patras, I.e.a. (2024). VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning. arXiv."},{"key":"ref_30","unstructured":"Liu, H., Li, C., Wu, Q., and Lee, Y.J. (2023). Visual instruction tuning. arXiv."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Liu, H., Li, C., Li, Y., and Lee, Y.J. (2023). Improved baselines with visual instruction tuning. arXiv.","DOI":"10.1109\/CVPR52733.2024.02484"},{"key":"ref_32","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 18\u201324). Learning transferable visual models from natural language supervision. Proceedings of the International Conference on Machine Learning, Virtual."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Auty, D., and Mikolajczyk, K. (2023, January 1\u20136). Learning to Prompt CLIP for Monocular Depth Estimation: Exploring the Limits of Human Language. Proceedings of the IEEE\/CVF ICCV, Paris, France.","DOI":"10.1109\/ICCVW60793.2023.00218"},{"key":"ref_34","unstructured":"Bondielli, A., and Passaro, L.C. (2021, January 4\u20135). Leveraging CLIP for Image Emotion Recognition. Proceedings of the CEUR WORKSHOP PROCEEDINGS, Virtual."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Chen, D., and Gou, G. (2023, January 10\u201316). Unleash the Capabilities of the Vision-Language Pre-training Model in Gaze Object Prediction. Proceedings of the Conference on Neural Information Processing Systems, New Orleans, LA, USA.","DOI":"10.1007\/978-981-99-8141-0_34"},{"key":"ref_36","unstructured":"Tao, F., and Busso, C. (August, January Sweden). Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection. Proceedings of the INTERSPEECH, Stockholm."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1016\/j.specom.2019.07.003","article-title":"End-to-end audiovisual speech activity detection with bimodal recurrent neural models","volume":"113","author":"Tao","year":"2019","journal-title":"Speech Commun."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Roth, J., Chaudhuri, S., Klejch, O., Marvin, R., Gallagher, A., Kaver, L., Ramaswamy, S., Stopczynski, A., Schmid, C., and Xi, Z. (2020, January 4\u20138). Ava active speaker: An audio-visual dataset for active speaker detection. Proceedings of the IEEE ICASSP, Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053900"},{"key":"ref_39","unstructured":"Sharma, R., Somandepalli, K., and Narayanan, S. (2020). Crossmodal learning for audio-visual speech event localization. arXiv."},{"key":"ref_40","unstructured":"Shvets, M., Liu, W., and Berg, A.C. (November, January 27). Leveraging long-range temporal relationships between proposals for video object detection. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"1086","DOI":"10.1109\/TPAMI.2017.2648793","article-title":"Audio-visual speaker diarization based on spatiotemporal bayesian fusion","volume":"40","author":"Gebru","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_42","unstructured":"Chakravarty, P., and Tuytelaars, T. (2016, January 11\u201314). Cross-modal supervision for learning active speaker detection in video. Proceedings of the Computer Vision\u2014ECCV 2016: 14th European Conference, Amsterdam, The Netherlands. Proceedings, Part V 14."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"183","DOI":"10.1007\/s12193-015-0187-2","article-title":"Voice activity detection based on facial movement","volume":"9","author":"Joosten","year":"2015","journal-title":"J. Multimodal User Interfaces"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Haider, F., Campbell, N., and Luz, S. (2016, January 7\u20139). Active speaker detection in human machine multiparty dialogue using visual prosody information. Proceedings of the 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Washington, DC, USA.","DOI":"10.1109\/GlobalSIP.2016.7906033"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Stefanov, K., Beskow, J., and Salvi, G. (2017, January 25). Vision-based active speaker detection in multiparty interaction. Proceedings of the Grounding Language Understanding (GLU2017), Stockholm, Sweden.","DOI":"10.21437\/GLU.2017-10"},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"250","DOI":"10.1109\/TCDS.2019.2927941","article-title":"Self-supervised vision-based detection of the active speaker as support for socially aware language acquisition","volume":"12","author":"Stefanov","year":"2019","journal-title":"IEEE Trans. Cogn. Dev. Syst."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Wortsman, M., Ilharco, G., Kim, J.W., Li, M., Kornblith, S., Roelofs, R., Lopes, R.G., Hajishirzi, H., Farhadi, A., and Namkoong, H. (2022, January 18\u201324). Robust fine-tuning of zero-shot models. Proceedings of the IEEE\/CVF CVPR, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00780"},{"key":"ref_48","unstructured":"Shen, S., Li, L.H., Tan, H., Bansal, M., Rohrbach, A., Chang, K.W., Yao, Z., and Keutzer, K. (2021). How much can clip benefit vision-and-language tasks?. arXiv."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Yuan, M., Lv, N., Xie, Y., Lu, F., and Zhan, K. (2023, January 8\u201311). CLIP-FG:Selecting Discriminative Image Patches by Contrastive Language-Image Pre-Training for Fine-Grained Image Classification. Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia.","DOI":"10.1109\/ICIP49359.2023.10223197"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Srivastava, M.M. (2023). RetailKLIP: Finetuning OpenCLIP backbone using metric learning on a single GPU for Zero-shot retail product image classification. arXiv.","DOI":"10.5220\/0012576000003660"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M., and Liu, T. (2022, January 18\u201324). Cris: Clip-driven referring image segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01139"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Liu, J., Zhang, Y., Chen, J.N., Xiao, J., Lu, Y., A Landman, B., Yuan, Y., Yuille, A., Tang, Y., and Zhou, Z. (2023, January 1\u20136). Clip-driven universal model for organ segmentation and tumor detection. Proceedings of the IEEE\/CVF ICCV, Paris, France.","DOI":"10.1109\/ICCV51070.2023.01934"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Liang, Z., Li, C., Zhou, S., Feng, R., and Loy, C.C. (2023, January 1\u20136). Iterative prompt learning for unsupervised backlit image enhancement. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCV51070.2023.00743"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Sanghi, A., Chu, H., Lambourne, J.G., Wang, Y., Cheng, C.Y., Fumero, M., and Malekshan, K.R. (2022, January 18\u201324). Clip-forge: Towards zero-shot text-to-shape generation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01805"},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Kim, G., Kwon, T., and Ye, J.C. (2022, January 18\u201324). Diffusionclip: Text-guided diffusion models for robust image manipulation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00246"},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Wang, H., Li, Y., Yao, H., and Li, X. (2023, January 18\u201324). Clipn for zero-shot ood detection: Teaching clip to say no. Proceedings of the IEEE\/CVF International Conference on Computer Vision, New Orleans, LA, USA.","DOI":"10.1109\/ICCV51070.2023.00173"},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Wang, M., and Yang, N. (2024). EmoAsst: Emotion recognition assistant via text-guided transfer learning on pre-trained visual and acoustic models. Front. Comput. Sci., 6.","DOI":"10.3389\/fcomp.2024.1304687"},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Garg, B., Kim, K., and Ranjan, S. (2022, January 24\u201328). From Video to Images: Contrastive Pretraining for Emotion Recognition from Single Image. Proceedings of the AAAI Conference on Artificial Intelligence, Pomona, CA, USA.","DOI":"10.1609\/aaai.v36i11.21612"},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Afouras, T., Owens, A., Chung, J.S., and Zisserman, A. (2020, January 23\u201328). Self-supervised learning of audio-visual objects from video. Proceedings of the ECCV, Glasgow, UK.","DOI":"10.1007\/978-3-030-58523-5_13"},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Truong, T.D., Duong, C.N., Pham, H.A., Raj, B., Le, N., and Luu, K. (2021, January 11\u201317). The right to talk: An audio-visual transformer approach. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.00114"},{"key":"ref_61","doi-asserted-by":"crossref","first-page":"225","DOI":"10.1109\/OJSP.2023.3267269","article-title":"Audio-visual activity guided cross-modal identity association for active speaker detection","volume":"4","author":"Sharma","year":"2023","journal-title":"IEEE Open J. Signal Process."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/16\/3\/233\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T16:54:32Z","timestamp":1760028872000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/16\/3\/233"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,16]]},"references-count":61,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2025,3]]}},"alternative-id":["info16030233"],"URL":"https:\/\/doi.org\/10.3390\/info16030233","relation":{},"ISSN":["2078-2489"],"issn-type":[{"type":"electronic","value":"2078-2489"}],"subject":[],"published":{"date-parts":[[2025,3,16]]}}}