{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,20]],"date-time":"2026-04-20T16:00:40Z","timestamp":1776700840919,"version":"3.51.2"},"reference-count":56,"publisher":"MDPI AG","issue":"23","license":[{"start":{"date-parts":[[2019,11,25]],"date-time":"2019-11-25T00:00:00Z","timestamp":1574640000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Speaker diarization systems aim to find \u2018who spoke when?\u2019 in multi-speaker recordings. The dataset usually consists of meetings, TV\/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique.<\/jats:p>","DOI":"10.3390\/s19235163","type":"journal-article","created":{"date-parts":[[2019,11,25]],"date-time":"2019-11-25T11:12:21Z","timestamp":1574680341000},"page":"5163","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":13,"title":["Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model"],"prefix":"10.3390","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0194-6653","authenticated-orcid":false,"given":"Rehan","family":"Ahmad","sequence":"first","affiliation":[{"name":"Department of Electrical Engineering, International Islamic University, Islamabad 44000, Pakistan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0897-2448","authenticated-orcid":false,"given":"Syed","family":"Zubair","sequence":"additional","affiliation":[{"name":"Analytics Camp, Islamabad 44000, Pakistan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8445-7742","authenticated-orcid":false,"given":"Hani","family":"Alquhayz","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Information, College of Science in Zulfi, Majmaah University, Al-Majmaah 11952, Saudi Arabia"}]},{"given":"Allah","family":"Ditta","sequence":"additional","affiliation":[{"name":"Division of Science &amp; Technology, University of Education, Township, Lahore 54770, Pakistan"}]}],"member":"1968","published-online":{"date-parts":[[2019,11,25]]},"reference":[{"key":"ref_1","unstructured":"Wooters, C., Fung, J., Peskin, B., and Anguera, X. (2004). Towards Robust Speaker Segmentation: The Icsi-Sri Fall 2004 Diarization System, Polytechnical University of Catalonia (UPC)."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Anguera, X., Wooters, C., and Pardo, J.M. (2006, January 1\u20134). Robust Speaker Diarization for Meetings. Proceedings of the MLMI: International Workshop on Machine Learning for Multimodal Interaction, Bethesda, MD, USA.","DOI":"10.21437\/Interspeech.2006-466"},{"key":"ref_3","first-page":"402","article-title":"Robust speaker segmentation for meetings: The ICSI-SRI spring 2005 diarization system","volume":"Volume 3869","author":"Anguera","year":"2006","journal-title":"Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)"},{"key":"ref_4","first-page":"248","article-title":"Automatic cluster complexity and quantity selection: Towards robust speaker diarization","volume":"Volume 4299","author":"Anguera","year":"2006","journal-title":"International Workshop on Machine Learning for Multimodal Interaction, Bethesda, MD, USA, 1\u20134 May 2006"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Han, K.J., and Narayanan, S.S. (2008, January 22\u201326). Agglomerative hierarchical speaker clustering using incremental Gaussian mixture cluster modeling. Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech, Brisbane, Australia.","DOI":"10.21437\/Interspeech.2008-3"},{"key":"ref_6","first-page":"509","article-title":"The ICSI RT07s speaker diarization system","volume":"Volume 4625","author":"Wooters","year":"2008","journal-title":"Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)"},{"key":"ref_7","first-page":"520","article-title":"The LIA RT\u201907 speaker diarization system","volume":"Volume 4625","author":"Fredouille","year":"2008","journal-title":"Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Imseng, D., and Friedland, G. (2009, January 13\u201317). Robust Speaker Diarization for short speech recordings. Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2009, Merano\/Meran, Italy.","DOI":"10.1109\/ASRU.2009.5373254"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Gonina, E., Friedland, G., Cook, H., and Keutzer, K. (2011, January 11\u201315). Fast speaker diarization using a high-level scripting language. Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Waikoloa, HI, USA.","DOI":"10.1109\/ASRU.2011.6163887"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"371","DOI":"10.1109\/TASL.2011.2158419","article-title":"The ICSI RT-09 Speaker Diarization System","volume":"20","author":"Friedland","year":"2012","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_11","first-page":"67","article-title":"Speaker, environment and channel change detection and clustering via the Bayesian Information Criterion","volume":"6","author":"Chen","year":"1998","journal-title":"Proc. DARPA Broadcast News Transcr. Underst. Work."},{"key":"ref_12","unstructured":"Molau, S., Pitz, M., Schluter, R., and Ney, H. (2001, January 7\u201311). Computing Mel-frequency cepstral coefficients on the power spectrum. Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings (Cat. No.01CH37221), Salt Lake City, UT, USA."},{"key":"ref_13","first-page":"1","article-title":"Front-end factor analysis for speaker verification","volume":"19","author":"Dehak","year":"2010","journal-title":"Audio Speech"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"980","DOI":"10.1109\/TASL.2008.925147","article-title":"A study of interspeaker variability in speaker verification","volume":"16","author":"Kenny","year":"2008","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Sell, G., and Garcia-Romero, D. (2014, January 7\u201310). Speaker diarization with plda i-vector scoring and unsupervised calibration. Proceedings of the 2014 IEEE Workshop on Spoken Language Technology, SLT 2014-Proceedings, South Lake Tahoe, NV, USA.","DOI":"10.1109\/SLT.2014.7078610"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"3393","DOI":"10.1007\/s00034-015-0206-2","article-title":"Improved i-vector representation for speaker diarization","volume":"35","author":"Xu","year":"2015","journal-title":"Cir. Syst. Signal Process."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Madikeri, S., Himawan, I., Motlicek, P., and Ferras, M. (2015, January 6\u201310). Integrating online i-vector extractor with information bottleneck based speaker diarization system. Proceedings of the Annual Conference of the International Speech Communication Association, Dresden, Germany.","DOI":"10.21437\/Interspeech.2015-111"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Wang, Q., Downey, C., Wan, L., Mansfield, P.A., and Moreno, I.L. (2018, January 15\u201320). Speaker diarization with LSTM. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8462628"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., and Khudanpur, S. (2018, January 15\u201320). X-Vectors: Robust DNN Embeddings for Speaker Recognition. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461375"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Cyrta, P., Trzci, T., and Stokowiec, W. (2017). Speaker Diarization using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings, Proceedings of the Advances in Intelligent Systems and Computing, Szklarska Por\u0119ba, Poland, 17\u201319 September 2017, Springer.","DOI":"10.1007\/978-3-319-67220-5_10"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Garcia-Romero, D., Snyder, D., Sell, G., Povey, D., and McCree, A. (2017, January 5\u20139). Speaker diarization using deep neural network embeddings. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7953094"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Zhang, A., Wang, Q., Zhu, Z., Paisley, J., and Wang, C. (2018). Fully Supervised Speaker Diarization. arXiv.","DOI":"10.1109\/ICASSP.2019.8683892"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Yin, R., Bredin, H., and Barras, C. (2018, January 2\u20136). Neural speech turn segmentation and affinity propagation for speaker diarization. Proceedings of the Annual Conference of the International Speech Communication Association, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1750"},{"key":"ref_24","unstructured":"Bredin, H., and Gelly, G. (2007, January 24\u201329). Improving Speaker Diarization of TV Series using Talking-Face Detection and Clustering. Proceedings of the 24th ACM international conference on Multimedia, Vancouver, BC, Canada."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"1694","DOI":"10.1109\/TMM.2015.2463722","article-title":"Multimodal Multi-Channel On-Line Speaker Diarization Using Sensor Fusion Through SVM","volume":"17","author":"Lee","year":"2015","journal-title":"IEEE Trans. Multimed."},{"key":"ref_26","first-page":"2","article-title":"Multimodal person discovery in broadcast TV at MediaEval 2016","volume":"1739","author":"Bredin","year":"2016","journal-title":"CEUR Workshop Proc."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"115","DOI":"10.1007\/s11042-014-2274-x","article-title":"Audio-visual speaker diarization using fisher linear semi-discriminant analysis","volume":"75","author":"Sarafianos","year":"2016","journal-title":"Multimed. Tools Appl."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Bost, X., Linares, G., and Gueye, S. (2015, January 19\u201324). Audiovisual speaker diarization of TV series. Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia.","DOI":"10.1109\/ICASSP.2015.7178882"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"747","DOI":"10.1007\/s11042-012-1080-6","article-title":"Audiovisual diarization of people in video content","volume":"68","author":"Joly","year":"2014","journal-title":"Multimed. Tools Appl."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"79","DOI":"10.1109\/TPAMI.2011.47","article-title":"Multimodal Speaker diarization","volume":"34","author":"Noulas","year":"2012","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"2223","DOI":"10.1007\/s11042-015-3181-5","article-title":"Multimodal speaker clustering in full length movies","volume":"76","author":"Kapsouras","year":"2017","journal-title":"Multimed. Tools Appl."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"27685","DOI":"10.1007\/s11042-018-5944-2","article-title":"Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis","volume":"77","author":"Lucena","year":"2018","journal-title":"Multimed. Tools Appl."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"1086","DOI":"10.1109\/TPAMI.2017.2648793","article-title":"Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion","volume":"40","author":"Gebru","year":"2018","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_34","unstructured":"Chung, J.S., and Zisserman, A. (2016, January 20\u201324). Out of time: Automated lip sync in the wild. Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Komai, Y., Ariki, Y., and Takiguchi, T. (2011). Audio-Visual Speech Recognition Based on AAM Parameter and Phoneme Analysis of Visual Feature, Proceedings of the Lecture Notes in Computer Science (Including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Gwangju, Korea, 20\u201323 November 2011, Springer.","DOI":"10.1007\/978-3-642-25367-6_9"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"1306","DOI":"10.1109\/JPROC.2003.817150","article-title":"Recent advances in the automatic recognition of audiovisual speech","volume":"91","author":"Potamianos","year":"2003","journal-title":"IEEE"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"96","DOI":"10.1109\/TASL.2006.872619","article-title":"Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures","volume":"15","author":"Rivet","year":"2007","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"108","DOI":"10.1109\/TMM.2009.2037387","article-title":"Onsets coincidence for cross-modal analysis","volume":"12","author":"Barzelay","year":"2010","journal-title":"IEEE Trans. Multimed."},{"key":"ref_39","unstructured":"Fisher, J.W., Darrell, T., Freeman, W.T., and Viola, P. (2001, January 3\u20138). Learning joint statistical models for audio-visual fusion and segregation. Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Siracusa, M.R., and Fisher, J.W. (2007, January 15\u201320). Dynamic dependency tests for audio-visual speaker association. Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing-Proceedings, Honolulu, HI, USA.","DOI":"10.1109\/ICASSP.2007.366271"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Noulas, A.K., and Krose, B.J.A. (2007). On-line multi-modal speaker diarization, Proceedings of the 9th International Conference on Multimodal Interfaces, ICMI\u201907, Nagoya, Aichi, Japan, 12\u201315 November 2007, ACM Press.","DOI":"10.1145\/1322192.1322254"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"488","DOI":"10.1007\/3-540-45113-7_48","article-title":"Speaker localisation using audio-visual synchrony: An empirical study","volume":"2728","author":"Nock","year":"2003","journal-title":"Lect. Notes Comput. Sci."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Friedland, G., Hung, H., Yeo, C., and Berkeley, U.C. (2009, January 19\u201324). Multi-modal speaker diarization of real-world meetings using compressed-domain video features int. Computer Science Institute Rue Marconi 19 CH-1920 Martigny. Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan.","DOI":"10.1109\/ICASSP.2009.4960522"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Garau, G., Dielmann, A., and Bourlard, H. (2010, January 26\u201330). Audio-visual synchronisation for speaker diarisation. Proceedings of the 11th Annual Conference of the International Speech Communication Association, Makuhari, Japan.","DOI":"10.21437\/Interspeech.2010-704"},{"key":"ref_45","first-page":"28","article-title":"The AMI Meeting Corpus: A Pre-announcement Machine Learning for Multimodal Interaction","volume":"Volume 3869","author":"Carletta","year":"2006","journal-title":"International Workshop on Machine Learning for Multimodal Interaction, Edinburgh, UK, July 11\u201313 2005"},{"key":"ref_46","unstructured":"(2019, November 23). Rehan-Ahmad\/MultimodalDiarization: Multimodal Speaker Diarization Using Pre-Trained Audio-Visual Synchronization Model. Available online: https:\/\/github.com\/Rehan-Ahmad\/MultimodalDiarization."},{"key":"ref_47","unstructured":"(2019, November 24). AMI Corpus. Available online: http:\/\/groups.inf.ed.ac.uk\/ami\/corpus\/."},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Yin, R., Bredin, H., and Barras, C. (2017, January 20\u201324). Speaker change detection in broadcast TV using bidirectional long short-term memory networks. Proceedings of the Annual Conference of the International Speech Communication Association, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-65"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Graves, A., Jaitly, N., and Mohamed, A.R. (2013, January 8\u201312). Hybrid speech recognition with Deep Bidirectional LSTM. Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic.","DOI":"10.1109\/ASRU.2013.6707742"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Bredin, H. (2017, January 5\u20139). TristouNet: Triplet loss for speaker turn embedding. Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7953194"},{"key":"ref_51","doi-asserted-by":"crossref","first-page":"972","DOI":"10.1126\/science.1136800","article-title":"Clustering by passing messages between data points","volume":"315","author":"Frey","year":"2007","journal-title":"Science"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Gebru, I.D., Ba, S., Evangelidis, G., and Horaud, R. (2015, January 11\u201312). Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model. Proceedings of the Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCVW.2015.96"},{"key":"ref_53","doi-asserted-by":"crossref","first-page":"718","DOI":"10.1109\/TASLP.2015.2405475","article-title":"Co-Localization of Audio Sources in Images Using Binaural Features and Locally-Linear Regression","volume":"23","author":"Deleforge","year":"2015","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Mcfee, B., Raffel, C., Liang, D., Ellis, D.P.W., Mcvicar, M., Battenberg, E., and Nieto, O. (2015, January 6\u201312). librosa: Audio and Music Signal Analysis in Python. Proceedings of the 14th Python in Science Conference, Austin, TX, USA.","DOI":"10.25080\/Majora-7b98e3ed-003"},{"key":"ref_55","first-page":"1755","article-title":"Dlib-ml: A Machine Learning Toolkit","volume":"10","author":"King","year":"2009","journal-title":"J. Mach. Learn. Res."},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Bredin, H. (2017, January 20\u201324). Pyannote.metrics: A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems. Proceedings of the Annual Conference of the International Speech Communication Association, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-411"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/19\/23\/5163\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T13:37:26Z","timestamp":1760189846000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/19\/23\/5163"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,11,25]]},"references-count":56,"journal-issue":{"issue":"23","published-online":{"date-parts":[[2019,12]]}},"alternative-id":["s19235163"],"URL":"https:\/\/doi.org\/10.3390\/s19235163","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,11,25]]}}}