{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:30:56Z","timestamp":1750307456719,"version":"3.41.0"},"reference-count":46,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2010,11,1]],"date-time":"2010-11-01T00:00:00Z","timestamp":1288569600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2010,11]]},"abstract":"<jats:p>The following article presents a novel audio-visual approach for unsupervised speaker localization in both time and space and systematically analyzes its unique properties. Using recordings from a single, low-resolution room overview camera and a single far-field microphone, a state-of-the-art audio-only speaker diarization system (speaker localization in time) is extended so that both acoustic and visual models are estimated as part of a joint unsupervised optimization problem. The speaker diarization system first automatically determines the speech regions and estimates \u201cwho spoke when,\u201d then, in a second step, the visual models are used to infer the location of the speakers in the video. We call this process \u201cdialocalization.\u201d The experiments were performed on real-world meetings using 4.5 hours of the publicly available AMI meeting corpus. The proposed system is able to exploit audio-visual integration to not only improve the accuracy of a state-of-the-art (audio-only) speaker diarization, but also adds visual speaker localization at little incremental engineering and computation costs. The combined algorithm has different properties, such as increased robustness, that cannot be observed in algorithms based on single modalities. The article describes the algorithm, presents benchmarking results, explains its properties, and systematically discusses the contributions of each modality.<\/jats:p>","DOI":"10.1145\/1865106.1865111","type":"journal-article","created":{"date-parts":[[2010,11,23]],"date-time":"2010-11-23T15:00:38Z","timestamp":1290524438000},"page":"1-18","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["Dialocalization"],"prefix":"10.1145","volume":"6","author":[{"given":"Gerald","family":"Friedland","sequence":"first","affiliation":[{"name":"International Computer Science Institute, Berkeley, CA"}]},{"given":"Chuohao","family":"Yeo","sequence":"additional","affiliation":[{"name":"Institute for Infocomm Research, Singapore"}]},{"given":"Hayley","family":"Hung","sequence":"additional","affiliation":[{"name":"IDIAP Research Institute, Martigny, Switzerland"}]}],"member":"320","published-online":{"date-parts":[[2010,11,26]]},"reference":[{"volume-title":"Proceedings of ISCA International Conference on Spoken Language Processing. 4--7.","author":"Adami A.","key":"e_1_2_2_1_1","unstructured":"Adami , A. , Burget , L. , Dupont , S. , Garudadri , H. , Grezl , F. , Hermansky , H. , Jain , P. , Kajarekar , S. , Morgan , N. , and Sivadas , S . 2002. Qualcomm-ICSI-OGI features for ASR . In Proceedings of ISCA International Conference on Spoken Language Processing. 4--7. Adami, A., Burget, L., Dupont, S., Garudadri, H., Grezl, F., Hermansky, H., Jain, P., Kajarekar, S., Morgan, N., and Sivadas, S. 2002. Qualcomm-ICSI-OGI features for ASR. In Proceedings of ISCA International Conference on Spoken Language Processing. 4--7."},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSMCB.2008.927274"},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1023\/B:VISI.0000011205.11775.fd"},{"volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Beymer D.","key":"e_1_2_2_4_1","unstructured":"Beymer , D. , McLauchlan , P. , Coifman , B. , and Malik , J . 1997. A Real-time Computer Vision System for Measuring Traffic Parameters . In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR). Beymer, D., McLauchlan, P., Coifman, B., and Malik, J. 1997. A Real-time Computer Vision System for Measuring Traffic Parameters. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)."},{"volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4353--4356","author":"Boakye K.","key":"e_1_2_2_5_1","unstructured":"Boakye , K. , Trueba-Hornero , B. , Vinyals , O. , and Friedland , G . 2008. Overlapped speech detection for improved speaker diarization in multiparty meetings . In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4353--4356 . Boakye, K., Trueba-Hornero, B., Vinyals, O., and Friedland, G. 2008. Overlapped speech detection for improved speaker diarization in multiparty meetings. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4353--4356."},{"key":"e_1_2_2_6_1","unstructured":"Campbell N. and Suzuki N. 2006. Working with very sparse data to detect speaker and listener participation in a meetings corpus. http:\/\/www.speech=data.gp\/nick\/pubs\/MM.pdf.  Campbell N. and Suzuki N. 2006. Working with very sparse data to detect speaker and listener participation in a meetings corpus. http:\/\/www.speech=data.gp\/nick\/pubs\/MM.pdf."},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1007\/11677482_3"},{"volume-title":"Proceedings of the DARPA Speech Recognition Workshop.","author":"Chen S.","key":"e_1_2_2_8_1","unstructured":"Chen , S. and Gopalakrishnan , P . 1998. Speaker, environment and channel change detection and clustering via the Bayesian information criterion . In Proceedings of the DARPA Speech Recognition Workshop. Chen, S. and Gopalakrishnan, P. 1998. Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In Proceedings of the DARPA Speech Recognition Workshop."},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.1996.545722"},{"volume-title":"Proceedings of the IEEE International Conference on Multimedia and Expo (ICME). 239--243","author":"Chien S.-Y.","key":"e_1_2_2_10_1","unstructured":"Chien , S.-Y. , Huang , Y.-W. , Ma , S.-Y. , and Chen , L . -G. 2001. Automatic video segmentation for MPEG-4 using predictive watersheds . In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME). 239--243 . Chien, S.-Y., Huang, Y.-W., Ma, S.-Y., and Chen, L.-G. 2001. Automatic video segmentation for MPEG-4 using predictive watersheds. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME). 239--243."},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2004.827503"},{"volume-title":"Proceedings of the Conference on Neural Information Processing Systems (NIPS). 772--778","author":"Fisher J. W.","key":"e_1_2_2_12_1","unstructured":"Fisher , J. W. , Darrell , T. , Freeman , W. T. , and Viola , P. A . 2000. Learning joint statistical models for audio-visual fusion and segregation . In Proceedings of the Conference on Neural Information Processing Systems (NIPS). 772--778 . Fisher, J. W., Darrell, T., Freeman, W. T., and Viola, P. A. 2000. Learning joint statistical models for audio-visual fusion and segregation. In Proceedings of the Conference on Neural Information Processing Systems (NIPS). 772--778."},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2009.4960522"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1142\/S1793351X07000123"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2009.2015089"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/1631272.1631301"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/1631272.1631387"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2006.881678"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/34.868683"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1037\/h0039516"},{"volume-title":"PrintPartners Ipskamp","author":"Huijbregts M.","key":"e_1_2_2_22_1","unstructured":"Huijbregts , M. 2008. Segmentation , Diarization, and Speech Transcription : Surprise Data Unraveled. PrintPartners Ipskamp , Enschede, Netherlands . Huijbregts, M. 2008. Segmentation, Diarization, and Speech Transcription: Surprise Data Unraveled. PrintPartners Ipskamp, Enschede, Netherlands."},{"volume-title":"Proceedings of the Workshop on Multi-Camera and Multi-Modal Sensor Fusion Algorithms and Applications in conjunction with ECCV.","author":"Hung H.","key":"e_1_2_2_23_1","unstructured":"Hung , H. and Friedland , G . 2008. Towards audio-visual on-line diarization of participants in group meetings . In Proceedings of the Workshop on Multi-Camera and Multi-Modal Sensor Fusion Algorithms and Applications in conjunction with ECCV. Hung, H. and Friedland, G. 2008. Towards audio-visual on-line diarization of participants in group meetings. In Proceedings of the Workshop on Multi-Camera and Multi-Modal Sensor Fusion Algorithms and Applications in conjunction with ECCV."},{"volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 835--838","author":"Hung H.","key":"e_1_2_2_24_1","unstructured":"Hung , H. , Huang , Y. , Friedland , G. , and Gatica-Perez , D . 2008. Estimating the dominant person in multi-party conversations using speaker diarization strategies . In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 835--838 . Hung, H., Huang, Y., Friedland, G., and Gatica-Perez, D. 2008. Estimating the dominant person in multi-party conversations using speaker diarization strategies. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 835--838."},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2010.2066267"},{"key":"e_1_2_2_26_1","unstructured":"Huynh B.-L. 2008. Towards Multimodal Speaker Diarization. Master Thesis Ecole Polytechnique Federale de Lausanne.  Huynh B.-L. 2008. Towards Multimodal Speaker Diarization. Master Thesis Ecole Polytechnique Federale de Lausanne."},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1137\/S1052623496303470"},{"key":"e_1_2_2_28_1","doi-asserted-by":"crossref","unstructured":"McGurk H. and MacDonald J. 1976. Hearing lips and seeing voices. Nature 264 5588 746--48.  McGurk H. and MacDonald J. 1976. Hearing lips and seeing voices. Nature 264 5588 746--48.","DOI":"10.1038\/264746a0"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0031-3203(98)00066-1"},{"volume-title":"Cambridge University Press","author":"McNeill D.","key":"e_1_2_2_30_1","unstructured":"McNeill , D. 2000. Language and Gesture. Cambridge University Press , Cambridge, UK . McNeill, D. 2000. Language and Gesture. Cambridge University Press, Cambridge, UK."},{"key":"e_1_2_2_31_1","unstructured":"Mermelstein P. 1976. Distance measures for speech recognition psychological and instrumental. Patt. Recog. Art. Intel. 374--388.  Mermelstein P. 1976. Distance measures for speech recognition psychological and instrumental. Patt. Recog. Art. Intel. 374--388."},{"volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 1--5.","author":"Misra H.","key":"e_1_2_2_32_1","unstructured":"Misra , H. , Bourlard , H. , and Tyagi , V . 2003. New entropy based combination rules in HMM\/ANN multi-stream ASR . In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 1--5. Misra, H., Bourlard, H., and Tyagi, V. 2003. New entropy based combination rules in HMM\/ANN multi-stream ASR. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 1--5."},{"volume-title":"Proceedings of the ACM International Conference on Image and Video Retrieval (CIVR). 488--499","author":"Nock H. J.","key":"e_1_2_2_33_1","unstructured":"Nock , H. J. , Iyengar , G. , and Neti , C . 2003. Speaker localisation using audio-visual synchrony: An empirical study . In Proceedings of the ACM International Conference on Image and Video Retrieval (CIVR). 488--499 . Nock, H. J., Iyengar, G., and Neti, C. 2003. Speaker localisation using audio-visual synchrony: An empirical study. In Proceedings of the ACM International Conference on Image and Video Retrieval (CIVR). 488--499."},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/1322192.1322254"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2007.1077"},{"volume-title":"Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 2017--2020","author":"Patterson E. K.","key":"e_1_2_2_36_1","unstructured":"Patterson , E. K. , Gurbuz , S. , Tufekci , Z. , and Gowdy , J. N . 2002. CUAVE: A new audio-visual database for multimodal human-computer interface research . In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 2017--2020 . Patterson, E. K., Gurbuz, S., Tufekci, Z., and Gowdy, J. N. 2002. CUAVE: A new audio-visual database for multimodal human-computer interface research. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 2017--2020."},{"volume-title":"Proceedings of the International Picture Coding Symposium.","author":"Rao R.","key":"e_1_2_2_37_1","unstructured":"Rao , R. and Chen , T . 1996. Exploiting audio-visual correlation in coding of talking head sequences . Proceedings of the International Picture Coding Symposium. Rao, R. and Chen, T. 1996. Exploiting audio-visual correlation in coding of talking head sequences. Proceedings of the International Picture Coding Symposium."},{"volume-title":"Proceedings of the International Conference Audio and Speech Signal Processing. 953--960","author":"Reynolds D. A.","key":"e_1_2_2_39_1","unstructured":"Reynolds , D. A. and Torres-Carrasquillo , P . 2005. Approaches and applications of audio diarization . In Proceedings of the International Conference Audio and Speech Signal Processing. 953--960 . Reynolds, D. A. and Torres-Carrasquillo, P. 2005. Approaches and applications of audio diarization. In Proceedings of the International Conference Audio and Speech Signal Processing. 953--960."},{"volume-title":"264 and MPEG-4 Video Compression: Video Coding for Next-Generation Multimedia","author":"Richardson I.","key":"e_1_2_2_40_1","unstructured":"Richardson , I. 2003. H. 264 and MPEG-4 Video Compression: Video Coding for Next-Generation Multimedia . John Wiley & amp; Sons Inc. Richardson, I. 2003. H. 264 and MPEG-4 Video Compression: Video Coding for Next-Generation Multimedia. John Wiley &amp; Sons Inc."},{"key":"e_1_2_2_41_1","doi-asserted-by":"crossref","unstructured":"Simon M. Behnke S. and Rojas R. 2001. Robust real time color tracking. In RoboCup 2000: Robot Soccer World Cup IV. Springer Berlin Germany 239--248.   Simon M. Behnke S. and Rojas R. 2001. Robust real time color tracking. In RoboCup 2000: Robot Soccer World Cup IV. Springer Berlin Germany 239--248.","DOI":"10.1007\/3-540-45324-5_22"},{"key":"e_1_2_2_42_1","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).","volume":"2","author":"Siracusa M.","unstructured":"Siracusa , M. and Fisher , J . 2007. Dynamic dependency tests for audio-visual speaker association . In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vol. 2 . 457--460. Siracusa, M. and Fisher, J. 2007. Dynamic dependency tests for audio-visual speaker association. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vol. 2. 457--460."},{"volume-title":"Real World Speech Processing","author":"Tamura S.","key":"e_1_2_2_43_1","unstructured":"Tamura , S. , Iwano , K. , and FURUI, S. 2004. Multi-modal speech recognition using optical-flow analysis for lip images . In Real World Speech Processing . Kluwer Academic Publishers . Tamura, S., Iwano, K., and FURUI, S. 2004. Multi-modal speech recognition using optical-flow analysis for lip images. In Real World Speech Processing. Kluwer Academic Publishers."},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPR.2006.283"},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2008.2005602"},{"volume-title":"Proceedings of the Rich Transcription Meeting Recognition Evaluation Workshop.","author":"Wooters C.","key":"e_1_2_2_46_1","unstructured":"Wooters , C. and Huijbregts , M . 2007. The ICSI RT07s speaker diarization system . In Proceedings of the Rich Transcription Meeting Recognition Evaluation Workshop. Wooters, C. and Huijbregts, M. 2007. The ICSI RT07s speaker diarization system. In Proceedings of the Rich Transcription Meeting Recognition Evaluation Workshop."},{"key":"e_1_2_2_47_1","unstructured":"Yeo C. and Ramchandran K. 2008. Compressed domain video processing of meetings for activity estimation in dominance classification and slide transition detection. Tech. rep. UCB\/EECS-2008-79 EECS Department University of California Berkeley.  Yeo C. and Ramchandran K. 2008. Compressed domain video processing of meetings for activity estimation in dominance classification and slide transition detection. Tech. rep. UCB\/EECS-2008-79 EECS Department University of California Berkeley."},{"volume-title":"Proceedings of the IEEE International Workshop on Multimedia Signal Processing (MMSP).","author":"Zhang C.","key":"e_1_2_2_48_1","unstructured":"Zhang , C. , Yin , P. , Rui , Y. , Cutler , R. , and Viola , P . 2006. Boosting-based multimodal speaker detection for distributed meetings . Proceedings of the IEEE International Workshop on Multimedia Signal Processing (MMSP). Zhang, C., Yin, P., Rui, Y., Cutler, R., and Viola, P. 2006. Boosting-based multimodal speaker detection for distributed meetings. Proceedings of the IEEE International Workshop on Multimedia Signal Processing (MMSP)."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1865106.1865111","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1865106.1865111","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T12:08:17Z","timestamp":1750248497000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1865106.1865111"}},"subtitle":["Acoustic speaker diarization and visual localization as joint optimization problem"],"short-title":[],"issued":{"date-parts":[[2010,11]]},"references-count":46,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2010,11]]}},"alternative-id":["10.1145\/1865106.1865111"],"URL":"https:\/\/doi.org\/10.1145\/1865106.1865111","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2010,11]]},"assertion":[{"value":"2010-01-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2010-07-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2010-11-26","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}