{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,17]],"date-time":"2026-06-17T16:19:39Z","timestamp":1781713179622,"version":"3.54.5"},"publisher-location":"New York, NY, USA","reference-count":67,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"DSO National Laboratories - Singapore"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3548027","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:42:46Z","timestamp":1665416566000},"page":"3838-3847","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":32,"title":["AVA-AVD: Audio-visual Speaker Diarization in the Wild"],"prefix":"10.1145","author":[{"given":"Eric Zhongcong","family":"Xu","sequence":"first","affiliation":[{"name":"Showlab, National University of Singapore, Singapore, Singapore"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zeyang","family":"Song","sequence":"additional","affiliation":[{"name":"Showlab, National University of Singapore, Singapore, Singapore"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Satoshi","family":"Tsutsui","sequence":"additional","affiliation":[{"name":"Showlab, National University of Singapore, Singapore, Singapore"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Chao","family":"Feng","sequence":"additional","affiliation":[{"name":"Showlab, National University of Singapore, Singapore, Singapore"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Mang","family":"Ye","sequence":"additional","affiliation":[{"name":"Wuhan University, Wuhan, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Mike Zheng","family":"Shou","sequence":"additional","affiliation":[{"name":"Showlab, National University of Singapore, Singapore, Singapore"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"crossref","unstructured":"T. Afouras J. S. Chung and A. Zisserman. 2018. Deep Lip Reading: a comparison of models and an online application. In INTERSPEECH.  T. Afouras J. S. Chung and A. Zisserman. 2018. Deep Lip Reading: a comparison of models and an online application. In INTERSPEECH.","DOI":"10.21437\/Interspeech.2018-1943"},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW53098.2021.00188"},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1590\/S2317-17822013000400003"},{"key":"e_1_3_2_2_4_1","volume-title":"Face","author":"Brown Andrew","year":"2021","unstructured":"Andrew Brown , Vicky Kalogeiton , and Andrew Zisserman . 2021. Face , Body, Voice : Video Person-Clustering with Multiple Modalities. ICCVWorkshop ( 2021 ). Andrew Brown, Vicky Kalogeiton, and Andrew Zisserman. 2021. Face, Body, Voice: Video Person-Clustering with Multiple Modalities. ICCVWorkshop (2021)."},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/FG.2018.00020"},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_3_2_2_7_1","volume-title":"The AMI meeting corpus: A pre-announcement. In International workshop on machine learning for multimodal interaction. Springer, 28--39","author":"Carletta Jean","year":"2005","unstructured":"Jean Carletta , Simone Ashby , Sebastien Bourban , Mike Flynn , Mael Guillemot , Thomas Hain , Jaroslav Kadlec , Vasilis Karaiskos , Wessel Kraaij , Melissa Kronenthal , 2005 . The AMI meeting corpus: A pre-announcement. In International workshop on machine learning for multimodal interaction. Springer, 28--39 . Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn, Mael Guillemot, Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos, Wessel Kraaij, Melissa Kronenthal, et al. 2005. The AMI meeting corpus: A pre-announcement. In International workshop on machine learning for multimodal interaction. Springer, 28--39."},{"key":"e_1_3_2_2_8_1","volume-title":"Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, and Icksang Han.","author":"Chung Joon Son","year":"2020","unstructured":"Joon Son Chung , Jaesung Huh , Seongkyu Mun , Minjae Lee , Hee Soo Heo , Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, and Icksang Han. 2020 . In defence of metric learning for speaker recognition. In INTERSPEECH. Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, and Icksang Han. 2020. In defence of metric learning for speaker recognition. In INTERSPEECH."},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"crossref","unstructured":"Joon Son Chung Jaesung Huh Arsha Nagrani Triantafyllos Afouras and Andrew Zisserman. 2020. Spot the conversation: speaker diarisation in the wild. In INTERSPEECH.  Joon Son Chung Jaesung Huh Arsha Nagrani Triantafyllos Afouras and Andrew Zisserman. 2020. Spot the conversation: speaker diarisation in the wild. In INTERSPEECH.","DOI":"10.21437\/Interspeech.2020-2337"},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"crossref","unstructured":"Joon Son Chung Bong-Jin Lee and Icksang Han. 2019. Who said that?: Audiovisual speaker diarisation of real-world meetings. In INTERSPEECH.  Joon Son Chung Bong-Jin Lee and Icksang Han. 2019. Who said that?: Audiovisual speaker diarisation of real-world meetings. In INTERSPEECH.","DOI":"10.21437\/Interspeech.2019-3116"},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"crossref","unstructured":"J. S. Chung A. Nagrani and A. Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. In INTERSPEECH.  J. S. Chung A. Nagrani and A. Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. In INTERSPEECH.","DOI":"10.21437\/Interspeech.2018-1929"},{"key":"e_1_3_2_2_12_1","volume-title":"Workshop on Multi-view Lip-reading, ACCV.","author":"Chung J. S.","unstructured":"J. S. Chung and A. Zisserman . 2016. Out of time: automated lip sync in the wild . In Workshop on Multi-view Lip-reading, ACCV. J. S. Chung and A. Zisserman. 2016. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV."},{"key":"e_1_3_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8682524"},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF01890115"},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2010.2064307"},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00525"},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00482"},{"key":"e_1_3_2_2_18_1","unstructured":"Mireia Diez Luk\u00e1? Burget Federico Landini Shuai Wang and Honza ?ernock  Mireia Diez Luk\u00e1? Burget Federico Landini Shuai Wang and Honza ?ernock"},{"key":"e_1_3_2_2_19_1","volume-title":"ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6519--6523","year":"2020","unstructured":"y. 2020 . Optimizing bayesian hmm based x-vector clustering for the second dihard speech diarization challenge . In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6519--6523 . y. 2020. Optimizing bayesian hmm based x-vector clustering for the second dihard speech diarization challenge. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6519--6523."},{"key":"e_1_3_2_2_20_1","volume-title":"Audiovisual diarization of people in video content. Multimedia tools and applications 68, 3","author":"Khoury Elie El","year":"2014","unstructured":"Elie El Khoury , Christine S\u00e9nac , and Philippe Joly . 2014. Audiovisual diarization of people in video content. Multimedia tools and applications 68, 3 ( 2014 ), 747--775. Elie El Khoury, Christine S\u00e9nac, and Philippe Joly. 2014. Audiovisual diarization of people in video content. Multimedia tools and applications 68, 3 (2014), 747--775."},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01524"},{"key":"e_1_3_2_2_22_1","unstructured":"Yixiao Ge Feng Zhu Dapeng Chen Rui Zhao and Hongsheng Li. 2020. Selfpaced Contrastive Learning with Hybrid Memory for Domain Adaptive Object Re-ID. In Advances in Neural Information Processing Systems.  Yixiao Ge Feng Zhu Dapeng Chen Rui Zhao and Hongsheng Li. 2020. Selfpaced Contrastive Learning with Hybrid Memory for Domain Adaptive Object Re-ID. In Advances in Neural Information Processing Systems."},{"key":"e_1_3_2_2_23_1","volume-title":"Audio-visual speaker diarization based on spatiotemporal bayesian fusion","author":"Gebru Israel D","year":"2017","unstructured":"Israel D Gebru , Sileye Ba , Xiaofei Li , and Radu Horaud . 2017. Audio-visual speaker diarization based on spatiotemporal bayesian fusion . IEEE transactions on pattern analysis and machine intelligence 40, 5 ( 2017 ), 1086--1099. Israel D Gebru, Sileye Ba, Xiaofei Li, and Radu Horaud. 2017. Audio-visual speaker diarization based on spatiotemporal bayesian fusion. IEEE transactions on pattern analysis and machine intelligence 40, 5 (2017), 1086--1099."},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01842"},{"key":"e_1_3_2_2_25_1","volume-title":"European conference on computer vision (ECCV). Springer, 87--102","author":"Guo Yandong","year":"2016","unstructured":"Yandong Guo , Lei Zhang , Yuxiao Hu , Xiaodong He , and Jianfeng Gao . 2016 . Msceleb- 1m: A dataset and benchmark for large-scale face recognition . In European conference on computer vision (ECCV). Springer, 87--102 . Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. 2016. Msceleb- 1m: A dataset and benchmark for large-scale face recognition. In European conference on computer vision (ECCV). Springer, 87--102."},{"key":"e_1_3_2_2_26_1","volume-title":"Audio vision: Using audio-visual synchrony to locate sounds. Advances in neural information processing systems 12","author":"Hershey John","year":"1999","unstructured":"John Hershey and Javier Movellan . 1999. Audio vision: Using audio-visual synchrony to locate sounds. Advances in neural information processing systems 12 ( 1999 ), 813--819. John Hershey and Javier Movellan. 1999. Audio vision: Using audio-visual synchrony to locate sounds. Advances in neural information processing systems 12 (1999), 813--819."},{"key":"e_1_3_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2020-1022"},{"key":"e_1_3_2_2_28_1","volume-title":"Breakthroughs in statistics","author":"Hotelling Harold","unstructured":"Harold Hotelling . 1992. Relations between two sets of variates . In Breakthroughs in statistics . Springer , 162--190. Harold Hotelling. 1992. Relations between two sets of variates. In Breakthroughs in statistics. Springer, 162--190."},{"key":"e_1_3_2_2_29_1","volume-title":"International conference on machine learning. PMLR, 1558--1567","author":"Hu Weihua","year":"2017","unstructured":"Weihua Hu , Takeru Miyato , Seiya Tokui , Eiichi Matsumoto , and Masashi Sugiyama . 2017 . Learning discrete representations via information maximizing self-augmented training . In International conference on machine learning. PMLR, 1558--1567 . Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, and Masashi Sugiyama. 2017. Learning discrete representations via information maximizing self-augmented training. In International conference on machine learning. PMLR, 1558--1567."},{"key":"e_1_3_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9053760"},{"key":"e_1_3_2_2_31_1","volume-title":"NIST RT'05S evaluation: pre-processing techniques and speaker diarization on multiple microphone meetings. In International Workshop on Machine Learning for Multimodal Interaction. Springer, 428--439","author":"Istrate Dan","year":"2005","unstructured":"Dan Istrate , Corinne Fredouille , Sylvain Meignier , Laurent Besacier , and Jean Fran\u00e7ois Bonastre . 2005 . NIST RT'05S evaluation: pre-processing techniques and speaker diarization on multiple microphone meetings. In International Workshop on Machine Learning for Multimodal Interaction. Springer, 428--439 . Dan Istrate, Corinne Fredouille, Sylvain Meignier, Laurent Besacier, and Jean Fran\u00e7ois Bonastre. 2005. NIST RT'05S evaluation: pre-processing techniques and speaker diarization on multiple microphone meetings. In International Workshop on Machine Learning for Multimodal Interaction. Springer, 428--439."},{"key":"e_1_3_2_2_32_1","volume-title":"Proceedings of the 33rd International Conference on Neural Information Processing Systems. 5888--5892","author":"Jiang Yangbangyan","year":"2019","unstructured":"Yangbangyan Jiang , Qianqian Xu , Zhiyong Yang , Xiaochun Cao , and Qingming Huang . 2019 . Dm2c: Deep mixed-modal clustering . In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 5888--5892 . Yangbangyan Jiang, Qianqian Xu, Zhiyong Yang, Xiaochun Cao, and Qingming Huang. 2019. Dm2c: Deep mixed-modal clustering. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 5888--5892."},{"key":"e_1_3_2_2_33_1","volume-title":"Proceedings of the IEEE International Conference on Computer Vision (ICCV). 5276--5285","author":"Jin SouYoung","year":"2017","unstructured":"SouYoung Jin , Hang Su , Chris Stauffer , and Erik Learned-Miller . 2017 . End-toend face detection and cast grouping in movies using erdos-renyi clustering . In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 5276--5285 . SouYoung Jin, Hang Su, Chris Stauffer, and Erik Learned-Miller. 2017. End-toend face detection and cast grouping in movies using erdos-renyi clustering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 5276--5285."},{"key":"e_1_3_2_2_34_1","unstructured":"Alan B Johnston and Daniel C Burnett. 2012. WebRTC: APIs and RTCWEB protocols of the HTML5 real-time web. Digital Codex LLC.  Alan B Johnston and Daniel C Burnett. 2012. WebRTC: APIs and RTCWEB protocols of the HTML5 real-time web. Digital Codex LLC."},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-015-3181-5"},{"key":"e_1_3_2_2_36_1","volume-title":"Odyssey","volume":"14","author":"Kenny Patrick","year":"2010","unstructured":"Patrick Kenny . 2010 . Bayesian speaker verification with heavy-tailed priors .. In Odyssey , Vol. 14 . Patrick Kenny. 2010. Bayesian speaker verification with heavy-tailed priors.. In Odyssey, Vol. 14."},{"key":"e_1_3_2_2_37_1","unstructured":"Gregory Koch Richard Zemel Ruslan Salakhutdinov etal 2015. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop Vol. 2. Lille.  Gregory Koch Richard Zemel Ruslan Salakhutdinov et al. 2015. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop Vol. 2. Lille."},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9414315"},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.csl.2021.101254"},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"crossref","unstructured":"Ivan Medennikov Maxim Korenevsky Tatiana Prisyach Yuri Khokhlov Mariya Korenevskaya Ivan Sorokin Tatiana Timofeeva Anton Mitrofanov Andrei Andrusenko Ivan Podluzhny etal 2020. Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario. In INTERSPEECH.  Ivan Medennikov Maxim Korenevsky Tatiana Prisyach Yuri Khokhlov Mariya Korenevskaya Ivan Sorokin Tatiana Timofeeva Anton Mitrofanov Andrei Andrusenko Ivan Podluzhny et al. 2020. Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario. In INTERSPEECH.","DOI":"10.21437\/Interspeech.2020-1602"},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01261-8_5"},{"key":"e_1_3_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00879"},{"key":"e_1_3_2_2_43_1","doi-asserted-by":"crossref","unstructured":"A. Nagrani J. S. Chung and A. Zisserman. 2017. VoxCeleb: a large-scale speaker identification dataset. In INTERSPEECH.  A. Nagrani J. S. Chung and A. Zisserman. 2017. VoxCeleb: a large-scale speaker identification dataset. In INTERSPEECH.","DOI":"10.21437\/Interspeech.2017-950"},{"key":"e_1_3_2_2_44_1","unstructured":"Andrew Y Ng Michael I Jordan and Yair Weiss. 2002. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems. 849--856.  Andrew Y Ng Michael I Jordan and Yair Weiss. 2002. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems. 849--856."},{"key":"e_1_3_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2011.47"},{"key":"e_1_3_2_2_46_1","doi-asserted-by":"crossref","unstructured":"Omkar M Parkhi Andrea Vedaldi and Andrew Zisserman. 2015. Deep face recognition. (2015).  Omkar M Parkhi Andrea Vedaldi and Andrew Zisserman. 2015. Deep face recognition. (2015).","DOI":"10.5244\/C.29.41"},{"key":"e_1_3_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2021.3057230"},{"key":"e_1_3_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00809"},{"key":"e_1_3_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9053900"},{"key":"e_1_3_2_2_50_1","doi-asserted-by":"crossref","unstructured":"Neville Ryant Kenneth Church Christopher Cieri Alejandrina Cristia Jun Du Sriram Ganapathy and Mark Liberman. 2019. The second dihard diarization challenge: Dataset task and baselines. In INTERSPEECH.  Neville Ryant Kenneth Church Christopher Cieri Alejandrina Cristia Jun Du Sriram Ganapathy and Mark Liberman. 2019. The second dihard diarization challenge: Dataset task and baselines. In INTERSPEECH.","DOI":"10.21437\/Interspeech.2019-1268"},{"key":"e_1_3_2_2_51_1","volume-title":"ICASSP 2021--2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6194--6198","author":"Leda","year":"2021","unstructured":"Leda Sar?, Kritika Singh , Jiatong Zhou , Lorenzo Torresani , Nayan Singhal , and Yatharth Saraf . 2021 . A Multi-View Approach to Audio-Visual Speaker Verification . In ICASSP 2021--2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6194--6198 . Leda Sar?, Kritika Singh, Jiatong Zhou, Lorenzo Torresani, Nayan Singhal, and Yatharth Saraf. 2021. A Multi-View Approach to Audio-Visual Speaker Verification. In ICASSP 2021--2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6194--6198."},{"key":"e_1_3_2_2_52_1","volume-title":"Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge.. In INTERSPEECH. 2808-- 2812.","author":"Sell Gregory","year":"2018","unstructured":"Gregory Sell , David Snyder , Alan McCree , Daniel Garcia-Romero , Jes\u00fas Villalba , Matthew Maciejewski , Vimal Manohar , Najim Dehak , Daniel Povey , ShinjiWatanabe, 2018 . Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge.. In INTERSPEECH. 2808-- 2812. Gregory Sell, David Snyder, Alan McCree, Daniel Garcia-Romero, Jes\u00fas Villalba, Matthew Maciejewski, Vimal Manohar, Najim Dehak, Daniel Povey, ShinjiWatanabe, et al. 2018. Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge.. In INTERSPEECH. 2808-- 2812."},{"key":"e_1_3_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8461375"},{"key":"e_1_3_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01422"},{"key":"e_1_3_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00131"},{"key":"e_1_3_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475587"},{"key":"e_1_3_2_2_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00513"},{"key":"e_1_3_2_2_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8462665"},{"key":"e_1_3_2_2_59_1","volume-title":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5239--5243","author":"Downey Carlton","year":"2018","unstructured":"QuanWang, Carlton Downey , LiWan, Philip AndrewMansfield , and Ignacio Lopz Moreno . 2018 . Speaker diarization with LSTM . In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5239--5243 . QuanWang, Carlton Downey, LiWan, Philip AndrewMansfield, and Ignacio Lopz Moreno. 2018. Speaker diarization with LSTM. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5239--5243."},{"key":"e_1_3_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01608"},{"key":"e_1_3_2_2_61_1","volume-title":"Disjoint Mapping Network for Cross-modal Matching of Voices and Faces. In International Conference on Learning Representations.","author":"Wen Yandong","year":"2018","unstructured":"Yandong Wen , Mahmoud Al Ismail , Weiyang Liu , Bhiksha Raj , and Rita Singh . 2018 . Disjoint Mapping Network for Cross-modal Matching of Voices and Faces. In International Conference on Learning Representations. Yandong Wen, Mahmoud Al Ismail, Weiyang Liu, Bhiksha Raj, and Rita Singh. 2018. Disjoint Mapping Network for Cross-modal Matching of Voices and Faces. In International Conference on Learning Representations."},{"key":"e_1_3_2_2_62_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58610-2_11"},{"key":"e_1_3_2_2_63_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9413832"},{"key":"e_1_3_2_2_64_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58598-3_13"},{"key":"e_1_3_2_2_65_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0167-6393(98)00048-X"},{"key":"e_1_3_2_2_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8683892"},{"key":"e_1_3_2_2_67_1","volume-title":"Proceedings of the IEEE international conference on computer vision (ICCV). 192--201","author":"Zhang Shifeng","year":"2017","unstructured":"Shifeng Zhang , Xiangyu Zhu , Zhen Lei , Hailin Shi , Xiaobo Wang , and Stan Z Li . 2017 . S3fd: Single shot scale-invariant face detector . In Proceedings of the IEEE international conference on computer vision (ICCV). 192--201 . Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, and Stan Z Li. 2017. S3fd: Single shot scale-invariant face detector. In Proceedings of the IEEE international conference on computer vision (ICCV). 192--201."}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548027","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3548027","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:02:29Z","timestamp":1750186949000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548027"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":67,"alternative-id":["10.1145\/3503161.3548027","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3548027","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}