{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,10]],"date-time":"2026-04-10T10:01:30Z","timestamp":1775815290613,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":50,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,10,17]],"date-time":"2021-10-17T00:00:00Z","timestamp":1634428800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Robotics Programme, Science and Engineering Research Council, Agency of Science, Technology and Research, Singapore","award":["1922500054"],"award-info":[{"award-number":["1922500054"]}]},{"name":"Human Robot Collaborative AI for AME, Singapore Government?s Research, Innovation and Enterprise 2020 plan (Advanced Manufacturing and Engineering domain), Singapore","award":["A18A2b0046"],"award-info":[{"award-number":["A18A2b0046"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,10,17]]},"DOI":"10.1145\/3474085.3475587","type":"proceedings-article","created":{"date-parts":[[2021,10,18]],"date-time":"2021-10-18T20:00:05Z","timestamp":1634587205000},"page":"3927-3935","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":122,"title":["Is Someone Speaking?"],"prefix":"10.1145","author":[{"given":"Ruijie","family":"Tao","sequence":"first","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}]},{"given":"Zexu","family":"Pan","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}]},{"given":"Rohan Kumar","family":"Das","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}]},{"given":"Xinyuan","family":"Qian","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}]},{"given":"Mike Zheng","family":"Shou","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}]},{"given":"Haizhou","family":"Li","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}]}],"member":"320","published-online":{"date-parts":[[2021,10,17]]},"reference":[{"key":"e_1_3_2_2_1_1","volume-title":"Andrew Senior, Oriol Vinyals, and Andrew Zisserman.","author":"Afouras Triantafyllos","year":"2018","unstructured":"Triantafyllos Afouras , Joon Son Chung , Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018 c. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence (TPAMI) ( 2018). Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018c. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence (TPAMI) (2018)."},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2018-1400"},{"key":"e_1_3_2_2_3_1","volume-title":"Joon Son Chung, and Andrew Zisserman","author":"Afouras Triantafyllos","year":"2018","unstructured":"Triantafyllos Afouras , Joon Son Chung, and Andrew Zisserman . 2018 b. LRS 3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018). Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018b. LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018)."},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58523-5_13"},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01248"},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/JSTSP.2019.2901195"},{"key":"e_1_3_2_2_7_1","volume-title":"RealVAD: A real-world dataset and a method for voice activity detection by body motion analysis","author":"Beyan Cigdem","year":"2020","unstructured":"Cigdem Beyan , Muhammad Shahid , and Vittorio Murino . 2020. RealVAD: A real-world dataset and a method for voice activity detection by body motion analysis . IEEE Transactions on Multimedia ( 2020 ). Cigdem Beyan, Muhammad Shahid, and Vittorio Murino. 2020. RealVAD: A real-world dataset and a method for voice activity detection by body motion analysis. IEEE Transactions on Multimedia (2020)."},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-24673-2_3"},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2818346.2820780"},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46454-1_18"},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413869"},{"key":"e_1_3_2_2_12_1","volume-title":"Naver at ActivityNet Challenge 2019--Task B Active Speaker Detection (AVA). arXiv preprint arXiv:1906.10555","author":"Chung Joon Son","year":"2019","unstructured":"Joon Son Chung . 2019. Naver at ActivityNet Challenge 2019--Task B Active Speaker Detection (AVA). arXiv preprint arXiv:1906.10555 ( 2019 ). Joon Son Chung. 2019. Naver at ActivityNet Challenge 2019--Task B Active Speaker Detection (AVA). arXiv preprint arXiv:1906.10555 (2019)."},{"key":"e_1_3_2_2_13_1","volume-title":"Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, and Icksang Han. 2020 a. In defence of metric learning for speaker recognition. In Interspeech","author":"Chung Joon Son","year":"2020","unstructured":"Joon Son Chung , Jaesung Huh , Seongkyu Mun , Minjae Lee , Hee Soo Heo , Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, and Icksang Han. 2020 a. In defence of metric learning for speaker recognition. In Interspeech 2020 . Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, and Icksang Han. 2020 a. In defence of metric learning for speaker recognition. In Interspeech 2020."},{"key":"e_1_3_2_2_14_1","first-page":"299","article-title":"b","volume":"2020","author":"Chung Joon Son","year":"2020","unstructured":"Joon Son Chung , Jaesung Huh , Arsha Nagrani , Triantafyllos Afouras , and Andrew Zisserman . 2020 b . Spot the Conversation: Speaker Diarisation in the Wild. In Proc. Interspeech 2020. 299 -- 303 . Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, and Andrew Zisserman. 2020 b. Spot the Conversation: Speaker Diarisation in the Wild. In Proc. Interspeech 2020. 299--303.","journal-title":"Spot the Conversation: Speaker Diarisation in the Wild. In Proc. Interspeech"},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2019-3116"},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2018-1929"},{"key":"e_1_3_2_2_17_1","volume-title":"Asian conference on computer vision (ACCV). Springer, 251--263","author":"Chung Joon Son","year":"2016","unstructured":"Joon Son Chung and Andrew Zisserman . 2016 . Out of time: automated lip sync in the wild . In Asian conference on computer vision (ACCV). Springer, 251--263 . Joon Son Chung and Andrew Zisserman. 2016. Out of time: automated lip sync in the wild. In Asian conference on computer vision (ACCV). Springer, 251--263."},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8682524"},{"key":"e_1_3_2_2_19_1","volume-title":"Oxford guide to plain English","author":"Cutts Martin","unstructured":"Martin Cutts . 2020. Oxford guide to plain English . Oxford University Press , USA. Martin Cutts. 2020. Oxford guide to plain English. Oxford University Press, USA."},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.21437\/Odyssey.2020-62"},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3197517.3201357"},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.compbiomed.2020.104152"},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00745"},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW50498.2020.00483"},{"key":"e_1_3_2_2_25_1","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Ko T.","year":"2017","unstructured":"T. Ko , V. Peddinti , D. Povey , M. L. Seltzer , and S. Khudanpur . 2017. A study on data augmentation of reverberant speech for robust speech recognition . In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2017 . 5220--5224. T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur. 2017. A study on data augmentation of reverberant speech for robust speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2017. 5220--5224."},{"key":"e_1_3_2_2_26_1","volume-title":"Ali Thabet, and Bernard Ghanem.","author":"Le\u00f3n-Alc\u00e1zar Juan","year":"2021","unstructured":"Juan Le\u00f3n-Alc\u00e1zar , Fabian Caba Heilbron , Ali Thabet, and Bernard Ghanem. 2021 . MAAS : Multi-modal assignation for active speaker detection. arXiv preprint arXiv:2101.03682 (2021). Juan Le\u00f3n-Alc\u00e1zar, Fabian Caba Heilbron, Ali Thabet, and Bernard Ghanem. 2021. MAAS: Multi-modal assignation for active speaker detection. arXiv preprint arXiv:2101.03682 (2021)."},{"key":"e_1_3_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2015.2463722"},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01231-1_39"},{"key":"e_1_3_2_2_29_1","volume-title":"Muse: Multi-modal target speaker extraction with visual cues. arXiv preprint arXiv:2010.07775","author":"Pan Zexu","year":"2020","unstructured":"Zexu Pan , Ruijie Tao , Chenglin Xu , and Haizhou Li . 2020 . Muse: Multi-modal target speaker extraction with visual cues. arXiv preprint arXiv:2010.07775 (2020). Zexu Pan, Ruijie Tao, Chenglin Xu, and Haizhou Li. 2020. Muse: Multi-modal target speaker extraction with visual cues. arXiv preprint arXiv:2010.07775 (2020)."},{"key":"e_1_3_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2016.2535357"},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8682912"},{"key":"e_1_3_2_2_32_1","volume-title":"IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (ASRU).","author":"Povey Daniel","year":"2011","unstructured":"Daniel Povey , Arnab Ghoshal , Gilles Boulianne , Lukas Burget , Ondrej Glembek , Nagendra Goel , Mirko Hannemann , Petr Motlicek , Yanmin Qian , Petr Schwarz , 2011 . The Kaldi speech recognition toolkit . In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (ASRU). Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et almbox. 2011. The Kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (ASRU)."},{"key":"e_1_3_2_2_33_1","volume-title":"2021 a. Audio-visual tracking of concurrent speakers","author":"Qian Xinyuan","year":"2021","unstructured":"Xinyuan Qian , Alessio Brutti , Oswald Lanz , Maurizio Omologo , and Andrea Cavallaro . 2021 a. Audio-visual tracking of concurrent speakers . IEEE Transactions on Multimedia ( 2021 ). Xinyuan Qian, Alessio Brutti, Oswald Lanz, Maurizio Omologo, and Andrea Cavallaro. 2021 a. Audio-visual tracking of concurrent speakers. IEEE Transactions on Multimedia (2021)."},{"key":"e_1_3_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9413776"},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9053900"},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2018.2800728"},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-30642-7_5"},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV48630.2021.00238"},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV45572.2020.9093490"},{"key":"e_1_3_2_2_40_1","volume-title":"Crossmodal learning for audio-visual speech event localization. arXiv preprint arXiv:2003.04358","author":"Sharma Rahul","year":"2020","unstructured":"Rahul Sharma , Krishna Somandepalli , and Shrikanth Narayanan . 2020. Crossmodal learning for audio-visual speech event localization. arXiv preprint arXiv:2003.04358 ( 2020 ). Rahul Sharma, Krishna Somandepalli, and Shrikanth Narayanan. 2020. Crossmodal learning for audio-visual speech event localization. arXiv preprint arXiv:2003.04358 (2020)."},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00985"},{"key":"e_1_3_2_2_42_1","volume-title":"MUSAN: A Music, Speech, and Noise Corpus. CoRR","author":"Snyder D.","year":"2015","unstructured":"D. Snyder , G. Chen , and D. Povey . 2015 . MUSAN: A Music, Speech, and Noise Corpus. CoRR , Vol. abs\/ 1510 .08484 (2015). http:\/\/arxiv.org\/abs\/1510.08484 D. Snyder, G. Chen, and D. Povey. 2015. MUSAN: A Music, Speech, and Noise Corpus. CoRR, Vol. abs\/1510.08484 (2015). http:\/\/arxiv.org\/abs\/1510.08484"},{"key":"e_1_3_2_2_43_1","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Snyder D.","year":"2018","unstructured":"D. Snyder , D. Garcia-Romero , G. Sell , D. Povey , and S. Khudanpur . 2018. X-Vectors: robust DNN embeddings for speaker recognition . In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018 . 5329--5333. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur. 2018. X-Vectors: robust DNN embeddings for speaker recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018. 5329--5333."},{"key":"e_1_3_2_2_44_1","doi-asserted-by":"crossref","unstructured":"Fei Tao and Carlos Busso. 2017. Bimodal recurrent neural network for audiovisual voice activity detection.. In INTERSPEECH. 1938--1942.  Fei Tao and Carlos Busso. 2017. Bimodal recurrent neural network for audiovisual voice activity detection.. In INTERSPEECH. 1938--1942.","DOI":"10.21437\/Interspeech.2017-1573"},{"key":"e_1_3_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2019.07.003"},{"key":"e_1_3_2_2_46_1","volume-title":"Speech rates in british english. Applied linguistics","author":"Tauroza Steve","year":"1990","unstructured":"Steve Tauroza and Desmond Allison . 1990. Speech rates in british english. Applied linguistics , Vol. 11 , 1 ( 1990 ), 90--105. Steve Tauroza and Desmond Allison. 1990. Speech rates in british english. Applied linguistics, Vol. 11, 1 (1990), 90--105."},{"key":"e_1_3_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00931"},{"key":"e_1_3_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.5555\/1771530.1771554"},{"key":"e_1_3_2_2_49_1","volume-title":"Proceedings of the IEEE international conference on computer vision (ICCV). 192--201","author":"Zhang Shifeng","year":"2017","unstructured":"Shifeng Zhang , Xiangyu Zhu , Zhen Lei , Hailin Shi , Xiaobo Wang , and Stan Z Li . 2017 . S3FD: Single shot scale-invariant face detector . In Proceedings of the IEEE international conference on computer vision (ICCV). 192--201 . Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, and Stan Z Li. 2017. S3FD: Single shot scale-invariant face detector. In Proceedings of the IEEE international conference on computer vision (ICCV). 192--201."},{"key":"e_1_3_2_2_50_1","unstructured":"Yuan-Hang Zhang Jingyun Xiao Shuang Yang and Shiguang Shan. 2019. Multi-Task Learning for Audio-Visual Active Speaker Detection. (2019).  Yuan-Hang Zhang Jingyun Xiao Shuang Yang and Shiguang Shan. 2019. Multi-Task Learning for Audio-Visual Active Speaker Detection. (2019)."}],"event":{"name":"MM '21: ACM Multimedia Conference","location":"Virtual Event China","acronym":"MM '21","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 29th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3474085.3475587","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3474085.3475587","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:49:11Z","timestamp":1750193351000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3474085.3475587"}},"subtitle":["Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection"],"short-title":[],"issued":{"date-parts":[[2021,10,17]]},"references-count":50,"alternative-id":["10.1145\/3474085.3475587","10.1145\/3474085"],"URL":"https:\/\/doi.org\/10.1145\/3474085.3475587","relation":{},"subject":[],"published":{"date-parts":[[2021,10,17]]},"assertion":[{"value":"2021-10-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}