{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,16]],"date-time":"2025-07-16T13:55:47Z","timestamp":1752674147446,"version":"3.37.3"},"reference-count":47,"publisher":"Springer Science and Business Media LLC","issue":"5","license":[{"start":{"date-parts":[[2024,12,14]],"date-time":"2024-12-14T00:00:00Z","timestamp":1734134400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,12,14]],"date-time":"2024-12-14T00:00:00Z","timestamp":1734134400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001659","name":"Deutsche Forschungsgemeinschaft","doi-asserted-by":"publisher","award":["TRR 169"],"award-info":[{"award-number":["TRR 169"]}],"id":[{"id":"10.13039\/501100001659","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100005711","name":"Universit\u00e4t Hamburg","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100005711","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Neural Comput &amp; Applic"],"published-print":{"date-parts":[[2025,2]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Active speaker detection (ASD) in multimodal environments is crucial for various applications, from video conferencing to human-robot interaction. This paper introduces FabuLight-ASD, an advanced ASD model that integrates facial, audio, and body pose information to enhance detection accuracy and robustness. Our model builds upon the existing Light-ASD framework by incorporating human pose data, represented through skeleton graphs, which minimises computational overhead. Using the Wilder Active Speaker Detection (WASD) dataset, renowned for reliable face and body bounding box annotations, we demonstrate FabuLight-ASD\u2019s effectiveness in real-world scenarios. Achieving an overall mean average precision (mAP) of 94.3%, FabuLight-ASD outperforms Light-ASD, which has an overall mAP of 93.7% across various challenging scenarios. The incorporation of body pose information shows a particularly advantageous impact, with notable improvements in mAP observed in scenarios with speech impairment, face occlusion, and human voice background noise. Furthermore, efficiency analysis indicates only a modest increase in parameter count (27.3%) and multiply-accumulate operations (up to 2.4%), underscoring the model\u2019s efficiency and feasibility. These findings validate the efficacy of FabuLight-ASD in enhancing ASD performance through the integration of body pose data. FabuLight-ASD\u2019s code and model weights are available at <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/github.com\/knowledgetechnologyuhh\/FabuLight-ASD\" ext-link-type=\"uri\">https:\/\/github.com\/knowledgetechnologyuhh\/FabuLight-ASD<\/jats:ext-link>.<\/jats:p>","DOI":"10.1007\/s00521-024-10792-0","type":"journal-article","created":{"date-parts":[[2024,12,14]],"date-time":"2024-12-14T08:28:33Z","timestamp":1734164913000},"page":"3561-3579","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["FabuLight-ASD: unveiling speech activity via body language"],"prefix":"10.1007","volume":"37","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5094-5908","authenticated-orcid":false,"given":"Hugo","family":"Carneiro","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Stefan","family":"Wermter","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2024,12,14]]},"reference":[{"key":"10792_CR1","doi-asserted-by":"publisher","unstructured":"Afouras T, Chung JS, Zisserman A (2018) The conversation: Deep audio-visual speech enhancement. In: Yegnanarayana B (ed) Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018. ISCA, pp 3244\u20133248, https:\/\/doi.org\/10.21437\/INTERSPEECH.2018-1400","DOI":"10.21437\/INTERSPEECH.2018-1400"},{"key":"10792_CR2","doi-asserted-by":"publisher","unstructured":"Alc\u00e1zar JL, Caba F, Mai L, et\u00a0al (2020) Active speakers in context. In: 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation \/ IEEE, pp 12462\u201312471, https:\/\/doi.org\/10.1109\/CVPR42600.2020.01248","DOI":"10.1109\/CVPR42600.2020.01248"},{"key":"10792_CR3","doi-asserted-by":"publisher","unstructured":"Alc\u00e1zar JL, Heilbron FC, Thabet AK, et\u00a0al (2021) MAAS: multi-modal assignation for active speaker detection. In: 2021 IEEE\/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, pp 265\u2013274, https:\/\/doi.org\/10.1109\/ICCV48922.2021.00033","DOI":"10.1109\/ICCV48922.2021.00033"},{"key":"10792_CR4","doi-asserted-by":"publisher","unstructured":"Alc\u00e1zar JL, Cordes M, Zhao C, et\u00a0al (2022) End-to-end active speaker detection. In: Avidan S, Brostow GJ, Ciss\u00e9 M, et\u00a0al (eds) Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXVII, Lecture Notes in Computer Science, vol 13697. Springer, pp 126\u2013143, https:\/\/doi.org\/10.1007\/978-3-031-19836-6_8","DOI":"10.1007\/978-3-031-19836-6_8"},{"key":"10792_CR5","doi-asserted-by":"publisher","unstructured":"Carneiro HCC, Weber C, Wermter S (2021) FaVoA: Face-voice association favours ambiguous speaker detection. In: Farka\u0161 I, Masulli P, Otte S, et\u00a0al (eds) Artificial Neural Networks and Machine Learning - ICANN 2021 - 30th International Conference on Artificial Neural Networks, Bratislava, Slovakia, September 14-17, 2021, Proceedings, Part I, Lecture Notes in Computer Science, vol 12891. Springer, pp 439\u2013450, https:\/\/doi.org\/10.1007\/978-3-030-86362-3_36","DOI":"10.1007\/978-3-030-86362-3_36"},{"key":"10792_CR6","doi-asserted-by":"publisher","DOI":"10.1016\/J.NEUCOM.2023.126271","volume":"545","author":"HCC Carneiro","year":"2023","unstructured":"Carneiro HCC, Weber C, Wermter S (2023) Whose emotion matters? Speaking activity localisation without prior knowledge. Neurocomputing 545:126271. https:\/\/doi.org\/10.1016\/J.NEUCOM.2023.126271","journal-title":"Neurocomputing"},{"key":"10792_CR7","doi-asserted-by":"publisher","unstructured":"Chakravarty P, Mirzaei S, Tuytelaars T, et\u00a0al (2015) Who\u2019s speaking?: Audio-supervised classification of active speakers in video. In: Zhang Z, Cohen P, Bohus D, et\u00a0al (eds) Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, November 09 - 13, 2015. ACM, pp 87\u201390, https:\/\/doi.org\/10.1145\/2818346.2820780","DOI":"10.1145\/2818346.2820780"},{"key":"10792_CR8","doi-asserted-by":"publisher","unstructured":"Cho K, van Merrienboer B, G\u00fcl\u00e7ehre \u00c7, et\u00a0al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Moschitti A, Pang B, Daelemans W (eds) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL. ACL, pp 1724\u20131734, https:\/\/doi.org\/10.3115\/V1\/D14-1179","DOI":"10.3115\/V1\/D14-1179"},{"key":"10792_CR9","unstructured":"Chung JS (2019) Naver at ActivityNet challenge 2019 - Task B active speaker detection (AVA). https:\/\/static.googleusercontent.com\/media\/research.google.com\/en\/\/ava\/2019\/Naver_Corporation.pdf"},{"key":"10792_CR10","doi-asserted-by":"publisher","unstructured":"Chung JS, Lee B, Han I (2019) Who said that?: Audio-visual speaker diarisation of real-world meetings. In: Kubin G, Kacic Z (eds) Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019. ISCA, pp 371\u2013375, https:\/\/doi.org\/10.21437\/INTERSPEECH.2019-3116","DOI":"10.21437\/INTERSPEECH.2019-3116"},{"key":"10792_CR11","doi-asserted-by":"publisher","unstructured":"Chung JS, Huh J, Nagrani A, et\u00a0al (2020) Spot the conversation: Speaker diarisation in the wild. In: Meng H, Xu B, Zheng TF (eds) Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020. ISCA, pp 299\u2013303, https:\/\/doi.org\/10.21437\/INTERSPEECH.2020-2337","DOI":"10.21437\/INTERSPEECH.2020-2337"},{"key":"10792_CR12","doi-asserted-by":"publisher","unstructured":"Cutler R, Davis LS (2000) Look who\u2019s talking: Speaker detection using video and audio correlation. In: 2000 IEEE International Conference on Multimedia and Expo, ICME 2000, New York, NY, USA, July 30 - August 2, 2000. IEEE Computer Society, pp 1589\u20131592, https:\/\/doi.org\/10.1109\/ICME.2000.871073","DOI":"10.1109\/ICME.2000.871073"},{"key":"10792_CR13","doi-asserted-by":"publisher","unstructured":"Datta G, Etchart T, Yadav V, et\u00a0al (2022) ASD-Transformer: Efficient active speaker detection using self and multimodal transformers. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022. IEEE, pp 4568\u20134572, https:\/\/doi.org\/10.1109\/ICASSP43922.2022.9746991","DOI":"10.1109\/ICASSP43922.2022.9746991"},{"key":"10792_CR14","doi-asserted-by":"publisher","unstructured":"Donley J, Tourbabin V, Lee J, et\u00a0al (2021) EasyCom: An augmented reality dataset to support algorithms for easy communication in noisy environments. https:\/\/doi.org\/10.48550\/ARXIV.2107.04174","DOI":"10.48550\/ARXIV.2107.04174"},{"key":"10792_CR15","doi-asserted-by":"publisher","unstructured":"Grauman K, Westbury A, Byrne E, et\u00a0al (2022) Ego4D: Around the world in 3,000 hours of egocentric video. In: IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, pp 18973\u201318990, https:\/\/doi.org\/10.1109\/CVPR52688.2022.01842","DOI":"10.1109\/CVPR52688.2022.01842"},{"key":"10792_CR16","unstructured":"Hegde SB, Zisserman A (2023) GestSync: Determining who is speaking without a talking head. In: 34th British Machine Vision Conference 2023, BMVC 2023, Aberdeen, UK, November 20-24, 2023. BMVA Press, pp 506\u2013509, http:\/\/proceedings.bmvc2023.org\/506\/"},{"key":"10792_CR17","doi-asserted-by":"publisher","unstructured":"Howard AG, Zhu M, Chen B, et\u00a0al (2017) MobileNets: Efficient convolutional neural networks for mobile vision applications. https:\/\/doi.org\/10.48550\/ARXIV.1704.04861","DOI":"10.48550\/ARXIV.1704.04861"},{"key":"10792_CR18","unstructured":"Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Bach FR, Blei DM (eds) Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, JMLR Workshop and Conference Proceedings, vol\u00a037. JMLR.org, pp 448\u2013456, http:\/\/proceedings.mlr.press\/v37\/ioffe15.html"},{"key":"10792_CR19","doi-asserted-by":"publisher","unstructured":"Jiang Y, Tao R, Pan Z, et\u00a0al (2023) Target active speaker detection with audio-visual cue. In: Harte N, Carson-Berndsen J, Jones G (eds) Interspeech 2023, 24th Annual Conference of the International Speech Communication Association, Dublin, Ireland, 20 - 24 August 2023. ISCA, pp 3152\u20133156, https:\/\/doi.org\/10.21437\/INTERSPEECH.2023-574","DOI":"10.21437\/INTERSPEECH.2023-574"},{"key":"10792_CR20","doi-asserted-by":"publisher","unstructured":"Jung C, Lee S, Nam K, et\u00a0al (2024) TalkNCE: Improving active speaker detection with talk-aware contrastive learning. In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 8391\u20138395, https:\/\/doi.org\/10.1109\/ICASSP48485.2024.10448124","DOI":"10.1109\/ICASSP48485.2024.10448124"},{"key":"10792_CR21","doi-asserted-by":"publisher","unstructured":"Kim YJ, Heo H, Choe S, et\u00a0al (2021) Look who\u2019s talking: Active speaker detection in the wild. In: Hermansky H, Cernock\u00fd H, Burget L, et\u00a0al (eds) Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021. ISCA, pp 3675\u20133679, https:\/\/doi.org\/10.21437\/INTERSPEECH.2021-2041","DOI":"10.21437\/INTERSPEECH.2021-2041"},{"key":"10792_CR22","doi-asserted-by":"publisher","unstructured":"K\u00f6p\u00fckl\u00fc O, Taseska M, Rigoll G (2021) How to design a three-stage architecture for audio-visual active speaker detection in the wild. In: 2021 IEEE\/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, pp 1173\u20131183, https:\/\/doi.org\/10.1109\/ICCV48922.2021.00123","DOI":"10.1109\/ICCV48922.2021.00123"},{"key":"10792_CR23","doi-asserted-by":"publisher","unstructured":"Liao J, Duan H, Feng K, et\u00a0al (2023) A light weight model for active speaker detection. In: IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, pp 22932\u201322941, https:\/\/doi.org\/10.1109\/CVPR52729.2023.02196","DOI":"10.1109\/CVPR52729.2023.02196"},{"key":"10792_CR24","doi-asserted-by":"publisher","unstructured":"Lin T, Maire M, Belongie SJ, et\u00a0al (2014) Microsoft COCO: Common objects in context. In: Fleet DJ, Pajdla T, Schiele B, et\u00a0al (eds) Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, Lecture Notes in Computer Science, vol 8693. Springer, pp 740\u2013755, https:\/\/doi.org\/10.1007\/978-3-319-10602-1_48","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"10792_CR25","doi-asserted-by":"publisher","unstructured":"Min K, Roy S, Tripathi S, et\u00a0al (2022) Learning long-term spatial-temporal graphs for active speaker detection. In: Avidan S, Brostow GJ, Ciss\u00e9 M, et\u00a0al (eds) Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXV, Lecture Notes in Computer Science, vol 13695. Springer, pp 371\u2013387, https:\/\/doi.org\/10.1007\/978-3-031-19833-5_22","DOI":"10.1007\/978-3-031-19833-5_22"},{"key":"10792_CR26","unstructured":"MMPose Contributors (2020) OpenMMLab pose estimation toolbox and benchmark. https:\/\/github.com\/open-mmlab\/mmpose"},{"key":"10792_CR27","doi-asserted-by":"publisher","unstructured":"Qian X, Madhavi MC, Pan Z, et\u00a0al (2021) Multi-target DoA estimation with an audio-visual fusion mechanism. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. IEEE, pp 4280\u20134284, https:\/\/doi.org\/10.1109\/ICASSP39728.2021.9413776","DOI":"10.1109\/ICASSP39728.2021.9413776"},{"key":"10792_CR28","doi-asserted-by":"publisher","first-page":"942","DOI":"10.1109\/TMM.2021.3061800","volume":"24","author":"X Qian","year":"2022","unstructured":"Qian X, Brutti A, Lanz O et al (2022) Audio-visual tracking of concurrent speakers. IEEE Transactions on Multimedia 24:942\u2013954. https:\/\/doi.org\/10.1109\/TMM.2021.3061800","journal-title":"IEEE Transactions on Multimedia"},{"key":"10792_CR29","doi-asserted-by":"publisher","unstructured":"Qu L, Weber C, Wermter S (2020) Multimodal target speech separation with voice and face references. In: Meng H, Xu B, Zheng TF (eds) Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020. ISCA, pp 1416\u20131420, https:\/\/doi.org\/10.21437\/INTERSPEECH.2020-1697","DOI":"10.21437\/INTERSPEECH.2020-1697"},{"key":"10792_CR30","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-024-18457-9","author":"A Radman","year":"2024","unstructured":"Radman A, Laaksonen J (2024) AS-Net: Active speaker detection using deep audio-visual attention. Multimedia Tools and Applications. https:\/\/doi.org\/10.1007\/s11042-024-18457-9","journal-title":"Multimedia Tools and Applications"},{"key":"10792_CR31","doi-asserted-by":"publisher","unstructured":"Roth J, Chaudhuri S, Klejch O, et\u00a0al (2020) AVA Active Speaker: An audio-visual dataset for active speaker detection. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020. IEEE, pp 4492\u20134496, https:\/\/doi.org\/10.1109\/ICASSP40776.2020.9053900","DOI":"10.1109\/ICASSP40776.2020.9053900"},{"key":"10792_CR32","doi-asserted-by":"publisher","DOI":"10.1109\/TBIOM.2024.3412821","author":"T Roxo","year":"2024","unstructured":"Roxo T, Costa JC, In\u00e1cio PRM et al (2024) WASD: A wilder active speaker detection dataset. IEEE Transactions on Biometrics, Behavior, and Identity Science. https:\/\/doi.org\/10.1109\/TBIOM.2024.3412821","journal-title":"IEEE Transactions on Biometrics, Behavior, and Identity Science"},{"key":"10792_CR33","doi-asserted-by":"publisher","unstructured":"Shahid M, Beyan C, Murino V (2021) S-VVAD: Visual voice activity detection by motion segmentation. In: IEEE Winter Conference on Applications of Computer Vision, WACV 2021, Waikoloa, HI, USA, January 3-8, 2021. IEEE, pp 2331\u20132340, https:\/\/doi.org\/10.1109\/WACV48630.2021.00238","DOI":"10.1109\/WACV48630.2021.00238"},{"key":"10792_CR34","doi-asserted-by":"publisher","first-page":"7825","DOI":"10.1109\/TMM.2022.3229975","volume":"25","author":"R Sharma","year":"2023","unstructured":"Sharma R, Somandepalli K, Narayanan S (2023) Cross modal video representations for weakly supervised active speaker localization. IEEE Transactions on Multimedia 25:7825\u20137836. https:\/\/doi.org\/10.1109\/TMM.2022.3229975","journal-title":"IEEE Transactions on Multimedia"},{"key":"10792_CR35","doi-asserted-by":"publisher","unstructured":"Stefanov K, Sugimoto A, Beskow J (2016) Look who\u2019s talking: visual identification of the active speaker in multi-party human-robot interaction. In: Truong KP, Heylen D, Nishida T, et\u00a0al (eds) Proceedings of the 2nd Workshop on Advancements in Social Signal Processing for Multimodal Interaction, ASSP4MI@ICMI 2016, Tokyo, Japan, November 12 - 16, 2016. ACM, pp 22\u201327, https:\/\/doi.org\/10.1145\/3005467.3005470","DOI":"10.1145\/3005467.3005470"},{"key":"10792_CR36","doi-asserted-by":"publisher","unstructured":"Stefanov K, Beskow J, Salvi G (2017) Vision-based active speaker detection in multiparty interaction. In: Salvi G, Dupont S (eds) Proceedings of GLU 2017 International Workshop on Grounding Language Understanding. ISCA, pp 47\u201351, https:\/\/doi.org\/10.21437\/GLU.2017-10","DOI":"10.21437\/GLU.2017-10"},{"key":"10792_CR37","doi-asserted-by":"publisher","unstructured":"Sun K, Xiao B, Liu D, et\u00a0al (2019) Deep high-resolution representation learning for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation \/ IEEE, pp 5693\u20135703, https:\/\/doi.org\/10.1109\/CVPR.2019.00584","DOI":"10.1109\/CVPR.2019.00584"},{"key":"10792_CR38","doi-asserted-by":"publisher","unstructured":"Tao R, Pan Z, Das RK, et\u00a0al (2021) Is someone speaking?: Exploring long-term temporal features for audio-visual active speaker detection. In: Shen HT, Zhuang Y, Smith JR, et\u00a0al (eds) MM \u201921: ACM Multimedia Conference, Virtual Event, China, October 20 - 24, 2021. ACM, pp 3927\u20133935, https:\/\/doi.org\/10.1145\/3474085.3475587","DOI":"10.1145\/3474085.3475587"},{"issue":"11","key":"10792_CR39","doi-asserted-by":"publisher","first-page":"1608","DOI":"10.1109\/TCSVT.2008.2005602","volume":"18","author":"H Vajaria","year":"2008","unstructured":"Vajaria H, Sarkar S, Kasturi R (2008) Exploring co-occurence between speech and body movement for audio-guided video localization. IEEE Transactions on Circuits and Systems for Video Technology 18(11):1608\u20131617. https:\/\/doi.org\/10.1109\/TCSVT.2008.2005602","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"10792_CR40","doi-asserted-by":"crossref","unstructured":"Wang X, Cheng F, Bertasius G, et\u00a0al (2024) LoCoNet: Long-short context network for active speaker detection. In: 2024 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 17-21, 2024. Computer Vision Foundation \/ IEEE, pp 18462\u201318472, https:\/\/openaccess.thecvf.com\/content\/CVPR2024\/html\/Wang_LoCoNet_Long-Short_Context_Network_for_Active_Speaker_Detection_CVPR_2024_paper.html","DOI":"10.1109\/CVPR52733.2024.01747"},{"key":"10792_CR41","doi-asserted-by":"publisher","unstructured":"Wuerkaixi A, Zhang Y, Duan Z, et\u00a0al (2022) Rethinking audio-visual synchronization for active speaker detection. In: 32nd IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2022, Xi\u2019an, China, August 22-25, 2022. IEEE, pp 1\u20136, https:\/\/doi.org\/10.1109\/MLSP55214.2022.9943352","DOI":"10.1109\/MLSP55214.2022.9943352"},{"key":"10792_CR42","doi-asserted-by":"publisher","first-page":"5800","DOI":"10.1109\/TMM.2022.3199109","volume":"25","author":"J Xiong","year":"2023","unstructured":"Xiong J, Zhou Y, Zhang P et al (2023) Look &listen: Multi-modal correlation learning for active speaker detection and speech enhancement. IEEE Transactions on Multimedia 25:5800\u20135812. https:\/\/doi.org\/10.1109\/TMM.2022.3199109","journal-title":"IEEE Transactions on Multimedia"},{"key":"10792_CR43","doi-asserted-by":"publisher","unstructured":"Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: McIlraith SA, Weinberger KQ (eds) Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018. AAAI Press, pp 7444\u20137452, https:\/\/doi.org\/10.1609\/AAAI.V32I1.12328","DOI":"10.1609\/AAAI.V32I1.12328"},{"key":"10792_CR44","unstructured":"Zhang Y, Xiao J, Yang S, et\u00a0al (2019) Multi-task learning for audio-visual active speaker detection. https:\/\/static.googleusercontent.com\/media\/research.google.com\/en\/\/ava\/2019\/Multi_Task_Learning_for_Audio_Visual_Active_Speaker_Detection.pdf"},{"key":"10792_CR45","unstructured":"Zhang Y, Liang S, Yang S, et\u00a0al (2021a) ICTCAS-UCAS-TAL submission to the AVA-ActiveSpeaker task at ActivityNet Challenge 2021. https:\/\/static.googleusercontent.com\/media\/research.google.com\/en\/\/ava\/2021\/S1_ICTCAS-UCAS-TAL.pdf"},{"key":"10792_CR46","doi-asserted-by":"publisher","unstructured":"Zhang Y, Liang S, Yang S, et\u00a0al (2021b) UniCon: Unified context network for robust active speaker detection. In: Shen HT, Zhuang Y, Smith JR, et\u00a0al (eds) MM \u201921: ACM Multimedia Conference, Virtual Event, China, October 20 - 24, 2021. ACM, pp 3964\u20133972, https:\/\/doi.org\/10.1145\/3474085.3475275","DOI":"10.1145\/3474085.3475275"},{"key":"10792_CR47","unstructured":"Zhang Y, Liang S, Yang S, et\u00a0al (2022) UniCon+: ICTCAS-UCAS submission to the AVA-ActiveSpeaker task at ActivityNet Challenge 2022. https:\/\/static.googleusercontent.com\/media\/research.google.com\/en\/\/ava\/2022\/S1_ICTCAS_UCAS_UniCon+.pdf"}],"container-title":["Neural Computing and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-024-10792-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00521-024-10792-0\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-024-10792-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,2,7]],"date-time":"2025-02-07T09:01:19Z","timestamp":1738918879000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00521-024-10792-0"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,12,14]]},"references-count":47,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2025,2]]}},"alternative-id":["10792"],"URL":"https:\/\/doi.org\/10.1007\/s00521-024-10792-0","relation":{},"ISSN":["0941-0643","1433-3058"],"issn-type":[{"type":"print","value":"0941-0643"},{"type":"electronic","value":"1433-3058"}],"subject":[],"published":{"date-parts":[[2024,12,14]]},"assertion":[{"value":"26 July 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 November 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 December 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors have no conflict of interest to declare that are relevant to the content of this article.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics Approval"}},{"value":"Not applicable.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent to Participate"}},{"value":"Not applicable.","order":5,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for Publication"}},{"value":"FabuLight-ASD\u2019s code and model weights are available at .","order":6,"name":"Ethics","group":{"name":"EthicsHeading","label":"Code Availability"}}]}}