{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,14]],"date-time":"2025-10-14T00:34:43Z","timestamp":1760402083139,"version":"build-2065373602"},"reference-count":38,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2021,4,10]],"date-time":"2021-04-10T00:00:00Z","timestamp":1618012800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia (FCT) - Portugal","award":["UIDB\/00048\/2020"],"award-info":[{"award-number":["UIDB\/00048\/2020"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Applied Sciences"],"abstract":"<jats:p>Human beings have developed fantastic abilities to integrate information from various sensory sources exploring their inherent complementarity. Perceptual capabilities are therefore heightened, enabling, for instance, the well-known \"cocktail party\" and McGurk effects, i.e., speech disambiguation from a panoply of sound signals. This fusion ability is also key in refining the perception of sound source location, as in distinguishing whose voice is being heard in a group conversation. Furthermore, neuroscience has successfully identified the superior colliculus region in the brain as the one responsible for this modality fusion, with a handful of biological models having been proposed to approach its underlying neurophysiological process. Deriving inspiration from one of these models, this paper presents a methodology for effectively fusing correlated auditory and visual information for active speaker detection. Such an ability can have a wide range of applications, from teleconferencing systems to social robotics. The detection approach initially routes auditory and visual information through two specialized neural network structures. The resulting embeddings are fused via a novel layer based on the superior colliculus, whose topological structure emulates spatial neuron cross-mapping of unimodal perceptual fields. The validation process employed two publicly available datasets, with achieved results confirming and greatly surpassing initial expectations.<\/jats:p>","DOI":"10.3390\/app11083397","type":"journal-article","created":{"date-parts":[[2021,4,12]],"date-time":"2021-04-12T03:04:06Z","timestamp":1618196646000},"page":"3397","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Bio-Inspired Modality Fusion for Active Speaker Detection"],"prefix":"10.3390","volume":"11","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4015-4111","authenticated-orcid":false,"given":"Gustavo","family":"Assun\u00e7\u00e3o","sequence":"first","affiliation":[{"name":"Department of Electrical and Computer Engineering, University of Coimbra, 3004-531 Coimbra, Portugal"},{"name":"Institute of Systems and Robotics, 3030-194 Coimbra, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1854-049X","authenticated-orcid":false,"given":"Nuno","family":"Gon\u00e7alves","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, University of Coimbra, 3004-531 Coimbra, Portugal"},{"name":"Institute of Systems and Robotics, 3030-194 Coimbra, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4903-3554","authenticated-orcid":false,"given":"Paulo","family":"Menezes","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, University of Coimbra, 3004-531 Coimbra, Portugal"},{"name":"Institute of Systems and Robotics, 3030-194 Coimbra, Portugal"}]}],"member":"1968","published-online":{"date-parts":[[2021,4,10]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Nagrani, A., Chung, J.S., and Zisserman, A. (2017). Voxceleb: A large-scale speaker identification dataset. arXiv.","DOI":"10.21437\/Interspeech.2017-950"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3197517.3201357","article-title":"Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation","volume":"37","author":"Ephrat","year":"2018","journal-title":"ACT Trans. Graph."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1608","DOI":"10.1109\/TCSVT.2008.2005602","article-title":"Exploring co-occurrence between speech and body movement for audio-guided video localization","volume":"18","author":"Vajaria","year":"2008","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1557","DOI":"10.1109\/TASL.2006.878256","article-title":"An overview of automatic speaker diarization systems","volume":"14","author":"Tranter","year":"2006","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"245","DOI":"10.1016\/0167-6393(94)00056-G","article-title":"Study of a voice activity detector and its influence on a noise reduction system","volume":"16","author":"Faucon","year":"1995","journal-title":"Speech Comm."},{"key":"ref_6","unstructured":"Liu, D., and Kubala, F. (2003, January 6\u201310). Online speaker clustering. Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003 Proceedings, (ICASSP \u201903), Hong Kong, China."},{"key":"ref_7","unstructured":"Maraboina, S., Kolossa, D., Bora, P.K., and Orglmeister, R. (2006, January 4\u20138). Multi-speaker voice activity detection using ICA and beampattern analysis. Proceedings of the 2006 14th European Signal Processing Conference, Florence, Italy."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Bertrand, A., and Moonen, M. (2010, January 14\u201319). Energy-based multi-speaker voice activity detection with an ad hoc microphone array. Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA.","DOI":"10.1109\/ICASSP.2010.5496183"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"133","DOI":"10.1109\/TCSVT.2008.2009262","article-title":"Visual Lip Activity Detection and Speaker Detection Using Mouth Region Intensities","volume":"19","author":"Siatras","year":"2009","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Ahmad, R., Raza, S.P., and Malik, H. (2013, January 4\u20137). Visual Speech Detection Using an Unsupervised Learning Framework. Proceedings of the 2013 12th International Conference on Machine Learning and Applications, Miami, FL, USA.","DOI":"10.1109\/ICMLA.2013.171"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Stefanov, K., Sugimoto, A., and Beskow, J. (2016, January 16). Look who\u2019s talking: Visual identification of the active speaker in multi-party human\u2013robot interaction. Proceedings of the Workshop Advancements in Social Signal Processing for Multimodal Interaction, Tokyo, Japan.","DOI":"10.1145\/3005467.3005470"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"1032","DOI":"10.1109\/TMM.2014.2305632","article-title":"Simultaneous-Speaker Voice Activity Detection and Localization Using Mid-Fusion of SVM and HMMs","volume":"16","author":"Minotto","year":"2014","journal-title":"IEEE Trans. Multimed."},{"key":"ref_13","unstructured":"Cutler, R., and Davis, L. (August, January 30). Look who\u2019s talking: Speaker detection using video and audio correlation. Proceedings of the 2000 IEEE International Conference on Multimedia and Expo, ICME2000, Proceedings, Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532), New York, NY, USA."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Chakravarty, P., Mirzaei, S., Tuytelaars, T., and Vanhamme, H. (2015, January 9). Who\u2019s speaking? audio-supervised classification of active speakers in video. Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA.","DOI":"10.1145\/2818346.2820780"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Chakravarty, P., and Tuytelaars, T. (2016). Cross-modal supervision for learning active speaker detection in video. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-46454-1_18"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Stefanov, K., Beskow, J., and Salvi, G. (2017, January 25). Vision-based active speaker detection in multiparty interaction. Proceedings of the GLU 2017 Inter-national Workshop on Grounding Language Understanding, Stockholm, Sweden.","DOI":"10.21437\/GLU.2017-10"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"93","DOI":"10.1007\/BF00344251","article-title":"Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position","volume":"36","author":"Fukushima","year":"1980","journal-title":"Biol. Cybern."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"215","DOI":"10.1113\/jphysiol.1968.sp008455","article-title":"Receptive fields and functional architecture of monkey striate cortex","volume":"195","author":"Hubel","year":"1968","journal-title":"J. Physiol."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"250","DOI":"10.1109\/TCDS.2019.2927941","article-title":"Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition","volume":"12","author":"Stefanov","year":"2019","journal-title":"IEEE Trans. Cogn. Dev. Syst."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Ren, J., Hu, Y., Tai, Y.-W., Wang, C., Xu, L., Sun, W., and Yan, Q. (2016, January 12\u201317). Look, listen and learn\u2014A multimodal LSTM for speaker identification. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AR, USA.","DOI":"10.1609\/aaai.v30i1.10471"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Cech, J., Mittal, R., Deleforge, A., Sanchez-Riera, J.X., Alameda-Pineda, X., and Horaud, R. (2013, January 15\u201317). Active-speaker detection and localization with microphones and cameras embedded into a robotic head. Proceedings of the 2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids), Atlanta, GA, USA.","DOI":"10.1109\/HUMANOIDS.2013.7029977"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Gebru, I.D., Alameda-Pineda, X., Horaud, R., and Forbes, F. (2014, January 21\u201324). Audio-visual speaker localization via weighted clustering. Proceedings of the 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Reims, France.","DOI":"10.1109\/MLSP.2014.6958874"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Hoover, K., Chaudhuri, S., Pantofaru, C., Sturdy, I., and Slaney, M. (2018, January 15\u201320). Using audio-visual information to understand speaker activity: Tracking active speakers on and off screen. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461891"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"1704","DOI":"10.1109\/TPAMI.2011.235","article-title":"Aggregating Local Image Descriptors into Compact Codes","volume":"34","author":"Jegou","year":"2012","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Roth, J., Chaudhuri, S., Klejch, O., Marvin, R., Gallagher, A.C., Kaver, L., Ramaswamy, S., Stopczynski, A., Schmid, C., and Xi, Z. (2019). AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection. arXiv.","DOI":"10.1109\/ICCVW.2019.00460"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Cao, Q., Shen, L., Xie, W., Parkhi, O., and Zisserman, A. (2018, January 15\u201319). VGGFace2: A Dataset for Recognising Faces across Pose and Age. Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition, Xi\u2019an, China.","DOI":"10.1109\/FG.2018.00020"},{"key":"ref_28","unstructured":"(2020, November 23). The Keras-VGGFace Package. Available online: https:\/\/pypi.org\/project\/keras-vggface\/."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"55","DOI":"10.1007\/s10827-008-0096-4","article-title":"Multisensory integration in the superior colliculus: A neural network model","volume":"26","author":"Ursino","year":"2008","journal-title":"J. Comput. Neurosci."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Chatfield, K., Simonyan, K., Vedaldi, A., and Zisserman, A. (2014, January 1\u20135). Return of the devil in the details: Delving deep into convolutional nets. Proceedings of the British Machine Vision Conference, Nottingham, UK.","DOI":"10.5244\/C.28.6"},{"key":"ref_31","unstructured":"Chen, C.H. (1976). Distance measures for speech recognition, psychological and instrumental. Pattern Recognition and Artificial Intelligence, Academic."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"He, K., and Sun, J. (2015, January 7\u201312). Convolutional Neural Networks at Constrained Timecost. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299173"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Kankanamge, S., Fookes, C., and Sridharan, S. (2017, January 17\u201320). Facial analysis in the wild with LSTM networks. Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China.","DOI":"10.1109\/ICIP.2017.8296442"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Xu, Z., Li, S., and Deng, W. (2015, January 3\u20136). Learning temporal features using LSTM-CNN architecture for face anti-spoofing. Proceedings of the 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, Malaysia.","DOI":"10.1109\/ACPR.2015.7486482"},{"key":"ref_35","unstructured":"Chollet, F. (2020, November 23). Keras. Available online: https:\/\/keras.io."},{"key":"ref_36","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."},{"key":"ref_37","unstructured":"Ioffe, S., and Szegedy, C. (2015). Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. arXiv."},{"key":"ref_38","first-page":"1929","article-title":"Dropout: A simple way to prevent neural networks from overfitting","volume":"15","author":"Srivastava","year":"2014","journal-title":"J. Mach. Learn. Res."}],"container-title":["Applied Sciences"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2076-3417\/11\/8\/3397\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,13]],"date-time":"2025-10-13T13:26:41Z","timestamp":1760362001000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2076-3417\/11\/8\/3397"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,4,10]]},"references-count":38,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2021,4]]}},"alternative-id":["app11083397"],"URL":"https:\/\/doi.org\/10.3390\/app11083397","relation":{},"ISSN":["2076-3417"],"issn-type":[{"type":"electronic","value":"2076-3417"}],"subject":[],"published":{"date-parts":[[2021,4,10]]}}}