{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,22]],"date-time":"2026-04-22T18:10:24Z","timestamp":1776881424425,"version":"3.51.2"},"reference-count":37,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,4,8]],"date-time":"2021-04-08T00:00:00Z","timestamp":1617840000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,4,8]],"date-time":"2021-04-08T00:00:00Z","timestamp":1617840000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100010661","name":"Horizon 2020 Framework Programme","doi-asserted-by":"publisher","award":["871245"],"award-info":[{"award-number":["871245"]}],"id":[{"id":"10.13039\/100010661","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100006245","name":"Ministry of Science and Technology, Israel","doi-asserted-by":"publisher","award":["3-16416"],"award-info":[{"award-number":["3-16416"]}],"id":[{"id":"10.13039\/501100006245","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J AUDIO SPEECH MUSIC PROC."],"published-print":{"date-parts":[[2021,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In this study, we present a deep neural network-based online multi-speaker localization algorithm based on a multi-microphone array. Following the W-disjoint orthogonality principle in the spectral domain, time-frequency (TF) bin is dominated by a single speaker and hence by a single direction of arrival (DOA). A fully convolutional network is trained with instantaneous spatial features to estimate the DOA for each TF bin. The high-resolution classification enables the network to accurately and simultaneously localize and track multiple speakers, both static and dynamic. Elaborated experimental study using simulated and real-life recordings in static and dynamic scenarios demonstrates that the proposed algorithm significantly outperforms both classic and recent deep-learning-based algorithms. Finally, as a byproduct, we further show that the proposed method is also capable of separating moving speakers by the application of the obtained TF masks.<\/jats:p>","DOI":"10.1186\/s13636-021-00203-w","type":"journal-article","created":{"date-parts":[[2021,4,8]],"date-time":"2021-04-08T13:03:17Z","timestamp":1617886997000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":30,"title":["Dynamically localizing multiple speakers based on the time-frequency domain"],"prefix":"10.1186","volume":"2021","author":[{"given":"Hodaya","family":"Hammer","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shlomo E.","family":"Chazan","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jacob","family":"Goldberger","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2885-170X","authenticated-orcid":false,"given":"Sharon","family":"Gannot","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2021,4,8]]},"reference":[{"issue":"3","key":"203_CR1","doi-asserted-by":"publisher","first-page":"276","DOI":"10.1109\/TAP.1986.1143830","volume":"34","author":"R. Schmidt","year":"1986","unstructured":"R. Schmidt, Multiple emitter location and signal parameter estimation. IEEE Trans. Antennas Propag.34(3), 276\u2013280 (1986).","journal-title":"IEEE Trans. Antennas Propag."},{"key":"203_CR2","unstructured":"J. P. Dmochowski, J. Benesty, S. Affes, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. Broadband MUSIC: opportunities and challenges for multiple source localization, (2007), pp. 18\u201321."},{"key":"203_CR3","unstructured":"M. S. Brandstein, H. F. Silverman, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1. A robust method for speech signal time-delay estimation in reverberant rooms, (1997), pp. 375\u2013378."},{"key":"203_CR4","doi-asserted-by":"publisher","first-page":"1620","DOI":"10.1109\/TASLP.2020.2990485","volume":"28","author":"C. Evers","year":"2020","unstructured":"C. Evers, H. W. L\u00f6llmann, H. Mellmann, A. Schmidt, H. Barfuss, P. A. Naylor, W. Kellermann, The LOCATA challenge: acoustic source localization and tracking. IEEE\/ACM Trans. Audio Speech Lang. Process.28:, 1620\u20131643 (2020).","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"issue":"1","key":"203_CR5","doi-asserted-by":"publisher","first-page":"8","DOI":"10.1109\/JSTSP.2019.2901664","volume":"13","author":"S. Chakrabarty","year":"2019","unstructured":"S. Chakrabarty, E. A. P. Habets, Multi-speaker DOA estimation using deep convolutional networks trained with noise signals. IEEE J. Sel. Top. Signal Process.13(1), 8\u201321 (2019).","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"203_CR6","unstructured":"X. Xiao, S. Zhao, X. Zhong, D. L. Jones, E. S. Chng, H. Li, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). A learning-based approach to direction of arrival estimation in noisy and reverberant environments, (2015), pp. 2814\u20132818."},{"key":"203_CR7","unstructured":"F. Vesperini, P. Vecchiotti, E. Principi, S. Squartini, F. Piazza, in IEEE International Workshop on Machine Learning for Signal Processing (MLSP). A neural network based algorithm for speaker localization in a multi-room environment, (2016), pp. 1\u20136."},{"key":"203_CR8","unstructured":"R. Takeda, K. Komatani, in IEEE Spoken Language Technology Workshop (SLT). Discriminative multiple sound source localization based on deep neural networks using independent location model, (2016), pp. 603\u2013609."},{"key":"203_CR9","unstructured":"H. Pujol, E. Bavu, A. Garcia, in International Congress on Acoustics (ICA). Source localization in reverberant rooms using deep learning and microphone arrays, (2019), pp. 1\u20138."},{"issue":"10","key":"203_CR10","doi-asserted-by":"publisher","first-page":"3418","DOI":"10.3390\/s18103418","volume":"18","author":"J. M. Vera-Diaz","year":"2018","unstructured":"J. M. Vera-Diaz, D. Pizarro, J. Macias-Guarasa, Towards end-to-end acoustic localization using deep learning: from audio signals to source position coordinates. Sensors. 18(10), 3418 (2018).","journal-title":"Sensors"},{"key":"203_CR11","unstructured":"S. Chakrabarty, E. A. P. Habets, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Broadband DOA estimation using convolutional neural networks trained with noise signals, (2017), pp. 136\u2013140."},{"key":"203_CR12","unstructured":"S. Rickard, O. Yilmaz, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). On the approximate W-disjoint orthogonality of speech, (2002), pp. 3049\u20133052."},{"key":"203_CR13","doi-asserted-by":"crossref","unstructured":"O. Ronneberger, P. Fischer, T. Brox, in International Conference on Medical Image Computing and Computer-Assisted Intervention. U-net: convolutional networks for biomedical image segmentation (Springer, 2015), pp. 234\u2013241.","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"203_CR14","unstructured":"O. Ernst, S. E. Chazan, S. Gannot, J. Goldberger, in The 26th European Signal Processing Conference (EUSIPCO). Speech dereverberation using fully convolutional networks, (2018), pp. 390\u2013394."},{"key":"203_CR15","doi-asserted-by":"crossref","unstructured":"S. E. Chazan, H. Hammer, G. Hazan, J. Goldberger, S. Gannot, in European Signal Processing Conference (EUSIPCO). Multi-microphone speaker separation based on deep DOA estimation, (2019).","DOI":"10.23919\/EUSIPCO.2019.8903121"},{"key":"203_CR16","unstructured":"S. D. Grechkov, V. P. Semenov, A. A. Bezrukov, in IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (EIConRus). Comparative analysis of the usage of neural networks for sound processing, (2020), pp. 1389\u20131391."},{"key":"203_CR17","unstructured":"Y. Zhang, Q. Duan, Y. Liao, J. Liu, R. Wu, B. Xie, in The 4th International Conference on Mechanical, Control and Computer Engineering (ICMCCE). Research on speech enhancement algorithm based on SA-Unet, (2019), pp. 818\u20138183."},{"key":"203_CR18","unstructured":"R. Giri, U. Isik, A. Krishnaswamy, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Attention wave-U-net for speech enhancement, (2019), pp. 249\u2013253."},{"key":"203_CR19","unstructured":"E. Hadad, F. Heese, P. Vary, S. Gannot, in International Workshop on Acoustic Signal Enhancement (IWAENC). Multichannel audio database in various acoustic environments, (2014), pp. 313\u2013317."},{"issue":"8","key":"203_CR20","doi-asserted-by":"publisher","first-page":"1614","DOI":"10.1109\/78.934132","volume":"49","author":"S. Gannot","year":"2001","unstructured":"S. Gannot, D. Burshtein, E. Weinstein, Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans. Signal Process.49(8), 1614\u20131626 (2001).","journal-title":"IEEE Trans. Signal Process."},{"key":"203_CR21","unstructured":"S. Stenzel, J. Freudenberger, G. Schmidt, in 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA). A minimum variance beamformer for spatially distributed microphones using a soft reference selection, (2014), pp. 127\u2013131."},{"issue":"7","key":"203_CR22","doi-asserted-by":"publisher","first-page":"1830","DOI":"10.1109\/TSP.2004.828896","volume":"52","author":"O. Yilmaz","year":"2004","unstructured":"O. Yilmaz, S. Rickard, Blind separation of speech mixtures via time-frequency masking. IEEE Trans. Signal Process.52(7), 1830\u20131847 (2004).","journal-title":"IEEE Trans. Signal Process."},{"issue":"12","key":"203_CR23","doi-asserted-by":"publisher","first-page":"2481","DOI":"10.1109\/TPAMI.2016.2644615","volume":"39","author":"V. Badrinarayanan","year":"2017","unstructured":"V. Badrinarayanan, A. Kendall, R. Cipolla, Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell.39(12), 2481\u20132495 (2017).","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"203_CR24","unstructured":"O. Ronneberger, P. Fischer, T. Brox, in International Conference on Medical Image Computing and Computer-assisted Intervention. U-net: convolutional networks for biomedical image segmentation, (2015), pp. 234\u2013241."},{"issue":"10","key":"203_CR25","doi-asserted-by":"publisher","first-page":"1702","DOI":"10.1109\/TASLP.2018.2842159","volume":"26","author":"D. Wang","year":"2018","unstructured":"D. Wang, J. Chen, Supervised speech separation based on deep learning: an overview. IEEE\/ACM Trans. Audio Speech Lang. Process.26(10), 1702\u20131726 (2018).","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"issue":"4","key":"203_CR26","doi-asserted-by":"publisher","first-page":"943","DOI":"10.1121\/1.382599","volume":"65","author":"J. B. Allen","year":"1979","unstructured":"J. B. Allen, D. A. Berkley, Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Am.65(4), 943\u2013950 (1979).","journal-title":"J. Acoust. Soc. Am."},{"key":"203_CR27","unstructured":"D. B. Paul, J. M. Baker, in Workshop on Speech and Natural Language. The design for the Wall Street Journal-based CSR corpus, (1992), pp. 357\u2013362. https:\/\/www.aclweb.org\/anthology\/H92-1073."},{"key":"203_CR28","unstructured":"D. P. Kingma, J. Ba, Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)."},{"issue":"6","key":"203_CR29","doi-asserted-by":"publisher","first-page":"1071","DOI":"10.1109\/TASL.2009.2016395","volume":"17","author":"S. Markovich-Golan","year":"2009","unstructured":"S. Markovich-Golan, S. Gannot, I. Cohen, Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfering speech signals. IEEE Trans. Audio Speech Lang. Process.17(6), 1071\u20131086 (2009). https:\/\/doi.org\/10.1109\/TASL.2009.2016395.","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"issue":"5","key":"203_CR30","doi-asserted-by":"publisher","first-page":"825","DOI":"10.1109\/JSTSP.2015.2415761","volume":"9","author":"I. Dokmani\u0107","year":"2015","unstructured":"I. Dokmani\u0107, R. Scheibler, M. Vetterli, Raking the cocktail party. IEEE J Sel. Top. Signal Process.9(5), 825\u2013836 (2015).","journal-title":"IEEE J Sel. Top. Signal Process."},{"issue":"01","key":"203_CR31","doi-asserted-by":"publisher","first-page":"1440003","DOI":"10.1142\/S0129065714400036","volume":"25","author":"A. Deleforge","year":"2015","unstructured":"A. Deleforge, F. Forbes, R. Horaud, Acoustic space learning for sound-source separation and localization on binaural manifolds. Int. J. Neural Syst.25(01), 1440003 (2015).","journal-title":"Int. J. Neural Syst."},{"issue":"8","key":"203_CR32","doi-asserted-by":"publisher","first-page":"1393","DOI":"10.1109\/TASLP.2016.2555085","volume":"24","author":"B. Laufer-Goldshtein","year":"2016","unstructured":"B. Laufer-Goldshtein, R. Talmon, S. Gannot, Semi-supervised sound source localization based on manifold regularization. IEEE\/ACM Trans. Audio Speech Lang. Process.24(8), 1393\u20131407 (2016).","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"203_CR33","unstructured":"S. E. Chazan, L. Wolf, E. Nachmani, Y. Adi, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Single channel voice separation for unknown number of speakers under reverberant and noisy settings, (2021). https:\/\/doi.org\/http:\/\/arxiv.org\/abs\/2011.02329."},{"issue":"4","key":"203_CR34","doi-asserted-by":"publisher","first-page":"692","DOI":"10.1109\/TASLP.2016.2647702","volume":"25","author":"S. Gannot","year":"2017","unstructured":"S. Gannot, E. Vincent, S. Markovich-Golan, A. Ozerov, A consolidated perspective on multimicrophone speech enhancement and source separation. IEEE\/ACM Trans. Audio Speech Lang. Process.25(4), 692\u2013730 (2017). Invited tutorial paper.","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"203_CR35","first-page":"1","volume":"2006","author":"S. Gannot","year":"2006","unstructured":"S. Gannot, T. G. Dvorkind, Microphone array speaker localizers using spatial-temporal information. EURASIP J. Adv. Signal Process.2006:, 1\u201317 (2006).","journal-title":"EURASIP J. Adv. Signal Process."},{"issue":"4","key":"203_CR36","doi-asserted-by":"publisher","first-page":"725","DOI":"10.1109\/TASLP.2018.2790707","volume":"26","author":"B. Laufer-Goldshtein","year":"2018","unstructured":"B. Laufer-Goldshtein, R. Talmon, S. Gannot, A hybrid approach for speaker tracking based on TDOA and data-driven models. IEEE\/ACM Trans. Audio Speech Lang. Process.26(4), 725\u2013735 (2018).","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"issue":"2","key":"203_CR37","doi-asserted-by":"publisher","first-page":"590","DOI":"10.1177\/1475921718762154","volume":"18","author":"A. K. Das","year":"2019","unstructured":"A. K. Das, T. T. Lai, C. W. Chan, C. K. Y. Leung, A new non-linear framework for localization of acoustic sources. Struct. Health Monit.18(2), 590\u2013601 (2019).","journal-title":"Struct. Health Monit."}],"container-title":["EURASIP Journal on Audio, Speech, and Music Processing"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-021-00203-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1186\/s13636-021-00203-w\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-021-00203-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,4,8]],"date-time":"2021-04-08T13:11:52Z","timestamp":1617887512000},"score":1,"resource":{"primary":{"URL":"https:\/\/asmp-eurasipjournals.springeropen.com\/articles\/10.1186\/s13636-021-00203-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,4,8]]},"references-count":37,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["203"],"URL":"https:\/\/doi.org\/10.1186\/s13636-021-00203-w","relation":{},"ISSN":["1687-4722"],"issn-type":[{"value":"1687-4722","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,4,8]]},"assertion":[{"value":"1 December 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 March 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 April 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"All authors agree to the publication in this journal.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"16"}}