{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,17]],"date-time":"2026-02-17T12:15:10Z","timestamp":1771330510616,"version":"3.50.1"},"reference-count":35,"publisher":"MDPI AG","issue":"16","license":[{"start":{"date-parts":[[2024,8,18]],"date-time":"2024-08-18T00:00:00Z","timestamp":1723939200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Chongqing Technology Innovation and Application Development Project","award":["2023TIAD-GPX0007"],"award-info":[{"award-number":["2023TIAD-GPX0007"]}]},{"name":"Chongqing Technology Innovation and Application Development Project","award":["CYS23752"],"award-info":[{"award-number":["CYS23752"]}]},{"name":"Chongqing Postgraduate Research and Innovation Program 2023","award":["2023TIAD-GPX0007"],"award-info":[{"award-number":["2023TIAD-GPX0007"]}]},{"name":"Chongqing Postgraduate Research and Innovation Program 2023","award":["CYS23752"],"award-info":[{"award-number":["CYS23752"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Sound Event Detection and Localization (SELD) is a comprehensive task that aims to solve the subtasks of Sound Event Detection (SED) and Sound Source Localization (SSL) simultaneously. The task of SELD lies in the need to solve both sound recognition and spatial localization problems, and different categories of sound events may overlap in time and space, making it more difficult for the model to distinguish between different events occurring at the same time and to locate the sound source. In this study, the Dual-conv Coordinate Attention Module (DCAM) combines dual convolutional blocks and Coordinate Attention, and based on this, the network architecture based on the two-stage strategy is improved to form the SELD-oriented Two-Stage Dual-conv Coordinate Attention Model (TDCAM) for SELD. TDCAM draws on the concepts of Visual Geometry Group (VGG) networks and Coordinate Attention to effectively capture critical local information by focusing on the coordinate space information of the feature map and dealing with the relationship between the feature map channels to enhance the feature selection capability of the model. To address the limitation of a single-layer Bi-directional Gated Recurrent Unit (Bi-GRU) in the two-stage network in terms of timing processing, we add to the structure of the two-layer Bi-GRU and introduce the data enhancement techniques of the frequency mask and time mask to improve the modeling and generalization ability of the model for timing features. Through experimental validation on the TAU Spatial Sound Events 2019 development dataset, our approach significantly improves the performance of SELD compared to the two-stage network baseline model. Furthermore, the effectiveness of DCAM and the two-layer Bi-GRU structure is confirmed by performing ablation experiments.<\/jats:p>","DOI":"10.3390\/s24165336","type":"journal-article","created":{"date-parts":[[2024,8,19]],"date-time":"2024-08-19T06:41:31Z","timestamp":1724049691000},"page":"5336","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["A Study of Improved Two-Stage Dual-Conv Coordinate Attention Model for Sound Event Detection and Localization"],"prefix":"10.3390","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-7186-7417","authenticated-orcid":false,"given":"Guorong","family":"Chen","sequence":"first","affiliation":[{"name":"School of Intelligent Technology and Engineering, Chongqing University of Science and Technology, No. 20, Daxuecheng East Road, Shapingba District, Chongqing 401331, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-2664-9413","authenticated-orcid":false,"given":"Yuan","family":"Yu","sequence":"additional","affiliation":[{"name":"School of Intelligent Technology and Engineering, Chongqing University of Science and Technology, No. 20, Daxuecheng East Road, Shapingba District, Chongqing 401331, China"}]},{"given":"Yuan","family":"Qiao","sequence":"additional","affiliation":[{"name":"School of Intelligent Technology and Engineering, Chongqing University of Science and Technology, No. 20, Daxuecheng East Road, Shapingba District, Chongqing 401331, China"}]},{"given":"Junliang","family":"Yang","sequence":"additional","affiliation":[{"name":"School of Intelligent Technology and Engineering, Chongqing University of Science and Technology, No. 20, Daxuecheng East Road, Shapingba District, Chongqing 401331, China"}]},{"given":"Chongling","family":"Du","sequence":"additional","affiliation":[{"name":"School of Intelligent Technology and Engineering, Chongqing University of Science and Technology, No. 20, Daxuecheng East Road, Shapingba District, Chongqing 401331, China"}]},{"given":"Zhang","family":"Qian","sequence":"additional","affiliation":[{"name":"School of Intelligent Technology and Engineering, Chongqing University of Science and Technology, No. 20, Daxuecheng East Road, Shapingba District, Chongqing 401331, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3867-900X","authenticated-orcid":false,"given":"Xiao","family":"Huang","sequence":"additional","affiliation":[{"name":"Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR 999077, China"}]}],"member":"1968","published-online":{"date-parts":[[2024,8,18]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"7564","DOI":"10.1109\/ACCESS.2020.3048675","article-title":"Polyphonic sound event detection based on residual convolutional recurrent neural network with semi-supervised loss function","volume":"9","author":"Kim","year":"2021","journal-title":"IEEE Access"},{"key":"ref_2","unstructured":"Kumar, A., Hegde, R.M., Singh, R., and Raj, B. (2013, January 9\u201313). Event detection in short duration audio using gaussian mixture model and random forest classifier. Proceedings of the 21st European Signal Processing Conference (EUSIPCO 2013), Marrakech, Morocco."},{"key":"ref_3","unstructured":"Mesaros, A., Heittola, T., Eronen, A., and Virtanen, T. (2010, January 23\u201327). Acoustic event detection in real life recordings. Proceedings of the 2010 18th European Signal Processing Conference, Aalborg, Denmark."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1687","DOI":"10.1109\/TASLP.2024.3369529","article-title":"On Local Temporal Embedding for Semi-Supervised Sound Event Detection","volume":"32","author":"Gao","year":"2024","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"943","DOI":"10.1109\/89.966097","article-title":"Real-time passive source localization: A practical linear-correction least-squares approach","volume":"9","author":"Huang","year":"2001","journal-title":"IEEE Trans. Speech Audio Process."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"13","DOI":"10.1016\/j.sigpro.2018.05.010","article-title":"Robust AOA based acoustic source localization method with unreliable measurements","volume":"152","author":"Yan","year":"2018","journal-title":"Signal Process."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"71","DOI":"10.1109\/LSP.2010.2091502","article-title":"A modified SRP-PHAT functional for robust real-time sound source localization with scalable spatial sampling","volume":"18","author":"Cobos","year":"2010","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"2505913","DOI":"10.1109\/TIM.2023.3348907","article-title":"IFAN: An Icosahedral Feature Attention Network for Sound Source Localization","volume":"73","author":"Zhu","year":"2024","journal-title":"IEEE Trans. Instrum. Meas."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"279","DOI":"10.1109\/TITS.2015.2470216","article-title":"Audio surveillance of roads: A system for detecting anomalous sounds","volume":"17","author":"Foggia","year":"2015","journal-title":"IEEE Trans. Intell. Transp. Syst."},{"key":"ref_10","unstructured":"Sun, H., Liu, X., Xu, K., Miao, J., and Luo, Q. (2021). Emergency vehicles audio detection and localization in autonomous driving. arXiv."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"107949","DOI":"10.1016\/j.apacoust.2021.107949","article-title":"Optimizing passive acoustic systems for marine mammal detection and localization: Application to real-time monitoring north Atlantic right whales in Gulf of St. Lawrence","volume":"178","author":"Gervaise","year":"2021","journal-title":"Appl. Acoust."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"He, W., Motlicek, P., and Odobez, J.M. (2018, January 21\u201325). Deep neural networks for multiple speaker detection and localization. Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia.","DOI":"10.1109\/ICRA.2018.8461267"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"16","DOI":"10.1109\/MSP.2014.2326181","article-title":"Acoustic scene classification: Classifying environments from the sounds they produce","volume":"32","author":"Barchiesi","year":"2015","journal-title":"IEEE Signal Process. Mag."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Piczak, K.J. (2015, January 17\u201320). Environmental sound classification with convolutional neural networks. Proceedings of the 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), Boston, MA, USA.","DOI":"10.1109\/MLSP.2015.7324337"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Parascandolo, G., Huttunen, H., and Virtanen, T. (2016, January 20\u201325). Recurrent neural networks for polyphonic sound event detection in real life recordings. Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472917"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"34","DOI":"10.1109\/JSTSP.2018.2885636","article-title":"Sound event localization and detection of overlapping sources using convolutional recurrent neural networks","volume":"13","author":"Adavanne","year":"2018","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Cao, Y., Kong, Q., Iqbal, T., An, F., Wang, W., and Plumbley, M.D. (2019). Polyphonic sound event detection and localization using a two-stage strategy. arXiv.","DOI":"10.33682\/4jhy-bj81"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"27","DOI":"10.1186\/s13636-023-00292-9","article-title":"Dual-branch attention module-based network with parameter sharing for joint sound event detection and localization","volume":"2023","author":"Zhou","year":"2023","journal-title":"EURASIP J. Audio Speech Music Process."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Hou, Q., Zhou, D., and Feng, J. (2021, January 20\u201325). Coordinate attention for efficient mobile network design. Proceedings of the IEEE\/CVF Conference on Computer Vision and PATTERN Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01350"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Wang, K.C., Zhang, J., Huang, J., Li, Q., Sun, M.T., Sakai, K., and Ku, W.S. (2023, January 26\u201330). CA-Wav2Lip: Coordinate Attention-based Speech To Lip Synthesis In The Wild. Proceedings of the 2023 IEEE International Conference on Smart Computing (SMARTCOMP), Nashville, TN, USA.","DOI":"10.1109\/SMARTCOMP58114.2023.00018"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Jiang, Y., Yang, Q., Zhang, Y., Wu, Z., Liu, S., and Liu, S. (2024). CCAUNet: Boosting Feature Representation Using Complex Coordinate Attention for Monaural Speech Enhancement. SSRN Electron. J.","DOI":"10.2139\/ssrn.4492197"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Adavanne, S., Politis, A., and Virtanen, T. (2019). A multi-room reverberant dataset for sound event localization and detection. arXiv.","DOI":"10.33682\/1xwd-5v76"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Delikaris-Manias, S., Pavlidi, D., Mouchtaris, A., and Pulkki, V. (2017, January 5\u20139). DOA estimation with histogram analysis of spatially constrained active intensity vectors. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7952211"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"320","DOI":"10.1109\/TASSP.1976.1162830","article-title":"The generalized correlation method for estimation of time delay","volume":"24","author":"Knapp","year":"1976","journal-title":"IEEE Trans. Acoust. Speech Signal Process."},{"key":"ref_25","unstructured":"Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Park, D.S., Chan, W., Zhang, Y., Chiu, C.C., Zoph, B., Cubuk, E.D., and Le, Q.V. (2019). Specaugment: A simple data augmentation method for automatic speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2019-2680"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"19","DOI":"10.1007\/s10479-005-5724-z","article-title":"A tutorial on the cross-entropy method","volume":"134","author":"Kroese","year":"2005","journal-title":"Ann. Oper. Res."},{"key":"ref_28","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Mesaros, A., Heittola, T., and Virtanen, T. (2016). Metrics for polyphonic sound event detection. Appl. Sci., 6.","DOI":"10.3390\/app6060162"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Adavanne, S., Politis, A., and Virtanen, T. (2018, January 3\u20137). Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network. Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO), Rome, Italy.","DOI":"10.23919\/EUSIPCO.2018.8553182"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"83","DOI":"10.1002\/nav.3800020109","article-title":"The Hungarian method for the assignment problem","volume":"2","author":"Kuhn","year":"1955","journal-title":"Nav. Res. Logist. Q."},{"key":"ref_32","unstructured":"Noh, K., Jeong-Hwan, C., Dongyeop, J., and Joon-Hyuk, C. (2019). Three-stage approach for sound event localization and detection. DCASE2019 Challenge, Technical Report."},{"key":"ref_33","unstructured":"Zhang, J., Ding, W., and He, L. (2019). Data augmentation and prior knowledge-based regularization for sound event localization and detection. DCASE2019 Challenge, Technical Report."},{"key":"ref_34","unstructured":"Nguyen, T.N.T., Jones, D.L., Ranjan, R., Jayabalan, S., and Gan, W.S. (2019). A two-step system for sound event localization and detection. arXiv."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Grondin, F., Glass, J., Sobieraj, I., and Plumbley, M.D. (2019). Sound event localization and detection using CRNN on pairs of microphones. arXiv.","DOI":"10.33682\/4v2a-7q02"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/16\/5336\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T15:38:28Z","timestamp":1760110708000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/16\/5336"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,8,18]]},"references-count":35,"journal-issue":{"issue":"16","published-online":{"date-parts":[[2024,8]]}},"alternative-id":["s24165336"],"URL":"https:\/\/doi.org\/10.3390\/s24165336","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,8,18]]}}}