{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,22]],"date-time":"2025-12-22T12:53:52Z","timestamp":1766408032430,"version":"build-2065373602"},"reference-count":24,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2024,12,30]],"date-time":"2024-12-30T00:00:00Z","timestamp":1735516800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Symmetry"],"abstract":"<jats:p>To address the interdependence of local time-frequency information in audio scene recognition, a segment-based time-frequency feature fusion method based on cross-attention is proposed. Since audio scene recognition is highly sensitive to individual sound events within a scene, the input features are segmented into multiple segments along the time dimension to obtain local features, allowing the subsequent attention mechanism to focus on the time slices of key sound events. Furthermore, to leverage the advantages of both convolutional neural networks (CNNs) and recurrent neural networks (RNNs), which are mainstream structures in audio scene recognition tasks, this paper employs a symmetry structure to separately obtain the time-frequency features output by CNNs and RNNs and then fuses the two sets of features using cross-attention. Experiments on the TUT2018, TAU2019, and TAU2020 datasets demonstrate that the performance of this algorithm improves the official baseline results by 17.78%, 15.95%, and 20.13%, respectively.<\/jats:p>","DOI":"10.3390\/sym17010049","type":"journal-article","created":{"date-parts":[[2024,12,31]],"date-time":"2024-12-31T14:21:12Z","timestamp":1735654872000},"page":"49","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Local Time-Frequency Feature Fusion Using Cross-Attention for Acoustic Scene Classification"],"prefix":"10.3390","volume":"17","author":[{"given":"Rong","family":"Huang","sequence":"first","affiliation":[{"name":"Information Construction and Management Office, Nanjing University of Posts and Telecommunications, Nanjing 210049, China"}]},{"given":"Yue","family":"Xie","sequence":"additional","affiliation":[{"name":"School of Communication and Artificial Intelligence, School of Integrated Circuits, Nanjing Institute of Technology, Nanjing 211167, China"}]},{"given":"Pengxu","family":"Jiang","sequence":"additional","affiliation":[{"name":"School of Information Science and Engineering, Southeast University, Nanjing 210018, China"}]}],"member":"1968","published-online":{"date-parts":[[2024,12,30]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Surendiran, J., Prabhakar, P.B.E., Ibrahim, M.M., Saritha, G., K, S., and Vijayan, V.B. (2024, January 8\u20139). A Systemic Review on Automatic Acoustic Scene Classification. Proceedings of the 2024 International Conference on Power, Energy, Control and Transmission Systems (ICPECTS), Chennai, India.","DOI":"10.1109\/ICPECTS62210.2024.10780377"},{"key":"ref_2","first-page":"20","article-title":"Auditory context awareness via wearable computing","volume":"400","author":"Clarkson","year":"1998","journal-title":"Energy"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"187","DOI":"10.1016\/S0003-682X(97)00105-9","article-title":"Automatic classification of environmental noise events by hidden Markov models","volume":"54","author":"Couvreur","year":"1998","journal-title":"Appl. Acoust."},{"key":"ref_4","unstructured":"Abuirbaiha, R.A.A., Lee, C.-H., and Lien, C.C. (2024, January 9\u201311). Acoustic Scene Classification Using Perceptually Weighted Log Mel Spectrogram and Buttom-Up Broadcast Neural Network. Proceedings of the 2024 International Conference on Consumer Electronics-Taiwan (ICCE-Taiwan), Taichung, Taiwan."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Eronen, A., Tuomi, J., Klapuri, A., Fagerlund, S., Sorsa, T., Lorho, G., and Huopaniemi, J. (2003, January 6\u201310). Audio-based context awareness-acoustic modeling and perceptual evaluation. Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP \u201903), Hong Kong, China.","DOI":"10.1109\/ASPAA.2003.1285814"},{"key":"ref_6","unstructured":"Heittola, T., Mesaros, A., Eronen, A.J., and Virtanen, T. (2010, January 23\u201327). Audio context recognition using audio event histogram. Proceedings of the 2010 18th European Signal Processing Conference, Aalborg, Denmark."},{"key":"ref_7","unstructured":"Valenti, M., Diment, A., Parascandolo, G., Squartini, S., and Virtanen, T. (2016, January 3). DCASE 2016 acoustic scene classification using convolutional neural network. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016, Budapest, Hungary."},{"key":"ref_8","unstructured":"Xu, Y., Huang, Q., Wang, W., and Plumbley, M.D. (2016, January 3). Hierarchical learning for DNN-based acoustic scene classification. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016, Budapest, Hungary."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Basbug, A.M., and Sert, M. (February, January 30). Acoustic Scene Classification Using Spatial Pyramid Pooling with Convolutional Neural Networks. Proceedings of the 2019 IEEE 13th International Conference on Semantic Computing (ICSC), Newport Beach, CA, USA.","DOI":"10.1109\/ICOSC.2019.8665547"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"950","DOI":"10.1109\/LSP.2020.2996085","article-title":"Learning Temporal Relations from Semantic Neighbors for Acoustic Scene Classification","volume":"27","author":"Zhang","year":"2020","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Cai, Y., Zhang, P., and Li, S. (2024, January 14\u201319). TF-SepNet: An Efficient 1D Kernel Design in Cnns for Low-Complexity Acoustic Scene Classification. Proceedings of the ICASSP 2024\u20142024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea.","DOI":"10.1109\/ICASSP48485.2024.10447999"},{"key":"ref_12","unstructured":"Z\u00f6hrer, M., and Pernkopf, F. (2016, January 3). Gated recurrent networks applied to acoustic scene classification and acoustic event detection. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016, Budapest, Hungary."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Li, Y., Li, X., Zhang, Y., Wang, W., Liu, M., and Feng, X. (2018, January 16\u201317). Acoustic Scene Classification Using Deep Audio Feature and BLSTM Network. Proceedings of the 2018 International Conference on Audio, Language and Image Processing (ICALIP), Shanghai, China.","DOI":"10.1109\/ICALIP.2018.8455765"},{"key":"ref_14","unstructured":"Vij, D., and Aggarwal, N. (2017, January 16). Performance evaluation of deep learning architectures for acoustic scene classification. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017, Munich, Germany."},{"key":"ref_15","unstructured":"Hao, W., Zhao, L., Zhang, Q., Zhao, H., and Wang, J. (2018, January 19\u201320). DCASE 2018 task 1a: Acoustic scene classification by bi-LSTM-CNN-net multichannel fusion. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018, Surrey, UK."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Cho, K., Van Merri\u00ebnboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv.","DOI":"10.3115\/v1\/D14-1179"},{"key":"ref_18","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS\u201917), Long Beach, CA, USA."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"1733","DOI":"10.1109\/TMM.2015.2428998","article-title":"Detection and classification of acoustic scenes and events","volume":"17","author":"Stowell","year":"2015","journal-title":"IEEE Trans. Multimed."},{"key":"ref_20","unstructured":"Naranjo-Alcazar, J., Perez-Castanos, S., Zuccarello, P., and Cobos, M. (2024, October 24). DCASE 2019: CNN Depth Analysis with Different Channel Inputs for Acoustic Scene Classification. DCASE2019 Challenge, Tech. Rep. June 2019. Available online: https:\/\/dcase.community\/documents\/challenge2019\/technical_reports\/DCASE2019_Naranjo-Alcazar_13.pdf."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Wang, Y., Feng, C., and Anderson, D.V. (2021, January 6\u201311). A Multi-Channel Temporal Attention Convolutional Neural Network Model for Environmental Sound Classification. Proceedings of the ICASSP 2021\u20142021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada.","DOI":"10.1109\/ICASSP39728.2021.9413498"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Shim, H.J., Jung, J.W., Kim, J.H., and Yu, H.J. (2020). Capturing scattered discriminative information using a deep architecture in acoustic scene classification. arXiv.","DOI":"10.3390\/app11188361"},{"key":"ref_23","unstructured":"Vilouras, K. (2024, October 24). Acoustic Scene Classification Using Fully Convolutional Neural Networks and Per-Channel Energy Normalization. DCASE 2020 Challenge, Tech. Rep. June 2020. Available online: https:\/\/dcase.community\/documents\/challenge2020\/technical_reports\/DCASE2020_Vilouras_3.pdf."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"222","DOI":"10.1109\/TEVC.2022.3185543","article-title":"A Genetic Algorithm Approach to Automate Architecture Design for Acoustic Scene Classification","volume":"27","author":"Hasan","year":"2023","journal-title":"IEEE Trans. Evol. Comput."}],"container-title":["Symmetry"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-8994\/17\/1\/49\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T16:57:09Z","timestamp":1760115429000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-8994\/17\/1\/49"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,12,30]]},"references-count":24,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,1]]}},"alternative-id":["sym17010049"],"URL":"https:\/\/doi.org\/10.3390\/sym17010049","relation":{},"ISSN":["2073-8994"],"issn-type":[{"type":"electronic","value":"2073-8994"}],"subject":[],"published":{"date-parts":[[2024,12,30]]}}}