{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,9]],"date-time":"2026-05-09T08:08:35Z","timestamp":1778314115217,"version":"3.51.4"},"reference-count":51,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2021,9,18]],"date-time":"2021-09-18T00:00:00Z","timestamp":1631923200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Spanish Ministry of 419 Economy, Industry and Competitiveness","award":["TEC2017-84395-P"],"award-info":[{"award-number":["TEC2017-84395-P"]}]},{"name":"Spanish Ministry of 419 Economy, Industry and Competitiveness","award":["TEC2017- 420 84593-C2-1-R"],"award-info":[{"award-number":["TEC2017- 420 84593-C2-1-R"]}]},{"DOI":"10.13039\/501100006318","name":"Universidad Carlos III de Madrid","doi-asserted-by":"publisher","award":["Strategic Action 2018\/00071\/001"],"award-info":[{"award-number":["Strategic Action 2018\/00071\/001"]}],"id":[{"id":"10.13039\/501100006318","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Symmetry"],"abstract":"<jats:p>Speech intelligibility is a crucial element in oral communication that can be influenced by multiple elements, such as noise, channel characteristics, or speech disorders. In this paper, we address the task of speech intelligibility classification (SIC) in this last circumstance. Taking our previous works, a SIC system based on an attentional long short-term memory (LSTM) network, as a starting point, we deal with the problem of the inadequate learning of the attention weights due to training data scarcity. For overcoming this issue, the main contribution of this paper is a novel type of weighted pooling (WP) mechanism, called saliency pooling where the WP weights are not automatically learned during the training process of the network, but are obtained from an external source of information, the Kalinli\u2019s auditory saliency model. In this way, it is intended to take advantage of the apparent symmetry between the human auditory attention mechanism and the attentional models integrated into deep learning networks. The developed systems are assessed on the UA-speech dataset that comprises speech uttered by subjects with several dysarthria levels. Results show that all the systems with saliency pooling significantly outperform a reference support vector machine (SVM)-based system and LSTM-based systems with mean pooling and attention pooling, suggesting that Kalinli\u2019s saliency can be successfully incorporated into the LSTM architecture as an external cue for the estimation of the speech intelligibility level.<\/jats:p>","DOI":"10.3390\/sym13091728","type":"journal-article","created":{"date-parts":[[2021,9,21]],"date-time":"2021-09-21T22:35:20Z","timestamp":1632263720000},"page":"1728","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":8,"title":["An Auditory Saliency Pooling-Based LSTM Model for Speech Intelligibility Classification"],"prefix":"10.3390","volume":"13","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9322-3128","authenticated-orcid":false,"given":"Ascensi\u00f3n","family":"Gallardo-Antol\u00edn","sequence":"first","affiliation":[{"name":"Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Avda. de la Universidad, 30, Legan\u00e9s, 28911 Madrid, Spain"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7908-5400","authenticated-orcid":false,"given":"Juan M.","family":"Montero","sequence":"additional","affiliation":[{"name":"Speech Technology Group, E.T.S.I. Telecomunicaci\u00f3n, Universidad Polit\u00e9cnica de Madrid, Avda. de la Complutense, 30, 28040 Madrid, Spain"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,9,18]]},"reference":[{"key":"ref_1","first-page":"309","article-title":"Dysarthric speech: A comparison of computerized speech recognition and listener intelligibility","volume":"34","author":"Doyle","year":"1997","journal-title":"J. Rehabil. Res. Dev."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"283","DOI":"10.1016\/S0021-9924(02)00065-5","article-title":"Intelligibility as a linear combination of dimensions in dysarthric speech","volume":"35","year":"2002","journal-title":"J. Commun. Disord."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"622","DOI":"10.1016\/j.specom.2011.03.007","article-title":"Characterization of atypical vocal source excitation, temporal dynamics, and prosody for objective measurement of dysarthric word intelligibility","volume":"54","author":"Falk","year":"2012","journal-title":"Speech Commun."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"408","DOI":"10.3109\/17549507.2014.927922","article-title":"Automatic Assessment of Speech Intelligibility for Individuals With Aphasia","volume":"16","author":"Landa","year":"2014","journal-title":"Int. J. Speech-Lang. Pathol."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1246","DOI":"10.1044\/1092-4388(2010\/09-0121)","article-title":"Discriminating dysarthria type from envelope modulation spectra","volume":"53","author":"Liss","year":"2010","journal-title":"J. Speech Lang. Hear. Res."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Sarria-Paja, M., and Falk, T. (2012, January 9\u201313). Automated dysarthria severity classification for improved objective intelligibility assessment of spastic dysarthric speech. Proceedings of the 13th Annual Conference of the International Speech Communication Association (INTERSPEECH), Portland, OR, USA.","DOI":"10.21437\/Interspeech.2012-26"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"35","DOI":"10.1016\/j.bbe.2013.10.003","article-title":"Classification of speech intelligibility in Parkinson\u2019s disease","volume":"34","author":"Khan","year":"2014","journal-title":"Biocybern. Biomed. Eng."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"103976","DOI":"10.1016\/j.engappai.2020.103976","article-title":"An attention Long Short-Term Memory based system for automatic classification of speech intelligibility","volume":"96","year":"2020","journal-title":"Eng. Appl. Artif. Intell."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Tripathi, A., Bhosale, S., and Kopparapu, S.K. (2020, January 4\u20138). Improved Speaker Independent Dysarthria Intelligibility Classification Using Deepspeech Posteriors. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9054492"},{"key":"ref_10","first-page":"88","article-title":"Developing A Model for Predicting the Speech Intelligibility of South Korean Children with Cochlear Implantation using a Random Forest Algorithm","volume":"9","author":"Byeon","year":"2018","journal-title":"Int. J. Adv. Comput. Sci. Appl."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"49","DOI":"10.1016\/j.neucom.2021.05.065","article-title":"On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification","volume":"456","author":"Montero","year":"2021","journal-title":"Neurocomputing"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Hummel, R., Chan, W.Y., and Falk, T.H. (2011, January 27\u201331). Spectral Features for Automatic Blind Intelligibility Estimation of Spastic Dysarthric Speech. Proceedings of the Interspeech 2011, Florence, Italy.","DOI":"10.21437\/Interspeech.2011-755"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Zlotnik, A., Montero, J.M., San-Segundo, R., and Gallardo-Antol\u00edn, A. (2015, January 6\u201310). Random Forest-Based Prediction of Parkinson\u2019s Disease Progression Using Acoustic, ASR and Intelligibility Features. Proceedings of the Interspeech 2015, Dresden, Germany.","DOI":"10.21437\/Interspeech.2015-184"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Kao, C.C., Sun, M., Wang, W., and Wang, C. (2020, January 4\u20138). A Comparison of Pooling Methods on LSTM Models for Rare Acoustic Event Classification. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053150"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Yu, D., and Deng, L. (2014). Automatic Speech Recognition\u2014A Deep Learning Approach, Springer.","DOI":"10.1007\/978-1-4471-5779-3"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Huang, C.W., and Narayanan, S.S. (2016, January 8\u201312). Attention Assisted Discovery of Sub-Utterance Structure in Speech Emotion Recognition. Proceedings of the Interspeech 2016, San Francisco, CA, USA.","DOI":"10.21437\/Interspeech.2016-448"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Mirsamadi, S., Barsoum, E., and Zhang, C. (2017, January 5\u20139). Automatic speech emotion recognition using recurrent neural networks with local attention. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7952552"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Lieskovsk\u00e1, E., Jakubec, M., Jarina, R., and Chmul\u00edk, M. (2021). A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism. Electronics, 10.","DOI":"10.3390\/electronics10101163"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Gallardo-Antol\u00edn, A., and Montero, J.M. (2019, January 15\u201319). A Saliency-Based Attention LSTM Model for Cognitive Load Classification from Speech. Proceedings of the Interspeech 2019, Graz, Austria.","DOI":"10.21437\/Interspeech.2019-1603"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Gallardo-Antol\u00edn, A., and Montero, J.M. (2019). External Attention LSTM Models for Cognitive Load Classification from Speech, Springer. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics).","DOI":"10.1007\/978-3-030-31372-2_12"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Gallardo-Antol\u00edn, A., and Montero, J.M. (2021). Detecting Deception from Gaze and Speech Using a Multimodal Attention LSTM-Based Framework. Appl. Sci., 11.","DOI":"10.3390\/app11146393"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Geng, M., Liu, S., Yu, J., Xie, X., Hu, S., Ye, Z., Jin, Z., Liu, X., and Meng, H. (September, January 30). Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition. Proceedings of the Interspeech 2021, Brno, Czech Republic.","DOI":"10.21437\/Interspeech.2021-60"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"390","DOI":"10.1109\/JSTSP.2019.2949912","article-title":"Spectro-Temporal Representation of Speech for Intelligibility Assessment of Dysarthria","volume":"14","author":"Chandrashekar","year":"2020","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"322","DOI":"10.1109\/JSTSP.2020.2967652","article-title":"Automatic Assessment of Sentence-Level Dysarthria Intelligibility Using BLSTM","volume":"14","author":"Bhat","year":"2020","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"ref_25","first-page":"577","article-title":"Attention-Based Models for Speech Recognition","volume":"Volume 1","author":"Chorowski","year":"2015","journal-title":"Proceedings of the 28th International Conference on Neural Information Processing Systems-NIPS\u201915"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Zacarias-Morales, N., Pancardo, P., Hern\u00e1ndez-Nolasco, J.A., and Garcia-Constantino, M. (2021). Attention-Inspired Artificial Neural Networks for Speech Processing: A Systematic Review. Symmetry, 13.","DOI":"10.3390\/sym13020214"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"1379","DOI":"10.1016\/j.specom.2006.07.007","article-title":"Band-pass filtering of the time sequences of spectral parameters for robust wireless speech recognition","volume":"48","year":"2006","journal-title":"Speech Commun."},{"key":"ref_28","unstructured":"Anderson, R. (2004). Cognitive Psychology and Its Implications, Worth Publishers."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"D202","DOI":"10.2741\/Alain","article-title":"Selectively attending to auditory objects","volume":"5","author":"Alain","year":"2000","journal-title":"Front. Biosci. J. Virtual Libr."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"1943","DOI":"10.1016\/j.cub.2005.09.040","article-title":"Mechanisms for allocating auditory attention: An auditory saliency map","volume":"15","author":"Kayser","year":"2005","journal-title":"Curr. Biol."},{"key":"ref_31","unstructured":"Tsuchida, T., and Cottrell, G. (2012, January 1\u20134). Auditory saliency using natural statistics. Proceedings of the 34th Annual Meeting of the Cognitive Science Society, Sapporo, Japan."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Schauerte, B., and Stiefelhagen, R. (2013, January 26\u201331). \u201cWow!\u201d Bayesian surprise for salient acoustic event detection. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada.","DOI":"10.1109\/ICASSP.2013.6638898"},{"key":"ref_33","first-page":"1","article-title":"Modelling auditory attention","volume":"372","author":"Kaya","year":"2017","journal-title":"Philos. Trans. R. Soc. B"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"255","DOI":"10.1016\/j.eswa.2018.07.018","article-title":"Echoic log-surprise: A multi-scale scheme for acoustic saliency detection","volume":"114","year":"2018","journal-title":"Expert Syst. Appl."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Kalinli, O., and Narayanan, S.S. (2007, January 27\u201331). A saliency-based auditory attention model with applications to unsupervised prominent syllable detection in speech. Proceedings of the Interspeech 2007, Antwerp, Belgium.","DOI":"10.21437\/Interspeech.2007-44"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Kalinli, O., and Narayanan, S.S. (2008, January 22\u201326). Combining task-dependent information with auditory attention cues for prominence detection in speech. Proceedings of the Interspeech 2008, Brisbane, Australia.","DOI":"10.21437\/Interspeech.2008-329"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"1009","DOI":"10.1109\/TASL.2009.2014795","article-title":"Prominence Detection Using Auditory Attention Cues and Task-Dependent High Level Information","volume":"17","author":"Kalinli","year":"2009","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Harding, S., Cooke, M., and K\u00f6nig, P. (2007, January 8). Auditory Gist Perception: An Alternative to Attentional Selection of Auditory Streams?. Proceedings of the WAPCV 2007, Hyderabad, India.","DOI":"10.1007\/978-3-540-77343-6_26"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Kim, H., Hasegawa-Johnson, M., Perlman, A., Gunderson, J., Huang, T.S., Watkin, K., and Frame, S. (2008, January 22\u201326). Dysarthric speech database for universal access research. Proceedings of the 9th Annual Conference of the International Speech Communication Association (INTERSPEECH), ISCA, Brisbane, Australia.","DOI":"10.21437\/Interspeech.2008-480"},{"key":"ref_40","unstructured":"Macaluso, E. (2021, August 05). MT_TOOLS: Computation of Saliency and Feature-Specific Maps. Available online: https:\/\/www.brainreality.eu\/mt_tools."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"340","DOI":"10.1016\/S1364-6613(00)01704-6","article-title":"On the role of space and time in auditory processing","volume":"5","author":"Shamma","year":"2001","journal-title":"Trends Cogn. Sci."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long Short-term Memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_43","first-page":"115","article-title":"Learning Precise Timing with LSTM Recurrent Networks","volume":"3","author":"Gers","year":"2003","journal-title":"J. Mach. Learn. Res."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Huang, C., and Narayanan, S. (2017, January 10\u201314). Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition. Proceedings of the ICME 2017, Hong Kong, China.","DOI":"10.1109\/ICME.2017.8019296"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Guo, J., Xu, N., Li, L.J., and Alwan, A. (2017, January 20\u201324). Attention based CLDNNs for short-duration acoustic scene classification. Proceedings of the Interspeech 2017, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-440"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Kalinli, O., Sundaram, S., and Narayanan, S. (2009, January 5\u20137). Saliency-driven unstructured acoustic scene classification using latent perceptual indexing. Proceedings of the 2009 IEEE International Workshop on Multimedia Signal Processing, Rio de Janeiro, Brazil.","DOI":"10.1109\/MMSP.2009.5293267"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"V\u00e1zquez-Romero, A., and Gallardo-Antol\u00edn, A. (2020). Automatic Detection of Depression in Speech Using Ensemble Convolutional Neural Networks. Entropy, 22.","DOI":"10.3390\/e22060688"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Piczak, K.J. (2015, January 17\u201320). Environmental sound classification with convolutional neural networks. Proceedings of the 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), Boston, MA, USA.","DOI":"10.1109\/MLSP.2015.7324337"},{"key":"ref_49","unstructured":"McFee, B., Lostanlen, V., McVicar, M., Metsai, A., Balke, S., Thom\u00e9, C., Raffel, C., Malek, A., Lee, D., and Zalkow, F. (2021, August 05). LibROSA\/LibROSA: 0.7.2. Available online: https:\/\/librosa.org."},{"key":"ref_50","unstructured":"Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., and Devin, M. (2021, August 05). TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Available online: https:\/\/www.tensorflow.org."},{"key":"ref_51","unstructured":"Chollet, F. (2021, August 05). Keras. Available online: https:\/\/keras.io."}],"container-title":["Symmetry"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-8994\/13\/9\/1728\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T07:01:41Z","timestamp":1760166101000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-8994\/13\/9\/1728"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,9,18]]},"references-count":51,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2021,9]]}},"alternative-id":["sym13091728"],"URL":"https:\/\/doi.org\/10.3390\/sym13091728","relation":{},"ISSN":["2073-8994"],"issn-type":[{"value":"2073-8994","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,9,18]]}}}