{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T14:03:54Z","timestamp":1767621834997,"version":"3.48.0"},"reference-count":35,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2026,1,3]],"date-time":"2026-01-03T00:00:00Z","timestamp":1767398400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Discovery Grant from Natural Sciences and Engineering Research Council (NSERC), Canada"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>In this study, we propose a novel automated model for speech quality estimation that objectively evaluates perceptual dysphonia severity and breathiness in audio samples, demonstrating strong correlation with expert ratings. The proposed model integrates Whisper encoder embeddings with Mel spectrograms augmented by second-order delta features combined with a sequential-attention fusion network feature mapping path. This hybrid approach enhances the model\u2019s sensitivity to phonetic, high-level feature representation, and spectral variations, enabling more accurate predictions of perceptual speech quality. A sequential-attention fusion network feature mapping module captures long-range dependencies through the multi-head attention network, while LSTM layers refine the learned representations by modeling temporal dynamics. Comparative analysis against state-of-the-art methods for dysphonia assessment demonstrates our model\u2019s better correlation with clinician\u2019s judgments across test samples. Our findings underscore the effectiveness of ASR-derived embeddings alongside the deep feature mapping structure in disordered speech quality assessment, offering a promising pathway for advancing automated evaluation systems.<\/jats:p>","DOI":"10.3390\/info17010032","type":"journal-article","created":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T10:53:50Z","timestamp":1767610430000},"page":"32","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Automated Severity and Breathiness Assessment of Disordered Speech Using a Speech Foundation Model"],"prefix":"10.3390","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2074-9863","authenticated-orcid":false,"given":"Vahid","family":"Ashkanichenarlogh","sequence":"first","affiliation":[{"name":"National Centre for Audiology, Western University, London, ON N6A 3K7, Canada"},{"name":"Department of Electrical and Computer Engineering, Western University, London, ON N6A 3K7, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4359-5254","authenticated-orcid":false,"given":"Arman","family":"Hassanpour","sequence":"additional","affiliation":[{"name":"National Centre for Audiology, Western University, London, ON N6A 3K7, Canada"},{"name":"School of Communication Sciences and Disorders, Faculty of Health Sciences, Western University, London, ON N6A 3K7, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4305-8347","authenticated-orcid":false,"given":"Vijay","family":"Parsa","sequence":"additional","affiliation":[{"name":"National Centre for Audiology, Western University, London, ON N6A 3K7, Canada"},{"name":"Department of Electrical and Computer Engineering, Western University, London, ON N6A 3K7, Canada"},{"name":"School of Communication Sciences and Disorders, Faculty of Health Sciences, Western University, London, ON N6A 3K7, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2026,1,3]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"183","DOI":"10.1016\/j.anl.2014.11.001","article-title":"Assessment of voice quality: Current state-of-the-art","volume":"42","author":"Barsties","year":"2015","journal-title":"Auris Nasus Larynx"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"62","DOI":"10.1044\/vvd20.2.62","article-title":"Perceptual Assessment of Voice Quality: Past, Present, and Future","volume":"20","author":"Kreiman","year":"2010","journal-title":"Perspect. Voice Voice Disord."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"856","DOI":"10.1136\/jnnp-2014-308043","article-title":"Distinct phenotypes of speech and voice disorders in Parkinson\u2019s disease after subthalamic nucleus deep brain stimulation","volume":"86","author":"Tsuboi","year":"2015","journal-title":"J. Neurol. Neurosurg. Psychiatry"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1547","DOI":"10.1007\/s00702-017-1804-x","article-title":"Early detection of speech and voice disorders in Parkinson\u2019s disease patients treated with subthalamic nucleus deep brain stimulation: A 1-year follow-up study","volume":"124","author":"Tsuboi","year":"2017","journal-title":"J. Neural Transm."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Kim, S., Le, D., Zheng, W., Singh, T., Arora, A., Zhai, X., Fuegen, C., Kalinli, O., and Seltzer, M.L. (2022). Evaluating User Perception of Speech Recognition System Quality with Semantic Distance Metric. arXiv.","DOI":"10.21437\/Interspeech.2022-11144"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"846.e1","DOI":"10.1016\/j.jvoice.2022.10.020","article-title":"Automatic GRBAS Scoring of Pathological Voices using Deep Learning and a Small Set of Labeled Voice Data","volume":"39","author":"Hidaka","year":"2025","journal-title":"J. Voice"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1044\/1058-0360.0503.07","article-title":"Hearing and Believing","volume":"5","author":"Kent","year":"1996","journal-title":"Am. J. Speech-Lang. Pathol."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1097\/MOO.0b013e3282fe96ce","article-title":"Voice assessment: Updates on perceptual, acoustic, aerodynamic, and endoscopic imaging methods","volume":"16","author":"Mehta","year":"2008","journal-title":"Curr. Opin. Otolaryngol. Head Neck Surg."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"685","DOI":"10.1016\/j.jvoice.2022.11.014","article-title":"Clinical Use of the CAPE-V Scales: Agreement, Reliability and Notes on Voice Quality","volume":"39","author":"Nagle","year":"2025","journal-title":"J. Voice"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"2619","DOI":"10.1121\/1.3224706","article-title":"Acoustic measurement of overall voice quality: A meta-analysisa","volume":"126","author":"Maryn","year":"2009","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"236","DOI":"10.1016\/j.engappai.2019.03.027","article-title":"Emulating the perceptual capabilities of a human evaluator to map the GRB scale for the assessment of voice disorders","volume":"82","year":"2019","journal-title":"Eng. Appl. Artif. Intell."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"35","DOI":"10.1016\/j.jvoice.2014.06.015","article-title":"Objective Dysphonia Measures in the Program Praat: Smoothed Cepstral Peak Prominence and Acoustic Voice Quality Index","volume":"29","author":"Maryn","year":"2015","journal-title":"J. Voice"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Leng, Y., Tan, X., Zhao, S., Soong, F., Li, X.-Y., and Qin, T. (2021, January 6\u201311). MBNET: MOS Prediction for Synthesized Speech with Mean-Bias Network. Proceedings of the ICASSP 2021\u20142021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada.","DOI":"10.1109\/ICASSP39728.2021.9413877"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"54","DOI":"10.1109\/TASLP.2022.3205757","article-title":"Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features","volume":"31","author":"Zezario","year":"2023","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Dong, X., and Williamson, D.S. (2020, January 4\u20138). An Attention Enhanced Multi-Task Model for Objective Speech Assessment in Real-World Environments. Proceedings of the ICASSP 2020\u20142020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053366"},{"key":"ref_16","unstructured":"Zezario, R.E., Fu, S.-W., Fuh, C.-S., Tsao, Y., and Wang, H.-M. (2020, January 7\u201310). STOI-Net: A Deep Learning based Non-Intrusive Speech Intelligibility Assessment Model. Proceedings of the 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand. Available online: https:\/\/ieeexplore.ieee.org\/abstract\/document\/9306495."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Fu, S.-W., Tsao, Y., Hwang, H.-T., and Wang, H.-M. (2018). Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM. arXiv.","DOI":"10.21437\/Interspeech.2018-1802"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Liu, Y., Yang, L.-C., Pawlicki, A., and Stamenovic, M. (2022, January 18\u201322). CCATMos: Convolutional Context-aware Transformer Network for Non-intrusive Speech Quality Assessment. Proceedings of the Interspeech 2022, Incheon, Republic of Korea.","DOI":"10.21437\/Interspeech.2022-10857"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Kumar, A., Tan, K., Ni, Z., Manocha, P., Zhang, X., Henderson, E., and Xu, B. (2023, January 4\u201310). Torchaudio-Squim: Reference-Less Speech Quality and Intelligibility Measures in Torchaudio. Proceedings of the ICASSP 2023\u20142023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece.","DOI":"10.1109\/ICASSP49357.2023.10096680"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Gao, Y., Shi, H., Chu, C., and Kawahara, T. (2024, January 14\u201319). Enhancing Two-Stage Finetuning for Speech Emotion Recognition Using Adapters. Proceedings of the ICASSP 2024\u20142024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea.","DOI":"10.1109\/ICASSP48485.2024.10446645"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Gao, Y., Chu, C., and Kawahara, T. (2023, January 20\u201324). Two-stage Finetuning of Wav2vec 2.0 for Speech Emotion Recognition with ASR and Gender Pretraining. Proceedings of the INTERSPEECH 2023, ISCA, Dublin, Ireland.","DOI":"10.21437\/Interspeech.2023-756"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Tian, J., Hu, D., Shi, X., He, J., Li, X., Gao, Y., Toda, T., Xu, X., and Hu, X. (2023, January 29). Semi-supervised Multimodal Emotion Recognition with Consensus Decision-making and Label Correction. Proceedings of the 1st International Workshop on Multimodal and Responsible Affective Computing, Ottawa, ON, Canada.","DOI":"10.1145\/3607865.3613182"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Dang, S., Matsumoto, T., Takeuchi, Y., and Kudo, H. (2023, January 20\u201324). Using Semi-supervised Learning for Monaural Time-domain Speech Separation with a Self-supervised Learning-based SI-SNR Estimator. Proceedings of the INTERSPEECH 2023, ISCA, Dublin, Ireland.","DOI":"10.21437\/Interspeech.2023-85"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Sun, H., Zhao, S., Wang, X., Zeng, W., Chen, Y., and Qin, Y. (2024, January 14\u201319). Fine-Grained Disentangled Representation Learning For Multimodal Emotion Recognition. Proceedings of the ICASSP 2024\u20142024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea.","DOI":"10.1109\/ICASSP48485.2024.10447667"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Cuervo, S., and Marxer, R. (2024, January 14\u201319). Speech Foundation Models on Intelligibility Prediction for Hearing-Impaired Listeners. Proceedings of the ICASSP 2024\u20142024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea.","DOI":"10.1109\/ICASSP48485.2024.10447907"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Mogridge, R., Close, G., Sutherland, R., Hain, T., Barker, J., Goetze, S., and Ragni, A. (2024, January 14\u201319). Non-Intrusive Speech Intelligibility Prediction for Hearing-Impaired Users Using Intermediate ASR Features and Human Memory Models. Proceedings of the ICASSP 2024\u20142024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea.","DOI":"10.1109\/ICASSP48485.2024.10447597"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"658","DOI":"10.1002\/ohn.809","article-title":"A Scoping Review of Artificial Intelligence Detection of Voice Pathology: Challenges and Opportunities","volume":"171","author":"Liu","year":"2024","journal-title":"Otolaryngol.\u2013Head Neck Surg."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"279","DOI":"10.1111\/1460-6984.12783","article-title":"Deep learning in automatic detection of dysphonia: Comparing acoustic features and developing a generalizable framework","volume":"58","author":"Chen","year":"2023","journal-title":"Int. J. Lang. Commun. Disord."},{"key":"ref_29","unstructured":"Garc\u00eda, M.A., and Rosset, A.L. (2022). Deep Neural Network for Automatic Assessment of Dysphonia. arXiv."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Dang, S., Matsumoto, T., Takeuchi, Y., Tsuboi, T., Tanaka, Y., Nakatsubo, D., Maesawa, S., Saito, R., Katsuno, M., and Kudo, H. (2024). Developing vocal system impaired patient-aimed voice quality assessment approach using ASR representation-included multiple features. arXiv.","DOI":"10.21437\/Interspeech.2024-1577"},{"key":"ref_31","first-page":"1440","article-title":"A Machine-Learning Algorithm for the Automated Perceptual Evaluation of Dysphonia Severity","volume":"39","author":"Chen","year":"2023","journal-title":"J. Voice"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Lin, Y.-H., Tseng, W.-H., Chen, L.-C., Tan, C.-T., and Tsao, Y. (2024, January 6\u20138). Lightly Weighted Automatic Audio Parameter Extraction for the Quality Assessment of Consensus Auditory-Perceptual Evaluation of Voice. Proceedings of the 2024 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA.","DOI":"10.1109\/ICCE59016.2024.10444177"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"4229924","DOI":"10.1155\/2023\/4229924","article-title":"Mathematical Analysis and Performance Evaluation of the GELU Activation Function in Deep Learning","volume":"2023","author":"Lee","year":"2023","journal-title":"J. Math."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"875.e15","DOI":"10.1016\/j.jvoice.2020.10.001","article-title":"Perceptual Voice Qualities Database (PVQD): Database Characteristics","volume":"36","author":"Walden","year":"2022","journal-title":"J. Voice"},{"key":"ref_35","unstructured":"Ensar, B., Searl, J., and Doyle, P. (2024, January 5\u20137). Stability of Auditory-Perceptual Judgments of Vocal Quality by Inexperienced Listeners. Proceedings of the American Speech and Hearing Convention, Seattle, WA, USA."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/17\/1\/32\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T11:05:33Z","timestamp":1767611133000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/17\/1\/32"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,3]]},"references-count":35,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2026,1]]}},"alternative-id":["info17010032"],"URL":"https:\/\/doi.org\/10.3390\/info17010032","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,3]]}}}