{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T18:15:18Z","timestamp":1775067318608,"version":"3.50.1"},"reference-count":43,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,3,14]],"date-time":"2025-03-14T00:00:00Z","timestamp":1741910400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,3,14]],"date-time":"2025-03-14T00:00:00Z","timestamp":1741910400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"d.hip"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Hum-Cent Intell Syst"],"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>This study presents a speech and spoken language analysis framework, leveraging a robust, end-to-end deep learning model developed in our prior work. The framework represents a foundational step towards a comprehensive solution for analyzing speech articulation and spoken language. Unlike traditional approaches that rely on separate specialized models, our architecture integrates multiple prediction tasks into a single multi-task learning setup: nine articulatory trajectories, a phoneme sequence, and phoneme alignment. While conceptually distinct, these outputs share a strong underlying relation: phonemes, as the fundamental building blocks of language, emerge from specific articulatory configurations, and phoneme alignment provides crucial temporal structure. We bridge the gap between abstract linguistic representations and their physical realizations by integrating phoneme recognition, articulatory trajectory prediction, and phoneme alignment within a single deep learning framework. Phonemes, as abstract speech units, manifest as concrete articulatory gestures, which can be precisely captured through EMA and analyzed using deep learning methods. This integration lays the foundation for diverse applications, including intelligibility assessment and therapeutic feedback. Extensive experiments validate the model\u2019s capabilities and demonstrate its potential in real-world contexts. These include evaluations of articulatory and phoneme-related metrics, intelligibility estimation using phoneme error rates, and open vocabulary keyword spotting. A case study on stroke-related datasets highlights how the framework provides detailed articulatory feedback and supports therapy progress tracking. While not a complete solution, this work shows that an integrated, end-to-end deep learning approach can effectively address multiple facets of speech analysis. Ultimately, it serves as a foundation for developing scalable and robust frameworks to tackle challenges in speech and language processing.<\/jats:p>","DOI":"10.1007\/s44230-025-00094-6","type":"journal-article","created":{"date-parts":[[2025,3,14]],"date-time":"2025-03-14T09:09:03Z","timestamp":1741943343000},"page":"103-122","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Towards End-to-End Speech Articulation and Spoken Language Analysis Using Deep Learning"],"prefix":"10.1007","volume":"5","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6225-1348","authenticated-orcid":false,"given":"Tobias","family":"Weise","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kubilay Can","family":"Demir","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Paula Andrea","family":"P\u00e9rez-Toro","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tomas","family":"Arias-Vergara","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Andreas","family":"Maier","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Elmar","family":"N\u00f6th","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Maria","family":"Schuster","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Bj\u00f6rn","family":"Heismann","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Seung Hee","family":"Yang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2025,3,14]]},"reference":[{"issue":"5","key":"94_CR1","doi-asserted-by":"publisher","first-page":"1042","DOI":"10.1016\/j.neuron.2018.04.031","volume":"98","author":"J Chartier","year":"2018","unstructured":"Chartier J, Anumanchipalli GK, Johnson K, et al. Encoding of articulatory kinematic trajectories in human speech sensorimotor cortex. Neuron. 2018;98(5):1042\u201354.","journal-title":"Neuron"},{"issue":"5","key":"94_CR2","first-page":"3074","volume":"121","author":"SS Narayanan","year":"2004","unstructured":"Narayanan SS, Nayak KS, Lee S, Sethy A, Byrd D. Magnetic resonance imaging and electromagnetic articulography database for speech research. J Acoust Soc Am. 2004;121(5):3074\u201382.","journal-title":"J Acoust Soc Am"},{"issue":"1","key":"94_CR3","first-page":"47","volume":"13","author":"H Tanabe","year":"2021","unstructured":"Tanabe H, Ertan A, Byrd D, Narayanan SS. Vocal tract imaging using real-time MRI for articulatory phonetics and speech science research. Phon Speech Sci. 2021;13(1):47\u201359.","journal-title":"Phon. Speech Sci"},{"issue":"6\u20137","key":"94_CR4","doi-asserted-by":"publisher","first-page":"605","DOI":"10.1080\/02699200500113943","volume":"19","author":"BM Bernhardt","year":"2005","unstructured":"Bernhardt BM, Gick B, Bacsfalvi P, Adler-Bock M. Ultrasound in speech therapy with adolescents and adults. Clin Linguist Phon. 2005;19(6\u20137):605\u201317. https:\/\/doi.org\/10.1080\/02699200500113943.","journal-title":"Clin Linguist Phon"},{"key":"94_CR5","unstructured":"Bradlow AR, Lee E-K. The use of MRI and ultrasound technology in teaching about Spanish and general phonetics and pronunciation. In: Proceedings of the 6th International Conference on Spanish Phonetics and Phonology, 2015."},{"issue":"3","key":"94_CR6","doi-asserted-by":"publisher","first-page":"299","DOI":"10.1016\/S0095-4470(19)30376-6","volume":"18","author":"CP Browman","year":"1990","unstructured":"Browman CP, Goldstein L. Gestural specification using dynamically-defined articulatory structures. J Phon. 1990;18(3):299\u2013320.","journal-title":"J Phon"},{"key":"94_CR7","unstructured":"Ji A. Speaker independent acoustic-to-articulatory inversion. PhD thesis, Marquette University, 2014."},{"issue":"1","key":"94_CR8","doi-asserted-by":"publisher","first-page":"19","DOI":"10.1016\/0167-6393(94)90055-8","volume":"14","author":"RS McGowan","year":"1994","unstructured":"McGowan RS. Recovering articulatory movement from formant frequency trajectories using task dynamics and a genetic algorithm: preliminary model tests. Speech Commun. 1994;14(1):19\u201348.","journal-title":"Speech Commun"},{"key":"94_CR9","doi-asserted-by":"crossref","unstructured":"Weise T, et al. Speaker-and text-independent estimation of articulatory movements and phoneme alignments from speech. 2024. arXiv preprint (INTERSPEECH) arXiv:2407.03132.","DOI":"10.21437\/Interspeech.2024-1208"},{"key":"94_CR10","doi-asserted-by":"publisher","first-page":"41","DOI":"10.1016\/j.ymeth.2018.07.007","volume":"151","author":"N Cummins","year":"2018","unstructured":"Cummins N, Baird A, Schuller BW. Speech analysis for health: current state-of-the-art and the increasing impact of deep learning. Methods. 2018;151:41\u201354.","journal-title":"Methods"},{"issue":"1","key":"94_CR11","doi-asserted-by":"publisher","first-page":"186","DOI":"10.1186\/s13195-022-01131-3","volume":"14","author":"Q Yang","year":"2022","unstructured":"Yang Q, Li X, Ding X, Xu F, Ling Z. Deep learning-based speech analysis for Alzheimer\u2019s disease detection: a literature review. Alzheimer\u2019s Res Therapy. 2022;14(1):186.","journal-title":"Alzheimer\u2019s Res Therapy"},{"key":"94_CR12","first-page":"448","volume":"25","author":"A-R Mohamed","year":"2012","unstructured":"Mohamed A-R, Dahl GE, Hinton G. Deep belief networks for phoneme recognition. Adv Neural Inf Process Syst. 2012;25:448\u201356.","journal-title":"Adv Neural Inf Process Syst"},{"key":"94_CR13","first-page":"165","volume":"77","author":"Z Wu","year":"2016","unstructured":"Wu Z, King S, Renals S. Articulatory-to-acoustic conversion using deep neural networks. Speech Commun. 2016;77:165\u201377.","journal-title":"Speech Commun"},{"key":"94_CR14","first-page":"213","volume":"15","author":"F Rudzicz","year":"2012","unstructured":"Rudzicz F. Dysarthria recognition using deep belief networks. Int J Speech Technol. 2012;15:213\u201327.","journal-title":"Int J Speech Technol"},{"key":"94_CR15","unstructured":"Schatz T, Mitra V, Feldman N, Goldwater S. Phoneme recognition using articulatory features and deep learning. In: Proceedings of INTERSPEECH 2013."},{"key":"94_CR16","first-page":"990","volume":"22","author":"A Senior","year":"2014","unstructured":"Senior A, Narayanan A. Computational challenges in deep learning for speech recognition. IEEE Trans Audio Speech Lang Process. 2014;22:990\u20131000.","journal-title":"IEEE Trans Audio Speech Lang Process"},{"issue":"1","key":"94_CR17","doi-asserted-by":"publisher","first-page":"41","DOI":"10.1023\/A:1007379606734","volume":"28","author":"R Caruana","year":"1997","unstructured":"Caruana R. Multitask learning. Mach Learn. 1997;28(1):41\u201375.","journal-title":"Mach Learn"},{"key":"94_CR18","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TASLP.2019.2917338","volume":"27","author":"H Liu","year":"2019","unstructured":"Liu H, Chen Y, Yu D. Self-supervised learning for speech representation learning: a survey. IEEE Trans Speech Audio Process. 2019;27:1\u201310.","journal-title":"IEEE Trans Speech Audio Process"},{"key":"94_CR19","first-page":"3171","volume":"32","author":"Y Ren","year":"2019","unstructured":"Ren Y, Ruan Y, Tan X, Qin T, Zhao S, Zhao Z, Liu T-Y. FastSpeech: fast, robust and controllable text to speech. Adv Neural Inf Process Syst. 2019;32:3171\u201380.","journal-title":"Adv Neural Inf Process Syst"},{"key":"94_CR20","first-page":"12449","volume":"33","author":"A Baevski","year":"2020","unstructured":"Baevski A, et al. wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv Neural Inf Process Syst. 2020;33:12449\u201360.","journal-title":"Adv Neural Inf Process Syst"},{"key":"94_CR21","unstructured":"Hendrycks D, Gimpel K. Gaussian error linear units (GELUS). 2016. arXiv preprint arXiv:1606.08415."},{"key":"94_CR22","first-page":"5998","volume":"30","author":"A Vaswani","year":"2017","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser \u0141, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:5998\u20136008.","journal-title":"Adv Neural Inf Process Syst"},{"key":"94_CR23","doi-asserted-by":"crossref","unstructured":"Graves A. Connectionist temporal classification. In: Supervised sequence labelling with recurrent neural networks. Springer; 2012. p. 61\u201393.","DOI":"10.1007\/978-3-642-24797-2_7"},{"key":"94_CR24","doi-asserted-by":"crossref","unstructured":"Zhu J, Zhang C, Jurgens D. Phone-to-audio alignment without text: a semi-supervised approach. In: ICASSP 2022. IEEE. p. 8167\u201371.","DOI":"10.1109\/ICASSP43922.2022.9746112"},{"key":"94_CR25","doi-asserted-by":"crossref","unstructured":"Badlani R, \u0141a\u0144cucki A, Shih KJ, Valle R, Ping W, Catanzaro B. One TTS alignment to rule them all. In: ICASSP 2022, 2022. IEEE. p. 6092\u20136.","DOI":"10.1109\/ICASSP43922.2022.9747707"},{"key":"94_CR26","unstructured":"Shih KJ, Valle R, et al. Rad-TTS: parallel flow-based TTS with robust alignment learning and diverse synthesis. In: ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021."},{"key":"94_CR27","doi-asserted-by":"crossref","unstructured":"Parrot M, Millet J, Dunbar E. Independent and automatic evaluation of acoustic-to-articulatory inversion models. In: Proc. Interspeech 2020.","DOI":"10.21437\/Interspeech.2020-1746"},{"issue":"5","key":"94_CR28","doi-asserted-by":"publisher","first-page":"425","DOI":"10.1016\/j.specom.2009.01.004","volume":"51","author":"A Maier","year":"2009","unstructured":"Maier A, Haderlein T, Eysholdt U, Rosanowski F, Batliner A, Schuster M, N\u00f6th E. Peaks\u2014a system for the automatic evaluation of voice and speech disorders. Speech Commun. 2009;51(5):425\u201337.","journal-title":"Speech Commun"},{"key":"94_CR29","unstructured":"Klumpp P, et al. Common phone: a multilingual dataset for robust acoustic modelling. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022, p. 763\u20138."},{"key":"94_CR30","unstructured":"Ardila R, Branson M, Davis K, Henretty M, Kohler M, Meyer J, et al. Common voice: a massively-multilingual speech corpus. 2019. arXiv preprint arXiv:1912.06670."},{"key":"94_CR31","unstructured":"Garofolo JS. TIMIT acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993, 1993."},{"issue":"5\u2013 Supplement","key":"94_CR32","doi-asserted-by":"publisher","first-page":"3580","DOI":"10.1121\/1.4987629","volume":"141","author":"M Tiede","year":"2017","unstructured":"Tiede M, Espy-Wilson CY, et al. Quantifying kinematic aspects of reduction in a contrasting rate production task. J Acoust Soc Am. 2017;141(5\u2013 Supplement):3580.","journal-title":"J Acoust Soc Am"},{"key":"94_CR33","doi-asserted-by":"publisher","first-page":"523","DOI":"10.1007\/s10579-011-9145-0","volume":"46","author":"F Rudzicz","year":"2012","unstructured":"Rudzicz F, Namasivayam AK, Wolff T. The TORGO database of acoustic and articulatory speech from speakers with dysarthria. Lang Resour Eval. 2012;46:523\u201341.","journal-title":"Lang Resour Eval"},{"key":"94_CR34","doi-asserted-by":"crossref","unstructured":"Wu P, Chen L-W, Cho CJ, Watanabe S, Goldstein L, Black AW, Anumanchipalli GK. Speaker-independent acoustic-to-articulatory speech inversion. 2023. arXiv:2302.06774 [eess.AS].","DOI":"10.1109\/ICASSP49357.2023.10096796"},{"key":"94_CR35","doi-asserted-by":"crossref","unstructured":"Kim H, Hasegawa-Johnson M, Perlman A, Gunderson J, Watkin K, Frame S. Dysarthric speech database for universal access research. 2008:1741\u20134","DOI":"10.21437\/Interspeech.2008-480"},{"key":"94_CR36","doi-asserted-by":"publisher","first-page":"326","DOI":"10.1016\/j.csl.2017.01.005","volume":"45","author":"T Kisler","year":"2017","unstructured":"Kisler T, Reichel U, Schiel F. Multilingual processing of speech via web services. Comput Speech Lang. 2017;45:326\u201347.","journal-title":"Comput Speech Lang"},{"key":"94_CR37","doi-asserted-by":"crossref","unstructured":"Seneviratne N, Sivaraman G, Espy-Wilson CY. Multi-corpus acoustic-to-articulatory speech inversion. In: Interspeech, 2019, p. 859\u201363.","DOI":"10.21437\/Interspeech.2019-3168"},{"key":"94_CR38","unstructured":"Berndt DJ, Clifford J. Using dynamic time warping to find patterns in time series. In: KDD Workshop, vol. 10. Seattle; 1994. p. 359\u201370."},{"issue":"1","key":"94_CR39","doi-asserted-by":"publisher","first-page":"1","DOI":"10.5334\/labphon.237","volume":"12","author":"T Rebernik","year":"2021","unstructured":"Rebernik T, Jacobi J, Jonkers R, Noiray A, Wieling M. A review of data collection practices using electromagnetic articulography. Lab Phonol. 2021;12(1):1\u201334. https:\/\/doi.org\/10.5334\/labphon.237.","journal-title":"Lab Phonol"},{"key":"94_CR40","doi-asserted-by":"crossref","unstructured":"Schu G, et al. On using the UA-Speech and TORGO databases to validate automatic dysarthric speech classification approaches. 2022. arXiv preprint arXiv:2211.08833.","DOI":"10.1109\/ICASSP49357.2023.10095981"},{"key":"94_CR41","doi-asserted-by":"publisher","first-page":"1147","DOI":"10.1109\/TNSRE.2022.3169814","volume":"30","author":"AA Joshy","year":"2022","unstructured":"Joshy AA, Rajan R. Automated dysarthria severity classification: a study on acoustic features and deep learning techniques. IEEE Trans Neural Syst Rehabil Eng. 2022;30:1147\u201357.","journal-title":"IEEE Trans Neural Syst Rehabil Eng"},{"key":"94_CR42","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2024.103047","volume":"158","author":"F Javanmardi","year":"2024","unstructured":"Javanmardi F, Kadiri SR, Alku P. Pre-trained models for detection and severity level classification of dysarthria from speech. Speech Commun. 2024;158: 103047.","journal-title":"Speech Commun"},{"key":"94_CR43","unstructured":"Klumpp P. Phonetic transfer learning from healthy references for the analysis of pathological speech. PhD thesis, Friedrich-Alexander Universit\u00e4t Erlangen-N\u00fcrnberg, 2024."}],"container-title":["Human-Centric Intelligent Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44230-025-00094-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s44230-025-00094-6\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44230-025-00094-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,4,23]],"date-time":"2025-04-23T13:02:49Z","timestamp":1745413369000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s44230-025-00094-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,14]]},"references-count":43,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,3]]}},"alternative-id":["94"],"URL":"https:\/\/doi.org\/10.1007\/s44230-025-00094-6","relation":{},"ISSN":["2667-1336"],"issn-type":[{"value":"2667-1336","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,3,14]]},"assertion":[{"value":"9 December 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 February 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 March 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare they have no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}