{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T17:18:05Z","timestamp":1767633485275,"version":"3.48.0"},"reference-count":38,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2026,1,4]],"date-time":"2026-01-04T00:00:00Z","timestamp":1767484800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Reading proficiency in early childhood is crucial for academic success and intellectual development. However, more and more children are struggling with reading. According to the last PISA study in Austria, one out of five children is dealing with reading difficulties. The reasons for this are diverse, but an application that tracks children while reading aloud and guides them when they experience difficulties could offer meaningful help. Therefore, this proposal explores a prototyping approach for a core component that tracks children\u2019s reading using a self-supervised Wav2Vec2 model with a limited amount of data. Self-supervised learning allows models to learn general representations from large amounts of unlabeled audio, which can then be fine-tuned on smaller, task-specific datasets, making it especially useful when labeled data is limited. Our model is operating on the phonetic level with the help of the International Phonetic Alphabet (IPA). To implement this, the KidsTALC dataset from the Leibniz University Hannover was used, which contains spontaneous speech recordings of German-speaking children. To enhance the training data and improve robustness, several data augmentation techniques were applied and evaluated, including pitch shifting, formant shifting, and speed variation. The models were trained using different data configurations to compare the effects of data variety and quality on recognition performance. The best model trained in this work achieved a phoneme error rate (PER) of 14.3% and a word error rate (WER) of 31.6% on unseen child speech data, demonstrating the potential of self-supervised models for such use cases.<\/jats:p>","DOI":"10.3390\/info17010040","type":"journal-article","created":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T12:38:56Z","timestamp":1767616736000},"page":"40","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Listen Closely: Self-Supervised Phoneme Tracking for Children\u2019s Reading Assessment"],"prefix":"10.3390","volume":"17","author":[{"given":"Philipp","family":"Ollmann","sequence":"first","affiliation":[{"name":"Department for Smart and Interconnected Living, University of Applied Sciences Upper Austria, Softwarepark 11, 4232 Hagenberg, Austria"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0646-782X","authenticated-orcid":false,"given":"Erik","family":"Sonnleitner","sequence":"additional","affiliation":[{"name":"Department for Smart and Interconnected Living, University of Applied Sciences Upper Austria, Softwarepark 11, 4232 Hagenberg, Austria"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2437-0589","authenticated-orcid":false,"given":"Marc","family":"Kurz","sequence":"additional","affiliation":[{"name":"Department for Smart and Interconnected Living, University of Applied Sciences Upper Austria, Softwarepark 11, 4232 Hagenberg, Austria"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-2492-5647","authenticated-orcid":false,"given":"Jens","family":"Kr\u00f6sche","sequence":"additional","affiliation":[{"name":"Department for Smart and Interconnected Living, University of Applied Sciences Upper Austria, Softwarepark 11, 4232 Hagenberg, Austria"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Stephan","family":"Selinger","sequence":"additional","affiliation":[{"name":"Department for Smart and Interconnected Living, University of Applied Sciences Upper Austria, Softwarepark 11, 4232 Hagenberg, Austria"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2026,1,4]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"359","DOI":"10.1017\/S0033291723001381","article-title":"Early-initiated childhood reading for pleasure: Associations with better cognitive performance, mental well-being and brain structure in young adolescence","volume":"54","author":"Sun","year":"2024","journal-title":"Psychol. Med."},{"key":"ref_2","unstructured":"ORF Science (2025, December 11). Lesende Kinder Werden Zufriedene Jugendliche. Available online: https:\/\/science.orf.at\/stories\/3219992\/."},{"key":"ref_3","unstructured":"Der Standard (2025, December 11). PIRLS-Studie: Jedes f\u00fcnfte Kind in \u00d6sterreich hat Probleme beim Lesen. Available online: https:\/\/www.derstandard.at\/story\/2000146458792\/."},{"key":"ref_4","unstructured":"Statista (2025, December 11). Mediennutzung von Kindern-Statistiken und Umfragen. Available online: https:\/\/de.statista.com\/themen\/2660\/mediennutzung-von-kindern\/."},{"key":"ref_5","unstructured":"The Access Center (2024, December 09). Early Reading Assessment: A Guiding Tool for Instruction. Available online: https:\/\/www.readingrockets.org\/topics\/assessment-and-evaluation\/articles\/early-reading-assessment-guiding-tool-instruction."},{"key":"ref_6","unstructured":"Professional Development Service for Teachers (PDST) (2025, December 11). Running Records. Available online: https:\/\/www.pdst.ie\/sites\/default\/files\/Running%20Records%20Final%2015%20Oct.pdf."},{"key":"ref_7","unstructured":"Bildung durch Sprache und Schrift (BiSS) (2025, April 09). Salzburger Lesescreening f\u00fcr die Schulstufen 2\u20139 (SLS 2\u20139). Available online: https:\/\/www.biss-sprachbildung.de\/btools\/salzburger-lesescreening-fuer-die-schulstufen-2-9\/."},{"key":"ref_8","unstructured":"SoapBox Labs (2025, April 09). Children\u2019s Speech Recognition & Voice Technology Solutions. Available online: https:\/\/www.soapboxlabs.com\/."},{"key":"ref_9","unstructured":"Hello Ello (2025, April 09). Read with Ello. Available online: https:\/\/www.ello.com\/."},{"key":"ref_10","unstructured":"Ravaglia, R. (2025, December 11). Microsoft Reading Coach\u2014Seamless AI Support For Students and Teachers. Forbes. Available online: https:\/\/www.forbes.com\/sites\/rayravaglia\/2024\/01\/18\/microsoft-reading-coach-seamless-ai-support-for-students-and-teachers\/."},{"key":"ref_11","unstructured":"Google (2025, April 09). Read Along. Available online: https:\/\/readalong.google.com\/."},{"key":"ref_12","first-page":"1053","article-title":"Challenges and limitations in speech recognition technology: A critical review of speech signal processing algorithms, tools and systems","volume":"135","author":"Basak","year":"2023","journal-title":"CMES-Comput. Model. Eng. Sci."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Yeung, G., and Alwan, A. (2018, January 2\u20136). On the difficulties of automatic speech recognition for kindergarten-aged children. Proceedings of the Interspeech 2018, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-2297"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"D\u2019Arcy, S., and Russell, M.J. (2005, January 4\u20138). A comparison of human and computer recognition accuracy for children\u2019s speech. Proceedings of the Interspeech 2005, Lisbon, Portugal.","DOI":"10.21437\/Interspeech.2005-697"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Bhardwaj, V., Ben Othman, M.T., Kukreja, V., Belkhier, Y., Bajaj, M., Goud, B.S., Rehman, A.U., Shafiq, M., and Hamam, H. (2022). Automatic speech recognition (asr) systems for children: A systematic literature review. Appl. Sci., 12.","DOI":"10.3390\/app12094419"},{"key":"ref_16","first-page":"111","article-title":"Challenges for computer recognition of children\u2019s speech","volume":"108","author":"Russell","year":"2007","journal-title":"SLaTE"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Park, D.S., Chan, W., Zhang, Y., Chiu, C.C., Zoph, B., Cubuk, E.D., and Le, Q.V. (2019, January 15\u201319). SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. Proceedings of the Interspeech 2019, Graz, Austria.","DOI":"10.21437\/Interspeech.2019-2680"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"43","DOI":"10.1109\/TASSP.1978.1163055","article-title":"Dynamic programming algorithm optimization for spoken word recognition","volume":"26","author":"Sakoe","year":"1978","journal-title":"IEEE Trans. Acoust. Speech Signal Process."},{"key":"ref_19","first-page":"524","article-title":"Dynamic time warping (dtw) algorithm in speech: A review","volume":"6","author":"Yadav","year":"2018","journal-title":"Int. J. Res. Electron. Comput. Eng."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"257","DOI":"10.1109\/5.18626","article-title":"A tutorial on hidden Markov models and selected applications in speech recognition","volume":"77","author":"Rabiner","year":"1989","journal-title":"Proc. IEEE"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"032033","DOI":"10.1088\/1742-6596\/1802\/3\/032033","article-title":"Study on a CNN-HMM approach for audio-based musical chord recognition","volume":"1802","author":"Li","year":"2021","journal-title":"J. Phys. Conf. Ser."},{"key":"ref_22","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems 30, Curran Associates Inc."},{"key":"ref_23","unstructured":"Kushwaha, N. (2023). A Basic High-Level View of Transformer Architecture & LLMs from 35000 Feet. Python Plain Engl., Available online: https:\/\/python.plainenglish.io\/a-basic-high-level-view-of-transformer-architecture-llms-from-35000-feet-c036dd2f7c25."},{"key":"ref_24","unstructured":"Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 30, Curran Associates Inc."},{"key":"ref_25","unstructured":"Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (2020, January 5\u201310). Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"3451","DOI":"10.1109\/TASLP.2021.3122291","article-title":"Hubert: Self-supervised speech representation learning by masked prediction of hidden units","volume":"29","author":"Hsu","year":"2021","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"1505","DOI":"10.1109\/JSTSP.2022.3188113","article-title":"Wavlm: Large-scale self-supervised pre-training for full stack speech processing","volume":"16","author":"Chen","year":"2022","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"ref_28","unstructured":"Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023, January 23\u201329). Robust speech recognition via large-scale weak supervision. Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA. PMLR."},{"key":"ref_29","unstructured":"Mostert, N. (2024). Implementing Wav2Vec 2.0 into an Automated Reading Tutor. [Master\u2019s Thesis, Utrecht University]."},{"key":"ref_30","unstructured":"Medin, L.B., Pellegrini, T., and Gelin, L. (2024, January 1\u20135). Self-Supervised Models for Phoneme Recognition: Applications in Children\u2019s Speech for Reading Learning. Proceedings of the 25th Interspeech Conference (Interspeech 2024), Kos Island, Greece."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Jain, R., Barcovschi, A., Yiwere, M., Corcoran, P., and Cucu, H. (2023). Adaptation of Whisper models to child speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2023-935"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Fan, R., Shankar, N.B., and Alwan, A. (2024). Benchmarking Children\u2019s ASR with Supervised and Self-supervised Speech Foundation Models. arXiv.","DOI":"10.21437\/Interspeech.2024-1353"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Rumberg, L., Gebauer, C., Ehlert, H., Wallbaum, M., Bornholt, L., Ostermann, J., and L\u00fcdtke, U. (2022, January 18\u201322). kidsTALC: A Corpus of 3- to 11-year-old German Children\u2019s Connected Natural Speech. Proceedings of the Interspeech 2022, Incheon, Republic of Korea.","DOI":"10.21437\/Interspeech.2022-330"},{"key":"ref_34","unstructured":"Calzolari, N., B\u00e9chet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., and Mariani, J. (2020, January 11\u201316). Common Voice: A Massively-Multilingual Speech Corpus. Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France."},{"key":"ref_35","unstructured":"International Phonetic Association (2025, October 21). The International Phonetic Alphabet (2020 Revision). Available online: https:\/\/www.internationalphoneticassociation.org\/IPAcharts\/IPA_chart_orig\/pdfs\/IPA_Kiel_2020_full.pdf."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"3958","DOI":"10.21105\/joss.03958","article-title":"Phonemizer: Text to Phones Transcription for Multiple Languages in Python","volume":"6","author":"Bernard","year":"2021","journal-title":"J. Open Source Softw."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., and Funtowicz, M. (2020, January 16\u201320). Transformers: State-of-the-Art Natural Language Processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online.","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"ref_38","unstructured":"Biewald, L. (2025, December 11). Experiment Tracking with Weights and Biases. Available online: https:\/\/wandb.ai\/site\/experiment-tracking\/."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/17\/1\/40\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T12:51:59Z","timestamp":1767617519000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/17\/1\/40"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,4]]},"references-count":38,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2026,1]]}},"alternative-id":["info17010040"],"URL":"https:\/\/doi.org\/10.3390\/info17010040","relation":{},"ISSN":["2078-2489"],"issn-type":[{"type":"electronic","value":"2078-2489"}],"subject":[],"published":{"date-parts":[[2026,1,4]]}}}