{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T02:18:16Z","timestamp":1760149096919,"version":"build-2065373602"},"reference-count":50,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2023,7,3]],"date-time":"2023-07-03T00:00:00Z","timestamp":1688342400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100003246","name":"Gravitation program of the Dutch Ministry of Education, Culture, and Science and the Netherlands Organization for Scientific Research","doi-asserted-by":"publisher","award":["024.001.003","HAI Summer 2021 Small Grant"],"award-info":[{"award-number":["024.001.003","HAI Summer 2021 Small Grant"]}],"id":[{"id":"10.13039\/501100003246","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Utrecht University\u2019s Human-centered Artificial Intelligence focus area","award":["024.001.003","HAI Summer 2021 Small Grant"],"award-info":[{"award-number":["024.001.003","HAI Summer 2021 Small Grant"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MTI"],"abstract":"<jats:p>There is a large interest in the annotation of speech addressed to infants. Infant-directed speech (IDS) has acoustic properties that might pose a challenge to automatic speech recognition (ASR) tools developed for adult-directed speech (ADS). While ASR tools could potentially speed up the annotation process, their effectiveness on this speech register is currently unknown. In this study, we assessed to what extent open-source ASR tools can successfully transcribe IDS. We used speech data from 21 Dutch mothers reading picture books containing target words to their 18- and 24-month-old children (IDS) and the experimenter (ADS). In Experiment 1, we examined how the ASR tool Kaldi-NL performs at annotating target words in IDS vs. ADS. We found that Kaldi-NL only found 55.8% of target words in IDS, while it annotated 66.8% correctly in ADS. In Experiment 2, we aimed to assess the difficulties in annotating IDS more broadly by transcribing all IDS utterances manually and comparing the word error rates (WERs) of two different ASR systems: Kaldi-NL and WhisperX. We found that WhisperX performs significantly better than Kaldi-NL. While there is much room for improvement, the results show that automatic transcriptions provide a promising starting point for researchers who have to transcribe a large amount of speech directed at infants.<\/jats:p>","DOI":"10.3390\/mti7070068","type":"journal-article","created":{"date-parts":[[2023,7,4]],"date-time":"2023-07-04T01:42:47Z","timestamp":1688434967000},"page":"68","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Using Open-Source Automatic Speech Recognition Tools for the Annotation of Dutch Infant-Directed Speech"],"prefix":"10.3390","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5807-0667","authenticated-orcid":false,"given":"Anika","family":"van der Klis","sequence":"first","affiliation":[{"name":"Institute for Language Sciences, Utrecht University, 3512 JK Utrecht, The Netherlands"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4342-7947","authenticated-orcid":false,"given":"Frans","family":"Adriaans","sequence":"additional","affiliation":[{"name":"Institute for Language Sciences, Utrecht University, 3512 JK Utrecht, The Netherlands"}]},{"given":"Mengru","family":"Han","sequence":"additional","affiliation":[{"name":"Department of Chinese Language and Literature, East China Normal University, Shanghai 200241, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5811-839X","authenticated-orcid":false,"given":"Ren\u00e9","family":"Kager","sequence":"additional","affiliation":[{"name":"Institute for Language Sciences, Utrecht University, 3512 JK Utrecht, The Netherlands"}]}],"member":"1968","published-online":{"date-parts":[[2023,7,3]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"104","DOI":"10.1037\/0012-1649.20.1.104","article-title":"Expanded intonation contours in mothers\u2019 speech to newborns","volume":"20","author":"Fernald","year":"1984","journal-title":"Dev. Psychol."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"477","DOI":"10.1017\/S0305000900010679","article-title":"A cross-language study of prosodic modifications in mothers\u2019 and fathers\u2019 speech to preverbal infants","volume":"16","author":"Fernald","year":"1989","journal-title":"J. Child Lang."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"684","DOI":"10.1126\/science.277.5326.684","article-title":"Cross-language analysis of phonetic units in language addressed to infants","volume":"277","author":"Kuhl","year":"1997","journal-title":"Science"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"501","DOI":"10.1016\/j.dr.2007.06.002","article-title":"Beyond babytalk: Re-evaluating the nature and content of speech input to preverbal infants","volume":"27","author":"Soderstrom","year":"2007","journal-title":"Dev. Rev."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"534","DOI":"10.1121\/1.4828977","article-title":"A multimodal corpus of speech to infant and adult listeners","volume":"134","author":"Johnson","year":"2013","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1080\/15475441.2020.1855182","article-title":"Language Specificity of Infant-directed Speech: Speaking Rate and Word Position in Word-learning Contexts","volume":"17","author":"Han","year":"2021","journal-title":"Lang. Learn. Dev."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.dr.2016.12.001","article-title":"Does prosody make the difference? A meta-analysis on relations between prosodic aspects of infant-directed speech and infant outcomes","volume":"44","author":"Spinelli","year":"2017","journal-title":"Dev. Rev."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"389","DOI":"10.1121\/1.3419786","article-title":"Effects of the acoustic properties of infant-directed speech on infant word recognition","volume":"128","author":"Song","year":"2010","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"179","DOI":"10.1207\/s15327647jcd0602_2","article-title":"Dynamics of Word Comprehension in Infancy: Developments in Timing, Accuracy, and Resistance to Acoustic Degradation","volume":"6","author":"Zangl","year":"2005","journal-title":"J. Cogn. Dev."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"654","DOI":"10.1080\/15250000903263973","article-title":"Influences of infant-directed speech on early word recognition","volume":"14","author":"Singh","year":"2009","journal-title":"Infancy"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"797","DOI":"10.1111\/infa.12006","article-title":"Infant-directed prosody helps infants map sounds to meanings","volume":"18","author":"Estes","year":"2013","journal-title":"Infancy"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"85","DOI":"10.1017\/S0305000919000813","article-title":"Pitch properties of infant-directed speech specific to word-learning contexts: A cross-linguistic investigation of Mandarin Chinese and Dutch","volume":"47","author":"Han","year":"2020","journal-title":"J. Child Lang."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"181","DOI":"10.1016\/S0163-6383(85)80005-9","article-title":"Four-month-old infants prefer to listen to motherese","volume":"8","author":"Fernald","year":"1985","journal-title":"Infant Behav. Dev."},{"key":"ref_14","first-page":"1","article-title":"Preference for infant-directed speech in preverbal young children","volume":"5","author":"Dunst","year":"2012","journal-title":"Cent. Early Lit. Learn."},{"key":"ref_15","first-page":"1728","article-title":"ManyBabies1: Infants\u2019 preference for infant-directed speech","volume":"145","author":"Soderstrom","year":"2019","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"913","DOI":"10.1017\/S0305000912000669","article-title":"The hyperarticulation hypothesis of infant-directed speech","volume":"41","author":"Cristia","year":"2014","journal-title":"J. Child Lang."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"84","DOI":"10.1016\/j.cognition.2017.05.003","article-title":"Vowels in infant-directed speech: More breathy and more variable, but not clearer","volume":"166","author":"Miyazawa","year":"2017","journal-title":"Cognition"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"3070","DOI":"10.1121\/1.4982246","article-title":"Prosodic exaggeration within infant-directed speech: Consequences for vowel learnability","volume":"141","author":"Adriaans","year":"2017","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"372","DOI":"10.1016\/S0163-6383(02)00086-3","article-title":"Universality and specificity in infant-directed speech: Pitch modifications as a function of infant age and sex in a tonal and non-tonal language","volume":"24","author":"Kitamura","year":"2001","journal-title":"Infant Behav. Dev."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Sjons, J., H\u00f6rberg, T., \u00d6stling, R., and Bjerva, J. (2017). Articulation rate in Swedish child-directed speech increases as a function of the age of the child even when surprisal is controlled for INTERSPEECH. arXiv.","DOI":"10.21437\/Interspeech.2017-1052"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1016\/S0167-6393(00)00067-4","article-title":"Transcriber: Development and use of a tool for assisting speech corpora production","volume":"33","author":"Barras","year":"2001","journal-title":"Speech Commun."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Gaur, Y., Lasecki, W.S., Metze, F., and Bigham, J.P. (2016, January 11\u201313). The effects of automatic speech recognition quality on human transcription latency. Proceedings of the 13th International Web for All Conference, New York, NY, USA.","DOI":"10.1145\/2899475.2899478"},{"key":"ref_23","unstructured":"Burnham, D., Kalashnikova, M., Muawiyath, S., Cassidy, S., and Estival, D. (April, January 30). Infant-Directed Speech Research Made Easy: A Database, Some Tools and a Virtual Laboratory. Proceedings of the Abstract and Paper Presented at the 43rd Experimental Psychology Conference, Melbourne, Australia."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Mohamed, A., Hinton, G., and Penn, G. (2012, January 25\u201330). Understanding how Deep Belief Networks perform acoustic modelling. Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan.","DOI":"10.1109\/ICASSP.2012.6288863"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"181","DOI":"10.1016\/j.specom.2009.10.001","article-title":"Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates","volume":"52","author":"Goldwater","year":"2010","journal-title":"Speech Commun."},{"key":"ref_26","unstructured":"Kawahara, T., Nanjo, H., Shinozaki, T., and Furui, S. (2003, January 13\u201316). Benchmark test for speech recognition using the corpus of spontaneous Japanese. Proceedings of the ISCA\/IEEE Workshop on Spontaneous Speech Processing and Recognition, Tokyo, Japan."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Shinozaki, T., Hori, C., and Furui, S. (2001, January 3\u20137). Towards automatic transcription of spontaneous presentations. Proceedings of the Seventh European Conference on Speech Communication and Technology, Aalborg, Denmark.","DOI":"10.21437\/Eurospeech.2001-129"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"155","DOI":"10.1016\/j.specom.2004.01.006","article-title":"Prosodic and other cues to speech recognition failures","volume":"43","author":"Hirschberg","year":"2004","journal-title":"Speech Commun."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"2238","DOI":"10.1121\/1.1869172","article-title":"Statistical properties of infant-directed versus adult-directed speech: Insights from speech recognition","volume":"117","author":"Kirchhoff","year":"2005","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"1500","DOI":"10.1121\/1.3183593","article-title":"Characteristics of speaking style and implications for speech recognition","volume":"126","author":"Shinozaki","year":"2009","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_31","unstructured":"Han, M. (2019). The Role of Prosodic input in Word Learning: A Cross-Linguistic Investigation of Dutch and Mandarin Chinese Infant-Directed Speech. [Ph.D Dissertation, Utrecht University]."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"643","DOI":"10.3758\/BRM.42.3.643","article-title":"SUBTLEX-NL: A new measure for Dutch word frequency based on film subtitles","volume":"42","author":"Keuleers","year":"2010","journal-title":"Behav. Res. Methods"},{"key":"ref_33","unstructured":"Boersma, P., and Weenink, D. (2023, May 23). Praat: Doing Phonetics by Computer [Computer Program]. Version 6.1.09. Available online: https:\/\/www.fon.hum.uva.nl\/praat\/."},{"key":"ref_34","unstructured":"Yilmaz, E., and Gompel, M. (2020, March 18). Automatic Transcription of Dutch Speech Recordings [ASR Tool]. Available online: https:\/\/webservices.cls.ru.nl\/asr_nl."},{"key":"ref_35","unstructured":"Oostdijk, N. (June, January 31). The Spoken Dutch Corpus. Overview and First Evaluation. Proceedings of the Second International Conference on Language Resources and Evaluation, Athens, Greece."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"373","DOI":"10.1109\/LSP.2017.2723507","article-title":"Low Latency Acoustic Modeling Using Temporal Convolution and LSTMs","volume":"25","author":"Peddinti","year":"2018","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_37","unstructured":"Tejedor-Garc\u00eda, C., van der Molen, B., van den Heuvel, H., van Hessen, A., and Pieters, T. (2022, January 20\u201325). Towards an Open-Source Dutch Speech Recognition System for the Healthcare Domain. Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Losada, D.E., and Fern\u00e1ndez-Luna, J.M. (2005). Advances in Information Retrieval, Springer. Lecture Notes in Computer Science.","DOI":"10.1007\/b107096"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"1","DOI":"10.18637\/jss.v067.i01","article-title":"Fitting linear mixed-effects models using lme4","volume":"67","author":"Bates","year":"2015","journal-title":"J. Stat. Softw."},{"key":"ref_40","unstructured":"R Core Team (2023, May 23). R: A Language and Environment for Statistical Computing, 2022. Version 4.2.0. Available online: https:\/\/www.r-project.org\/."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Han, M., de Jong, N.H., and Kager, R. (2023). Relating the prosody of infant-directed speech to children\u2019s vocabulary size. J. Child Lang., 1\u201317.","DOI":"10.1017\/S0305000923000041"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Bain, M., Huh, J., Han, T., and Zisserman, A. (2023). WhisperX: Time-Accurate Speech Transcription of Long-Form Audio. arXiv.","DOI":"10.21437\/Interspeech.2023-78"},{"key":"ref_43","unstructured":"Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv."},{"key":"ref_44","unstructured":"(2023, May 23). SCTK, the NIST Scoring Toolkit. 2023. Version 2.4.12. Available online: https:\/\/github.com\/usnistgov\/SCTK."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Sluijter, A., and van Heuven, V. (1996, January 3\u20136). Acoustic correlates of linguistic stress and accent in Dutch and American English. Proceedings of the Fourth International Conference on Spoken Language Processing, Philadelphia, PA, USA.","DOI":"10.21437\/ICSLP.1996-159"},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"209","DOI":"10.1037\/0012-1649.27.2.209","article-title":"Prosody and focus in speech to infants and adults","volume":"27","author":"Fernald","year":"1991","journal-title":"Dev. Psychol."},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"617","DOI":"10.1044\/jshr.3503.617","article-title":"Vowel Duration in Mothers\u2019 Speech to Young Children","volume":"35","author":"Swanson","year":"1992","journal-title":"J. Speech Lang. Hear. Res."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"1394","DOI":"10.1044\/jshr.3706.1394","article-title":"Duration of function-word vowels in mothers\u2019 speech to young children","volume":"37","author":"Swanson","year":"1994","journal-title":"J. Speech Hear. Res."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Gustafson, J., and Sj\u00f6lander, K. (2002, January 16\u201320). Voice transformations for improving children\u2019s speech recognition in a publicly available dialogue system. Proceedings of the 7th International Conference on Spoken Language Processing, ISCA, Denver, CO, USA.","DOI":"10.21437\/ICSLP.2002-139"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Van der Klis, A., Adriaans, F., Han, M., and Kager, R. (2020, January 25\u201329). Automatic Recognition of Target Words in Infant-Directed Speech. Proceedings of the Companion Publication of the 2020 International Conference on Multimodal Interaction, Online.","DOI":"10.1145\/3395035.3425184"}],"container-title":["Multimodal Technologies and Interaction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2414-4088\/7\/7\/68\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T20:05:06Z","timestamp":1760126706000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2414-4088\/7\/7\/68"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,7,3]]},"references-count":50,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2023,7]]}},"alternative-id":["mti7070068"],"URL":"https:\/\/doi.org\/10.3390\/mti7070068","relation":{},"ISSN":["2414-4088"],"issn-type":[{"type":"electronic","value":"2414-4088"}],"subject":[],"published":{"date-parts":[[2023,7,3]]}}}