{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,15]],"date-time":"2026-03-15T05:07:15Z","timestamp":1773551235906,"version":"3.50.1"},"reference-count":53,"publisher":"MDPI AG","issue":"24","license":[{"start":{"date-parts":[[2023,12,6]],"date-time":"2023-12-06T00:00:00Z","timestamp":1701820800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"PROMPT Institute Research","award":["2018"],"award-info":[{"award-number":["2018"]}]},{"name":"PROMPT Institute Research","award":["WANMA2021\/7"],"award-info":[{"award-number":["WANMA2021\/7"]}]},{"name":"WA Near Miss Award","award":["2018"],"award-info":[{"award-number":["2018"]}]},{"name":"WA Near Miss Award","award":["WANMA2021\/7"],"award-info":[{"award-number":["WANMA2021\/7"]}]},{"name":"Department of Health WA and administered through the Future Health Research and Innovation (FHRI) Fund","award":["2018"],"award-info":[{"award-number":["2018"]}]},{"name":"Department of Health WA and administered through the Future Health Research and Innovation (FHRI) Fund","award":["WANMA2021\/7"],"award-info":[{"award-number":["WANMA2021\/7"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Problem: Phonetic transcription is crucial in diagnosing speech sound disorders (SSDs) but is susceptible to transcriber experience and perceptual bias. Current forced alignment (FA) tools, which annotate audio files to determine spoken content and its placement, often require manual transcription, limiting their effectiveness. Method: We introduce a novel, text-independent forced alignment model that autonomously recognises individual phonemes and their boundaries, addressing these limitations. Our approach leverages an advanced, pre-trained wav2vec 2.0 model to segment speech into tokens and recognise them automatically. To accurately identify phoneme boundaries, we utilise an unsupervised segmentation tool, UnsupSeg. Labelling of segments employs nearest-neighbour classification with wav2vec 2.0 labels, before connectionist temporal classification (CTC) collapse, determining class labels based on maximum overlap. Additional post-processing, including overfitting cleaning and voice activity detection, is implemented to enhance segmentation. Results: We benchmarked our model against existing methods using the TIMIT dataset for normal speakers and, for the first time, evaluated its performance on the TORGO dataset containing SSD speakers. Our model demonstrated competitive performance, achieving a harmonic mean score of 76.88% on TIMIT and 70.31% on TORGO. Implications: This research presents a significant advancement in the assessment and diagnosis of SSDs, offering a more objective and less biased approach than traditional methods. Our model\u2019s effectiveness, particularly with SSD speakers, opens new avenues for research and clinical application in speech pathology.<\/jats:p>","DOI":"10.3390\/s23249650","type":"journal-article","created":{"date-parts":[[2023,12,6]],"date-time":"2023-12-06T06:01:07Z","timestamp":1701842467000},"page":"9650","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Improving Text-Independent Forced Alignment to Support Speech-Language Pathologists with Phonetic Transcription"],"prefix":"10.3390","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5449-4382","authenticated-orcid":false,"given":"Ying","family":"Li","sequence":"first","affiliation":[{"name":"School of EECMS, Curtin University, Bentley, WA 6102, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-2952-6347","authenticated-orcid":false,"given":"Bryce Johannas","family":"Wohlan","sequence":"additional","affiliation":[{"name":"School of EECMS, Curtin University, Bentley, WA 6102, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4006-7803","authenticated-orcid":false,"given":"Duc-Son","family":"Pham","sequence":"additional","affiliation":[{"name":"School of EECMS, Curtin University, Bentley, WA 6102, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4949-7647","authenticated-orcid":false,"given":"Kit Yan","family":"Chan","sequence":"additional","affiliation":[{"name":"School of EECMS, Curtin University, Bentley, WA 6102, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4488-1199","authenticated-orcid":false,"given":"Roslyn","family":"Ward","sequence":"additional","affiliation":[{"name":"School of Allied Health, Curtin University, Bentley, WA 6102, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1538-7140","authenticated-orcid":false,"given":"Neville","family":"Hennessey","sequence":"additional","affiliation":[{"name":"School of Allied Health, Curtin University, Bentley, WA 6102, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3195-3480","authenticated-orcid":false,"given":"Tele","family":"Tan","sequence":"additional","affiliation":[{"name":"School of EECMS, Curtin University, Bentley, WA 6102, Australia"}]}],"member":"1968","published-online":{"date-parts":[[2023,12,6]]},"reference":[{"key":"ref_1","first-page":"275","article-title":"Diagnostic and statistical manual of mental disorders","volume":"48","author":"Carter","year":"2014","journal-title":"Ther. Recreat. J."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"112","DOI":"10.1097\/TLD.0b013e318217b5dd","article-title":"Subtyping children with speech sound disorders by endophenotypes","volume":"31","author":"Lewis","year":"2011","journal-title":"Top. Lang. Disord."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"578","DOI":"10.1111\/dmcn.12635","article-title":"Speech sound disorder at 4 years: Prevalence, comorbidities, and predictors in a community cohort of children","volume":"57","author":"Eadie","year":"2015","journal-title":"Dev. Med. Child Neurol."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1341","DOI":"10.1044\/jshr.3706.1341","article-title":"A 28-year follow-up of adults with a history of moderate phonological disorder: Educational and occupational results","volume":"37","author":"Felsenfeld","year":"1994","journal-title":"J. Speech Lang. Hear. Res."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"70","DOI":"10.1016\/j.jcomdis.2012.08.006","article-title":"When he\u2019s around his brothers\u00a6he\u2019s not so quiet: The private and public worlds of school-aged children with speech sound disorder","volume":"46","author":"McLeod","year":"2013","journal-title":"J. Commun. Disord."},{"key":"ref_6","unstructured":"Bates, S., and Titterington, J. (2021). Good Practice Guidelines for the Analysis of Child Speech, Ulster University."},{"key":"ref_7","unstructured":"(2023, October 11). Child Speech Disorder Research Network. Available online: https:\/\/www.nbt.nhs.uk\/bristol-speech-language-therapy-research-unit\/bsltru-research\/child-speech-disorder-research-network."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"456","DOI":"10.1044\/jshr.2703.456","article-title":"A procedure for phonetic transcription by consensus","volume":"27","author":"Shriberg","year":"1984","journal-title":"J. Speech Lang. Hear. Res."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1111\/j.1460-6984.2012.00195.x","article-title":"How should children with speech sound disorders be classified? A review and critical evaluation of current classification systems","volume":"48","author":"Waring","year":"2013","journal-title":"Int. J. Lang. Commun. Disord."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"189","DOI":"10.1007\/s40474-014-0017-3","article-title":"Differential diagnosis of pediatric speech sound disorder","volume":"1","author":"Dodd","year":"2014","journal-title":"Curr. Dev. Disord. Rep."},{"key":"ref_11","unstructured":"Titterington, J., and Bates, S. (2021). Manual of Clinical Phonetics, Routledge."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"225","DOI":"10.3109\/02699209108986113","article-title":"Reliability studies in broad and narrow phonetic transcription","volume":"5","author":"Shriberg","year":"1991","journal-title":"Clin. Linguist. Phon."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1044\/1058-0360.0503.07","article-title":"Hearing and believing: Some limits to the auditory-perceptual assessment of speech and voice disorders","volume":"5","author":"Kent","year":"1996","journal-title":"Am. J.-Speech-Lang. Pathol."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"382","DOI":"10.1044\/jslhr.4202.382","article-title":"Undifferentiated lingual gestures in children with articulation\/phonological disorders","volume":"42","author":"Gibbon","year":"1999","journal-title":"J. Speech Lang. Hear. Res."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1080\/02699206.2016.1174739","article-title":"Electropalatographic (EPG) evidence of covert contrasts in disordered speech","volume":"31","author":"Gibbon","year":"2017","journal-title":"Clin. Linguist. Phon."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"583","DOI":"10.1080\/17549507.2018.1477991","article-title":"Automated speech analysis tools for children\u2019s speech production: A systematic literature review","volume":"20","author":"McKechnie","year":"2018","journal-title":"Int. J.-Speech-Lang. Pathol."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Bhardwaj, V., Ben Othman, M.T., Kukreja, V., Belkhier, Y., Bajaj, M., Goud, B.S., Rehman, A.U., Shafiq, M., and Hamam, H. (2022). Automatic speech recognition (asr) systems for children: A systematic literature review. Appl. Sci., 12.","DOI":"10.3390\/app12094419"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Attwell, G.A., Bennin, K.E., and Tekinerdogan, B. (2022). A Systematic Review of Online Speech Therapy Systems for Intervention in Childhood Speech Communication Disorders. Sensors, 22.","DOI":"10.3390\/s22249713"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"259","DOI":"10.1016\/0885-2308(91)90010-N","article-title":"A recurrent error propagation network speech recognition system","volume":"5","author":"Robinson","year":"1991","journal-title":"Comput. Speech Lang."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"2673","DOI":"10.1109\/78.650093","article-title":"Bidirectional recurrent neural networks","volume":"45","author":"Schuster","year":"1997","journal-title":"IEEE Trans. Signal Process."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Graves, A., Fern\u00e1ndez, S., Gomez, F., and Schmidhuber, J. (2006, January 25\u201329). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA.","DOI":"10.1145\/1143844.1143891"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Wang, D., Wang, X., and Lv, S. (2019). An overview of end-to-end automatic speech recognition. Symmetry, 11.","DOI":"10.3390\/sym11081018"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"602","DOI":"10.1016\/j.neunet.2005.06.042","article-title":"Framewise phoneme classification with bidirectional LSTM and other neural network architectures","volume":"18","author":"Graves","year":"2005","journal-title":"Neural Netw."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., and Pallett, D.S. (1993). DARPA TIMIT Acoustic-Phonetic Continous Speech Corpus CD-ROM. NIST Speech Disc 1-1.1, NASA STI\/Recon Technical Report n.","DOI":"10.6028\/NIST.IR.4930"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Graves, A., Mohamed, A.R., and Hinton, G. (2013, January 26\u201331). Speech recognition with deep recurrent neural networks. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada.","DOI":"10.1109\/ICASSP.2013.6638947"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_27","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., and Wu, Y. (2020). Conformer: Convolution-augmented transformer for speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2020-3015"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., and Gadde, R.T. (2019). Jasper: An end-to-end convolutional neural acoustic model. arXiv.","DOI":"10.21437\/Interspeech.2019-1819"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Kriman, S., Beliaev, S., Ginsburg, B., Huang, J., Kuchaiev, O., Lavrukhin, V., Leary, R., Li, J., and Zhang, Y. (2020, January 4\u20138). Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053889"},{"key":"ref_31","first-page":"12449","article-title":"wav2vec 2.0: A framework for self-supervised learning of speech representations","volume":"33","author":"Baevski","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"V\u00e1squez-Correa, J.C., and \u00c1lvarez Muniain, A. (2023). Novel speech recognition systems applied to forensics within child exploitation: Wav2vec2. 0 vs. whisper. Sensors, 23.","DOI":"10.3390\/s23041843"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. (2015, January 19\u201324). Librispeech: An asr corpus based on public domain audio books. Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLS, Australia.","DOI":"10.1109\/ICASSP.2015.7178964"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"20180061","DOI":"10.1515\/lingvan-2018-0061","article-title":"Assessing the accuracy of existing forced alignment software on varieties of British English","volume":"6","author":"MacKenzie","year":"2020","journal-title":"Linguist. Vanguard"},{"key":"ref_35","first-page":"192","article-title":"Prosodylab-aligner: A tool for forced alignment of laboratory speech","volume":"39","author":"Gorman","year":"2011","journal-title":"Can. Acoust."},{"key":"ref_36","unstructured":"Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., and Schwarz, P. (2011, January 11\u201315). The Kaldi speech recognition toolkit. Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing Society, Waikoloa, HI, USA. number CONF."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., and Sonderegger, M. (2017, January 20\u201324). Montreal forced aligner: Trainable text-speech alignment using kaldi. Proceedings of the Interspeech, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-1386"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Kreuk, F., Sheena, Y., Keshet, J., and Adi, Y. (2020, January 4\u20138). Phoneme boundary detection using learnable segmental features. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053053"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Kreuk, F., Keshet, J., and Adi, Y. (2020). Self-supervised contrastive learning for unsupervised phoneme segmentation. arXiv.","DOI":"10.21437\/Interspeech.2020-2398"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Wohlan, B., Pham, D.S., Chan, K.Y., and Ward, R. (2022, January 5\u20138). A Text-Independent Forced Alignment Method for Automatic Phoneme Segmentation. Proceedings of the Australasian Joint Conference on Artificial Intelligence, Perth, WA, Australia.","DOI":"10.1007\/978-3-031-22695-3_41"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Lhoest, Q., del Moral, A.V., Jernite, Y., Thakur, A., von Platen, P., Patil, S., Chaumond, J., Drame, M., Plu, J., and Tunstall, L. (2021). Datasets: A community library for natural language processing. arXiv.","DOI":"10.18653\/v1\/2021.emnlp-demo.21"},{"key":"ref_42","unstructured":"Gutmann, M., and Hyv\u00e4rinen, A. (2010, January 13\u201315). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, Sardinia, Italy."},{"key":"ref_43","unstructured":"Maas, A.L., Hannun, A.Y., and Ng, A.Y. (2013, January 16\u201321). Rectifier nonlinearities improve neural network acoustic models. Proceedings of the ICML, Atlanta, GA, USA."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"523","DOI":"10.1007\/s10579-011-9145-0","article-title":"The TORGO database of acoustic and articulatory speech from speakers with dysarthria","volume":"46","author":"Rudzicz","year":"2012","journal-title":"Lang. Resour. Eval."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"2213","DOI":"10.1044\/2020_JSLHR-20-00268","article-title":"Performance of forced-alignment algorithms on children\u2019s speech","volume":"64","author":"Mahr","year":"2021","journal-title":"J. Speech Lang. Hear. Res."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Zhu, J., Zhang, C., and Jurgens, D. (2022, January 22\u201327). Phone-to-audio alignment without text: A semi-supervised approach. Proceedings of the ICASSP 2022\u20132022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore.","DOI":"10.1109\/ICASSP43922.2022.9746112"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Lin, Y., Wang, L., Li, S., Dang, J., and Ding, C. (2020, January 25\u201329). Staged Knowledge Distillation for End-to-End Dysarthric Speech Recognition and Speech Attribute Transcription. Proceedings of the INTERSPEECH, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-1755"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Fainberg, J., Bell, P., Lincoln, M., and Renals, S. (2016, January 8\u201312). Improving Children\u2019s Speech Recognition Through Out-of-Domain Data Augmentation. Proceedings of the Interspeech, San Francisco, CA, USA.","DOI":"10.21437\/Interspeech.2016-1348"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Christensen, H., Aniol, M.B., Bell, P., Green, P.D., Hain, T., King, S., and Swietojanski, P. (2013, January 25\u201329). Combining in-domain and out-of-domain speech data for automatic recognition of disordered speech. Proceedings of the Interspeech, Lyon, France.","DOI":"10.21437\/Interspeech.2013-324"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Smith, D.V., Sneddon, A., Ward, L., Duenser, A., Freyne, J., Silvera-Tawil, D., and Morgan, A. (2017, January 20\u201324). Improving Child Speech Disorder Assessment by Incorporating Out-of-Domain Adult Speech. Proceedings of the Interspeech, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-455"},{"key":"ref_51","unstructured":"Rosenfelder, I., Fruehwald, J., Evanini, K., Seyfarth, S., Gorman, K., Prichard, H., and Yuan, J. (2023, October 15). FAVE (Forced Alignment and Vowel Extraction) Suite Version 1.1. 3. Available online: https:\/\/zenodo.org\/records\/9846."},{"key":"ref_52","unstructured":"Oschshorn, R., and Hawkins, M. (2023, October 17). Gentle. Available online: https:\/\/github.com\/lowerquality\/gentle."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Eshky, A., Ribeiro, M.S., Cleland, J., Richmond, K., Roxburgh, Z., Scobbie, J., and Wrench, A. (2019). UltraSuite: A repository of ultrasound and acoustic data from child speech therapy sessions. arXiv.","DOI":"10.21437\/Interspeech.2018-1736"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/24\/9650\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T21:34:00Z","timestamp":1760132040000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/24\/9650"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,12,6]]},"references-count":53,"journal-issue":{"issue":"24","published-online":{"date-parts":[[2023,12]]}},"alternative-id":["s23249650"],"URL":"https:\/\/doi.org\/10.3390\/s23249650","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,12,6]]}}}