{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,2]],"date-time":"2026-04-02T01:38:53Z","timestamp":1775093933825,"version":"3.50.1"},"reference-count":68,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2020,3,25]],"date-time":"2020-03-25T00:00:00Z","timestamp":1585094400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61771173"],"award-info":[{"award-number":["61771173"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"National Social Science Foundation of China","award":["15BG103"],"award-info":[{"award-number":["15BG103"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Advanced automatic pronunciation error detection (APED) algorithms are usually based on state-of-the-art automatic speech recognition (ASR) techniques. With the development of deep learning technology, end-to-end ASR technology has gradually matured and achieved positive practical results, which provides us with a new opportunity to update the APED algorithm. We first constructed an end-to-end ASR system based on the hybrid connectionist temporal classification and attention (CTC\/attention) architecture. An adaptive parameter was used to enhance the complementarity of the connectionist temporal classification (CTC) model and the attention-based seq2seq model, further improving the performance of the ASR system. After this, the improved ASR system was used in the APED task of Mandarin, and good results were obtained. This new APED method makes force alignment and segmentation unnecessary, and it does not require multiple complex models, such as an acoustic model or a language model. It is convenient and straightforward, and will be a suitable general solution for L1-independent computer-assisted pronunciation training (CAPT). Furthermore, we find that in regards to accuracy metrics, our proposed system based on the improved hybrid CTC\/attention architecture is close to the state-of-the-art ASR system based on the deep neural network\u2013deep neural network (DNN\u2013DNN) architecture, and has a stronger effect on the F-measure metrics, which are especially suitable for the requirements of the APED task.<\/jats:p>","DOI":"10.3390\/s20071809","type":"journal-article","created":{"date-parts":[[2020,3,25]],"date-time":"2020-03-25T13:10:47Z","timestamp":1585141847000},"page":"1809","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":43,"title":["End-to-End Automatic Pronunciation Error Detection Based on Improved Hybrid CTC\/Attention Architecture"],"prefix":"10.3390","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0460-7606","authenticated-orcid":false,"given":"Long","family":"Zhang","sequence":"first","affiliation":[{"name":"College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ziping","family":"Zhao","sequence":"additional","affiliation":[{"name":"College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chunmei","family":"Ma","sequence":"additional","affiliation":[{"name":"College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Linlin","family":"Shan","sequence":"additional","affiliation":[{"name":"College of Fine Arts and Design, Tianjin Normal University, Tianjin 300387, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Huazhi","family":"Sun","sequence":"additional","affiliation":[{"name":"College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lifen","family":"Jiang","sequence":"additional","affiliation":[{"name":"College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shiwen","family":"Deng","sequence":"additional","affiliation":[{"name":"School of Mathematical Sciences, Harbin Normal University, Harbin 150080, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chang","family":"Gao","sequence":"additional","affiliation":[{"name":"School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2020,3,25]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"896","DOI":"10.1016\/j.specom.2009.03.004","article-title":"A new method for mispronunciation detection using Support Vector Machine based on Pronunciation Space Models","volume":"51","author":"Wei","year":"2009","journal-title":"Speech Commun."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"154","DOI":"10.1016\/j.specom.2014.12.008","article-title":"Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers","volume":"67","author":"Hu","year":"2015","journal-title":"Speech Commun."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"52589","DOI":"10.1109\/ACCESS.2019.2912648","article-title":"Mispronunciation Detection Using Deep Convolutional Neural Network Features and Transfer Learning-Based Model for Arabic Phonemes","volume":"7","author":"Nazir","year":"2019","journal-title":"IEEE Access"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"95","DOI":"10.1016\/S0167-6393(99)00044-8","article-title":"Phone-level pronunciation scoring and assessment for interactive language learning","volume":"30","author":"Witt","year":"2000","journal-title":"Speech Commun."},{"key":"ref_5","unstructured":"Witt, S.M. (2012, January 6\u20138). Automatic error detection in pronunciation training: Where we are and where we need to go. Proceedings of the International Symposium on Automatic Detection of Errors in Pronunciation Training (IS ADEPT), Stockholm, Sweden."},{"key":"ref_6","unstructured":"Li, J., Wang, X., and Li, Y. (2019, January 12\u201319). The Speech transformer for Large-scale Mandarin Chinese Speech Recognition. Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Zhou, S., Dong, L., Xu, S., and Xu, B. (2018, January 13\u201316). A comparison of modeling units in sequence-to-sequence speech recognition with the transformer on mandarin Chinese. Proceedings of the International Conference on Neural Information Processing (ICONIP), Siem Reap, Cambodia.","DOI":"10.1007\/978-3-030-04221-9_19"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Zou, W., Jiang, D., Zhao, S., Yang, G., and Li, X. (2018, January 26\u201329). Comparable Study of Modeling Units For End-To-End Mandarin Speech Recognition. Proceedings of the 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taipei City, Taiwan.","DOI":"10.1109\/ISCSLP.2018.8706661"},{"key":"ref_9","first-page":"193","article-title":"Mispronunciation Detection and Diagnosis in L2 English Speech Using Multi-distribution Deep Neural Networks","volume":"25","author":"Li","year":"2017","journal-title":"IEEE Trans. Audio Speech"},{"key":"ref_10","unstructured":"Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., and Chen, G. (2016, January 20\u201322). Deep speech 2: End-to-end speech recognition in english and mandarin. Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Miao, Y., Gowayyed, M., and Metze, F. (2015, January 13\u201317). EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, AZ, USA.","DOI":"10.1109\/ASRU.2015.7404790"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Chan, W., Jaitly, N., Le, Q., and Vinyals, O. (2016, January 20\u201325). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472621"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Chiu, C.-C., Sainath, T.N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss, R.J., Rao, K., and Gonina, E. (2018, January 15\u201320). State-of-the-art speech recognition with sequence-to-sequence models. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8462105"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Kim, S., Hori, T., and Watanabe, S. (2017, January 5\u20139). Joint CTC-attention based end-to-end speech recognition using multi-task learning. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7953075"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"1240","DOI":"10.1109\/JSTSP.2017.2763455","article-title":"Hybrid CTC\/attention architecture for end-to-end speech recognition","volume":"11","author":"Watanabe","year":"2017","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Neumeyer, L., Franco, H., Weintraub, M., and Price, P. (1996, January 3\u20136). Automatic text-independent pronunciation scoring of foreign language student speech. Proceedings of the Fourth International Conference on Spoken Language Processing (ICSLP\u201996), Philadelphia, PA, USA.","DOI":"10.21437\/ICSLP.1996-372"},{"key":"ref_17","unstructured":"Franco, H., Neumeyer, L., Kim, Y., and Ronen, O. (1997, January 21\u201324). Automatic pronunciation scoring for language instruction. Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Munich (ICASSP), Munich, Germany."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Kim, Y., Franco, H., and Neumeyer, L. (1997, January 22\u201325). Automatic pronunciation scoring of specific phone segments for language instruction. Proceedings of the Fifth European Conference on Speech Communication and Technology, Rhodes, Greece.","DOI":"10.21437\/Eurospeech.1997-230"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Witt, S., and Young, S.J. (1997, January 22\u201325). Language learning based on non-native speech recognition. Proceedings of the Fifth European Conference on Speech Communication and Technology, Rhodes, Greece.","DOI":"10.21437\/Eurospeech.1997-227"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Kanters, S., Cucchiarini, C., and Strik, H. (2009, January 3\u20135). The goodness of pronunciation algorithm: A detailed performance study. Proceedings of the 2009 ISCA International Workshop on Speech and Language Technology in Education (SLaTE), Warwickshire, UK.","DOI":"10.21437\/SLaTE.2009-13"},{"key":"ref_21","unstructured":"Song, Y., Liang, W., and Liu, R. (2010, January 26\u201328). Lattice-based GOP in automatic pronunciation evaluation. Proceedings of the 2nd International Conference on Computer and Automation Engineering (ICCAE), Singapore."},{"key":"ref_22","unstructured":"Zhang, L., Li, H., and Ma, L. (2012, January 11\u201315). An adaptive unsupervised clustering of pronunciation errors for automatic pronunciation error detection. Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), Tsukuba, Japan."},{"key":"ref_23","first-page":"6","article-title":"Automatic detection of phoneme error pronunciation","volume":"2","author":"Wang","year":"2009","journal-title":"Bull. Adv. Technol. Res."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Wang, H., Meng, H., and Qian, X. (November, January 29). Predicting gradation of L2 English mispronunciations using ASR with extended recognition network. Proceedings of the 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Kaohsiung, Taiwan.","DOI":"10.1109\/APSIPA.2013.6694165"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Wang, H., Qian, X., and Meng, H. (2014, January 4\u20139). Phonological modeling of mispronunciation gradations in L2 English speech of L1 Chinese learners. Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy.","DOI":"10.1109\/ICASSP.2014.6855101"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Li, X., Mao, S., Wu, X., Li, K., Liu, X., and Meng, H. (2018, January 2\u20136). Unsupervised Discovery of Non-native Phonetic Patterns in L2 English Speech for Mispronunciation Detection and Diagnosis. Proceedings of the 19th Annual Conference of the International Speech Communication Association (INTERSPEECH), Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-2027"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"2462","DOI":"10.1587\/transinf.E92.D.2462","article-title":"Effective Prediction of Errors by Non-native Speakers Using Decision Tree for Speech Recognition-Based CALL System","volume":"92","author":"Wang","year":"2009","journal-title":"IEICE Trans. Inf. Syst."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Stanley, T., Hacioglu, K., and Pellom, B. (2011, January 24\u201326). Statistical Machine Translation Framework for Modeling Phonological Errors in Computer Assisted Pronunciation Training System. Proceedings of the 2011 ISCA International Workshop on Speech and Language Technology in Education (SLaTE), Venice, Italy.","DOI":"10.21437\/SLaTE.2011-32"},{"key":"ref_29","unstructured":"Witt, S.M. (1999). Use of Speech Recognition in Computer-Assisted Language Learning, University of Cambridge."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"875","DOI":"10.1016\/j.specom.2009.05.005","article-title":"A speaker adaptation method for non-native speech using learners\u2019 native utterances for computer-assisted language learning systems","volume":"51","author":"Ohkawa","year":"2009","journal-title":"Speech Commun."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"189","DOI":"10.1016\/S1007-0214(11)70029-3","article-title":"Experimental study of discriminative adaptive training and MLLR for automatic pronunciation evaluation","volume":"16","author":"Song","year":"2011","journal-title":"Tsinghua Sci. Technol."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"308","DOI":"10.1587\/transinf.E94.D.308","article-title":"Regularized maximum likelihood linear regression adaptation for computer-assisted language learning systems","volume":"94","author":"Luo","year":"2011","journal-title":"IEICE Trans. Inf. Syst."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"1145","DOI":"10.1587\/transinf.E96.D.1145","article-title":"A novel discriminative method for pronunciation quality assessment","volume":"96","author":"Zhang","year":"2013","journal-title":"IEICE Trans. Inf. Syst."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"436","DOI":"10.1038\/nature14539","article-title":"Deep learning","volume":"521","author":"Le","year":"2015","journal-title":"Nature"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"85","DOI":"10.1016\/j.neunet.2014.09.003","article-title":"Deep learning in neural networks: An overview","volume":"61","author":"Schmidhuber","year":"2015","journal-title":"Neural Netw."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1109\/MSP.2012.2205597","article-title":"Deep neural networks for acoustic modeling in speech recognition","volume":"29","author":"Hinton","year":"2012","journal-title":"IEEE Signal Process. Mag."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., and Penn, G. (2012, January 25\u201330). Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan.","DOI":"10.1109\/ICASSP.2012.6288864"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"1533","DOI":"10.1109\/TASLP.2014.2339736","article-title":"Convolutional neural networks for speech recognition","volume":"22","author":"Mohamed","year":"2014","journal-title":"IEEE Trans. Audio Speech Process."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Graves, A., Mohamed, A.-R., and Hinton, G. (2013, January 26\u201331). Speech recognition with deep recurrent neural networks. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada.","DOI":"10.1109\/ICASSP.2013.6638947"},{"key":"ref_40","unstructured":"Graves, A., and Jaitly, N. (2014, January 21\u201326). Towards end-to-end speech recognition with recurrent neural networks. Proceedings of the 31st International Conference on Machine Learning (ICML), Beijing, China."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Li, K., Xu, H., Wang, Y., Povey, D., and Khudanpur, S. (2018, January 2\u20136). Recurrent Neural Network Language Model Adaptation for Conversational Speech Recognition. Proceedings of the 19th Annual Conference of the International Speech Communication Association (INTERSPEECH), Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1413"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Qian, X., Meng, H., and Soong, F.K. (2012, January 9\u201313). The use of DBN-HMMs for mispronunciation detection and diagnosis in L2 English to support computer-aided pronunciation training. Proceedings of the Thirteenth Annual Conference of the International Speech Communication Association (INTERSPEECH), Portland, OR, USA.","DOI":"10.21437\/Interspeech.2012-238"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Hu, W., Qian, Y., and Soong, F.K. (2013, January 25\u201329). A new DNN-based high quality pronunciation evaluation for computer-aided language learning (CALL). Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH), Lyon, France.","DOI":"10.21437\/Interspeech.2013-458"},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"3573","DOI":"10.1007\/s11771-013-1883-2","article-title":"A new formant feature and its application in Mandarin vowel pronunciation quality assessment","volume":"20","author":"Lu","year":"2013","journal-title":"J. Cent. South Univ."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Li, H., Wang, S., Liang, J., Huang, S., and Xu, B. (2009, January 6\u201310). High performance automatic mispronunciation detection method based on neural network and TRAP features. Proceedings of the Tenth Annual Conference of the International Speech Communication Association (INTERSPEECH), Brighton, UK.","DOI":"10.21437\/Interspeech.2009-553"},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"691","DOI":"10.1016\/j.specom.2013.01.004","article-title":"On mispronunciation analysis of individual foreign speakers using auditory periphery models","volume":"55","author":"Koniaris","year":"2013","journal-title":"Speech Commun."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Suzuki, M., Qiao, Y., Minematsu, N., and Hirose, K. (2010, January 26\u201330). Integration of multilayer regression analysis with structure-based pronunciation assessment. Proceedings of the Eleventh Annual Conference of the International Speech Communication Association (INTERSPEECH), Makuhari, Japan.","DOI":"10.21437\/Interspeech.2010-229"},{"key":"ref_48","first-page":"435","article-title":"Bhattacharyya Distance between the Formants Structure for Robust Pronunciation Errors Detection","volume":"7","author":"Ru","year":"2011","journal-title":"J. Comput. Inf. Syst."},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"37","DOI":"10.1080\/09588221.2011.582845","article-title":"Analysis of and feedback on phonetic features in pronunciation training with a virtual teacher","volume":"25","author":"Engwall","year":"2012","journal-title":"Comput. Assist. Lang. Learn."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Iribe, Y., Mori, T., Katsurada, K., Kawai, G., and Nitta, T. (2012, January 9\u201313). Real-time Visualization of English Pronunciation on an IPA Chart Based on Articulatory Feature Extraction. Proceedings of the Thirteenth Annual Conference of the International Speech Communication Association (INTERSPEECH), Portland, OR, USA.","DOI":"10.21437\/Interspeech.2012-253"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Lee, A., and Glass, J. (September, January 30). Pronunciation assessment via a comparison-based system. Proceedings of the 2013 ISCA International Workshop on Speech and Language Technology in Education (SLaTE), Grenoble, France.","DOI":"10.21437\/SLaTE.2013-21"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Lee, A., Zhang, Y., and Glass, J. (2013, January 26\u201331). Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada.","DOI":"10.1109\/ICASSP.2013.6639269"},{"key":"ref_53","unstructured":"Truong, K., Neri, A., Cucchiarini, C., and Strik, H. (2004, January 17\u201319). Automatic pronunciation error detection: An acoustic-phonetic approach. Proceedings of the 2004 InSTIL\/ICALL Symposiumon on Computer Assisted Learning, Venice, Italy."},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Strik, H., Truong, K.P., Wet, F.D., and Cucchiarini, C. (2007, January 27\u201331). Comparing classifiers for pronunciation error detection. Proceedings of the Eighth Annual Conference of the International Speech Communication Association (INTERSPEECH), Antwerp, Belgium.","DOI":"10.21437\/Interspeech.2007-512"},{"key":"ref_55","unstructured":"Patil, V., and Rao, P. (2012, January 8\u201315). Automatic pronunciation assessment for language learners with acoustic-phonetic features. Proceedings of the 2012 International Conference on Computational Linguistics (COLING), Mumbai, India."},{"key":"ref_56","doi-asserted-by":"crossref","first-page":"602","DOI":"10.1016\/j.neunet.2005.06.042","article-title":"Framewise phoneme classification with bidirectional LSTM and other neural network architectures","volume":"18","author":"Graves","year":"2005","journal-title":"Neural Netw."},{"key":"ref_57","doi-asserted-by":"crossref","first-page":"2673","DOI":"10.1109\/78.650093","article-title":"Bidirectional recurrent neural networks","volume":"45","author":"Schuster","year":"1997","journal-title":"IEEE Trans. Signal Process."},{"key":"ref_58","unstructured":"Likic, V. (2020, March 24). The Needleman-Wunsch Algorithm for Sequence Alignment. Lecture Given at the 7th Melbourne Bioinformatics Course, Bi021 Molecular Science and Biotechnology Institute, University of Melbourne. Available online: https:\/\/www.cs.sjsu.edu\/~aid\/cs152\/NeedlemanWunsch.pdf."},{"key":"ref_59","unstructured":"Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. (2015, January 7\u201312). Attention-based models for speech recognition. Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada."},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Qian, X., Soong, F.K., and Meng, H. (2010, January 26\u201330). Discriminative acoustic model for improving mispronunciation detection and diagnosis in computer-aided pronunciation training (CAPT). Proceedings of the Eleventh Annual Conference of the International Speech Communication Association (INTERSPEECH), Makuhari, Chiba, Japan.","DOI":"10.21437\/Interspeech.2010-278"},{"key":"ref_61","doi-asserted-by":"crossref","first-page":"564","DOI":"10.1109\/TASLP.2014.2387413","article-title":"Supervised detection and unsupervised discovery of pronunciation error patterns for computer-assisted language learning","volume":"23","author":"Wang","year":"2015","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_62","doi-asserted-by":"crossref","first-page":"883","DOI":"10.1016\/j.specom.2009.04.009","article-title":"Automatic scoring of non-native spontaneous speech in tests of spoken English","volume":"51","author":"Zechner","year":"2009","journal-title":"Speech Commun."},{"key":"ref_63","unstructured":"Mohamed, A., Okhonko, D., and Zettlemoyer, L. (2019). Transformers with convolutional context for ASR. arXiv."},{"key":"ref_64","doi-asserted-by":"crossref","unstructured":"Zhou, S., Dong, L., Xu, S., and Xu, B. (2018). Syllable-based sequence-to-sequence speech recognition with the transformer in mandarin chinese. arXiv.","DOI":"10.21437\/Interspeech.2018-1107"},{"key":"ref_65","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1023\/A:1007379606734","article-title":"Multitask learning","volume":"28","author":"Caruana","year":"1997","journal-title":"Mach. Learn."},{"key":"ref_66","doi-asserted-by":"crossref","unstructured":"He, D., Yang, X., Lim, B.P., Liang, Y., Hasegawa-Johnson, M., and Chen, D. (2019, January 12\u201317). When CTC Training Meets Acoustic Landmarks. Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8683607"},{"key":"ref_67","doi-asserted-by":"crossref","first-page":"3207","DOI":"10.1121\/1.5039837","article-title":"Acoustic landmarks contain more information about the phone string than other frames for automatic speech recognition with deep neural network acoustic model","volume":"143","author":"He","year":"2018","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_68","doi-asserted-by":"crossref","unstructured":"Niu, C., Zhang, J., Yang, X., and Xie, Y. (2017, January 12\u201315). A study on landmark detection based on CTC and its application to pronunciation error detection. Proceedings of the 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, Malaysia.","DOI":"10.1109\/APSIPA.2017.8282103"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/20\/7\/1809\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T09:11:35Z","timestamp":1760173895000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/20\/7\/1809"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,3,25]]},"references-count":68,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2020,4]]}},"alternative-id":["s20071809"],"URL":"https:\/\/doi.org\/10.3390\/s20071809","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,3,25]]}}}