{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T02:44:59Z","timestamp":1774579499394,"version":"3.50.1"},"reference-count":18,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2025,9,8]],"date-time":"2025-09-08T00:00:00Z","timestamp":1757289600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Natural Science Foundation of China","award":["62201491"],"award-info":[{"award-number":["62201491"]}]},{"name":"National Natural Science Foundation of China","award":["ZR2021QF097"],"award-info":[{"award-number":["ZR2021QF097"]}]},{"name":"National Natural Science Foundation of China","award":["2323013-2023XDRH001"],"award-info":[{"award-number":["2323013-2023XDRH001"]}]},{"name":"National Natural Science Foundation of China","award":["2323013-2023XDRH001"],"award-info":[{"award-number":["2323013-2023XDRH001"]}]},{"name":"National Natural Science Foundation of China","award":["2023TSGC0823"],"award-info":[{"award-number":["2023TSGC0823"]}]},{"name":"Natural Science Foundation of Shandong Province","award":["62201491"],"award-info":[{"award-number":["62201491"]}]},{"name":"Natural Science Foundation of Shandong Province","award":["ZR2021QF097"],"award-info":[{"award-number":["ZR2021QF097"]}]},{"name":"Natural Science Foundation of Shandong Province","award":["2323013-2023XDRH001"],"award-info":[{"award-number":["2323013-2023XDRH001"]}]},{"name":"Natural Science Foundation of Shandong Province","award":["2323013-2023XDRH001"],"award-info":[{"award-number":["2323013-2023XDRH001"]}]},{"name":"Natural Science Foundation of Shandong Province","award":["2023TSGC0823"],"award-info":[{"award-number":["2023TSGC0823"]}]},{"name":"Yantai City 2023 School-Land Integration Development Project Fund","award":["62201491"],"award-info":[{"award-number":["62201491"]}]},{"name":"Yantai City 2023 School-Land Integration Development Project Fund","award":["ZR2021QF097"],"award-info":[{"award-number":["ZR2021QF097"]}]},{"name":"Yantai City 2023 School-Land Integration Development Project Fund","award":["2323013-2023XDRH001"],"award-info":[{"award-number":["2323013-2023XDRH001"]}]},{"name":"Yantai City 2023 School-Land Integration Development Project Fund","award":["2323013-2023XDRH001"],"award-info":[{"award-number":["2323013-2023XDRH001"]}]},{"name":"Yantai City 2023 School-Land Integration Development Project Fund","award":["2023TSGC0823"],"award-info":[{"award-number":["2023TSGC0823"]}]},{"name":"Science and Technology-Based Small and Medium-Sized Enterprise Innovation Capacity Enhancement Project of Shandong Province","award":["62201491"],"award-info":[{"award-number":["62201491"]}]},{"name":"Science and Technology-Based Small and Medium-Sized Enterprise Innovation Capacity Enhancement Project of Shandong Province","award":["ZR2021QF097"],"award-info":[{"award-number":["ZR2021QF097"]}]},{"name":"Science and Technology-Based Small and Medium-Sized Enterprise Innovation Capacity Enhancement Project of Shandong Province","award":["2323013-2023XDRH001"],"award-info":[{"award-number":["2323013-2023XDRH001"]}]},{"name":"Science and Technology-Based Small and Medium-Sized Enterprise Innovation Capacity Enhancement Project of Shandong Province","award":["2323013-2023XDRH001"],"award-info":[{"award-number":["2323013-2023XDRH001"]}]},{"name":"Science and Technology-Based Small and Medium-Sized Enterprise Innovation Capacity Enhancement Project of Shandong Province","award":["2023TSGC0823"],"award-info":[{"award-number":["2023TSGC0823"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Symmetry"],"abstract":"<jats:p>Cantonese automatic speech recognition (ASR) faces persistent challenges due to its nine lexical tones, extensive phonological variation, and the scarcity of professionally transcribed corpora. To address these issues, we propose a lightweight and data-efficient framework that leverages weak phonetic supervision (WPS) in conjunction with two pho-neme-aware augmentation strategies. (1) Dynamic Boundary-Aligned Phoneme Dropout progressively removes entire IPA segments according to a curriculum schedule, simulating real-world phenomena such as elision, lenition, and tonal drift while ensuring training stability. (2) Phoneme-Aware SpecAugment confines all time- and frequency-masking operations within phoneme boundaries and prioritizes high-attention regions, thereby preserving intra-phonemic contours and formant integrity. Built on the Whistle encoder\u2014which integrates a Conformer backbone, Connectionist Temporal Classification\u2013Conditional Random Field (CTC-CRF) alignment, and a multi-lingual phonetic space\u2014the approach requires only a grapheme-to-phoneme lexicon and Montreal Forced Aligner outputs, without any additional manual labeling. Experiments on the Cantonese subset of Common Voice demonstrate consistent gains: Dynamic Dropout alone reduces phoneme error rate (PER) from 17.8% to 16.7% with 50 h of speech and 16.4% to 15.1% with 100 h, while the combination of the two augmentations further lowers PER to 15.9%\/14.4%. These results confirm that structure-aware phoneme-level perturbations provide an effective and low-cost solution for building robust Cantonese ASR systems under low-resource conditions.<\/jats:p>","DOI":"10.3390\/sym17091478","type":"journal-article","created":{"date-parts":[[2025,9,8]],"date-time":"2025-09-08T08:06:32Z","timestamp":1757318792000},"page":"1478","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Phoneme-Aware Augmentation for Robust Cantonese ASR Under Low-Resource Conditions"],"prefix":"10.3390","volume":"17","author":[{"given":"Lusheng","family":"Zhang","sequence":"first","affiliation":[{"name":"School of Physics and Electronic Information, Yantai University, Yantai 264005, China"},{"name":"Shandong Data Open Innovation Application Laboratory of Smart Grid Advanced Technology, Yantai University, Yantai 264005, China"}]},{"given":"Shie","family":"Wu","sequence":"additional","affiliation":[{"name":"School of Physics and Electronic Information, Yantai University, Yantai 264005, China"},{"name":"Shandong Data Open Innovation Application Laboratory of Smart Grid Advanced Technology, Yantai University, Yantai 264005, China"}]},{"given":"Zhongxun","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Physics and Electronic Information, Yantai University, Yantai 264005, China"},{"name":"Shandong Data Open Innovation Application Laboratory of Smart Grid Advanced Technology, Yantai University, Yantai 264005, China"}]}],"member":"1968","published-online":{"date-parts":[[2025,9,8]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"23367","DOI":"10.1007\/s11042-023-16438-y","article-title":"A comprehensive survey on automatic speech recognition using neural networks","volume":"83","author":"Dhanjal","year":"2024","journal-title":"Multimed. Tools Appl."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Chen, G., Chai, S., Wang, G., Du, J., Zhang, W.-Q., Weng, C., Su, D., Povey, D., Trmal, J., and Zhang, J. (2021). Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv.","DOI":"10.21437\/Interspeech.2021-1965"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Kang, W., Yang, X., Yao, Z., Kuang, F., Yang, Y., Guo, L., Lin, L., and Povey, D. (2024, January 14\u201319). Libriheavy: A 50,000 hours ASR corpus with punctuation casing and context. Proceedings of the ICASSP 2024\u20132024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea.","DOI":"10.1109\/ICASSP48485.2024.10447120"},{"key":"ref_4","unstructured":"Dehaven, M., and Billa, J. (2022). Improving low-resource speech recognition with pretrained speech models: Continued pretraining vs. semi-supervised training. arXiv."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Zhu, H., Wang, L., Cheng, G., Wang, J., Zhang, P., and Yan, Y. (2021). Wav2vec-S: Semi-supervised pre-training for low-resource ASR. arXiv.","DOI":"10.21437\/Interspeech.2022-909"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"3451","DOI":"10.1109\/TASLP.2021.3122291","article-title":"Hubert: Self-supervised speech representation learning by masked prediction of hidden units","volume":"29","author":"Hsu","year":"2021","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"1211","DOI":"10.1109\/JSTSP.2022.3206084","article-title":"Self-supervised language learning from raw audio: Lessons from the zero resource speech challenge","volume":"16","author":"Dunbar","year":"2022","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Xiao, Z., Ou, Z., Chu, W., and Lin, H. (2018, January 26\u201329). Hybrid CTC-attention based end-to-end speech recognition using subword units. Proceedings of the 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taipei, Taiwan.","DOI":"10.1109\/ISCSLP.2018.8706675"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Wu, F., Kim, K., Watanabe, S., Han, K.J., McDonald, R., Weinberger, K.Q., and Artzi, Y. (2023, January 4\u201310). Wav2seq: Pre-training speech-to-text encoder-decoder models using pseudo languages. Proceedings of the ICASSP 2023\u20132023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece.","DOI":"10.1109\/ICASSP49357.2023.10096988"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Xu, Q., Baevski, A., and Auli, M. (2021). Simple and effective zero-shot cross-lingual phoneme recognition. arXiv.","DOI":"10.21437\/Interspeech.2022-60"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Feng, S., Tu, M., Xia, R., Huang, C., and Wang, Y. (2023). Language-universal phonetic representation in multilingual speech pretraining for low-resource speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2023-617"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Xue, H., Shao, Q., Chen, P., Guo, P., Xie, L., and Liu, J. (2023). Tranusr: Phoneme-to-word transcoder based unified speech representation learning for cross-lingual speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2023-746"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Martin, K., Gauthier, J., Breiss, C., and Levy, R. (2023). Probing self-supervised speech models for phonetic and phonemic information: A case study in aspiration. arXiv.","DOI":"10.21437\/Interspeech.2023-2359"},{"key":"ref_14","first-page":"12449","article-title":"wav2vec 2.0: A framework for self-supervised learning of speech representations","volume":"33","author":"Baevski","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"1440","DOI":"10.1109\/TASLPRO.2025.3550683","article-title":"Whistle: Data-efficient multilingual and crosslingual speech recognition via weakly phonetic supervision","volume":"33","author":"Yusuyin","year":"2025","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_16","unstructured":"Yao, Z., Guo, L., Yang, X., Kang, W., Kuang, F., Yang, Y., Jin, Z., Lin, L., and Povey, D. (2023). Zipformer: A faster and better encoder for automatic speech recognition. arXiv."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Park, D.S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B., Cubuk, E.D., and Le, Q.V. (2019). Specaugment: A simple data augmentation method for automatic speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2019-2680"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., and Sonderegger, M. (2017, January 20\u201324). Montreal forced aligner: Trainable text-speech alignment using kaldi. Proceedings of the Interspeech 2017, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-1386"}],"container-title":["Symmetry"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-8994\/17\/9\/1478\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:41:41Z","timestamp":1760035301000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-8994\/17\/9\/1478"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,8]]},"references-count":18,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2025,9]]}},"alternative-id":["sym17091478"],"URL":"https:\/\/doi.org\/10.3390\/sym17091478","relation":{},"ISSN":["2073-8994"],"issn-type":[{"value":"2073-8994","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,9,8]]}}}