{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,16]],"date-time":"2025-10-16T00:19:37Z","timestamp":1760573977988,"version":"build-2065373602"},"reference-count":37,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2025,10,15]],"date-time":"2025-10-15T00:00:00Z","timestamp":1760486400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Commun. Netw."],"abstract":"<jats:sec><jats:title>Introduction<\/jats:title><jats:p>We address robustness and efficiency in Chinese automatic speech recognition (ASR), focusing on long-form broadcast speech where sentence-level semantic consistency is often lost.<\/jats:p><\/jats:sec><jats:sec><jats:title>Methods<\/jats:title><jats:p>We propose a Conformer-based framework that integrates sentence-level consistency with pre-training knowledge distillation. We also construct CH Broadcast ASR, a domain-specific Chinese corpus for the broadcast and television domain, and evaluate on AISHELL-1, AISHELL-3, and CH Broadcast ASR.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>The proposed model consistently outperforms strong baselines (TDNN, DFSMN-T, TCN-Transformer), achieving CER = 3.3% on AISHELL-1, 3.7% on AISHELL-3, and 3.9% on CH Broadcast ASR, while reducing model size by &amp;gt;10%.<\/jats:p><\/jats:sec><jats:sec><jats:title>Discussion<\/jats:title><jats:p>Enforcing sentence-level semantic alignment together with distillation improves robustness for long-form broadcast speech and enhances efficiency for real-time deployment.<\/jats:p><\/jats:sec>","DOI":"10.3389\/frcmn.2025.1662788","type":"journal-article","created":{"date-parts":[[2025,10,15]],"date-time":"2025-10-15T05:43:21Z","timestamp":1760507001000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Sentence-level consistency of conformer based pre-training distillation for Chinese speech recognition"],"prefix":"10.3389","volume":"6","author":[{"given":"Haifang","family":"Li","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chao","family":"Tang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xin","family":"Yue","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xu","family":"Li","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1965","published-online":{"date-parts":[[2025,10,15]]},"reference":[{"key":"B1","first-page":"7694","article-title":"Effectiveness of self-supervised pre-training for asr","volume-title":"ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP)","author":"Baevski","year":"2020"},{"key":"B2","first-page":"7124","article-title":"Pyannote. Audio: neural building blocks for speaker diarization","volume-title":"ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP)","author":"Bredin","year":"2020"},{"key":"B3","first-page":"1","article-title":"Aishell-1: an open-source mandarin speech corpus and a speech recognition base-line","volume-title":"2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I\/O systems and assessment (O-COCOSDA)","author":"Bu","year":""},{"key":"B4","first-page":"1","article-title":"Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline","volume-title":"2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I\/O systems and assessment (O-COCOSDA)","author":"Bu","year":""},{"key":"B5","doi-asserted-by":"publisher","first-page":"357","DOI":"10.1109\/tassp.1980.1163420","article-title":"Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences","volume":"28","author":"Davis","year":"1980","journal-title":"IEEE Trans. Acoust. speech, signal Process."},{"key":"B6","doi-asserted-by":"publisher","first-page":"1647","DOI":"10.3233\/ida-230612","article-title":"Sampleformer: an efficient conformer-based neural network for automatic speech recognition","volume":"28","author":"Fan","year":"2024","journal-title":"Intell. Data Anal."},{"key":"B7","doi-asserted-by":"publisher","first-page":"935","DOI":"10.1109\/taslpro.2025.3533359","article-title":"Cross-Modal knowledge distillation with multi-stage adaptive feature fusion for speech separation","volume":"33","author":"Fan","year":"2025","journal-title":"IEEE Trans. Audio, Speech Lang. Process."},{"key":"B8","doi-asserted-by":"publisher","first-page":"1803","DOI":"10.1109\/taslp.2024.3350893","article-title":"Advanced long-content speech recognition with factorized neural transducer","volume":"32","author":"Gong","year":"2024","journal-title":"IEEE\/ACM Trans. Audio, Speech, Lang. Process."},{"key":"B9","doi-asserted-by":"publisher","first-page":"5036","DOI":"10.21437\/interspeech.2020-3015","article-title":"Conformer: convolution-augmented transformer for speech recognition","author":"Gulati","year":"2020","journal-title":"Proc. Interspeech"},{"key":"B11","doi-asserted-by":"publisher","first-page":"12","DOI":"10.21437\/interspeech.2018-1423","article-title":"End-to-end speech recognition using lattice-free MMI","author":"Hadian","year":"2018","journal-title":"Proc. Interspeech"},{"key":"B12","doi-asserted-by":"publisher","first-page":"4568","DOI":"10.21437\/interspeech.2024-588","article-title":"Revisiting convolution-free transformer for speech recognition","author":"Hou","year":"2024"},{"key":"B14","doi-asserted-by":"publisher","first-page":"127808","DOI":"10.1016\/j.neucom.2024.127808","article-title":"Sentence salience contrastive learning for abstractive text summarization","volume":"593","author":"Huang","year":"2024","journal-title":"Neurocomputing"},{"key":"B15","doi-asserted-by":"crossref","first-page":"449","DOI":"10.1109\/ASRU46091.2019.9003750","article-title":"A comparative study on transformer vs rnn in speech applications","volume-title":"2019 IEEE automatic speech recognition and understanding workshop (ASRU)","author":"Karita","year":"2019"},{"key":"B17","first-page":"10971","article-title":"Joint end-to-end spoken language understanding and automatic speech recognition training based on unified speech-to-text pre-training","volume-title":"ICASSP 2024-2024 IEEE international conference on acoustics, speech and signal processing (ICASSP)","author":"Kim","year":"2024"},{"key":"B18","doi-asserted-by":"crossref","first-page":"465","DOI":"10.1007\/978-3-030-58586-0_28","article-title":"Learning with privileged information for efficient image super-resolution","volume-title":"Computer Vision\u2013ECCV 2020: 16Th european conference, Glasgow, UK, August 23\u201328, 2020, proceedings, Part XXIV 16","author":"Lee","year":"2020"},{"key":"B19","doi-asserted-by":"publisher","first-page":"72707","DOI":"10.1109\/access.2024.3403761","article-title":"Knowledge distillation-based training of speech enhancement for noise-robust automatic speech recognition","volume":"12","author":"Lee","year":"2024","journal-title":"IEEE Access"},{"key":"B20","first-page":"7095","article-title":"The speechtransformer for large-scale mandarin chinese speech recognition","volume-title":"ICASSP 2019 - IEEE international conference on acoustics, speech and signal processing (ICASSP)","author":"Li","year":"2019"},{"key":"B21","author":"Logeswaran","year":"2018"},{"key":"B22","first-page":"1","article-title":"Advancing streaming ASR with chunk-wise attention and trans-chunk selective state spaces","volume-title":"ICASSP 2025-2025 IEEE international conference on acoustics, speech and signal processing (ICASSP)","author":"Mimura","year":""},{"key":"B23","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/icassp49660.2025.10889802","article-title":"Advancing streaming ASR with chunk-wise attention and trans-chunk selective state spaces","author":"Mimura","year":"","journal-title":"Proc. ICASSP"},{"key":"B24","first-page":"311","article-title":"Bleu: a method for automatic evaluation of machine translation","volume-title":"Proceedings of the 40th annual meeting of the association for computational linguistics","author":"Papineni","year":"2002"},{"key":"B25","first-page":"11557","article-title":"Meta pseudo labels","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Pham","year":"2021"},{"key":"B26","doi-asserted-by":"publisher","first-page":"325","DOI":"10.1109\/taslp.2023.3328283","article-title":"End-to-end speech recognition: a survey","volume":"32","author":"Prabhavalkar","year":"2023","journal-title":"IEEE\/ACM Trans. Audio, Speech, Lang. Process."},{"key":"B27","first-page":"6465","article-title":"The pytorch-kaldi speech recognition toolkit","volume-title":"ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP)","author":"Ravanelli","year":"2019"},{"key":"B28","doi-asserted-by":"publisher","first-page":"1147","DOI":"10.1109\/ojsp.2024.3496819","article-title":"Jep-kd: joint-embedding predictive architecture based knowledge distillation for visual speech recognition","volume":"5","author":"Sun","year":"2024","journal-title":"IEEE Open J. Signal Process."},{"key":"B30","doi-asserted-by":"publisher","first-page":"243","DOI":"10.21437\/interspeech.2022-775","article-title":"Knowledge distillation for CTC-based speech recognition via consistent acoustic representation learning","volume":"2022","author":"Tian","year":"2022","journal-title":"Proc. Interspeech"},{"key":"B31","doi-asserted-by":"publisher","first-page":"5998","DOI":"10.1007\/978-3-031-84300-6_13","article-title":"Attention is all you need","author":"Vaswani","year":"2017","journal-title":"Adv. Neural Inf. Process. Syst. (NeurIPS)"},{"key":"B32","first-page":"35","article-title":"Phoneme recognition using time-delay neural networks","volume-title":"Backpropagation: Theory, Architectures, and Applications","author":"Waibel","year":"2013"},{"key":"B33","first-page":"1","article-title":"Pre-training encoder-decoder for minority language speech recognition","volume-title":"2024 international joint conference on neural networks (IJCNN)","author":"Wang","year":"2024"},{"key":"B41","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2024.emnlp-main.1070","article-title":"BLSP-Emo: Towards empathetic large speech-language models","volume-title":"Proceedings of EMNLP 2024","author":"Wang","year":"2024"},{"key":"B34","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/icassp49357.2023.10096988","article-title":"Wav2Seq: pre-Training speech-to-text encoder\u2013decoder models using pseudo languages","author":"Wu","year":"2023","journal-title":"Proc. ICASSP"},{"key":"B35","doi-asserted-by":"publisher","DOI":"10.19734\/j.issn.1001-3695.2021.08.0323","article-title":"Tcn-transformer-ctc for end-to-end speech recognition","volume":"39","author":"Xie","year":"2022","journal-title":"Appl. Res. Computers\/Jisuanji Yingyong Yanjiu"},{"key":"B36","doi-asserted-by":"publisher","first-page":"1626","DOI":"10.1109\/taslp.2021.3071662","article-title":"Tutornet: towards flexible knowledge distillation for end-to-end speech recognition","volume":"29","author":"Yoon","year":"2021","journal-title":"IEEE\/ACM Trans. Audio, Speech, Lang. Process."},{"key":"B37","doi-asserted-by":"publisher","first-page":"1519","DOI":"10.1109\/jstsp.2022.3182537","article-title":"BigSSL: exploring the frontier of large-scale semi-supervised learning for automatic speech recognition","volume":"16","author":"Zhang","year":"2022","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"B38","first-page":"13647","article-title":"Mm-narrator: narrating long-form videos with multimodal in-context learning","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Zhang","year":""},{"key":"B39","doi-asserted-by":"publisher","first-page":"13647","DOI":"10.1109\/cvpr52733.2024.01295","article-title":"MM-Narrator: narrating long-form videos with multimodal in-context learning","author":"Zhang","year":"","journal-title":"Proc. IEEE\/CVF CVPR"},{"key":"B40","doi-asserted-by":"crossref","DOI":"10.21437\/Interspeech.2022-500","article-title":"Knowledge distillation via module replacing for automatic speech recognition with recurrent neural network transducer","volume-title":"23rd interspeech conference","author":"Zhao","year":"2022"}],"container-title":["Frontiers in Communications and Networks"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frcmn.2025.1662788\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,15]],"date-time":"2025-10-15T05:43:26Z","timestamp":1760507006000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frcmn.2025.1662788\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,15]]},"references-count":37,"alternative-id":["10.3389\/frcmn.2025.1662788"],"URL":"https:\/\/doi.org\/10.3389\/frcmn.2025.1662788","relation":{},"ISSN":["2673-530X"],"issn-type":[{"value":"2673-530X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,15]]},"article-number":"1662788"}}