{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,30]],"date-time":"2026-04-30T08:18:04Z","timestamp":1777537084727,"version":"3.51.4"},"reference-count":46,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2019,2,2]],"date-time":"2019-02-02T00:00:00Z","timestamp":1549065600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Ministry of Education Humanities and Social Sciences Research Planning Fund Project","award":["16YJAZH072"],"award-info":[{"award-number":["16YJAZH072"]}]},{"name":"Major projects of the National Social Science Fund","award":["14ZDB156"],"award-info":[{"award-number":["14ZDB156"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Symmetry"],"abstract":"<jats:p>To rescue and preserve an endangered language, this paper studied an end-to-end speech recognition model based on sample transfer learning for the low-resource Tujia language. From the perspective of the Tujia language international phonetic alphabet (IPA) label layer, using Chinese corpus as an extension of the Tujia language can effectively solve the problem of an insufficient corpus in the Tujia language, constructing a cross-language corpus and an IPA dictionary that is unified between the Chinese and Tujia languages. The convolutional neural network (CNN) and bi-directional long short-term memory (BiLSTM) network were used to extract the cross-language acoustic features and train shared hidden layer weights for the Tujia language and Chinese phonetic corpus. In addition, the automatic speech recognition function of the Tujia language was realized using the end-to-end method that consists of symmetric encoding and decoding. Furthermore, transfer learning was used to establish the model of the cross-language end-to-end Tujia language recognition system. The experimental results showed that the recognition error rate of the proposed model is 46.19%, which is 2.11% lower than the that of the model that only used the Tujia language data for training. Therefore, this approach is feasible and effective.<\/jats:p>","DOI":"10.3390\/sym11020179","type":"journal-article","created":{"date-parts":[[2019,2,5]],"date-time":"2019-02-05T11:31:07Z","timestamp":1549366267000},"page":"179","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":19,"title":["Cross-Language End-to-End Speech Recognition Research Based on Transfer Learning for the Low-Resource Tujia Language"],"prefix":"10.3390","volume":"11","author":[{"given":"Chongchong","family":"Yu","sequence":"first","affiliation":[{"name":"College of Computer &amp; Information Engineering, Beijing Technology and Business University, Beijing 100048, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yunbing","family":"Chen","sequence":"additional","affiliation":[{"name":"College of Computer &amp; Information Engineering, Beijing Technology and Business University, Beijing 100048, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yueqiao","family":"Li","sequence":"additional","affiliation":[{"name":"College of Computer &amp; Information Engineering, Beijing Technology and Business University, Beijing 100048, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Meng","family":"Kang","sequence":"additional","affiliation":[{"name":"College of Computer &amp; Information Engineering, Beijing Technology and Business University, Beijing 100048, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shixuan","family":"Xu","sequence":"additional","affiliation":[{"name":"Institute of Ethnology &amp; Anthropology, Chinese Academy of Social Sciences, Beijing 100081, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xueer","family":"Liu","sequence":"additional","affiliation":[{"name":"College of Computer &amp; Information Engineering, Beijing Technology and Business University, Beijing 100048, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2019,2,2]]},"reference":[{"key":"ref_1","unstructured":"Xu, S. (2015). The course and prospect of endangered language studies in China. J. Northwest Univ. Natl. Philos. Soc. Sci., 83\u201390."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Rosenberg, A., Audhkhasi, K., Sethy, A., Ramabhadran, B., and Picheny, M. (2017, January 5\u20139). End-to-end speech recognition and keyword search on low-resource languages. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7953164"},{"key":"ref_3","unstructured":"Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., and Coates, A. (arXiv, 2014). Deep Speech: Scaling up end-to-end speech recognition, arXiv."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Chan, W., and Jaitly, N. (2017, January 5\u20139). Very Deep Convolutional Networks for End-to-End Speech Recognition. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7953077"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Lu, L., Kong, L., Dyer, C., and Smith, N.A. (2017, January 20\u201324). Multitask Learning with CTC and Segmental CRF for Speech Recognition. Proceedings of the INTERSPEECH 2017, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-71"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Ochiai, T., Watanabe, S., Hori, T., and Hershey, J.R. (arXiv, 2017). Multichannel End-to-end Speech Recognition, arXiv.","DOI":"10.1109\/ICASSP.2018.8462161"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Parcollet, T., Zhang, Y., Morchid, M., Trabelsi, C., Linar\u00e8s, G., De Mori, R., and Bengio, Y. (arXiv, 2018). Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition, arXiv.","DOI":"10.21437\/Interspeech.2018-1898"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Ghoshal, A., Swietojanski, P., and Renals, S. (2013, January 26\u201331). Multilingual training of deep neural networks. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada.","DOI":"10.1109\/ICASSP.2013.6639084"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Heigold, G., Vanhoucke, V., Senior, A., Nguyen, P., Ranzato, M.A., Devin, M., and Dean, J. (2013, January 26\u201331). Multilingual acoustic models using distributed deep neural networks. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada.","DOI":"10.1109\/ICASSP.2013.6639348"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Evgeniou, T., and Pontil, M. (2004, January 22\u201325). Regularized multi-task learning. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA.","DOI":"10.1145\/1014052.1014067"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"1345","DOI":"10.1109\/TKDE.2009.191","article-title":"A Survey on Transfer Learning","volume":"22","author":"Pan","year":"2010","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Dalmia, S., Sanabria, R., Metze, F., and Black, A.W. (arXiv, 2018). Sequence-based Multi-lingual Low Resource Speech Recognition, arXiv.","DOI":"10.1109\/ICASSP.2018.8461802"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Kim, S., Hori, T., and Watanabe, S. (2017, January 5\u20139). Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7953075"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Chen, W., Hasegawa-Johnson, M., and Chen, N.F. (2018, January 2\u20136). Topic and Keyword Identification for Low-resourced Speech Using Cross-Language Transfer Learning. Proceedings of the INTERSPEECH, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1283"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Latif, S., Rana, R., Younis, S., Qadir, J., and Epps, J. (arXiv, 2018). Transfer Learning for Improving Speech Emotion Classification Accuracy, arXiv.","DOI":"10.21437\/Interspeech.2018-1625"},{"key":"ref_16","unstructured":"Deng, L. (2011, January 19\u201322). An Overview of Deep-Structured Learning for Information Processing. Proceedings of the Asian-Pacific Signal and Information Processing-Annual Summit and Conference (APSIPA-ASC), Xi\u2019an, China."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1561\/2200000006","article-title":"Learning Deep Architectures for AI","volume":"2","author":"Bengio","year":"2009","journal-title":"Found. Trends Mach. Learn."},{"key":"ref_18","unstructured":"Mohamed, A.R., Dahl, G., and Hinton, G. (2009). Deep Belief Networks for phone recognition. Nips Workshop on Deep Learning for Speech Recognition and Related Applications, MIT Press."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Lin, H., and Ou, Z. (2006, January 8\u201311). Partial-tied-mixture Auxiliary Chain Models for Speech Recognition Based on Dynamic Bayesian Networks. Proceedings of the IEEE International Conference on Systems, Taipei, Taiwan.","DOI":"10.1109\/ICSMC.2006.384829"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Pundak, G., and Sainath, T.N. (2016, January 8\u201312). Lower Frame Rate Neural Network Acoustic Models. Proceedings of the INTERSPEECH, San Francisc, CA, USA.","DOI":"10.21437\/Interspeech.2016-275"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"W\u00d6Lfel, M., and Mcdonough, J. (2009). Speech Feature Extraction. Distant Speech Recognition, John Wiley & Sons, Ltd.","DOI":"10.1002\/9780470714089"},{"key":"ref_22","unstructured":"Lee, J.H., Jung, H.Y., and Lee, T.W. (2000, January 5\u20139). Speech feature extraction using independent component analysis. Proceedings of the IEEE International Conference on Acoustics, Istanbul, Turkey."},{"key":"ref_23","unstructured":"Li, H., Xu, X., Wu, G., Ding, C., and Zhao, X. (2017). Research on speech emotion feature extraction based on MFCC. J. Electron. Meas. Instrum., (In Chinese)."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Yu, D., and Seltzer, M. (2011, January 27\u201331). Improved Bottleneck Features Using Pretrained Deep Neural Networks. Proceedings of the Conference of the International Speech Communication Association, Florence, Italy.","DOI":"10.21437\/Interspeech.2011-91"},{"key":"ref_25","unstructured":"Maimaitiaili, T., and Dai, L. (2015). Deep Neural Network based Uyghur Large Vocabulary Continuous Speech Recognition. J. Data Acquis. Process., 365\u2013371."},{"key":"ref_26","first-page":"1540","article-title":"Keyword Spotting Based on Deep Neural Networks Bottleneck Feature","volume":"36","author":"Liu","year":"2015","journal-title":"J. Chin. Comput. Syst."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Ozeki, M., and Okatani, T. (2014). Understanding Convolutional Neural Networks in Terms of Category-Level Attributes. Computer Vision\u2014ACCV 2014, Springer International Publishing.","DOI":"10.1007\/978-3-319-16808-1_25"},{"key":"ref_28","first-page":"818","article-title":"Visualizing and Understanding Convolutional Networks","volume":"Volume 8689","author":"Zeiler","year":"2013","journal-title":"European Conference on Computer Vision"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Abdel-Hamid, O., Mohamed, A.R., Jiang, H., and Penn, G. (2012, January 25\u201330). Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition. Proceedings of the IEEE International Conference on Acoustics, Kyoto, Japan.","DOI":"10.1109\/ICASSP.2012.6288864"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Abdelhamid O Deng, L., and Yu, D. (2013, January 25\u201329). Exploring Convolutional Neural Network Structures and Optimization Techniques for Speech Recognition. Proceedings of the INTERSPEECH, Lyon, France.","DOI":"10.21437\/Interspeech.2013-744"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Sercu, T., Puhrsch, C., Kingsbury, B., and LeCun, Y. (2016, January 20\u201325). Very deep multilingual convolutional neural networks for LVCSR. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472620"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Sercu, T., and Goel, V. (arXiv, 2016). Advances in Very Deep Convolutional Neural Networks for LVCSR, arXiv.","DOI":"10.21437\/Interspeech.2016-1033"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Graves, A., Mohamed, A.R., and Hinton, G. (2013, January 26\u201331). Speech recognition with deep recurrent neural networks. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada.","DOI":"10.1109\/ICASSP.2013.6638947"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Graves, A. (2012). Supervised Sequence Labelling with Recurrent Neural Networks, Springer.","DOI":"10.1007\/978-3-642-24797-2"},{"key":"ref_35","unstructured":"Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (arXiv, 2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, arXiv."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Graves, A., and Gomez, F. (2006, January 25\u201329). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. Proceedings of the International Conference on Machine Learning, Pittsburgh, PA, USA.","DOI":"10.1145\/1143844.1143891"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Miao, Y., Gowayyed, M., and Metze, F. (2016, January 13\u201317). EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, Scottsdale, AZ, USA.","DOI":"10.1109\/ASRU.2015.7404790"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Zeghidour, N., Usunier, N., Synnaeve, G., Collobert, R., and Dupoux, E. (arXiv, 2018). End-to-End Speech Recognition from the Raw Waveform, arXiv.","DOI":"10.21437\/Interspeech.2018-2414"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N.E.Y., Heymann, J., Wiesner, M., and Chen, N. (arXiv, 2018). ESPnet: End-to-End Speech Processing Toolkit, arXiv.","DOI":"10.21437\/Interspeech.2018-1456"},{"key":"ref_40","first-page":"35","article-title":"Grammatical and semantic representation of spatial concepts in the Tujia language","volume":"1","author":"Xu","year":"2013","journal-title":"J. Minor. Lang. China"},{"key":"ref_41","unstructured":"Xu, S. (2012). Features of Change in the Structure of Endangered Languages: A Case Study of the South Tujia Language. J. Yunnan Natl. Univ. (Soc. Sci.), 29, (In Chinese)."},{"key":"ref_42","unstructured":"Wang, D., and Zhang, X. (arXiv, 2015). THCHS-30: A Free Chinese Speech Corpus, arXiv."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Wu, C., and Wang, B. (2017, January 26\u201329). Extracting Topics Based on Word2Vec and Improved Jaccard Similarity Coefficient. Proceedings of the IEEE Second International Conference on Data Science in Cyberspace, Shenzhen, China.","DOI":"10.1109\/DSC.2017.70"},{"key":"ref_44","unstructured":"Inc, G. (2015, January 19\u201324). Convolutional, long short-term memory, fully connected deep neural networks. Proceedings of the IEEE International Conference on Acoustics, Brisbane, Australia."},{"key":"ref_45","unstructured":"Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., and Chen, G. (arXiv, 2015). Deep Speech 2: End-to-End Speech Recognition in English and Mandarin, arXiv."},{"key":"ref_46","unstructured":"Ioffe, S., and Szegedy, C. (2015, January 6\u201311). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Machine Learning, Lille, France."}],"container-title":["Symmetry"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-8994\/11\/2\/179\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T12:30:48Z","timestamp":1760185848000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-8994\/11\/2\/179"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,2,2]]},"references-count":46,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2019,2]]}},"alternative-id":["sym11020179"],"URL":"https:\/\/doi.org\/10.3390\/sym11020179","relation":{},"ISSN":["2073-8994"],"issn-type":[{"value":"2073-8994","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,2,2]]}}}