{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,28]],"date-time":"2026-03-28T16:26:34Z","timestamp":1774715194952,"version":"3.50.1"},"reference-count":31,"publisher":"MDPI AG","issue":"24","license":[{"start":{"date-parts":[[2022,12,16]],"date-time":"2022-12-16T00:00:00Z","timestamp":1671148800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"JSPS KAKENHI","award":["JP18K11431"],"award-info":[{"award-number":["JP18K11431"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>The most effective automatic speech recognition (ASR) approaches are based on artificial neural networks (ANN). ANNs need to be trained with an adequate amount of matched conditioned data. Therefore, performing training adaptation of an ASR model using augmented data of matched condition as the real environment gives better results for real data. Real-world speech recordings can vary in different acoustic aspects depending on the recording channels and environment such as the Long Term Evolution (LTE) channel of mobile telephones, where data are transmitted with voice over LTE (VoLTE) technology, wireless pin mics in a classroom condition, etc. Acquiring data with such variation is costly. Therefore, we propose training ASR models with simulated augmented data and fine-tune them for domain adaptation using deep neural network (DNN)-based simulated data along with re-recorded data. DNN-based feature transformation creates realistic speech features from recordings of clean conditions. In this research, a comparative investigation is performed for different recording channel adaptation methods for real-world speech recognition. The proposed method yields 27.0% character error rate reduction (CERR) for the DNN\u2013hidden Markov model (DNN-HMM) hybrid ASR approach and 36.4% CERR for the end-to-end ASR approach for the target domain of the LTE channel of telephone speech.<\/jats:p>","DOI":"10.3390\/s22249945","type":"journal-article","created":{"date-parts":[[2022,12,19]],"date-time":"2022-12-19T09:31:01Z","timestamp":1671442261000},"page":"9945","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Domain Adaptation with Augmented Data by Deep Neural Network Based Method Using Re-Recorded Speech for Automatic Speech Recognition in Real Environment"],"prefix":"10.3390","volume":"22","author":[{"given":"Raufun","family":"Nahar","sequence":"first","affiliation":[{"name":"Graduate School of Science and Technology, Shizuoka University, Hamamatsu 432-8561, Japan"}]},{"given":"Shogo","family":"Miwa","sequence":"additional","affiliation":[{"name":"Graduate School of Integrated Science and Technology, Shizuoka University, Hamamatsu 432-8561, Japan"}]},{"given":"Atsuhiko","family":"Kai","sequence":"additional","affiliation":[{"name":"Graduate School of Science and Technology, Shizuoka University, Hamamatsu 432-8561, Japan"}]}],"member":"1968","published-online":{"date-parts":[[2022,12,16]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Ko, T., Peddinti, V., Povey, D., and Khudanpur, S. (2015, January 6\u201310). Audio augmentation for speech recognition. Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany.","DOI":"10.21437\/Interspeech.2015-711"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Hsiao, R., Ma, J., Hartmann, W., Karafi\u00e1t, M., Franti\u0161ek, G., Burget, L., Sz\u00f6ke, I., \u010cernock\u00fd, J.H., Watanabe, S., and Chen, Z. (2015, January 13\u201317). Robust speech recognition in unknown reverberant and noisy conditions. Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, AZ, USA.","DOI":"10.1109\/ASRU.2015.7404841"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Ko, T., Peddinti, V., Povey, D., Seltzer, M.L., and Khudanpur, S. (2017, January 5\u20139). A study on data augmentation of reverberant speech for robust speech recognition. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7953152"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1469","DOI":"10.1109\/TASLP.2015.2438544","article-title":"Data augmentation for deep neural network acoustic modeling","volume":"23","author":"Cui","year":"2015","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Khokhlov, Y., Zatvornitskiy, A., Medennikov, I., Sorokin, I., Prisyach, T., Romanenko, A., Mitrofanov, A., Bataev, V., Andrusenko, A., and Korenevskaya, M. (2019, January 15\u201319). R-vectors: New Technique for Adaptation to Room Acoustics. Proceedings of the INTERSPEECH, Graz, Austria.","DOI":"10.21437\/Interspeech.2019-2645"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"12","DOI":"10.1109\/TASL.2011.2109382","article-title":"Acoustic Modeling using Deep Belief Networks","volume":"20","author":"Mohamed","year":"2012","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"30","DOI":"10.1109\/TASL.2011.2134090","article-title":"Context-dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition","volume":"20","author":"Dahl","year":"2012","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1109\/MSP.2012.2205597","article-title":"Deep Neural Network for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups","volume":"29","author":"Hinton","year":"2012","journal-title":"IEEE Signal Process. Mag."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"1527","DOI":"10.1162\/neco.2006.18.7.1527","article-title":"A Fast Learning Algorithm for Deep Belief Nets","volume":"18","author":"Hinton","year":"2006","journal-title":"Neural Comput."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Christensen, H., Cunningham, S., Fox, C., Green, P., and Hain, T. (2012, January 9\u201313). A comparative study of adaptive, automatic recognition of disordered speech. Proceedings of the INTERSPEECH, Portland, OR, USA.","DOI":"10.21437\/Interspeech.2012-484"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Hsu, W.-N., Zhang, Y., and Glass, J. (2017, January 16\u201320). Unsupervised domain adaptation for robust speech recognition via autoencoder-based data augmentation. Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop, Okinawa, Japan.","DOI":"10.1109\/ASRU.2017.8268911"},{"key":"ref_12","first-page":"1","article-title":"Environment-dependent denoising autoencoder for distant-talking speech recognition","volume":"92","author":"Ueda","year":"2015","journal-title":"EURASIP J. Adv. Signal Process"},{"key":"ref_13","unstructured":"Zhang, Y., Qin, J., Park, S.D., Han, W., Chiu, C.C., Pang, R., Le, V.Q., and Wu, Y. (2020). Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition. arXiv."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Graves, A., Fernandez, S., Gomez, F., and Huber, J.S. (2006, January 25\u201329). Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA.","DOI":"10.1145\/1143844.1143891"},{"key":"ref_15","unstructured":"Chorowski, J., Bahdanau, D., Cho, K., and Bengio, Y. (2014). End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1240","DOI":"10.1109\/JSTSP.2017.2763455","article-title":"Hybrid CTC\/Attention Architecture for End-to-End Speech Recognition","volume":"11","author":"Watanabe","year":"2017","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Nakamura, A., Saito, T., Ikeda, D., Ohta, K., Mineno, H., and Nishimura, M. (2021). Automatic Detection of Chewing and Swallowing. Sensors, 21.","DOI":"10.3390\/s21103378"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Peddinti, V., Povey, D., and Khudanpur, S. (2015, January 6\u201310). A time delay neural network architecture for efficient modeling of long temporal contexts. Proceedings of the INTERSPEECH, Dresden, Germany.","DOI":"10.21437\/Interspeech.2015-647"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"328","DOI":"10.1109\/29.21701","article-title":"Phoneme recognition using time-delay neural networks","volume":"37","author":"Waibel","year":"1989","journal-title":"IEEE Trans. Acoust. Speech Signal Process."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"39","DOI":"10.1162\/neco.1989.1.1.39","article-title":"Modular construction of time-delay neural networks for speech recognition","volume":"1","author":"Waibel","year":"1989","journal-title":"Neural Comput."},{"key":"ref_21","unstructured":"Jankowski, C., Kalyanswamy, A., Basson, S., and Spitz, J. (1990, January 3\u20136). NTIMIT: A phonetically balanced, continuous speech, telephone bandwidth speech database. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Albuquerque, NM, USA."},{"key":"ref_22","unstructured":"Brown, K.L., and George, E.B. (1995, January 9\u201312). CTIMIT: A speech corpus for the cellular environment with applications to automatic speech recognition. Proceedings of the 1995 International Conference on Acoustics Speech, and Signal Processing, Detroit, MI, USA."},{"key":"ref_23","unstructured":"Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallett, D., and Dahlgren, N. (1990). DARPA, TIMIT Acoustic-Phonetic Continuous Speech Corpus CD-ROM, National Institute of Standards and Technology."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"367","DOI":"10.1109\/TASSP.1980.1163421","article-title":"Distortion measures for speech processing","volume":"28","author":"Gray","year":"1980","journal-title":"IEEE Trans. Acoust. Speech Signal Process."},{"key":"ref_25","unstructured":"(2022, September 01). Corpus of Spontaneous Japanese. Available online: https:\/\/clrd.ninjal.ac.jp\/csj\/en\/index.html."},{"key":"ref_26","unstructured":"(2022, September 01). Report: \u201cConstruction of the Corpus of Spontaneous Japanese\u201d, Chapter 2: Transcriptions. Available online: https:\/\/clrd.ninjal.ac.jp\/csj\/en\/document.html."},{"key":"ref_27","unstructured":"(2022, September 01). Electronic Noise Database. (In Japanese)."},{"key":"ref_28","unstructured":"(1996). ITU Recommendation G.712. Transmission Performance Characteristics of Pulse Code Modulation Channels."},{"key":"ref_29","unstructured":"Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., and Schwarz, P. (2011, January 11\u201315). The Kaldi speech recognition toolkit. Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Waikoloa, HI, USA."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Park, D.S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B., Cubuk, E.D., and Le, Q.V. (2019, January 15\u201319). SpecAugment: A Simple Data Augmentation Method for Automatic speech Recognition. Proceedings of the INTERSPEECH, Graz, Austria.","DOI":"10.21437\/Interspeech.2019-2680"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Enrique Yalta Soplin, N., Heymann, J., Wiesner, M., and Chen, N. (2018, January 2\u20136). ESPnet: End-to-End Speech Processing Toolkit. Proceedings of the INTERSPEECH, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1456"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/24\/9945\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T01:42:58Z","timestamp":1760146978000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/24\/9945"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,12,16]]},"references-count":31,"journal-issue":{"issue":"24","published-online":{"date-parts":[[2022,12]]}},"alternative-id":["s22249945"],"URL":"https:\/\/doi.org\/10.3390\/s22249945","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,12,16]]}}}