{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T02:50:32Z","timestamp":1760151032444,"version":"build-2065373602"},"reference-count":30,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2022,2,15]],"date-time":"2022-02-15T00:00:00Z","timestamp":1644883200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Kyungpook National University Research Fund","award":["N\/A"],"award-info":[{"award-number":["N\/A"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Successful applications of deep learning technologies in the natural language processing domain have improved text-based intent classifications. However, in practical spoken dialogue applications, the users\u2019 articulation styles and background noises cause automatic speech recognition (ASR) errors, and these may lead language models to misclassify users\u2019 intents. To overcome the limited performance of the intent classification task in the spoken dialogue system, we propose a novel approach that jointly uses both recognized text obtained by the ASR model and a given labeled text. In the evaluation phase, only the fine-tuned recognized language model (RLM) is used. The experimental results show that the proposed scheme is effective at classifying intents in the spoken dialogue system containing ASR errors.<\/jats:p>","DOI":"10.3390\/s22041509","type":"journal-article","created":{"date-parts":[[2022,2,15]],"date-time":"2022-02-15T22:44:47Z","timestamp":1644965087000},"page":"1509","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Improved Spoken Language Representation for Intent Understanding in a Task-Oriented Dialogue System"],"prefix":"10.3390","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0111-300X","authenticated-orcid":false,"given":"June-Woo","family":"Kim","sequence":"first","affiliation":[{"name":"Department of Artificial Intelligence, Graduate School, Kyungpook National University, Daegu 41566, Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1507-5967","authenticated-orcid":false,"given":"Hyekyung","family":"Yoon","sequence":"additional","affiliation":[{"name":"Department of Artificial Intelligence, Graduate School, Kyungpook National University, Daegu 41566, Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0398-831X","authenticated-orcid":false,"given":"Ho-Young","family":"Jung","sequence":"additional","affiliation":[{"name":"Department of Artificial Intelligence, Graduate School, Kyungpook National University, Daegu 41566, Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2022,2,15]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Liu, B., and Lane, I. (2016). Attention-based recurrent neural network models for joint intent detection and slot filling. arXiv.","DOI":"10.21437\/Interspeech.2016-1352"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Goo, C.W., Gao, G., Hsu, Y.K., Huo, C.L., Chen, T.C., Hsu, K.W., and Chen, Y.N. (2018, January 1\u20136). Slot-gated modeling for joint slot filling and intent prediction. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Short Papers), New Orleans, LA, USA.","DOI":"10.18653\/v1\/N18-2118"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Zhang, C., Li, Y., Du, N., Fan, W., and Yu, P.S. (2018). Joint slot filling and intent detection via capsule neural networks. arXiv.","DOI":"10.18653\/v1\/P19-1519"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Qin, L., Che, W., Li, Y., Wen, H., and Liu, T. (2019). A stack-propagation framework with token-level intent detection for spoken language understanding. arXiv.","DOI":"10.18653\/v1\/D19-1214"},{"key":"ref_5","unstructured":"Chen, Q., Zhuo, Z., and Wang, W. (2019). Bert for joint intent classification and slot filling. arXiv."},{"key":"ref_6","unstructured":"Niu, P., Chen, Z., and Song, M. (2019). A novel bi-directional interrelated model for joint intent detection and slot filling. arXiv."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Lugosch, L., Ravanelli, M., Ignoto, P., Tomar, V.S., and Bengio, Y. (2019). Speech model pre-training for end-to-end spoken language understanding. arXiv.","DOI":"10.21437\/Interspeech.2019-2396"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Wang, P., Wei, L., Cao, Y., Xie, J., and Nie, Z. (2020, January 4\u20138). Large-scale unsupervised pre-training for end-to-end spoken language understanding. Proceedings of the ICASSP 2020\u20132020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053163"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Huang, C.W., and Chen, Y.N. (2020, January 4\u20138). Learning asr-robust contextualized embeddings for spoken language understanding. Proceedings of the ICASSP 2020\u20132020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9054689"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Cao, J., Wang, J., Hamza, W., Vanee, K., and Li, S.W. (2020). Style attuned pre-training and parameter efficient fine-tuning for spoken language understanding. arXiv.","DOI":"10.21437\/Interspeech.2020-2907"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Kim, S., Kim, G., Shin, S., and Lee, S. (2020). Two-stage textual knowledge distillation to speech encoder for spoken language understanding. arXiv.","DOI":"10.1109\/ICASSP39728.2021.9414619"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Kim, M., Kim, G., Lee, S.W., and Ha, J.W. (2020). ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language Understanding. arXiv.","DOI":"10.1109\/ICASSP39728.2021.9414558"},{"key":"ref_13","unstructured":"Lai, C.I., Cao, J., Bodapati, S., and Li, S.W. (2020). Towards Semi-Supervised Semantics Understanding from Speech. arXiv."},{"key":"ref_14","first-page":"12449","article-title":"wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations","volume":"33","author":"Baevski","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_15","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_16","unstructured":"Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv."},{"key":"ref_17","first-page":"5753","article-title":"Xlnet: Generalized autoregressive pretraining for language understanding","volume":"32","author":"Yang","year":"2019","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_18","unstructured":"Clark, K., Luong, M.T., Le, Q.V., and Manning, C.D. (2020). Electra: Pre-training text encoders as discriminators rather than generators. arXiv."},{"key":"ref_19","unstructured":"Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv."},{"key":"ref_20","unstructured":"Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv."},{"key":"ref_21","unstructured":"Lafferty, J., McCallum, A., and Pereira, F.C. (2021, December 31). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Available online: https:\/\/repository.upenn.edu\/cis_papers\/159\/?ref=https:\/\/githubhelp.com."},{"key":"ref_22","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, MIT Press."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Chen, Y.P., Price, R., and Bangalore, S. (2018, January 15\u201320). Spoken language understanding without speech recognition. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461718"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Haghani, P., Narayanan, A., Bacchiani, M., Chuang, G., Gaur, N., Moreno, P., Prabhavalkar, R., Qu, Z., and Waters, A. (2018, January 18\u201321). From audio to semantics: Approaches to end-to-end spoken language understanding. Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece.","DOI":"10.1109\/SLT.2018.8639043"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Chung, Y.A., Zhu, C., and Zeng, M. (2020). SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding. arXiv.","DOI":"10.18653\/v1\/2021.naacl-main.152"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Qian, Y., Bianv, X., Shi, Y., Kanda, N., Shen, L., Xiao, Z., and Zeng, M. (2021, January 6\u201311). Speech-language pre-training for end-to-end spoken language understanding. Proceedings of the ICASSP 2021\u20142021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada.","DOI":"10.1109\/ICASSP39728.2021.9414900"},{"key":"ref_27","unstructured":"Coucke, A., Saade, A., Ball, A., Bluche, T., Caulier, A., Leroy, D., Doumouro, C., Gisselbrecht, T., Caltagirone, F., and Lavril, T. (2018). Snips voice platform: An embedded spoken language understanding system for private-by-design voice interfaces. arXiv."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Wolf, T., Chaumond, J., Debut, L., Sanh, V., Delangue, C., Moi, A., Cistac, P., Funtowicz, M., Davison, J., and Shleifer, S. (2020, January 16\u201320). Transformers: State-of-the-art natural language processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online.","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"ref_29","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."},{"key":"ref_30","unstructured":"Guo, H., Mao, Y., and Zhang, R. (2019). Augmenting data with mixup for sentence classification: An empirical study. arXiv."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/4\/1509\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T22:20:09Z","timestamp":1760134809000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/4\/1509"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,2,15]]},"references-count":30,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2022,2]]}},"alternative-id":["s22041509"],"URL":"https:\/\/doi.org\/10.3390\/s22041509","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2022,2,15]]}}}