{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,21]],"date-time":"2025-10-21T15:45:56Z","timestamp":1761061556844,"version":"3.37.3"},"reference-count":35,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,12,1]],"date-time":"2021-12-01T00:00:00Z","timestamp":1638316800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,12,11]],"date-time":"2021-12-11T00:00:00Z","timestamp":1639180800000},"content-version":"vor","delay-in-days":10,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J AUDIO SPEECH MUSIC PROC."],"published-print":{"date-parts":[[2021,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>We present an unsupervised domain adaptation (UDA) method for a lip-reading model that is an image-based speech recognition model. Most of conventional UDA methods cannot be applied when the adaptation data consists of an unknown class, such as out-of-vocabulary words. In this paper, we propose a cross-modal knowledge distillation (KD)-based domain adaptation method, where we use the intermediate layer output in the audio-based speech recognition model as a teacher for the unlabeled adaptation data. Because the audio signal contains more information for recognizing speech than lip images, the knowledge of the audio-based model can be used as a powerful teacher in cases where the unlabeled adaptation data consists of audio-visual parallel data. In addition, because the proposed intermediate-layer-based KD can express the teacher as the sub-class (sub-word)-level representation, this method allows us to use the data of unknown classes for the adaptation. Through experiments on an image-based word recognition task, we demonstrate that the proposed approach can not only improve the UDA performance but can also use the unknown-class adaptation data.<\/jats:p>","DOI":"10.1186\/s13636-021-00232-5","type":"journal-article","created":{"date-parts":[[2021,12,11]],"date-time":"2021-12-11T09:02:51Z","timestamp":1639213371000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Unsupervised domain adaptation for lip reading based on cross-modal knowledge distillation"],"prefix":"10.1186","volume":"2021","author":[{"given":"Yuki","family":"Takashima","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9808-0250","authenticated-orcid":false,"given":"Ryoichi","family":"Takashima","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ryota","family":"Tsunoda","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ryo","family":"Aihara","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tetsuya","family":"Takiguchi","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yasuo","family":"Ariki","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Nobuaki","family":"Motoyama","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2021,12,11]]},"reference":[{"key":"232_CR1","doi-asserted-by":"publisher","first-page":"746","DOI":"10.1038\/264746a0","volume":"264","author":"H. McGurk","year":"1976","unstructured":"H. McGurk, J. MacDonald, Hearing lips and seeing voices. Nature. 264:, 746\u2013748 (1976).","journal-title":"Nature"},{"key":"232_CR2","unstructured":"M. J. Tomlinson, M. J. Russell, N. M. Brooke, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Integrating audio and visual information to provide highly robust speech recognition, (1996), pp. 821\u2013824."},{"key":"232_CR3","unstructured":"A. Verma, T. Faruquie, C. Neti, S. Basu, A. Senior, in Proc. IEEE Automatic Speech Recognition and Understanding (ASRU), 1. Late integration in audio-visual continuous speech recognition, (1999), pp. 71\u201374."},{"key":"232_CR4","doi-asserted-by":"crossref","unstructured":"K. Palecek, J. Chaloupka, in Proc. International Conference on Telecommunications and Signal Processing (TSP). Audio-visual speech recognition in noisy audio environments, (2013), pp. 484\u2013487.","DOI":"10.1109\/TSP.2013.6613979"},{"key":"232_CR5","doi-asserted-by":"crossref","unstructured":"J. S. Chung, A. Zisserman, in Proc. Asian Conference on Computer Vision (ACCV). Lip reading in the wild, (2016), pp. 87\u2013103.","DOI":"10.1007\/978-3-319-54184-6_6"},{"key":"232_CR6","doi-asserted-by":"crossref","unstructured":"J. S. Chung, A. Senior, O. Vinyals, A. Zisserman, in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Lip reading sentences in the wild, (2017), pp. 3444\u20133453.","DOI":"10.1109\/CVPR.2017.367"},{"key":"232_CR7","doi-asserted-by":"crossref","unstructured":"J. Yu, S. Zhang, J. Wu, S. Ghorbani, B. Wu, S. Kang, S. Liu, X. Liu, H. Meng, D. Yu, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Audio-visual recognition of overlapped speech for the LRS2 dataset, (2020), pp. 6984\u20136988.","DOI":"10.1109\/ICASSP40776.2020.9054127"},{"key":"232_CR8","unstructured":"Y. Ganin, V. S. Lempitsky, in Proc. International Conference on Machine Learning (ICML). Unsupervised domain adaptation by backpropagation, (2015), pp. 1180\u20131189."},{"key":"232_CR9","doi-asserted-by":"crossref","unstructured":"M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, W. Li, in Proc. European Conference on Computer Vision (ECCV), 9908. Deep reconstruction-classification networks for unsupervised domain adaptation, (2016), pp. 597\u2013613.","DOI":"10.1007\/978-3-319-46493-0_36"},{"key":"232_CR10","doi-asserted-by":"crossref","unstructured":"K. Saito, K. Watanabe, Y. Ushiku, T. Harada, in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Maximum classifier discrepancy for unsupervised domain adaptation, (2018), pp. 3723\u20133732.","DOI":"10.1109\/CVPR.2018.00392"},{"key":"232_CR11","doi-asserted-by":"crossref","unstructured":"P. P. Busto, J. Gall, in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Open set domain adaptation, (2017), pp. 754\u2013763.","DOI":"10.1109\/ICCV.2017.88"},{"key":"232_CR12","unstructured":"G. Hinton, O. Vinyals, J. Dean, in Proc. NIPS Deep Learning Workshop. Distilling the knowledge in a neural network, (2014)."},{"key":"232_CR13","doi-asserted-by":"crossref","unstructured":"T. Asami, R. Masumura, Y. Yamaguchi, H. Masataki, Y. Aono, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Domain adaptation of DNN acoustic models using knowledge distillation, (2017), pp. 5185\u20135189.","DOI":"10.1109\/ICASSP.2017.7953145"},{"key":"232_CR14","unstructured":"G. Chen, W. Choi, X. Yu, T. X. Han, M. Chandraker, in NIPS. Learning efficient object detection models with knowledge distillation, (2017), pp. 742\u2013751."},{"key":"232_CR15","doi-asserted-by":"crossref","unstructured":"W. Li, S. Wang, M. Lei, S. M. Siniscalchi, C. H. Lee, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Improving audio-visual speech recognition performance with cross-modal student-teacher training, (2019), pp. 6560\u20136564.","DOI":"10.1109\/ICASSP.2019.8682868"},{"key":"232_CR16","doi-asserted-by":"crossref","unstructured":"H. Ninomiya, N. Kitaoka, S. Tamura, Y. Iribe, K. Takeda, in Proc. ISCA Interspeech. Integration of deep bottleneck features for audio-visual speech recognition, (2015), pp. 563\u2013567.","DOI":"10.21437\/Interspeech.2015-204"},{"key":"232_CR17","unstructured":"Y. M. Assael, B. Shillingford, S. Whiteson, N. de Freitas, LipNet: Sentence-level lipreading (2016). arXiv preprint arXiv:1611.01599."},{"key":"232_CR18","doi-asserted-by":"crossref","unstructured":"A. Graves, S. Fern\u00e1ndez, F. J. Gomez, J. Schmidhuber, in Proc. International Conference on Machine Learning (ICML). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, (2006), pp. 369\u2013376.","DOI":"10.1145\/1143844.1143891"},{"key":"232_CR19","doi-asserted-by":"crossref","unstructured":"A. Koumparoulis, G. Potamianos, in Proc. ISCA Interspeech. MobiLipNet: Resource-efficient deep learning based lipreading, (2019), pp. 2763\u20132767.","DOI":"10.21437\/Interspeech.2019-2618"},{"key":"232_CR20","unstructured":"I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, Y. Bengio, in NIPS. Generative adversarial nets, (2014), pp. 2672\u20132680."},{"key":"232_CR21","doi-asserted-by":"crossref","unstructured":"M. Wand, J. Schmidhuber, in Proc. ISCA Interspeech. Improving speaker-independent lipreading with domain-adversarial training, (2017), pp. 3662\u20133666.","DOI":"10.21437\/Interspeech.2017-421"},{"key":"232_CR22","doi-asserted-by":"crossref","unstructured":"D. A. B. Oliveira, A. B. Mattos, E. D. S. Morais, in Proc. European Conference on Computer Vision (ECCV) Workshops. Improving viseme recognition using GAN-based frontal view mapping, (2018), pp. 2148\u20132155.","DOI":"10.1109\/CVPRW.2018.00289"},{"key":"232_CR23","doi-asserted-by":"crossref","unstructured":"S. Gupta, J. Hoffman, J. Malik, in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cross modal distillation for supervision transfer, (2016), pp. 2827\u20132836.","DOI":"10.1109\/CVPR.2016.309"},{"key":"232_CR24","doi-asserted-by":"crossref","unstructured":"T. Afouras, J. S. Chung, A. Zisserman, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). ASR is all you need: Cross-modal distillation for lip reading, (2020), pp. 2143\u20132147.","DOI":"10.1109\/ICASSP40776.2020.9054253"},{"key":"232_CR25","doi-asserted-by":"crossref","unstructured":"Y. Zhao, R. Xu, X. Wang, P. Hou, H. Tang, M. Song, in Proc. The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI). Hearing lips: Improving lip reading by distilling speech recognizers, (2020), pp. 6917\u20136924.","DOI":"10.1609\/aaai.v34i04.6174"},{"key":"232_CR26","unstructured":"E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, T. Darrell, Deep domain confusion: Maximizing for domain invariance (2014). arXiv preprint arXiv:1412.3474."},{"key":"232_CR27","doi-asserted-by":"crossref","unstructured":"E. Tzeng, J. Hoffman, K. Saenko, T. Darrell, in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Adversarial discriminative domain adaptation, (2017), pp. 2962\u20132971.","DOI":"10.1109\/CVPR.2017.316"},{"key":"232_CR28","unstructured":"R. Shu, H. H. Bui, H. Narui, S. Ermon, in Proc. International Conference on Learning Representations (ICLR). A DIRT-T approach to unsupervised domain adaptation, (2018)."},{"key":"232_CR29","doi-asserted-by":"crossref","unstructured":"K. Sohn, S. Liu, G. Zhong, X. Yu, M. -H. Yang, M. Chandraker, in Proc. IEEE International Conference on Computer Vision (ICCV). Unsupervised domain adaptation for face recognition in unlabeled videos, (2017), pp. 5917\u20135925.","DOI":"10.1109\/ICCV.2017.630"},{"issue":"3","key":"232_CR30","doi-asserted-by":"publisher","first-page":"952","DOI":"10.1109\/TSA.2005.857790","volume":"14","author":"A. Mouchtaris","year":"2006","unstructured":"A. Mouchtaris, J. V. der Spiegel, P. Mueller, Nonparallel training for voice conversion based on a parameter adaptation approach. IEEE Trans. Audio Speech Lang. Process.14(3), 952\u2013963 (2006).","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"232_CR31","doi-asserted-by":"crossref","unstructured":"S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, M. Pantic, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). End-to-end audiovisual speech recognition, (2018), pp. 6548\u20136552.","DOI":"10.1109\/ICASSP.2018.8461326"},{"key":"232_CR32","doi-asserted-by":"crossref","unstructured":"T. Stafylakis, G. Tzimiropoulos, in Proc. ISCA Interspeech. Combining residual networks with LSTMs for lipreading, (2017), pp. 3652\u20133656.","DOI":"10.21437\/Interspeech.2017-85"},{"key":"232_CR33","unstructured":"D. P. Kingma, J. Ba, in Proc. International Conference on Learning Representations (ICLR). Adam: A method for stochastic optimization, (2015)."},{"key":"232_CR34","unstructured":"A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, MobileNets: Efficient convolutional neural networks for mobile vision applications (2017). arXiv preprint arXiv:1704.04861."},{"key":"232_CR35","doi-asserted-by":"crossref","unstructured":"M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, L. -C. Chen, in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). MobileNetV2: Inverted residuals and linear bottlenecks, (2018), pp. 4510\u20134520.","DOI":"10.1109\/CVPR.2018.00474"}],"container-title":["EURASIP Journal on Audio, Speech, and Music Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-021-00232-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13636-021-00232-5\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-021-00232-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,12,11]],"date-time":"2021-12-11T09:16:48Z","timestamp":1639214208000},"score":1,"resource":{"primary":{"URL":"https:\/\/asmp-eurasipjournals.springeropen.com\/articles\/10.1186\/s13636-021-00232-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,12]]},"references-count":35,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["232"],"URL":"https:\/\/doi.org\/10.1186\/s13636-021-00232-5","relation":{},"ISSN":["1687-4722"],"issn-type":[{"type":"electronic","value":"1687-4722"}],"subject":[],"published":{"date-parts":[[2021,12]]},"assertion":[{"value":"8 July 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 November 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 December 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"44"}}