{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,11]],"date-time":"2025-09-11T19:02:58Z","timestamp":1757617378016,"version":"3.44.0"},"reference-count":77,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2025,2,15]],"date-time":"2025-02-15T00:00:00Z","timestamp":1739577600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,2,15]],"date-time":"2025-02-15T00:00:00Z","timestamp":1739577600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100003359","name":"Generalitat Valenciana","doi-asserted-by":"publisher","award":["CIACIF\/2021\/295"],"award-info":[{"award-number":["CIACIF\/2021\/295"]}],"id":[{"id":"10.13039\/501100003359","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004837","name":"Ministerio de Ciencia e Innovaci\u00f3n","doi-asserted-by":"publisher","award":["PID2021-124719OB-I00"],"award-info":[{"award-number":["PID2021-124719OB-I00"]}],"id":[{"id":"10.13039\/501100004837","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Lang Resources &amp; Evaluation"],"published-print":{"date-parts":[[2025,9]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Visual speech recognition remains an open research problem where different challenges must be considered by dispensing with the auditory sense, such as visual ambiguities, the inter-personal variability among speakers, and the complex modeling of silence. Nonetheless, recent remarkable results have been achieved in the field thanks to the availability of large-scale databases and the use of powerful attention mechanisms. Besides, multiple languages apart from English are nowadays a focus of interest. This paper presents noticeable advances in automatic continuous lipreading for Spanish. First, an end-to-end system based on the hybrid CTC\/Attention architecture is presented. Experiments are conducted on two corpora of disparate nature, reaching state-of-the-art results that significantly improve the best performance obtained to date for both databases. In addition, a thorough ablation study is carried out, where it is studied how the different components that form the architecture influence the quality of speech recognition. Then, a rigorous error analysis is carried out to investigate the different factors that could affect the learning of the automatic system. Finally, a new Spanish lipreading benchmark is consolidated. Code and trained models are available at https:\/\/github.com\/david-gimeno\/evaluating-end2end-spanish-lipreading.<\/jats:p>","DOI":"10.1007\/s10579-025-09809-4","type":"journal-article","created":{"date-parts":[[2025,2,15]],"date-time":"2025-02-15T12:18:46Z","timestamp":1739621926000},"page":"2365-2386","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Evaluation of end-to-end continuous spanish lipreading in different data conditions"],"prefix":"10.1007","volume":"59","author":[{"given":"David","family":"Gimeno-G\u00f3mez","sequence":"first","affiliation":[]},{"given":"Carlos-D.","family":"Mart\u00ednez-Hinarejos","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,2,15]]},"reference":[{"key":"9809_CR1","unstructured":"Acosta-Triana, J.-M. , Gimeno-G\u00f3mez, D., Mart\u00ednez-Hinarejos, & C.-D. (2024). AnnoTheia: A semi-automatic annotation toolkit for audio-visual speech technologies. Proceedings of LREC-COLING (pp. 1260\u20131269)."},{"key":"9809_CR2","doi-asserted-by":"publisher","unstructured":"Afouras, T. , Chung, J.-S., & Zisserman, A. (2018). LRS3-TED: A large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496, https:\/\/doi.org\/10.48550\/arXiv.1809.00496","DOI":"10.48550\/arXiv.1809.00496"},{"key":"9809_CR3","doi-asserted-by":"crossref","unstructured":"Anwar, M. , Shi, B. , Goswami, V. , Hsu, W. , Pino, J., & Wang, C. (2023). MuAViC: A multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation. Interspeech (pp. 4064\u20134068).","DOI":"10.21437\/Interspeech.2023-2279"},{"key":"9809_CR4","unstructured":"Ardila, R. , Branson, M. , Davis, K. , Kohler, M. , Meyer, J. , Henretty, M., & Weber, G. (2020). Common voice: A massively-multilingual speech corpus. Proceedings LREC (pp. 4218\u20134222). https:\/\/aclanthology.org\/2020.lrec-1.520"},{"key":"9809_CR5","doi-asserted-by":"publisher","DOI":"10.5555\/3495724.3496768","author":"A Baevski","year":"2020","unstructured":"Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems. https:\/\/doi.org\/10.5555\/3495724.3496768","journal-title":"Advances in Neural Information Processing Systems"},{"key":"9809_CR6","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2016.7472029","author":"H Bear","year":"2016","unstructured":"Bear, H., & Harvey, R. (2016). Decoding visemes: Improving machine lip-reading. ICASSP. https:\/\/doi.org\/10.1109\/ICASSP.2016.7472029","journal-title":"ICASSP"},{"key":"9809_CR7","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2014.7025274","author":"H Bear","year":"2014","unstructured":"Bear, H., Harvey, R., Theobald, B., & Lan, Y. (2014). Resolution limits on visual speech recognition. ICIP. https:\/\/doi.org\/10.1109\/ICIP.2014.7025274","journal-title":"ICIP"},{"key":"9809_CR8","doi-asserted-by":"publisher","unstructured":"Bear, H. , Harvey, R. , Theobald, B., & Lan, Y. (2014b). Which phoneme-to-viseme maps best improve visual-only computer lip-reading? International symposium on visual computing (pp. 230\u2013239). https:\/\/doi.org\/10.1007\/978-3-319-14364-4_22","DOI":"10.1007\/978-3-319-14364-4_22"},{"key":"9809_CR9","doi-asserted-by":"publisher","first-page":"2225","DOI":"10.1111\/j.1460-9568.2004.03670.x","volume":"208","author":"J Besle","year":"2004","unstructured":"Besle, J., Fort, A., Delpuech, C., & Giard, M.-H. (2004). Bimodal speech: Early suppressive visual effects in human auditory cortex. European Journal of Neuroscience, 208, 2225\u20132234.","journal-title":"European Journal of Neuroscience"},{"key":"9809_CR10","doi-asserted-by":"publisher","unstructured":"Bisani, M., & Ney, H. (2004). Bootstrap estimates for confidence intervals in ASR performance evaluation. ICASSP (Vol.\u00a01, pp. 409\u2013412). https:\/\/doi.org\/10.1109\/ICASSP.2004.1326009","DOI":"10.1109\/ICASSP.2004.1326009"},{"key":"9809_CR11","doi-asserted-by":"crossref","unstructured":"Bowden, R. , Cox, S. , Harvey, R. , Lan, Y. , Ong, E.-J. , Owen, G., & Theobald, B.-J. (2013). Recent developments in automated lip-reading. Optics and photonics for counterterrorism, crime fighting and defence IX; and Optical materials and biomaterials in security and defence systems technology X, 89(01), 179\u2013191,","DOI":"10.1117\/12.2029464"},{"key":"9809_CR12","doi-asserted-by":"publisher","unstructured":"Bulat, A., & Tzimiropoulos, G. (2017). How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). ICCV (p.\u00a01021-1030). https:\/\/doi.org\/10.1109\/ICCV.2017.116","DOI":"10.1109\/ICCV.2017.116"},{"issue":"1493","key":"9809_CR13","doi-asserted-by":"publisher","first-page":"1001","DOI":"10.1098\/rstb.2007.2155","volume":"363","author":"R Campbell","year":"2008","unstructured":"Campbell, R. (2008). The processing of audio-visual speech: Empirical and neural bases. Philosophical Transactions of the Royal Society B: Biological Sciences, 363(1493), 1001\u20131010.","journal-title":"Philosophical Transactions of the Royal Society B: Biological Sciences"},{"key":"9809_CR14","doi-asserted-by":"crossref","unstructured":"Chang, O. , Liao, H. , Serdyuk, D. , Shah, A., & Siohan, O. (2024). Conformer is all you need for visual speech recognition. ICASSP (p.\u00a010136-10140).","DOI":"10.1109\/ICASSP48485.2024.10446532"},{"key":"9809_CR15","doi-asserted-by":"crossref","unstructured":"Chung, J., & Zisserman, A. (2017). Lip reading in the wild. 13th Asian conference on computer vision (pp. 87\u2013103).","DOI":"10.1007\/978-3-319-54184-6_6"},{"key":"9809_CR16","unstructured":"Cox, S.J. , Harvey, R.W. , Lan, Y. , Newman, J.L., & Theobald, B.-J. (2008). The challenge of multispeaker lip-reading. AVSP (pp. 179\u2013184). https:\/\/www.isca-speech.org\/archive_open\/avsp08\/av08_179.html"},{"key":"9809_CR17","doi-asserted-by":"crossref","unstructured":"Dai, Z. , Yang, Z. , Yang, Y. , Carbonell, J. , Le, Q., & Salakhutdinov, R. (2019). Transformer-XL: Attentive language models beyond a fixed-length context. Proceedings of the 57th ACL (pp. 2978\u20132988). ACL.","DOI":"10.18653\/v1\/P19-1285"},{"key":"9809_CR18","doi-asserted-by":"publisher","unstructured":"Deng, J. , Guo, J. , Ververas, E. , Kotsia, I., & Zafeiriou, S. (2020). Retinaface: Single-shot multi-level face localisation in the wild. CVPR (p.\u00a05202-5211). https:\/\/doi.org\/10.1109\/CVPR42600.2020.00525","DOI":"10.1109\/CVPR42600.2020.00525"},{"key":"9809_CR19","doi-asserted-by":"publisher","unstructured":"Dungan, L. , Karaali, A., & Harte, N. (2018). The impact of reduced video quality on visual speech recognition. ICIP (pp.\u00a02560-2564). https:\/\/doi.org\/10.1109\/ICIP.2018.8451754","DOI":"10.1109\/ICIP.2018.8451754"},{"key":"9809_CR20","unstructured":"Egorov, E. , Kostyumov, V. , Konyk, M., & Kolesnikov, S. (2021). LRWR: Large-scale benchmark for lip reading in Russian language. arXiv preprint arXiv:2109.06692,"},{"key":"9809_CR21","doi-asserted-by":"publisher","first-page":"55354","DOI":"10.1109\/ACCESS.2020.2982359","volume":"8","author":"M Ezz","year":"2020","unstructured":"Ezz, M., Mostafa, A. M., & Nasr, A. A. (2020). A silent password recognition framework based on lip analysis. IEEE Access, 8, 55354\u201355371.","journal-title":"IEEE Access"},{"key":"9809_CR22","doi-asserted-by":"publisher","DOI":"10.1007\/978-981-16-5172-4","author":"Z Feng","year":"2023","unstructured":"Feng, Z. (2023). Formal analysis for natural language processing: A handbook. Springer Nature. https:\/\/doi.org\/10.1007\/978-981-16-5172-4","journal-title":"Springer Nature"},{"key":"9809_CR23","doi-asserted-by":"crossref","unstructured":"Fernandez-Lopez, A. , Chen, H. , Ma, P. , Haliassos, A. , Petridis, S., & Pantic, M. (2023). Sparsevsr: Lightweight and noise robust visual speech recognition. arXiv preprint arXiv:2307.04552,","DOI":"10.21437\/Interspeech.2023-462"},{"key":"9809_CR24","doi-asserted-by":"publisher","unstructured":"Fernandez-Lopez, A. , Martinez, O., & Sukno, F.M. (2017). Towards estimating the upper bound of visual-speech recognition: The visual lip-reading feasibility database. 12th fg (pp. 208\u2013215). https:\/\/doi.org\/10.1109\/FG.2017.34","DOI":"10.1109\/FG.2017.34"},{"key":"9809_CR25","doi-asserted-by":"publisher","unstructured":"Fern\u00e1ndez-L\u00f3pez, A., & Sukno, F. (2017). Optimizing phoneme-to-viseme mapping for continuous lip-reading in spanish. International joint conference on computer vision, imaging and computer graphics (pp. 305\u2013328). https:\/\/doi.org\/10.1007\/978-3-030-12209-6_15","DOI":"10.1007\/978-3-030-12209-6_15"},{"key":"9809_CR26","doi-asserted-by":"publisher","first-page":"53","DOI":"10.1016\/j.imavis.2018.07.002","volume":"78","author":"A Fernandez-Lopez","year":"2018","unstructured":"Fernandez-Lopez, A., & Sukno, F. M. (2018). Survey on automatic lip-reading in the era of deep learning. Image and Vision Computing, 78, 53\u201372. https:\/\/doi.org\/10.1016\/j.imavis.2018.07.002","journal-title":"Image and Vision Computing"},{"key":"9809_CR27","doi-asserted-by":"publisher","first-page":"2076","DOI":"10.1109\/TASLP.2022.3182274","volume":"30","author":"A Fernandez-Lopez","year":"2022","unstructured":"Fernandez-Lopez, A., & Sukno, F. M. (2022). End-to-end lip-reading without large-scale data. IEEE\/ACM TASLP, 30, 2076\u20132090. https:\/\/doi.org\/10.1109\/TASLP.2022.3182274","journal-title":"IEEE\/ACM TASLP"},{"key":"9809_CR28","doi-asserted-by":"publisher","unstructured":"Gales, M., & Young, S. (2008). The application of hidden Markov models in speech recognition. Now Publishers Inc. https:\/\/doi.org\/10.1561\/2000000004","DOI":"10.1561\/2000000004"},{"key":"9809_CR29","unstructured":"Gimeno-G\u00f3mez, D., & Mart\u00ednez-Hinarejos, C.-D. (2022). LIP-RTVE: An audiovisual database for continuous Spanish in the wild. Proceedings LREC (pp. 2750\u20132758). ELRA. https:\/\/aclanthology.org\/2022.lrec-1.294"},{"issue":"1","key":"9809_CR30","doi-asserted-by":"publisher","first-page":"25","DOI":"10.1186\/s13636-024-00345-7","volume":"2024","author":"D Gimeno-G\u00f3mez","year":"2024","unstructured":"Gimeno-G\u00f3mez, D., & Mart\u00ednez-Hinarejos, C.-D. (2024). Continuous lipreading based on acoustic temporal alignments. EURASIP Journal on Audio, Speech, and Music Processing, 2024(1), 25.","journal-title":"EURASIP Journal on Audio, Speech, and Music Processing"},{"key":"9809_CR31","doi-asserted-by":"publisher","unstructured":"Graves, A. , Fern\u00e1ndez, S. , Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. 23rd ICML (pp.\u00a0369-376). ACM. https:\/\/doi.org\/10.1145\/1143844.1143891","DOI":"10.1145\/1143844.1143891"},{"key":"9809_CR32","doi-asserted-by":"publisher","unstructured":"Gulati, A. , Qin, J. , Chiu, C.C. , Parmar, N. , Zhang, Y. , Yu, J. , & Pang, R. (2020). Conformer: Convolution-augmented transformer for speech recognition. Proceedings interspeech (pp. 5036\u20135040). https:\/\/doi.org\/10.21437\/Interspeech.2020-3015","DOI":"10.21437\/Interspeech.2020-3015"},{"key":"9809_CR33","doi-asserted-by":"crossref","unstructured":"Haliassos, A. , Zinonos, A. , Mira, R. , Petridis, S., & Pantic, M. (2024). BRAVEn: Improving self-supervised pre-training for visual and auditory speech recognition. ICASSP (pp.\u00a011431-11435).","DOI":"10.1109\/ICASSP48485.2024.10448473"},{"issue":"5","key":"9809_CR34","doi-asserted-by":"publisher","first-page":"603","DOI":"10.1109\/TMM.2015.2407694","volume":"17","author":"N Harte","year":"2015","unstructured":"Harte, N., & Gillen, E. (2015). TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Transactions on Multimedia, 17(5), 603\u2013615. https:\/\/doi.org\/10.1109\/TMM.2015.2407694","journal-title":"IEEE Transactions on Multimedia"},{"key":"9809_CR35","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90","author":"K He","year":"2016","unstructured":"He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. CVPR. https:\/\/doi.org\/10.1109\/CVPR.2016.90","journal-title":"CVPR"},{"key":"9809_CR36","doi-asserted-by":"crossref","unstructured":"Higuchi, Y. , Inaguma, H. , Watanabe, S. , Ogawa, T., & Kobayashi, T. (2021). Improved mask-CTC for non-autoregressive end-to-end ASR. ICASSP (pp.\u00a08363-8367).","DOI":"10.1109\/ICASSP39728.2021.9414198"},{"key":"9809_CR37","doi-asserted-by":"publisher","unstructured":"Ivanko, D. , Ryumin, D., & Karpov, A. (2019). Automatic lip-reading of hearing impaired people. The international archives of the photogrammetry, remote sensing and spatial information sciences, XLII-2\/W12, 97\u2013101, https:\/\/doi.org\/10.5194\/isprs-archives-XLII-2-W12-97-2019","DOI":"10.5194\/isprs-archives-XLII-2-W12-97-2019"},{"key":"9809_CR38","doi-asserted-by":"publisher","first-page":"217","DOI":"10.1007\/s00138-019-01006-y","volume":"30","author":"A Jha","year":"2019","unstructured":"Jha, A., Namboodiri, V. P., & Jawahar, C. V. (2019). Spotting words in silent speech videos: A retrieval-based approach. Machine Vision and Applications, 30, 217\u2013229.","journal-title":"Machine Vision and Applications"},{"key":"9809_CR39","doi-asserted-by":"crossref","unstructured":"Kim, M. , Yeo, J.H. , Choi, J., & Ro, Y.M. (2023). Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge. ICCV (pp. 15359\u201315371).","DOI":"10.1109\/ICCV51070.2023.01409"},{"key":"9809_CR40","doi-asserted-by":"publisher","first-page":"108","DOI":"10.1016\/j.cviu.2015.09.013","volume":"141","author":"O Koller","year":"2015","unstructured":"Koller, O., Forster, J., & Ney, H. (2015). Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding, 141, 108\u2013125. https:\/\/doi.org\/10.1016\/j.cviu.2015.09.013","journal-title":"Computer Vision and Image Understanding"},{"key":"9809_CR41","doi-asserted-by":"publisher","first-page":"1928","DOI":"10.1038\/s41598-022-26155-5","volume":"13","author":"H Laux","year":"2023","unstructured":"Laux, H., Hallawa, A., Assis, J. C. S., Schmeink, A., Martin, L., & Peine, A. (2023). Two-stage visual speech recognition for intensive care patients. Scientific Reports, 13, 1928.","journal-title":"Scientific Reports"},{"key":"9809_CR42","doi-asserted-by":"crossref","unstructured":"Lee, J., & Watanabe, S. (2021). Intermediate loss regularization for CTC-based speech recognition. ICASSP (p.\u00a06224-6228).","DOI":"10.1109\/ICASSP39728.2021.9414594"},{"key":"9809_CR43","doi-asserted-by":"crossref","unstructured":"Liao, J. , Duan, H. , Feng, K. , Zhao, W. , Yang, Y., & Chen, L. (2023). A light weight model for active speaker detection. Proceedings of the IEEE\/CVF CVPR (pp.\u00a022932-22941).","DOI":"10.1109\/CVPR52729.2023.02196"},{"key":"9809_CR44","doi-asserted-by":"crossref","unstructured":"Liu, X. , Lakomkin, E. , Vougioukas, K. , Ma, P. , Chen, H. , Xie, R., & Fuegen, C. (2023). SynthVSR: Scaling up visual speech recognition with synthetic supervision. CVPR (pp. 18806\u201318815).","DOI":"10.1109\/CVPR52729.2023.01803"},{"key":"9809_CR45","unstructured":"Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. ICLR. https:\/\/openreview.net\/pdf?id=Bkg6RiCqY7"},{"key":"9809_CR46","doi-asserted-by":"crossref","unstructured":"Ma, P. , Haliassos, A. , Fernandez-Lopez, A. , Chen, H. , Petridis, S., & Pantic, M. (2023). Auto-avsr: Audio-visual speech recognition with automatic labels. ICASSP (pp.\u00a01-5).","DOI":"10.1109\/ICASSP49357.2023.10096889"},{"key":"9809_CR47","doi-asserted-by":"publisher","unstructured":"Ma, P. , Petridis, S., & Pantic, M. (2021). End-to-end audio-visual speech recognition with conformers. ICASSP (pp.\u00a07613-7617). https:\/\/doi.org\/10.1109\/ICASSP39728.2021.9414567","DOI":"10.1109\/ICASSP39728.2021.9414567"},{"issue":"11","key":"9809_CR48","doi-asserted-by":"publisher","first-page":"930","DOI":"10.1038\/s42256-022-00550-z","volume":"4","author":"P Ma","year":"2022","unstructured":"Ma, P., Petridis, S., & Pantic, M. (2022). Visual speech recognition for multiple languages in the wild. Nature Machine Intelligence, 4(11), 930\u2013939. https:\/\/doi.org\/10.1038\/s42256-022-00550-z","journal-title":"Nature Machine Intelligence"},{"key":"9809_CR49","doi-asserted-by":"publisher","unstructured":"Manaris, B. , Pellicoro, L. , Pothering, G., & Hodges, H. (2006). Investigating Esperanto\u2019s statistical proportions relative to other languages using neural networks and Zipf\u2019s law. Proceedings of the 24th IASTED international conference on artificial intelligence and applications (pp.\u00a0102\u2013108). USAACTA Press. https:\/\/doi.org\/10.5555\/1166890.1166908","DOI":"10.5555\/1166890.1166908"},{"key":"9809_CR50","doi-asserted-by":"publisher","first-page":"746","DOI":"10.1038\/264746a0","volume":"264","author":"H McGurk","year":"1976","unstructured":"McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746\u2013748. https:\/\/doi.org\/10.1038\/264746a0","journal-title":"Nature"},{"key":"9809_CR51","doi-asserted-by":"publisher","first-page":"277","DOI":"10.1186\/s13054-023-04420-x","volume":"27","author":"M Musalia","year":"2023","unstructured":"Musalia, M., Laha, S., Cazalilla-Chica, J., Allan, J., Roach, L., Twamley, J., & McAuley, D. F. (2023). A user evaluation of speech\/phrase recognition software in critically ill patients: A decide-AI feasibility study. Critical Care, 27, 277.","journal-title":"Critical Care"},{"key":"9809_CR52","doi-asserted-by":"publisher","unstructured":"Ott, M. , Edunov, S. , Grangier, D., & Auli, M. (2018). Scaling neural machine translation. Proceedings of the 3rd conference on machine translation (pp. 1\u20139). ACL. https:\/\/doi.org\/10.18653\/v1\/W18-6301","DOI":"10.18653\/v1\/W18-6301"},{"key":"9809_CR53","doi-asserted-by":"crossref","unstructured":"Park, S.J. , Kim, C.W. , Rha, H. , Kim, M. , Hong, J. , Yeo, J. , & Ro, Y.M. (2024). Let\u2019s go real talk: Spoken dialogue model for face-to-face conversation. Proceedings of the 62nd ACL (pp. 16334\u201316348).","DOI":"10.18653\/v1\/2024.acl-long.860"},{"key":"9809_CR54","doi-asserted-by":"publisher","first-page":"1112","DOI":"10.3758\/s13423-014-0585-6","volume":"21","author":"ST Piantadosi","year":"2014","unstructured":"Piantadosi, S. T. (2014). Zipf\u2019s word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin & Review, 21, 1112\u20131130. https:\/\/doi.org\/10.3758\/s13423-014-0585-6","journal-title":"Psychonomic Bulletin & Review"},{"issue":"9","key":"9809_CR55","doi-asserted-by":"publisher","first-page":"1306","DOI":"10.1109\/JPROC.2003.817150","volume":"91","author":"G Potamianos","year":"2003","unstructured":"Potamianos, G., Neti, C., Gravier, G., Garg, A., & Senior, A. (2003). Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 91(9), 1306\u20131326. https:\/\/doi.org\/10.1109\/JPROC.2003.817150","journal-title":"Proceedings of the IEEE"},{"key":"9809_CR56","doi-asserted-by":"crossref","unstructured":"Prajwal, K.R. , Afouras, T., Zisserman, A. (2022). Sub-word level lip reading with visual attention. CVPR (pp.\u00a05162\u20135172). https:\/\/openaccess.thecvf.com\/content\/CVPR2022\/html\/Prajwal_Sub-Word_Level_Lip_Reading_With_Visual_Attention_CVPR_2022_paper.html","DOI":"10.1109\/CVPR52688.2022.00510"},{"key":"9809_CR57","doi-asserted-by":"crossref","unstructured":"Prajwal, K. , Mukhopadhyay, R. , Namboodiri, V., & Jawahar, C.V. (2020). A lip sync expert is all you need for speech to lip generation in the wild. Proceedings of the 28th ACM international conference on multimedia (pp. 484\u2013492).","DOI":"10.1145\/3394171.3413532"},{"key":"9809_CR58","doi-asserted-by":"publisher","unstructured":"Pratap, V. , Xu, Q. , Sriram, A. , Synnaeve, G., & Collobert, R. (2020). MLS: A large-scale multilingual dataset for speech research. Proceedings interspeech (pp. 2757\u20132761). https:\/\/doi.org\/10.21437\/Interspeech.2020-2826","DOI":"10.21437\/Interspeech.2020-2826"},{"key":"9809_CR59","unstructured":"Ramachandran, P. , Zoph, B., & Le, Q.V. (2017). Searching for activation functions. arXiv preprint arXiv:1710.05941, https:\/\/arxiv.org\/abs\/1710.05941"},{"key":"9809_CR60","doi-asserted-by":"publisher","unstructured":"Salesky, E. , Wiesner, M. , Bremerman, J. , Cattoni, R. , Negri, M. , Turchi, M., & Post, M. (2021). The multilingual TEDx corpus for speech recognition and translation. Proceedings interspeech (pp. 3655\u20133659). https:\/\/doi.org\/10.21437\/Interspeech.2021-11","DOI":"10.21437\/Interspeech.2021-11"},{"key":"9809_CR61","doi-asserted-by":"publisher","unstructured":"Shi, B. , Hsu, W.N. , Lakhotia, K., & Mohamed, A. (2022). Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv preprint arXiv:2201.02184, https:\/\/doi.org\/10.48550\/arXiv.2201.02184","DOI":"10.48550\/arXiv.2201.02184"},{"key":"9809_CR62","doi-asserted-by":"publisher","unstructured":"Smith, L.N., & Topin, N. (2019). Super-convergence: Very fast training of neural networks using large learning rates. AI and ML for multi-domain operations applications (Vol. 11006, pp. 369\u2013386). https:\/\/doi.org\/10.1117\/12.2520589","DOI":"10.1117\/12.2520589"},{"key":"9809_CR63","doi-asserted-by":"crossref","unstructured":"Son\u00a0Chung, J. , Senior, A. , Vinyals, O., & Zisserman, A. (2017). Lip reading sentences in the wild. CVPR (pp. 6447\u20136456). https:\/\/openaccess.thecvf.com\/content_cvpr_2017\/html\/Chung_Lip_Reading_Sentences_CVPR_2017_paper.html","DOI":"10.1109\/CVPR.2017.367"},{"key":"9809_CR64","doi-asserted-by":"crossref","unstructured":"Stafylakis, T., & Tzimiropoulos, G. (2018). Zero-shot keyword spotting for visual speech recognition in-the-wild. Proceedings of ECCV (pp. 513\u2013529).","DOI":"10.1007\/978-3-030-01225-0_32"},{"key":"9809_CR65","doi-asserted-by":"crossref","unstructured":"Tao, R. , Pan, Z. , Das, R. , Qian, X. , Shou, M., & Li, h. (2021). Is someone speaking? Exploring long-term temporal features for audio-visual active speaker detection. Proceedings of the 29th ACM international conference on multimedia (pp.\u00a03927\u20133935). Association for computing machinery.","DOI":"10.1145\/3474085.3475587"},{"key":"9809_CR66","unstructured":"Thangthai, K. (2018). Computer lipreading via hybrid deep neural network hidden Markov models University of East Anglia. https:\/\/ueaeprints.uea.ac.uk\/id\/eprint\/69215"},{"key":"9809_CR67","doi-asserted-by":"crossref","unstructured":"Theobald, B.J. , Harvey, R. , Cox, S.J. , Lewis, C., & Owen, G.P. (2006). Lip-reading enhancement for law enforcement. Optics and photonics for counterterrorism and crime fighting ii (Vol. 6402, pp. 24\u201332).","DOI":"10.1117\/12.689960"},{"key":"9809_CR68","unstructured":"Vaswani, A. , Shazeer, N. , Parmar, N. , Uszkoreit, J. , Jones, L. , Gomez, A.N., & Polosukhin, I. (2017). Attention is all you need. Neur IPS 30: 6000\u20136010, https:\/\/dl.acm.org\/doi\/10.5555\/3295222.3295349"},{"key":"9809_CR69","doi-asserted-by":"publisher","unstructured":"Watanabe, S. , Hori, T. , Karita, S. , Hayashi, T. , Nishitoba, J. , Unno, Y., & Ochiai, T. (2018). ESPnet: End-to-end speech processing toolkit. Proceedings interspeech (pp. 2207\u20132211). https:\/\/doi.org\/10.21437\/Interspeech.2018-1456","DOI":"10.21437\/Interspeech.2018-1456"},{"issue":"8","key":"9809_CR70","doi-asserted-by":"publisher","first-page":"1240","DOI":"10.1109\/JSTSP.2017.2763455","volume":"11","author":"S Watanabe","year":"2017","unstructured":"Watanabe, S., Hori, T., Kim, S., Hershey, J. R., & Hayashi, T. (2017). Hybrid CTC\/attention architecture for end-to-end speech recognition. IEEE JSTSP, 11(8), 1240\u20131253. https:\/\/doi.org\/10.1109\/JSTSP.2017.2763455","journal-title":"IEEE JSTSP"},{"key":"9809_CR71","unstructured":"Wei, G. , Duan, Z. , Li, S. , Yang, G. , Yu, X., & Li, J. (2023). Sim-T: Simplify the transformer network by multiplexing technique for speech recognition. arXiv preprint arXiv:2304.04991"},{"key":"9809_CR72","doi-asserted-by":"crossref","unstructured":"Yang, S. , Zhang, Y. , Feng, D. , Yang, M. , Wang, C. , Xiao, J. , & Chen, X. (2019). LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild. 14th IEEE international conference on automatic face & gesture recognition (pp.\u00a01\u20138).","DOI":"10.1109\/FG.2019.8756582"},{"key":"9809_CR73","doi-asserted-by":"crossref","unstructured":"Yeo, J.H. , Kim, M. , Watanabe, S., & Ro, Y.M. (2024). Visual speech recognition for languages with limited labeled data using automatic labels from whisper. ICASSP (pp.\u00a010471\u201310475).","DOI":"10.1109\/ICASSP48485.2024.10446720"},{"key":"9809_CR74","doi-asserted-by":"publisher","unstructured":"Zadeh, A.B. , Cao, Y. , Hessner, S. , Liang, P.P. , Poria, S., & Morency, L.-P. (2020). CMU-MOSEAS: A multimodal language dataset for spanish, portuguese, german and french. EMNLP (pp. 1801\u20131812). https:\/\/doi.org\/10.18653\/v1\/2020.emnlp-main.141","DOI":"10.18653\/v1\/2020.emnlp-main.141"},{"key":"9809_CR75","doi-asserted-by":"crossref","unstructured":"Zhang, Y. , Yang, S. , Xiao, J. , Shan, S., & Chen, X. (2020). Can we read speech beyond the lips? Rethinking ROI selection for deep visual speech recognition. 15th IEEE fg (pp.\u00a0356\u2013363).","DOI":"10.1109\/FG47880.2020.00134"},{"key":"9809_CR76","volume-title":"The psychobiology of language","author":"GK Zipf","year":"1936","unstructured":"Zipf, G. K. (1936). The psychobiology of language. Houghton."},{"key":"9809_CR77","volume-title":"Human behavior and the principle of least effort","author":"GK Zipf","year":"1949","unstructured":"Zipf, G. K. (1949). Human behavior and the principle of least effort. Addison-Wesley Press."}],"container-title":["Language Resources and Evaluation"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10579-025-09809-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10579-025-09809-4\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10579-025-09809-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,6]],"date-time":"2025-09-06T05:47:01Z","timestamp":1757137621000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10579-025-09809-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,15]]},"references-count":77,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,9]]}},"alternative-id":["9809"],"URL":"https:\/\/doi.org\/10.1007\/s10579-025-09809-4","relation":{},"ISSN":["1574-020X","1574-0218"],"issn-type":[{"type":"print","value":"1574-020X"},{"type":"electronic","value":"1574-0218"}],"subject":[],"published":{"date-parts":[[2025,2,15]]},"assertion":[{"value":"13 January 2025","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"15 February 2025","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no Conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethical approval"}}]}}