{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T05:35:55Z","timestamp":1775626555056,"version":"3.50.1"},"reference-count":83,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2024,5,18]],"date-time":"2024-05-18T00:00:00Z","timestamp":1715990400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,5,18]],"date-time":"2024-05-18T00:00:00Z","timestamp":1715990400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Joint Fund for Regional Innovation and Development of NSFC","award":["U19A2083"],"award-info":[{"award-number":["U19A2083"]}]},{"name":"Science and Technology Research and Major Achievements Transformation Project of Strategic Emerging Industries in Hunan Province","award":["2019 GK4007"],"award-info":[{"award-number":["2019 GK4007"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Complex Intell. Syst."],"published-print":{"date-parts":[[2024,8]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Conformer-based models have proven highly effective in Audio-visual Speech Recognition, integrating auditory and visual inputs to significantly enhance speech recognition accuracy. However, the widely utilized softmax attention mechanism within conformer models encounters scalability issues, with its spatial and temporal complexity escalating quadratically with sequence length. To address these challenges, this paper introduces the Shifted Linear Attention Conformer, an evolved iteration of the conformer architecture. Shifted Linear Attention Conformer adopts shifted linear attention as a scalable alternative to softmax attention. We conducted a thorough analysis of the factors constraining the efficiency of linear attention. To mitigate these issues, we propose the utilization of a straightforward yet potent mapping function and an efficient rank restoration module, enhancing the effectiveness of self-attention while maintaining low computational complexity. Furthermore, we integrate an advanced attention-shifting technique facilitating token manipulation within attentional mechanisms, thereby enhancing information flow across various groups. This three-part approach enhances cognitive computations, particularly beneficial for processing longer sequences. Our model achieves exceptional Word Error Rates of 1.9% and 1.5% on the Lip Reading Sentences 2 and Lip Reading Sentences 3 datasets, respectively, showcasing its state-of-the-art performance in audio-visual speech recognition tasks.<\/jats:p>","DOI":"10.1007\/s40747-024-01451-x","type":"journal-article","created":{"date-parts":[[2024,5,18]],"date-time":"2024-05-18T07:01:30Z","timestamp":1716015690000},"page":"5721-5741","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Sla-former: conformer using shifted linear attention for audio-visual speech recognition"],"prefix":"10.1007","volume":"10","author":[{"given":"Yewei","family":"Xiao","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0009-0006-1898-3756","authenticated-orcid":false,"given":"Jian","family":"Huang","sequence":"additional","affiliation":[]},{"given":"Xuanming","family":"Liu","sequence":"additional","affiliation":[]},{"given":"Aosu","family":"Zhu","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,5,18]]},"reference":[{"key":"1451_CR1","doi-asserted-by":"publisher","first-page":"401","DOI":"10.1016\/j.inffus.2023.02.014","volume":"95","author":"DK Jain","year":"2023","unstructured":"Jain DK, Zhao X, Gonz\u00e1lez-Almagro G, Gan C, Kotecha K (2023) Multimodal pedestrian detection using metaheuristics with deep convolutional neural network in crowded scenes. Inform Fusion 95:401\u2013414","journal-title":"Inform Fusion"},{"issue":"2","key":"1451_CR2","doi-asserted-by":"publisher","first-page":"181","DOI":"10.3390\/jpm13020181","volume":"13","author":"SZ Kurdi","year":"2023","unstructured":"Kurdi SZ, Ali MH, Jaber MM, Saba T, Rehman A, Dama\u0161evi\u010dius R (2023) Brain tumor classification using meta-heuristic optimized convolutional neural networks. J Personal Med 13(2):181","journal-title":"J Personal Med"},{"issue":"22","key":"1451_CR3","doi-asserted-by":"publisher","first-page":"3798","DOI":"10.3390\/electronics11223798","volume":"11","author":"M Zivkovic","year":"2022","unstructured":"Zivkovic M, Bacanin N, Antonijevic M, Nikolic B, Kvascev G, Marjanovic M, Savanovic N (2022) Hybrid CNN and XGBoost model tuned by modified arithmetic optimization algorithm for COVID-19 early diagnostics from X-ray images. Electronics 11(22):3798","journal-title":"Electronics"},{"issue":"21","key":"1451_CR4","doi-asserted-by":"publisher","first-page":"14616","DOI":"10.3390\/su142114616","volume":"14","author":"L Jovanovic","year":"2022","unstructured":"Jovanovic L, Jovanovic D, Bacanin N, Jovancai Stakic A, Antonijevic M, Magd H, Zivkovic M (2022) Multi-step crude oil price prediction based on lstm approach tuned by salp swarm algorithm with disputation operator. Sustainability 14(21):14616","journal-title":"Sustainability"},{"issue":"1","key":"1451_CR5","doi-asserted-by":"publisher","first-page":"53","DOI":"10.1016\/j.petlm.2021.12.008","volume":"9","author":"CSW Ng","year":"2023","unstructured":"Ng CSW, Ghahfarokhi AJ, Amar MN (2023) Production optimization under waterflooding with Long Short-Term Memory and metaheuristic algorithm. Petroleum 9(1):53\u201360","journal-title":"Petroleum"},{"key":"1451_CR6","doi-asserted-by":"crossref","unstructured":"Zhou P, Shi W, Tian J, Qi Z, Li B, Hao H, Xu B (2016) Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers) (pp. 207-212)","DOI":"10.18653\/v1\/P16-2034"},{"issue":"2","key":"1451_CR7","doi-asserted-by":"publisher","first-page":"1539","DOI":"10.1109\/TIE.2017.2733438","volume":"65","author":"R Zhao","year":"2017","unstructured":"Zhao R, Wang D, Yan R, Mao K, Shen F, Wang J (2017) Machine health monitoring using local feature-based gated recurrent unit networks. IEEE Trans Indusl Electron 65(2):1539\u20131548","journal-title":"IEEE Trans Indusl Electron"},{"key":"1451_CR8","doi-asserted-by":"crossref","unstructured":"Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T (2014, September). Lipreading using convolutional neural network. In Interspeech (Vol. 1, p. 3)","DOI":"10.21437\/Interspeech.2014-293"},{"key":"1451_CR9","doi-asserted-by":"crossref","unstructured":"Stafylakis T, Tzimiropoulos G (2017) Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv:1703.04105","DOI":"10.21437\/Interspeech.2017-85"},{"key":"1451_CR10","doi-asserted-by":"crossref","unstructured":"Shillingford B, Assael Y, Hoffman MW, Paine T, Hughes C, Prabhu U, de Freitas N (2018) Large-scale visual speech recognition. arXiv preprint arXiv:1807.05162","DOI":"10.21437\/Interspeech.2019-1669"},{"issue":"12","key":"1451_CR11","doi-asserted-by":"publisher","first-page":"8717","DOI":"10.1109\/TPAMI.2018.2889052","volume":"44","author":"T Afouras","year":"2018","unstructured":"Afouras T, Chung JS, Senior A, Vinyals O, Zisserman A (2018) Deep audio-visual speech recognition. IEEE Trans Pattern Anal Mach Intell 44(12):8717\u20138727","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"1451_CR12","doi-asserted-by":"crossref","unstructured":"Son Chung J, Senior A, Vinyals O, Zisserman A (2017) Lip reading sentences in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6447-6456)","DOI":"10.1109\/CVPR.2017.367"},{"key":"1451_CR13","doi-asserted-by":"crossref","unstructured":"He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778)","DOI":"10.1109\/CVPR.2016.90"},{"key":"1451_CR14","doi-asserted-by":"crossref","unstructured":"Koonce B, Koonce B (2021) ResNet 50. Convolutional neural networks with swift for tensorflow: image recognition and dataset categorization, 63-72","DOI":"10.1007\/978-1-4842-6168-2_6"},{"key":"1451_CR15","unstructured":"Hayou S, Clerico E, He B, Deligiannidis G, Doucet A, Rousseau J (2021, March) Stable resnet. In International Conference on Artificial Intelligence and Statistics (pp. 1324-1332). PMLR"},{"key":"1451_CR16","unstructured":"Targ S, Almeida D, Lyman K (2016) Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:1603.08029"},{"key":"1451_CR17","doi-asserted-by":"crossref","unstructured":"Lea C, Flynn MD, Vidal R, Reiter A, Hager GD (2017) Temporal convolutional networks for action segmentation and detection. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 156-165)","DOI":"10.1109\/CVPR.2017.113"},{"key":"1451_CR18","doi-asserted-by":"crossref","unstructured":"Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132-7141)","DOI":"10.1109\/CVPR.2018.00745"},{"key":"1451_CR19","doi-asserted-by":"crossref","unstructured":"Ma P, Wang Y, Shen J, Petridis S, Pantic M (2021) Lip-reading with densely connected temporal convolutional networks. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision (pp. 2857-2866)","DOI":"10.1109\/WACV48630.2021.00290"},{"key":"1451_CR20","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems, 30"},{"key":"1451_CR21","unstructured":"Beltagy I, Peters ME, Cohan A (2020) Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150"},{"key":"1451_CR22","unstructured":"Kitaev N, Kaiser \u0141, Levskaya A (2020) Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451"},{"key":"1451_CR23","unstructured":"Wang C, Wu Y, Qian Y, Kumatani K, Liu S, Wei F, Huang X (2021, July) Unispeech: Unified speech representation learning with labeled and unlabeled data. In International Conference on Machine Learning (pp. 10937-10947). PMLR"},{"key":"1451_CR24","doi-asserted-by":"crossref","unstructured":"Chen W, Xing X, Xu X, Yang J, Pang J (2022, May) Key-sparse transformer for multimodal speech emotion recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6897-6901). IEEE","DOI":"10.1109\/ICASSP43922.2022.9746598"},{"key":"1451_CR25","doi-asserted-by":"crossref","unstructured":"Gulati A, Qin J, Chiu CC, Parmar N, Zhang Y, Yu J, Pang R (2020) Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100","DOI":"10.21437\/Interspeech.2020-3015"},{"key":"1451_CR26","unstructured":"Graves A, Jaitly N (2014, June) Towards end-to-end speech recognition with recurrent neural networks. In International conference on machine learning (pp. 1764-1772). PMLR"},{"key":"1451_CR27","doi-asserted-by":"crossref","unstructured":"Pascual S, Bonafonte A, Serra J (2017) SEGAN: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452","DOI":"10.21437\/Interspeech.2017-1428"},{"key":"1451_CR28","doi-asserted-by":"crossref","unstructured":"Subakan C, Ravanelli M, Cornell S, Bronzi M, Zhong J (2021, June) Attention is all you need in speech separation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 21-25). IEEE","DOI":"10.1109\/ICASSP39728.2021.9413901"},{"key":"1451_CR29","unstructured":"Choromanski K, Likhosherstov V, Dohan D, Song X, Gane A, Sarlos T, Weller A (2020) Rethinking attention with performers. arXiv preprint arXiv:2009.14794"},{"key":"1451_CR30","unstructured":"Qin Z, Sun W, Deng H, Li D, Wei Y, Lv B, Zhong Y (2022) cosformer: Rethinking softmax in attention. arXiv preprint arXiv:2202.08791"},{"key":"1451_CR31","doi-asserted-by":"crossref","unstructured":"Huang Z, Shi X, Zhang C, Wang Q, Cheung KC, Qin H, Li H (2022, October) Flowformer: A transformer architecture for optical flow. In European conference on computer vision (pp. 668-685). Cham: Springer Nature Switzerland","DOI":"10.1007\/978-3-031-19790-1_40"},{"key":"1451_CR32","unstructured":"Wang S, Li BZ, Khabsa M, Fang H, Ma H (2020) Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768"},{"key":"1451_CR33","doi-asserted-by":"crossref","unstructured":"Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, Zhang W (2021, May) Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence (Vol. 35, No. 12, pp. 11106-11115)","DOI":"10.1609\/aaai.v35i12.17325"},{"issue":"3","key":"1451_CR34","doi-asserted-by":"publisher","first-page":"141","DOI":"10.1109\/6046.865479","volume":"2","author":"S Dupont","year":"2000","unstructured":"Dupont S, Luettin J (2000) Audio-visual speech modeling for continuous speech recognition. IEEE Trans Multimedia 2(3):141\u2013151","journal-title":"IEEE Trans Multimedia"},{"issue":"3","key":"1451_CR35","doi-asserted-by":"publisher","first-page":"361","DOI":"10.1016\/S0959-440X(96)80056-X","volume":"6","author":"SR Eddy","year":"1996","unstructured":"Eddy SR (1996) Hidden markov models. Curr Opin Struct Biol 6(3):361\u2013365","journal-title":"Curr Opin Struct Biol"},{"key":"1451_CR36","doi-asserted-by":"crossref","unstructured":"Petridis S, Stafylakis T, Ma P, Tzimiropoulos G, Pantic M (2018, December) Audio-visual speech recognition with a hybrid ctc\/attention architecture. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 513-520). IEEE","DOI":"10.1109\/SLT.2018.8639643"},{"issue":"8","key":"1451_CR37","doi-asserted-by":"publisher","first-page":"1240","DOI":"10.1109\/JSTSP.2017.2763455","volume":"11","author":"S Watanabe","year":"2017","unstructured":"Watanabe S, Hori T, Kim S, Hershey JR, Hayashi T (2017) Hybrid CTC\/attention architecture for end-to-end speech recognition. IEEE J Sel Top Signal Process 11(8):1240\u20131253","journal-title":"IEEE J Sel Top Signal Process"},{"key":"1451_CR38","doi-asserted-by":"crossref","unstructured":"Makino T, Liao H, Assael Y, Shillingford B, Garcia B, Braga O, Siohan O (2019, December) Recurrent neural network transducer for audio-visual speech recognition. In 2019 IEEE automatic speech recognition and understanding workshop (ASRU) (pp. 905-912). IEEE","DOI":"10.1109\/ASRU46091.2019.9004036"},{"key":"1451_CR39","doi-asserted-by":"crossref","unstructured":"Graves A (2012) Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711","DOI":"10.1007\/978-3-642-24797-2"},{"key":"1451_CR40","doi-asserted-by":"crossref","unstructured":"Xu B, Lu C, Guo Y, Wang J (2020) Discriminative multi-modality speech recognition. In Proceedings of the IEEE\/CVF conference on Computer Vision and Pattern Recognition (pp. 14433-14442)","DOI":"10.1109\/CVPR42600.2020.01444"},{"key":"1451_CR41","doi-asserted-by":"crossref","unstructured":"Li W, Wang S, Lei M, Siniscalchi SM, Lee CH (2019, May) Improving audio-visual speech recognition performance with cross-modal student-teacher training. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6560-6564). IEEE","DOI":"10.1109\/ICASSP.2019.8682868"},{"key":"1451_CR42","doi-asserted-by":"crossref","unstructured":"Paraskevopoulos G, Parthasarathy S, Khare A, Sundaram S (2020) Multiresolution and multimodal speech recognition with transformers. arXiv preprint arXiv:2004.14840","DOI":"10.18653\/v1\/2020.acl-main.216"},{"key":"1451_CR43","doi-asserted-by":"crossref","unstructured":"Shukla A, Vougioukas K, Ma P, Petridis S, Pantic M (2020, May) Visually guided self supervised learning of speech representations. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6299-6303). IEEE","DOI":"10.1109\/ICASSP40776.2020.9053415"},{"key":"1451_CR44","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/TMM.2020.2975922","volume":"23","author":"F Tao","year":"2020","unstructured":"Tao F, Busso C (2020) End-to-end audiovisual speech recognition system with multitask learning. IEEE Trans Multimedia 23:1\u201311","journal-title":"IEEE Trans Multimedia"},{"key":"1451_CR45","doi-asserted-by":"crossref","unstructured":"Martinez B, Ma P, Petridis S, Pantic M (2020, May) Lipreading using temporal convolutional networks. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6319-6323). IEEE","DOI":"10.1109\/ICASSP40776.2020.9053841"},{"issue":"64\u201367","key":"1451_CR46","first-page":"2","volume":"5","author":"LR Medsker","year":"2001","unstructured":"Medsker LR, Jain L (2001) Recurrent neural networks. Design Appl 5(64\u201367):2","journal-title":"Design Appl"},{"key":"1451_CR47","doi-asserted-by":"crossref","unstructured":"Burchi M, Vielzeuf V (2021, December) Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 8-15). IEEE","DOI":"10.1109\/ASRU51503.2021.9687874"},{"key":"1451_CR48","doi-asserted-by":"crossref","unstructured":"Chen S, Wu Y, Chen Z, Wu J, Li J, Yoshioka T, Zhou M (2021, June) Continuous speech separation with conformer. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5749-5753). IEEE","DOI":"10.1109\/ICASSP39728.2021.9413423"},{"key":"1451_CR49","doi-asserted-by":"crossref","unstructured":"Deng J, Xie X, Wang T, Cui M, Xue B, Jin Z, Meng H (2022) Confidence score based conformer speaker adaptation for speech recognition. arXiv preprint arXiv:2206.12045","DOI":"10.21437\/Interspeech.2022-680"},{"key":"1451_CR50","doi-asserted-by":"crossref","unstructured":"Burchi M, Timofte R (2023) Audio-visual efficient conformer for robust speech recognition. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision (pp. 2258-2267)","DOI":"10.1109\/WACV56688.2023.00229"},{"issue":"13\u201314","key":"1451_CR51","doi-asserted-by":"publisher","first-page":"6676","DOI":"10.1016\/j.apm.2016.02.014","volume":"40","author":"V Stojanovic","year":"2016","unstructured":"Stojanovic V, Nedic N, Prsic D, Dubonjic L (2016) Optimal experiment design for identification of ARX models with constrained output in non-Gaussian noise. Appl Math Modell 40(13\u201314):6676\u20136689","journal-title":"Appl Math Modell"},{"issue":"3","key":"1451_CR52","doi-asserted-by":"publisher","first-page":"445","DOI":"10.1002\/rnc.3319","volume":"26","author":"V Stojanovic","year":"2016","unstructured":"Stojanovic V, Nedic N (2016) Robust Kalman filtering for nonlinear multivariable stochastic systems in the presence of non-Gaussian noise. Int J Robust Nonlinear Control 26(3):445\u2013460","journal-title":"Int J Robust Nonlinear Control"},{"issue":"14","key":"1451_CR53","doi-asserted-by":"publisher","first-page":"3058","DOI":"10.1002\/rnc.3490","volume":"26","author":"V Stojanovic","year":"2016","unstructured":"Stojanovic V, Nedic N (2016) Joint state and parameter robust estimation of stochastic nonlinear systems. Int J Robust Nonlinear Control 26(14):3058\u20133074","journal-title":"Int J Robust Nonlinear Control"},{"key":"1451_CR54","doi-asserted-by":"crossref","unstructured":"Zhang Y, Lv Z, Wu H, Zhang S, Hu P, Wu Z, Meng H (2022) Mfa-conformer: Multi-scale feature aggregation conformer for automatic speaker verification. arXiv preprint arXiv:2203.15249","DOI":"10.21437\/Interspeech.2022-563"},{"key":"1451_CR55","doi-asserted-by":"crossref","unstructured":"Andrusenko A, Nasretdinov R, Romanenko A (2023, June) Uconv-conformer: High reduction of input sequence length for end-to-end speech recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE","DOI":"10.1109\/ICASSP49357.2023.10095430"},{"key":"1451_CR56","doi-asserted-by":"crossref","unstructured":"Hernandez SM, Zhao D, Ding S, Bruguier A, Prabhavalkar R, Sainath TN, McGraw I (2023, June) Sharing low rank conformer weights for tiny always-on ambient speech recognition models. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE","DOI":"10.1109\/ICASSP49357.2023.10095006"},{"key":"1451_CR57","doi-asserted-by":"crossref","unstructured":"Chang O, Liao H, Serdyuk D, Shah A, Siohan O (2023) Conformers are All You Need for Visual Speech Recognition. arXiv preprint arXiv:2302.10915","DOI":"10.1109\/ICASSP48485.2024.10446532"},{"key":"1451_CR58","unstructured":"Shen Z, Zhang M, Zhao H, Yi S, Li H (2021) Efficient attention: Attention with linear complexities. In Proceedings of the IEEE\/CVF winter conference on applications of computer vision (pp. 3531-3539)"},{"key":"1451_CR59","doi-asserted-by":"crossref","unstructured":"Bolya D, Fu CY, Dai X, Zhang P, Hoffman J (2022, October) Hydra attention: Efficient attention with many heads. In European Conference on Computer Vision (pp. 35-49). Cham: Springer Nature Switzerland","DOI":"10.1007\/978-3-031-25082-8_3"},{"key":"1451_CR60","doi-asserted-by":"crossref","unstructured":"Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R (2019) Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860","DOI":"10.18653\/v1\/P19-1285"},{"key":"1451_CR61","doi-asserted-by":"crossref","unstructured":"Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R (2019) Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860","DOI":"10.18653\/v1\/P19-1285"},{"key":"1451_CR62","unstructured":"He P, Liu X, Gao J, Chen W (2020) Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654"},{"key":"1451_CR63","doi-asserted-by":"crossref","unstructured":"Guo M, Ainslie J, Uthus D, Ontanon S, Ni J, Sung YH, Yang Y (2021) LongT5: Efficient text-to-text transformer for long sequences. arXiv preprint arXiv:2112.07916","DOI":"10.18653\/v1\/2022.findings-naacl.55"},{"key":"1451_CR64","unstructured":"Gehring J, Auli M, Grangier D, Yarats D, Dauphin YN (2017, July) Convolutional sequence to sequence learning. In International conference on machine learning (pp. 1243-1252). PMLR"},{"key":"1451_CR65","doi-asserted-by":"crossref","unstructured":"Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. arXiv preprint arXiv:1803.02155","DOI":"10.18653\/v1\/N18-2074"},{"key":"1451_CR66","unstructured":"Bhojanapalli S, Yun C, Rawat AS, Reddi S, Kumar S (2020, November) Low-rank bottleneck in multi-head attention models. In International conference on machine learning (pp. 864-873). PMLR"},{"key":"1451_CR67","doi-asserted-by":"crossref","unstructured":"Chung JS, Zisserman A (2017) Lip reading in the wild. In Computer Vision-ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13 (pp. 87-103). Springer International Publishing","DOI":"10.1007\/978-3-319-54427-4"},{"key":"1451_CR68","unstructured":"Afouras T, Chung JS, Zisserman A (2018) LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496"},{"key":"1451_CR69","doi-asserted-by":"crossref","unstructured":"Deng J, Guo J, Ververas E, Kotsia I, Zafeiriou S (2020) Retinaface: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 5203-5212)","DOI":"10.1109\/CVPR42600.2020.00525"},{"key":"1451_CR70","doi-asserted-by":"crossref","unstructured":"Bulat A, Tzimiropoulos G (2017) How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE international conference on computer vision (pp. 1021-1030)","DOI":"10.1109\/ICCV.2017.116"},{"key":"1451_CR71","doi-asserted-by":"crossref","unstructured":"Kudo T, Richardson J (2018) Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226","DOI":"10.18653\/v1\/D18-2012"},{"key":"1451_CR72","doi-asserted-by":"crossref","unstructured":"Park DS, Zhang Y, Chiu CC, Chen Y, Li B, Chan W, Wu Y (2020, May) Specaugment on large scale datasets. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6879-6883). IEEE","DOI":"10.1109\/ICASSP40776.2020.9053205"},{"issue":"11","key":"1451_CR73","doi-asserted-by":"publisher","first-page":"930","DOI":"10.1038\/s42256-022-00550-z","volume":"4","author":"P Ma","year":"2022","unstructured":"Ma P, Petridis S, Pantic M (2022) Visual speech recognition for multiple languages in the wild. Nat Mach Intell 4(11):930\u2013939","journal-title":"Nat Mach Intell"},{"key":"1451_CR74","doi-asserted-by":"crossref","unstructured":"Prajwal KR, Afouras T, Zisserman A (2022) Sub-word level lip reading with visual attention. In Proceedings of the IEEE\/CVF conference on Computer Vision and Pattern Recognition (pp. 5162-5172)","DOI":"10.1109\/CVPR52688.2022.00510"},{"issue":"1","key":"1451_CR75","doi-asserted-by":"publisher","first-page":"6302","DOI":"10.1038\/s41598-022-09744-2","volume":"12","author":"N Bacanin","year":"2022","unstructured":"Bacanin N, Zivkovic M, Al-Turjman F, Venkatachalam K, Trojovsk\u00fd P, Strumberger I, Bezdan T (2022) Hybridized sine cosine algorithm with convolutional neural networks dropout regularization application. Sci Rep 12(1):6302","journal-title":"Sci Rep"},{"key":"1451_CR76","unstructured":"Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980"},{"key":"1451_CR77","doi-asserted-by":"crossref","unstructured":"Zhang X, Cheng F, Wang S (2019) Spatio-temporal fusion based convolutional sequence learning for lip reading. In Proceedings of the IEEE\/CVF International conference on Computer Vision (pp. 713-722)","DOI":"10.1109\/ICCV.2019.00080"},{"key":"1451_CR78","doi-asserted-by":"crossref","unstructured":"Afouras T, Chung JS, Zisserman A (2020, May) Asr is all you need: Cross-modal distillation for lip reading. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2143-2147). IEEE","DOI":"10.1109\/ICASSP40776.2020.9054253"},{"key":"1451_CR79","doi-asserted-by":"crossref","unstructured":"Yu J, Zhang SX, Wu J, Ghorbani S, Wu B, Kang S, Yu D (2020, May) Audio-visual recognition of overlapped speech for the lrs2 dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6984-6988). IEEE","DOI":"10.1109\/ICASSP40776.2020.9054127"},{"key":"1451_CR80","doi-asserted-by":"crossref","unstructured":"Ma P, Petridis S, Pantic M (2021, June) End-to-end audio-visual speech recognition with conformers. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7613-7617). IEEE","DOI":"10.1109\/ICASSP39728.2021.9414567"},{"key":"1451_CR81","doi-asserted-by":"crossref","unstructured":"Zhao Y, Xu R, Wang X, Hou P, Tang H, Song M (2020, April) Hearing lips: Improving lip reading by distilling speech recognizers. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 04, pp. 6917-6924)","DOI":"10.1609\/aaai.v34i04.6174"},{"key":"1451_CR82","doi-asserted-by":"crossref","unstructured":"Serdyuk D, Braga O, Siohan O (2021, December) Audio-visual speech recognition is worth $$32\\times 32\\times 8$$ voxels. In 2021 IEEE automatic speech recognition and understanding workshop (ASRU) (pp. 796-802). IEEE","DOI":"10.1109\/ASRU51503.2021.9688191"},{"key":"1451_CR83","doi-asserted-by":"crossref","unstructured":"Shi B, Hsu WN, Mohamed A (2022) Robust self-supervised audio-visual speech recognition. arXiv preprint arXiv:2201.01763","DOI":"10.21437\/Interspeech.2022-99"}],"container-title":["Complex &amp; Intelligent Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-024-01451-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s40747-024-01451-x\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-024-01451-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,17]],"date-time":"2024-07-17T17:28:18Z","timestamp":1721237298000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s40747-024-01451-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,18]]},"references-count":83,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,8]]}},"alternative-id":["1451"],"URL":"https:\/\/doi.org\/10.1007\/s40747-024-01451-x","relation":{},"ISSN":["2199-4536","2198-6053"],"issn-type":[{"value":"2199-4536","type":"print"},{"value":"2198-6053","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,5,18]]},"assertion":[{"value":"2 January 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"6 April 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"18 May 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no potential Conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}