{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,13]],"date-time":"2026-04-13T17:31:27Z","timestamp":1776101487929,"version":"3.50.1"},"reference-count":213,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2025,4,16]],"date-time":"2025-04-16T00:00:00Z","timestamp":1744761600000},"content-version":"vor","delay-in-days":105,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,4,3]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Simultaneous speech-to-text translation (SimulST) translates source-language speech into target-language text concurrently with the speaker\u2019s speech, ensuring low latency for better user comprehension. Despite its intended application to unbounded speech, most research has focused on human pre-segmented speech, simplifying the task and overlooking significant challenges. This narrow focus, coupled with widespread terminological inconsistencies, is limiting the applicability of research outcomes to real-world applications, ultimately hindering progress in the field. Our extensive literature review of 110 papers not only reveals these critical issues in current research but also serves as the foundation for our key contributions. We: 1) define the steps and core components of a SimulST system, proposing a standardized terminology and taxonomy; 2) conduct a thorough analysis of community trends; and 3) offer concrete recommendations and future directions to bridge the gaps in existing literature, from evaluation frameworks to system architectures, for advancing the field towards more realistic and effective SimulST solutions.<\/jats:p>","DOI":"10.1162\/tacl_a_00740","type":"journal-article","created":{"date-parts":[[2025,4,16]],"date-time":"2025-04-16T14:22:46Z","timestamp":1744813366000},"page":"281-313","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":2,"title":["How \u201cReal\u201d is Your Real-Time Simultaneous Speech-to-Text Translation System?"],"prefix":"10.1162","volume":"13","author":[{"given":"Sara","family":"Papi","sequence":"first","affiliation":[{"name":"Fondazione Bruno Kessler, Trento, Italy. spapi@fbk.eu"}]},{"given":"Peter","family":"Pol\u00e1k","sequence":"additional","affiliation":[{"name":"Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, Praha, Czech Republic. polak@ufal.mff.cuni.cz"}]},{"given":"Dominik","family":"Mach\u00e1\u010dek","sequence":"additional","affiliation":[{"name":"Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, Praha, Czech Republic. machacek@ufal.mff.cuni.cz"}]},{"given":"Ond\u0159ej","family":"Bojar","sequence":"additional","affiliation":[{"name":"Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, Praha, Czech Republic. bojar@ufal.mff.cuni.cz"}]}],"member":"281","published-online":{"date-parts":[[2025,4,3]]},"reference":[{"key":"2025051914251602900_bib1","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18653\/v1\/2023.iwslt-1.1","article-title":"Findings of the IWSLT 2023 evaluation campaign","volume-title":"Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)","author":"Agarwal","year":"2023"},{"key":"2025051914251602900_bib2","first-page":"31","article-title":"Contextual handling in neural machine translation: Look behind, ahead and on both sides","volume-title":"Proceedings of the 21st Annual Conference of the European Association for Machine Translation","author":"Agrawal","year":"2018"},{"key":"2025051914251602900_bib3","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18653\/v1\/2024.iwslt-1.1","article-title":"Findings of the IWSLT 2024 evaluation campaign","volume-title":"Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)","author":"Ahmad","year":"2024"},{"key":"2025051914251602900_bib4","doi-asserted-by":"publisher","first-page":"14","DOI":"10.18653\/v1\/2023.calcs-1.2","article-title":"Towards real-world streaming speech translation for code-switched speech","volume-title":"Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching","author":"Alastruey","year":"2023"},{"key":"2025051914251602900_bib5","first-page":"203","article-title":"Don\u2019t discard fixed-window audio segmentation in speech-to-text translation","volume-title":"Proceedings of the Seventh Conference on Machine Translation (WMT)","author":"Amrhein","year":"2022"},{"key":"2025051914251602900_bib6","doi-asserted-by":"publisher","first-page":"98","DOI":"10.18653\/v1\/2022.iwslt-1.10","article-title":"Findings of the IWSLT 2022 evaluation campaign","volume-title":"Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)","author":"Anastasopoulos","year":"2022"},{"key":"2025051914251602900_bib7","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18653\/v1\/2021.iwslt-1.1","article-title":"Findings of the IWSLT 2021 evaluation campaign","volume-title":"Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)","author":"Anastasopoulos","year":"2021"},{"key":"2025051914251602900_bib8","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18653\/v1\/2020.iwslt-1.1","article-title":"Findings of the IWSLT 2020 evaluation campaign","volume-title":"Proceedings of the 17th International Conference on Spoken Language Translation","author":"Ansari","year":"2020"},{"key":"2025051914251602900_bib9","doi-asserted-by":"publisher","first-page":"71","DOI":"10.18653\/v1\/2021.eacl-demos.9","article-title":"SLTEV: Comprehensive evaluation of spoken language translation","volume-title":"Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations","author":"Ansari","year":"2021"},{"key":"2025051914251602900_bib10","doi-asserted-by":"publisher","first-page":"220","DOI":"10.18653\/v1\/2020.iwslt-1.27","article-title":"Re-translation versus streaming for simultaneous translation","volume-title":"Proceedings of the 17th International Conference on Spoken Language Translation","author":"Arivazhagan","year":"2020"},{"key":"2025051914251602900_bib11","doi-asserted-by":"publisher","first-page":"7919","DOI":"10.1109\/ICASSP40776.2020.9054585","article-title":"Re-translation strategies for long form, simultaneous, spoken language translation","volume-title":"ICASSP 2020 \u2013 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Arivazhagan","year":"2020"},{"key":"2025051914251602900_bib12","article-title":"Method and system for evaluating and improving live translation captioning systems","author":"Arkhangorodsky","year":"2023"},{"issue":"3","key":"2025051914251602900_bib13","doi-asserted-by":"publisher","first-page":"201","DOI":"10.1109\/TASSP.1976.1162800","article-title":"A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition","volume":"24","author":"Atal","year":"1976","journal-title":"IEEE Transactions on Acoustics, Speech, and Signal Processing"},{"key":"2025051914251602900_bib14","doi-asserted-by":"publisher","first-page":"792","DOI":"10.1109\/ASRU46091.2019.9003774","article-title":"A comparative study on end-to-end speech to text translation","volume-title":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","author":"Bahar","year":"2019"},{"key":"2025051914251602900_bib15","doi-asserted-by":"publisher","first-page":"44","DOI":"10.18653\/v1\/2020.iwslt-1.3","article-title":"Start-before-end and end-to-end: Neural speech translation by AppTek and RWTH Aachen University","volume-title":"Proceedings of the 17th International Conference on Spoken Language Translation","author":"Bahar","year":"2020"},{"key":"2025051914251602900_bib16","doi-asserted-by":"publisher","first-page":"52","DOI":"10.18653\/v1\/2021.iwslt-1.5","article-title":"Without further ado: Direct and simultaneous speech translation by AppTek in 2021","volume-title":"Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)","author":"Bahar","year":"2021"},{"key":"2025051914251602900_bib17","first-page":"437","article-title":"Real-time incremental speech-to-speech translation of dialogs","volume-title":"Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Bangalore","year":"2012"},{"key":"2025051914251602900_bib18","doi-asserted-by":"publisher","first-page":"1298","DOI":"10.21437\/Interspeech.2018-1326","article-title":"Low-resource speech-to-text translation","volume-title":"Proceedings of Interspeech 2018","author":"Bansal","year":"2018"},{"key":"2025051914251602900_bib19","doi-asserted-by":"publisher","first-page":"58","DOI":"10.18653\/v1\/N19-1006","article-title":"Pre-training on high-resource speech recognition improves low-resource speech-to-text translation","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Bansal","year":"2019"},{"key":"2025051914251602900_bib20","article-title":"Seamless: Multilingual expressive and streaming speech translation","author":"Barrault","year":"2023","journal-title":"arXiv preprint arXiv:2312.05187"},{"key":"2025051914251602900_bib21","doi-asserted-by":"publisher","first-page":"2873","DOI":"10.18653\/v1\/2021.acl-long.224","article-title":"Cascade versus direct speech translation: Do the differences still make a difference?","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Bentivogli","year":"2021"},{"key":"2025051914251602900_bib22","article-title":"Listen and translate: A proof of concept for end-to-end speech-to-text translation","volume-title":"NIPS Workshop on end-to-end learning for speech and audio processing","author":"B\u00e9rard","year":"2016"},{"key":"2025051914251602900_bib23","doi-asserted-by":"publisher","first-page":"280","DOI":"10.18653\/v1\/2020.iwslt-1.34","article-title":"How human is machine translationese? Comparing human and machine translations of text and speech","volume-title":"Proceedings of the 17th International Conference on Spoken Language Translation","author":"Bizzoni","year":"2020"},{"key":"2025051914251602900_bib24","first-page":"23","article-title":"Operating a complex SLT system with speakers and human interpreters","volume-title":"Proceedings of the 1st Workshop on Automatic Spoken Language Translation in Real-World Settings (ASLTRW)","author":"Bojar","year":"2021"},{"key":"2025051914251602900_bib25","first-page":"1877","article-title":"Language models are few-shot learners","volume-title":"Advances in Neural Information Processing Systems","author":"Brown","year":"2020"},{"key":"2025051914251602900_bib26","doi-asserted-by":"publisher","first-page":"613\u2013616 vol.1","DOI":"10.1109\/ICASSP.2001.940906","article-title":"Speech-to-speech translation based on finite-state transducers","volume-title":"2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221)","author":"Casacuberta","year":"2001"},{"key":"2025051914251602900_bib27","doi-asserted-by":"publisher","first-page":"5175","DOI":"10.21437\/Interspeech.2022-10627","article-title":"Exploring continuous integrate-and-fire for adaptive simultaneous speech translation","volume-title":"Proceedings Interspeech 2022","author":"Chang","year":"2022"},{"key":"2025051914251602900_bib28","doi-asserted-by":"publisher","first-page":"4298","DOI":"10.1109\/ICASSP43922.2022.9747755","article-title":"Noise-robust speech recognition with 10 minutes unparalleled in-domain data","volume-title":"ICASSP 2022 \u2013 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Chen","year":"2022"},{"key":"2025051914251602900_bib29","doi-asserted-by":"publisher","first-page":"4618","DOI":"10.18653\/v1\/2021.findings-acl.406","article-title":"Direct simultaneous speech-to-text translation assisted by synchronized streaming ASR","volume-title":"Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021","author":"Chen","year":"2021"},{"key":"2025051914251602900_bib30","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/ASRU57964.2023.10389709","article-title":"Improving stability in simultaneous speech translation: A revision-controllable decoding approach","volume-title":"2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","author":"Chen","year":"2023"},{"issue":"16","key":"2025051914251602900_bib31","doi-asserted-by":"publisher","first-page":"17799","DOI":"10.1609\/aaai.v38i16.29733","article-title":"Divergence-guided simultaneous speech translation","volume":"38","author":"Chen","year":"2024","journal-title":"Proceedings of the AAAI Conference on Artificial Intelligence"},{"key":"2025051914251602900_bib32","article-title":"Thinking slow about latency evaluation for simultaneous machine translation","author":"Cherry","year":"2019","journal-title":"arXiv preprint arXiv:1906.00048"},{"key":"2025051914251602900_bib33","doi-asserted-by":"publisher","first-page":"889","DOI":"10.1109\/ASRU46091.2019.9003854","article-title":"A comparison of end-to-end models for long-form speech recognition","volume-title":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","author":"Chiu","year":"2019"},{"key":"2025051914251602900_bib34","doi-asserted-by":"publisher","first-page":"3473","DOI":"10.21437\/Interspeech.2013-612","article-title":"A real-world system for simultaneous translation of German lectures","volume-title":"Proceedings of Interspeech 2013","author":"Cho","year":"2013"},{"key":"2025051914251602900_bib35","first-page":"173","article-title":"Punctuation insertion for real-time spoken language translation","volume-title":"Proceedings of the 12th International Workshop on Spoken Language Translation: Papers","author":"Cho","year":"2015"},{"key":"2025051914251602900_bib36","doi-asserted-by":"publisher","first-page":"2645","DOI":"10.21437\/Interspeech.2017-1320","article-title":"NMT-based segmentation and punctuation insertion for real-time spoken language translation","volume-title":"Proceedings Interspeech 2017","author":"Cho","year":"2017"},{"key":"2025051914251602900_bib37","article-title":"Can neural machine translation do simultaneous translation?","author":"Cho","year":"2016","journal-title":"arXiv preprint arXiv:1606.02012"},{"key":"2025051914251602900_bib38","doi-asserted-by":"publisher","first-page":"2978","DOI":"10.18653\/v1\/P19-1285","article-title":"Transformer-XL: Attentive language models beyond a fixed-length context","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Dai","year":"2019"},{"key":"2025051914251602900_bib39","doi-asserted-by":"publisher","first-page":"493","DOI":"10.18653\/v1\/N18-2079","article-title":"Incremental decoding and training methods for simultaneous translation in neural machine translation","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)","author":"Dalvi","year":"2018"},{"key":"2025051914251602900_bib40","doi-asserted-by":"publisher","first-page":"1746","DOI":"10.21437\/Interspeech.2022-933","article-title":"Blockwise streaming transformer for spoken language understanding and simultaneous speech translation","volume-title":"Proceedings of Interspeech 2022","author":"Deng","year":"2022"},{"key":"2025051914251602900_bib41","doi-asserted-by":"publisher","first-page":"8235","DOI":"10.18653\/v1\/2024.acl-long.448","article-title":"Label-synchronous neural transducer for E2E simultaneous speech translation","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Deng","year":"2024"},{"key":"2025051914251602900_bib42","first-page":"89","article-title":"KIT lecture translator: Multilingual speech translation with one-shot learning","volume-title":"Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations","author":"Dessloch","year":"2018"},{"key":"2025051914251602900_bib43","doi-asserted-by":"publisher","first-page":"1299","DOI":"10.18653\/v1\/2021.acl-long.104","article-title":"Diverse pretrained context encodings improve document translation","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Donato","year":"2021"},{"key":"2025051914251602900_bib44","doi-asserted-by":"publisher","first-page":"680","DOI":"10.18653\/v1\/2022.acl-long.50","article-title":"Learning when to translate for streaming speech","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Dong","year":"2022"},{"key":"2025051914251602900_bib45","doi-asserted-by":"publisher","first-page":"1461","DOI":"10.21437\/Interspeech.2020-1241","article-title":"Efficient wait-k models for simultaneous machine translation","volume-title":"Proceedings of Interspeech 2020","author":"Elbayad","year":"2020"},{"key":"2025051914251602900_bib46","doi-asserted-by":"publisher","first-page":"35","DOI":"10.18653\/v1\/2020.iwslt-1.2","article-title":"ON-TRAC consortium for end-to-end and simultaneous speech translation challenge tasks at IWSLT 2020","volume-title":"Proceedings of the 17th International Conference on Spoken Language Translation","author":"Elbayad","year":"2020"},{"key":"2025051914251602900_bib47","doi-asserted-by":"publisher","first-page":"245","DOI":"10.18653\/v1\/2021.iwslt-1.29","article-title":"Towards the evaluation of automatic simultaneous speech translation from a communicative perspective","volume-title":"Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)","author":"Fantinuoli","year":"2021"},{"key":"2025051914251602900_bib48","first-page":"327","article-title":"Exploring the correlation between human and machine evaluation of simultaneous speech translation","volume-title":"Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)","author":"Fantinuoli","year":"2024"},{"key":"2025051914251602900_bib49","doi-asserted-by":"publisher","first-page":"6467","DOI":"10.18653\/v1\/2021.acl-long.505","article-title":"Measuring and increasing context usage in context-aware machine translation","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Fernandes","year":"2021"},{"key":"2025051914251602900_bib50","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2003.1198854","article-title":"A prosody-based approach to end-of-utterance detection that does not require speech recognition","volume-title":"2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003 (ICASSP\u201903)","author":"Ferrer","year":"2003"},{"key":"2025051914251602900_bib51","doi-asserted-by":"publisher","first-page":"16600","DOI":"10.18653\/v1\/2023.emnlp-main.1033","article-title":"Adapting offline speech translation models for streaming with future-aware distillation and inference","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Biao","year":"2023"},{"key":"2025051914251602900_bib52","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2006.1660084","article-title":"Open domain speech recognition & translation: Lectures and speeches","volume-title":"2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings","author":"F\u00fcgen","year":"2006"},{"key":"2025051914251602900_bib53","first-page":"81","article-title":"Open domain speech translation: From seminars and speeches to lectures","volume-title":"TC-STAR Workshop on Speech to Speech Translation, Barcelona, Spain","author":"F\u00fcgen","year":"2006"},{"key":"2025051914251602900_bib54","doi-asserted-by":"publisher","first-page":"209","DOI":"10.1007\/s10590-008-9047-0","article-title":"Simultaneous translation of lectures and speeches","volume":"21","author":"F\u00fcgen","year":"2007","journal-title":"Machine Translation"},{"key":"2025051914251602900_bib55","unstructured":"Christian\n              F\u00fcgen\n            \n          . 2009. A System for Simultaneous Translation of Lectures and Speeches. Ph.D. thesis, Universit\u00e4t Karlsruhe (TH). 10.5445\/IR\/1000013594"},{"key":"2025051914251602900_bib56","doi-asserted-by":"publisher","first-page":"3487","DOI":"10.21437\/Interspeech.2013-615","article-title":"Simple, lexicalized choice of translation timing for simultaneous speech translation","volume-title":"Proceedings of Interspeech 2013","author":"Fujita","year":"2013"},{"key":"2025051914251602900_bib57","doi-asserted-by":"publisher","first-page":"286","DOI":"10.18653\/v1\/2022.iwslt-1.25","article-title":"NAIST simultaneous speech-to-text translation system for IWSLT 2022","volume-title":"Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)","author":"Fukuda","year":"2022"},{"key":"2025051914251602900_bib58","doi-asserted-by":"publisher","first-page":"330","DOI":"10.18653\/v1\/2023.iwslt-1.31","article-title":"NAIST simultaneous speech-to-speech translation system for IWSLT 2023","volume-title":"Poceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)","author":"Fukuda","year":"2023"},{"key":"2025051914251602900_bib59","doi-asserted-by":"publisher","first-page":"121","DOI":"10.21437\/Interspeech.2022-11382","article-title":"Speech segmentation optimization using segmented bilingual speech corpus for end-to-end speech translation","volume-title":"Proceedings of Interspeech 2022","author":"Fukuda","year":"2022"},{"key":"2025051914251602900_bib60","doi-asserted-by":"publisher","first-page":"1471","DOI":"10.21437\/Interspeech.2020-2860","article-title":"Contextualized translation of automatically segmented speech","volume-title":"Proceedings of Interspeech 2020","author":"Gaido","year":"2020"},{"key":"2025051914251602900_bib61","first-page":"55","article-title":"Beyond voice activity detection: Hybrid audio segmentation for direct speech translation","volume-title":"Proceedings of the 4th International Conference on Natural Language and Speech Processing (ICNLSP 2021)","author":"Gaido","year":"2021"},{"key":"2025051914251602900_bib62","doi-asserted-by":"publisher","first-page":"177","DOI":"10.18653\/v1\/2022.iwslt-1.13","article-title":"Efficient yet competitive speech translation: FBK@IWSLT2022","volume-title":"Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)","author":"Gaido","year":"2022"},{"key":"2025051914251602900_bib63","doi-asserted-by":"publisher","first-page":"14760","DOI":"10.18653\/v1\/2024.acl-long.789","article-title":"Speech translation with speech foundation models and large language models: What is there and what is missing?","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Gaido","year":"2024"},{"key":"2025051914251602900_bib64","doi-asserted-by":"publisher","first-page":"47","DOI":"10.21437\/Interspeech.2023-1767","article-title":"Joint speech translation and named entity recognition","volume-title":"Proceedings of INTERSPEECH 2023","author":"Gaido","year":"2023"},{"issue":"12","key":"2025051914251602900_bib65","doi-asserted-by":"publisher","first-page":"1333","DOI":"10.1177\/0301006616657097","article-title":"The interaction between vision and eye movements","volume":"45","author":"Gegenfurtner","year":"2016","journal-title":"Perception"},{"key":"2025051914251602900_bib66","doi-asserted-by":"publisher","first-page":"369","DOI":"10.1145\/1143844.1143891","article-title":"Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks","volume-title":"Proceedings of the 23rd International Conference on Machine Learning","author":"Graves","year":"2006"},{"key":"2025051914251602900_bib67","doi-asserted-by":"publisher","first-page":"1342","DOI":"10.3115\/v1\/D14-1140","article-title":"Don\u2019t until the final verb wait: Reinforcement learning for simultaneous machine translation","volume-title":"Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Grissom","year":"2014"},{"key":"2025051914251602900_bib68","doi-asserted-by":"publisher","first-page":"216","DOI":"10.18653\/v1\/2022.iwslt-1.17","article-title":"The xiaomi text-to-text simultaneous speech translation system for IWSLT 2022","volume-title":"Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)","author":"Guo","year":"2022"},{"key":"2025051914251602900_bib69","doi-asserted-by":"publisher","first-page":"376","DOI":"10.18653\/v1\/2023.iwslt-1.35","article-title":"The HW-TSC\u2019s simultaneous speech-to-text translation system for IWSLT 2023 evaluation","volume-title":"Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)","author":"Guo","year":"2023"},{"key":"2025051914251602900_bib70","article-title":"R-bi: Regularized batched inputs enhance incremental decoding framework for low-latency simultaneous speech translation","author":"Guo","year":"2024","journal-title":"arXiv preprint arXiv:2401.05700"},{"key":"2025051914251602900_bib71","doi-asserted-by":"publisher","first-page":"62","DOI":"10.18653\/v1\/2020.iwslt-1.5","article-title":"End-to-end simultaneous translation system for IWSLT2020 using modality agnostic meta-learning","volume-title":"Proceedings of the 17th International Conference on Spoken Language Translation","author":"Han","year":"2020"},{"key":"2025051914251602900_bib72","doi-asserted-by":"publisher","first-page":"411","DOI":"10.18653\/v1\/2023.iwslt-1.39","article-title":"The xiaomi AI lab\u2019s speech translation systems for IWSLT 2023 offline task, simultaneous task and speech-to-speech task","volume-title":"Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)","author":"Huang","year":"2023"},{"key":"2025051914251602900_bib73","doi-asserted-by":"publisher","first-page":"4995","DOI":"10.21437\/Interspeech.2022-38","article-title":"E2e segmenter: Joint segmenting and decoding for long-form asr","volume-title":"Interspeech 2022","author":"Ronny Huang","year":"2022"},{"key":"2025051914251602900_bib74","doi-asserted-by":"publisher","first-page":"12","DOI":"10.18653\/v1\/2023.emnlp-demo.2","article-title":"End-to-end evaluation for low-latency simultaneous speech translation","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations","author":"Huber","year":"2023"},{"key":"2025051914251602900_bib75","article-title":"Code-switching without switching: Language agnostic end-to-end speech translation","author":"Huber","year":"2022","journal-title":"arXiv preprint arXiv:2210 .01512"},{"key":"2025051914251602900_bib76","doi-asserted-by":"publisher","first-page":"15524","DOI":"10.18653\/v1\/2024.findings-acl.917","article-title":"Textless acoustic model with self-supervised distillation for noise-robust expressive speech-to-speech translation","volume-title":"Findings of the Association for Computational Linguistics ACL 2024","author":"Hwang","year":"2024"},{"key":"2025051914251602900_bib77","doi-asserted-by":"publisher","first-page":"100","DOI":"10.18653\/v1\/2021.iwslt-1.10","article-title":"ESPnet-ST IWSLT 2021 offline speech translation system","volume-title":"Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)","author":"Inaguma","year":"2021"},{"key":"2025051914251602900_bib78","doi-asserted-by":"publisher","first-page":"38","DOI":"10.18653\/v1\/2022.naacl-main.3","article-title":"Language model augmented monotonic attention for simultaneous translation","volume-title":"Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Indurthi","year":"2022"},{"key":"2025051914251602900_bib79","doi-asserted-by":"publisher","first-page":"2599","DOI":"10.18653\/v1\/2020.emnlp-main.206","article-title":"Direct segmentation models for streaming speech translation","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Iranzo-S\u00e1nchez","year":"2020"},{"key":"2025051914251602900_bib80","doi-asserted-by":"publisher","first-page":"255","DOI":"10.18653\/v1\/2022.iwslt-1.22","article-title":"MLLP-VRAIN UPV systems for the IWSLT 2022 simultaneous speech translation and speech-to-speech translation tasks","volume-title":"Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)","author":"Iranzo-S\u00e1nchez","year":"2022"},{"key":"2025051914251602900_bib81","doi-asserted-by":"publisher","first-page":"1104","DOI":"10.1162\/tacl_a_00691","article-title":"Segmentation-free streaming machine translation","volume":"12","author":"Iranzo-S\u00e1nchez","year":"2024","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2025051914251602900_bib82","doi-asserted-by":"publisher","first-page":"303","DOI":"10.1016\/j.neunet.2021.05.013","article-title":"Streaming cascade-based speech translation leveraged by a direct segmentation model","volume":"142","author":"Iranzo-S\u00e1nchez","year":"2021","journal-title":"Neural Networks"},{"key":"2025051914251602900_bib83","doi-asserted-by":"publisher","DOI":"10.16875\/stem.2021.22.4.59","article-title":"Student insights related to the use of simultaneous speech translation for video lectures in a university english course","author":"Irvin","year":"2021","journal-title":"STEM Journal"},{"key":"2025051914251602900_bib84","first-page":"154","article-title":"Continuous rating as reliable human evaluation of simultaneous speech translation","volume-title":"Proceedings of the Seventh Conference on Machine Translation (WMT)","author":"Javorsk\u00fd","year":"2022"},{"key":"2025051914251602900_bib85","doi-asserted-by":"publisher","first-page":"4469","DOI":"10.21437\/Interspeech.2023-933","article-title":"Average token delay: A latency metric for simultaneous translation","volume-title":"Proceedings of INTERSPEECH 2023","author":"Kano","year":"2023"},{"key":"2025051914251602900_bib86","doi-asserted-by":"publisher","first-page":"209","DOI":"10.18653\/v1\/2020.iwslt-1.26","article-title":"Is 42 the answer to everything in subtitling-oriented speech translation?","volume-title":"Proceedings of the 17th International Conference on Spoken Language Translation","author":"Karakanta","year":"2020"},{"key":"2025051914251602900_bib87","first-page":"35","article-title":"Simultaneous speech translation for live subtitling: From delay to display","volume-title":"Proceedings of the 1st Workshop on Automatic Spoken Language Translation in Real-World Settings (ASLTRW)","author":"Karakanta","year":"2021"},{"key":"2025051914251602900_bib88","doi-asserted-by":"publisher","first-page":"24","DOI":"10.18653\/v1\/D19-6503","article-title":"When and why is document-level context useful in neural machine translation?","volume-title":"Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019)","author":"Kim","year":"2019"},{"key":"2025051914251602900_bib89","doi-asserted-by":"publisher","first-page":"363","DOI":"10.18653\/v1\/2023.iwslt-1.34","article-title":"Tagged end-to-end simultaneous speech translation training using simultaneous interpretation data","volume-title":"Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)","author":"Ko","year":"2023"},{"key":"2025051914251602900_bib90","doi-asserted-by":"publisher","first-page":"170","DOI":"10.18653\/v1\/2024.iwslt-1.23","article-title":"NAIST simultaneous speech translation system for IWSLT 2024","volume-title":"Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)","author":"Ko","year":"2024"},{"key":"2025051914251602900_bib91","doi-asserted-by":"publisher","first-page":"1999","DOI":"10.18653\/v1\/2024.acl-long.110","article-title":"Navigating the metrics maze: Reconciling score magnitudes and accuracies","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Kocmi","year":"2024"},{"key":"2025051914251602900_bib92","first-page":"174","article-title":"Simultaneous German-English lecture translation","volume-title":"Proceedings of the 5th International Workshop on Spoken Language Translation: Papers","author":"Kolss","year":"2008"},{"key":"2025051914251602900_bib93","volume-title":"Real-time Systems Design and Analysis: An Engineer\u2019s Handbook","author":"Laplante","year":"1992"},{"key":"2025051914251602900_bib94","article-title":"Sparks of large audio models: A survey and outlook","author":"Latif","year":"2023","journal-title":"arXiv preprint arXiv:2308.12792"},{"key":"2025051914251602900_bib95","doi-asserted-by":"publisher","first-page":"18","DOI":"10.18653\/v1\/2022.autosimtrans-1.3","article-title":"System description on automatic simultaneous translation workshop","volume-title":"Proceedings of the Third Workshop on Automatic Simultaneous Translation","author":"Li","year":"2022"},{"key":"2025051914251602900_bib96","doi-asserted-by":"publisher","first-page":"30","DOI":"10.18653\/v1\/2021.iwslt-1.2","article-title":"The USTC-NELSLIP systems for simultaneous speech translation task at IWSLT 2021","volume-title":"Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)","author":"Liu","year":"2021"},{"key":"2025051914251602900_bib97","doi-asserted-by":"publisher","first-page":"39","DOI":"10.18653\/v1\/2021.emnlp-main.4","article-title":"Cross attention augmented transducer networks for simultaneous translation","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"Liu","year":"2021"},{"key":"2025051914251602900_bib98","doi-asserted-by":"publisher","first-page":"8142","DOI":"10.24963\/ijcai.2024\/900","article-title":"Recent advances in end-to-end simultaneous speech translation","volume-title":"Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24","author":"Liu","year":"2024"},{"key":"2025051914251602900_bib99","first-page":"177","article-title":"Better punctuation prediction with dynamic conditional random fields","volume-title":"Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing","author":"Wei","year":"2010"},{"key":"2025051914251602900_bib100","article-title":"Input length matters: Improving rnn-t and mwer training for long-form telephony speech recognition","author":"Zhiyun","year":"2021","journal-title":"arXiv preprint arXiv:2110.03841"},{"key":"2025051914251602900_bib101","article-title":"Multi-task sequence to sequence learning","volume-title":"International Conference on Learning Representations","author":"Luong","year":"2016"},{"issue":"1","key":"2025051914251602900_bib102","doi-asserted-by":"publisher","first-page":"91","DOI":"10.1080\/0907676X.2018.1498531","article-title":"Is consecutive interpreting easier than simultaneous interpreting? \u2013 A corpus-based study of lexical simplification in interpretation","volume":"27","author":"Lv","year":"2019","journal-title":"Perspectives"},{"key":"2025051914251602900_bib103","doi-asserted-by":"publisher","first-page":"3025","DOI":"10.18653\/v1\/P19-1289","article-title":"STACL: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Ma","year":"2019"},{"key":"2025051914251602900_bib104","doi-asserted-by":"publisher","first-page":"144","DOI":"10.18653\/v1\/2020.emnlp-demos.19","article-title":"SIMULEVAL: An evaluation toolkit for simultaneous translation","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations","author":"Ma","year":"2020"},{"key":"2025051914251602900_bib105","doi-asserted-by":"publisher","first-page":"582","DOI":"10.18653\/v1\/2020.aacl-main.58","article-title":"SimulMT to SimulST: Adapting simultaneous text translation to end-to-end simultaneous speech translation","volume-title":"Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing","author":"Ma","year":"2020"},{"key":"2025051914251602900_bib106","article-title":"Efficient monotonic multihead attention","author":"Ma","year":"2023","journal-title":"arXiv preprint arXiv:2312.04515"},{"key":"2025051914251602900_bib107","doi-asserted-by":"publisher","first-page":"7523","DOI":"10.1109\/ICASSP39728.2021.9414897","article-title":"Streaming simultaneous speech translation with augmented memory transformer","volume-title":"ICASSP 2021 \u2013 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Ma","year":"2021"},{"key":"2025051914251602900_bib108","doi-asserted-by":"publisher","first-page":"1557","DOI":"10.18653\/v1\/2024.acl-long.85","article-title":"A non-autoregressive generation framework for end-to-end simultaneous speech-to-any translation","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Ma","year":"2024"},{"key":"2025051914251602900_bib109","doi-asserted-by":"publisher","first-page":"169","DOI":"10.18653\/v1\/2023.iwslt-1.12","article-title":"MT metrics correlate with human ratings of simultaneous speech translation","volume-title":"Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)","author":"Mach\u00e1\u010dek","year":"2023"},{"key":"2025051914251602900_bib110","doi-asserted-by":"publisher","first-page":"200","DOI":"10.18653\/v1\/2020.iwslt-1.25","article-title":"ELITR non-native speech translation at IWSLT 2020","volume-title":"Proceedings of the 17th International Conference on Spoken Language Translation","author":"Mach\u00e1\u010dek","year":"2020"},{"key":"2025051914251602900_bib111","first-page":"32","article-title":"Presenting simultaneous translation in limited space","volume-title":"Proceedings of the 20th Conference Information Technologies - Applications and Theory (ITAT 2020)","author":"Mach\u00e1cek","year":"2020"},{"key":"2025051914251602900_bib112","doi-asserted-by":"publisher","first-page":"V\u2013V","DOI":"10.1109\/ICASSP.2006.1661501","article-title":"Integrating speech recognition and machine translation: Where do we stand?","volume-title":"2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings","author":"Matusov","year":"2006"},{"key":"2025051914251602900_bib113","article-title":"Evaluating machine translation output with automatic sentence segmentation","volume-title":"Proceedings of the Second International Workshop on Spoken Language Translation","author":"Matusov","year":"2005"},{"key":"2025051914251602900_bib114","doi-asserted-by":"publisher","first-page":"82","DOI":"10.18653\/v1\/W19-5209","article-title":"Customizing neural machine translation for subtitling","volume-title":"Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers)","author":"Matusov","year":"2019"},{"key":"2025051914251602900_bib115","doi-asserted-by":"publisher","first-page":"6074","DOI":"10.1109\/ICASSP40776.2020.9054476","article-title":"Streaming automatic speech recognition with the transformer model","volume-title":"ICASSP 2020 \u2013 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Moritz","year":"2020"},{"key":"2025051914251602900_bib116","doi-asserted-by":"publisher","first-page":"82","DOI":"10.18653\/v1\/N16-3017","article-title":"Lecture translator - speech translation framework for simultaneous lecture translation","volume-title":"Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations","author":"M\u00fcller","year":"2016"},{"key":"2025051914251602900_bib117","doi-asserted-by":"publisher","first-page":"920","DOI":"10.1109\/ASRU46091.2019.9003913","article-title":"Recognizing long-form speech using streaming end-to-end models","volume-title":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","author":"Narayanan","year":"2019"},{"key":"2025051914251602900_bib118","doi-asserted-by":"publisher","first-page":"7528","DOI":"10.1109\/ICASSP39728.2021.9414276","article-title":"An empirical study of end-to-end simultaneous speech translation decoding strategies","volume-title":"ICASSP 2021 \u2013 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Ha","year":"2021"},{"key":"2025051914251602900_bib119","doi-asserted-by":"publisher","first-page":"2371","DOI":"10.21437\/Interspeech.2021-608","article-title":"Impact of encoding and segmentation strategies on end-to-end simultaneous speech translation","volume-title":"Interspeech 2021","author":"Ha","year":"2021"},{"key":"2025051914251602900_bib120","first-page":"2","article-title":"The IWSLT 2018 evaluation campaign","volume-title":"Proceedings of the 15th International Conference on Spoken Language Translation","author":"Niehues","year":"2018"},{"key":"2025051914251602900_bib121","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2016-154","article-title":"The IWSLT 2019 evaluation campaign","volume-title":"Proceedings of the 16th International Conference on Spoken Language Translation","author":"Niehues","year":"2019"},{"key":"2025051914251602900_bib122","doi-asserted-by":"publisher","first-page":"2513","DOI":"10.21437\/Interspeech.2018-1055","article-title":"Dynamic Transcription for Low-Latency Speech Translation","volume-title":"Proceedings of Interspeech 2016","author":"Niehues","year":"2016"},{"key":"2025051914251602900_bib123","doi-asserted-by":"crossref","first-page":"1293","DOI":"10.21437\/Interspeech.2018-1055","article-title":"Low-latency neural speech translation","volume-title":"Interspeech 2018","author":"Niehues","year":"2018"},{"key":"2025051914251602900_bib124","doi-asserted-by":"publisher","first-page":"3784","DOI":"10.21437\/Interspeech.2022-260","article-title":"Improving asr robustness in noisy condition through vad integration","volume-title":"Interspeech 2022","author":"Novitasari","year":"2022"},{"issue":"12","key":"2025051914251602900_bib125","doi-asserted-by":"publisher","first-page":"2195","DOI":"10.1587\/transinf.2021EDP7014","article-title":"Neural incremental speech recognition toward real-time machine speech translation","volume":"E104.D","author":"Novitasari","year":"2021","journal-title":"IEICE Transactions on Information and Systems"},{"key":"2025051914251602900_bib126","doi-asserted-by":"publisher","first-page":"551","DOI":"10.3115\/v1\/P14-2090","article-title":"Optimizing segmentation strategies for simultaneous speech translation","volume-title":"Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)","author":"Oda","year":"2014"},{"key":"2025051914251602900_bib127","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/ICASSP49357.2023.10095896","article-title":"Align, write, re-order: Explainable end-to-end speech translation via operation sequence generation","volume-title":"ICASSP 2023 \u2013 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Omachi","year":"2023"},{"key":"2025051914251602900_bib128","doi-asserted-by":"publisher","first-page":"159","DOI":"10.18653\/v1\/2023.iwslt-1.11","article-title":"Direct models for simultaneous translation and automatic subtitling: FBK@IWSLT2023","volume-title":"Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)","author":"Papi","year":"2023"},{"key":"2025051914251602900_bib129","doi-asserted-by":"publisher","first-page":"72","DOI":"10.18653\/v1\/2024.iwslt-1.11","article-title":"SimulSeamless: FBK at IWSLT 2024 simultaneous speech translation","volume-title":"Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)","author":"Papi","year":"2024"},{"key":"2025051914251602900_bib130","doi-asserted-by":"publisher","first-page":"3692","DOI":"10.18653\/v1\/2024.acl-long.202","article-title":"StreamAtt: Direct streaming speech-to-text translation with attention-based audio history selection","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Papi","year":"2024"},{"key":"2025051914251602900_bib131","doi-asserted-by":"publisher","first-page":"141","DOI":"10.18653\/v1\/2022.findings-emnlp.11","article-title":"Does simultaneous speech translation need simultaneous models?","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2022","author":"Papi","year":"2022"},{"key":"2025051914251602900_bib132","doi-asserted-by":"publisher","first-page":"12","DOI":"10.18653\/v1\/2022.autosimtrans-1.2","article-title":"Over-generation cannot be rewarded: Length-adaptive average lagging for simultaneous speech translation","volume-title":"Proceedings of the Third Workshop on Automatic Simultaneous Translation","author":"Papi","year":"2022"},{"key":"2025051914251602900_bib133","doi-asserted-by":"publisher","first-page":"480","DOI":"10.18653\/v1\/2022.aacl-short.59","article-title":"Dodging the data bottleneck: Automatic subtitling with automatically segmented ST corpora","volume-title":"Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)","author":"Papi","year":"2022"},{"key":"2025051914251602900_bib134","article-title":"Visualization: The missing factor in simultaneous speech translation","volume-title":"CEUR Workshop Proceedings","author":"Papi","year":"2021"},{"key":"2025051914251602900_bib135","doi-asserted-by":"publisher","first-page":"13340","DOI":"10.18653\/v1\/2023.acl-long.745","article-title":"Attention as a guide for simultaneous speech translation","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Papi","year":"2023"},{"key":"2025051914251602900_bib136","doi-asserted-by":"publisher","first-page":"3974","DOI":"10.21437\/Interspeech.2023-170","article-title":"AlignAtt: Using attention-based audio-translation alignments as a guide for simultaneous speech translation","volume-title":"Proceedings of INTERSPEECH 2023","author":"Papi","year":"2023"},{"key":"2025051914251602900_bib137","doi-asserted-by":"publisher","first-page":"10381","DOI":"10.1109\/ICASSP48485.2024.10447565","article-title":"Leveraging timestamp information for serialized joint streaming recognition and translation","volume-title":"ICASSP 2024 \u2013 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Papi","year":"2024"},{"key":"2025051914251602900_bib138","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/ASRU57964.2023.10389715","article-title":"Token-level serialized output training for joint streaming asr and st leveraging textual alignments","volume-title":"2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","author":"Papi","year":"2023"},{"key":"2025051914251602900_bib139","doi-asserted-by":"publisher","first-page":"311","DOI":"10.3115\/1073083.1073135","article-title":"Bleu: A method for automatic evaluation of machine translation","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics","author":"Papineni","year":"2002"},{"key":"2025051914251602900_bib140","doi-asserted-by":"publisher","first-page":"101317","DOI":"10.1016\/j.csl.2021.101317","article-title":"A review of speaker diarization: Recent advances with deep learning","volume":"72","author":"Park","year":"2022","journal-title":"Computer Speech & Language"},{"key":"2025051914251602900_bib141","doi-asserted-by":"publisher","first-page":"2534","DOI":"10.21437\/Interspeech.2010-680","article-title":"Rapid development of speech translation using consecutive interpretation","volume-title":"Proceedings of Interspeech 2010","author":"Paulik","year":"2010"},{"issue":"3","key":"2025051914251602900_bib142","doi-asserted-by":"publisher","first-page":"243","DOI":"10.1080\/15213269.2010.502873","article-title":"The cognitive effectiveness of subtitle processing","volume":"13","author":"Perego","year":"2010","journal-title":"Media Psychology"},{"key":"2025051914251602900_bib143","doi-asserted-by":"publisher","first-page":"64","DOI":"10.18653\/v1\/2023.ijcnlp-srw.9","article-title":"Long-form simultaneous speech translation: Thesis proposal","volume-title":"Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Student Research Workshop","author":"Pol\u00e1k","year":"2023"},{"key":"2025051914251602900_bib144","doi-asserted-by":"publisher","DOI":"10.1109\/SLT61566.2024.10832264","article-title":"Long-form end-to-end speech translation via latent alignment segmentation","author":"Pol\u00e1k","year":"2023","journal-title":"arXiv preprint arXiv: 2309.11384"},{"key":"2025051914251602900_bib145","doi-asserted-by":"publisher","first-page":"389","DOI":"10.18653\/v1\/2023.iwslt-1.37","article-title":"Towards efficient simultaneous speech translation: CUNI-KIT system for simultaneous track at IWSLT 2023","volume-title":"Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)","author":"Pol\u00e1k","year":"2023"},{"key":"2025051914251602900_bib146","doi-asserted-by":"publisher","first-page":"277","DOI":"10.18653\/v1\/2022.iwslt-1.24","article-title":"CUNI-KIT system for simultaneous speech translation task at IWSLT 2022","volume-title":"Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)","author":"Pol\u00e1k","year":"2022"},{"key":"2025051914251602900_bib147","doi-asserted-by":"publisher","first-page":"3979","DOI":"10.21437\/Interspeech.2023-2225","article-title":"Incremental blockwise beam search for simultaneous speech translation with controllable quality-latency tradeoff","volume-title":"Proceedings of INTERSPEECH 2023","author":"Pol\u00e1k","year":"2023"},{"key":"2025051914251602900_bib148","doi-asserted-by":"publisher","first-page":"89","DOI":"10.18653\/v1\/2020.iwslt-1.9","article-title":"SRPOL\u2019s system for the IWSLT 2020 end-to-end speech translation task","volume-title":"Proceedings of the 17th International Conference on Spoken Language Translation","author":"Potapczyk","year":"2020"},{"key":"2025051914251602900_bib149","first-page":"28492","article-title":"Robust speech recognition via large-scale weak supervision","volume-title":"International Conference on Machine Learning","author":"Radford","year":"2023"},{"key":"2025051914251602900_bib150","doi-asserted-by":"publisher","first-page":"12900","DOI":"10.18653\/v1\/2023.findings-acl.816","article-title":"Implicit memory transformer for computationally efficient simultaneous speech translation","volume-title":"Findings of the Association for Computational Linguistics: ACL 2023","author":"Raffel","year":"2023"},{"key":"2025051914251602900_bib151","article-title":"Shiftable context: Addressing training-inference context mismatch in simultaneous speech translation","volume-title":"Proceedings of the 40th International Conference on Machine Learning","author":"Raffel","year":"2023"},{"issue":"1","key":"2025051914251602900_bib152","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1080\/0907676X.2012.722651","article-title":"Effects of text chunking on subtitling: A quantitative and qualitative examination","volume":"21","author":"Rajendran","year":"2013","journal-title":"Perspectives"},{"key":"2025051914251602900_bib153","first-page":"230","article-title":"Segmentation strategies for streaming speech translation","volume-title":"Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Rangarajan Sridhar","year":"2013"},{"key":"2025051914251602900_bib154","first-page":"578","article-title":"COMET-22: Unbabel-IST 2022 submission for the metrics shared task","volume-title":"Proceedings of the Seventh Conference on Machine Translation (WMT)","author":"Rei","year":"2022"},{"key":"2025051914251602900_bib155","doi-asserted-by":"publisher","first-page":"2685","DOI":"10.18653\/v1\/2020.emnlp-main.213","article-title":"COMET: A neural framework for MT evaluation","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Rei","year":"2020"},{"key":"2025051914251602900_bib156","doi-asserted-by":"publisher","first-page":"3787","DOI":"10.18653\/v1\/2020.acl-main.350","article-title":"SimulSpeech: End-to-end simultaneous speech to text translation","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Yi","year":"2020"},{"key":"2025051914251602900_bib157","doi-asserted-by":"publisher","first-page":"175","DOI":"10.1163\/9789042031814_014","article-title":"Standing on quicksand: Hearing viewers\u2019 comprehension and reading patterns of respoken subtitles for the news","volume-title":"New insights into audiovisual translation and media accessibility","author":"Romero-Fresco","year":"2010"},{"key":"2025051914251602900_bib158","volume-title":"Subtitling through Speech Recognition: Respeaking","author":"Romero-Fresco","year":"2011"},{"key":"2025051914251602900_bib159","doi-asserted-by":"publisher","first-page":"683","DOI":"10.3115\/1273073.1273161","article-title":"Simultaneous English-Japanese spoken language translation based on incremental dependency parsing and transfer","volume-title":"Proceedings of the COLING\/ACL 2006 Main Conference Poster Sessions","author":"Ryu","year":"2006"},{"key":"2025051914251602900_bib160","article-title":"Evaluation of a simultaneous interpretation system and analysis of speech log for user experience assessment","volume-title":"Proceedings of the 10th International Workshop on Spoken Language Translation: Papers","author":"Sakamoto","year":"2013"},{"key":"2025051914251602900_bib161","doi-asserted-by":"publisher","first-page":"228","DOI":"10.18653\/v1\/2020.iwslt-1.28","article-title":"Towards stream translation: Adaptive computation time for simultaneous machine translation","volume-title":"Proceedings of the 17th International Conference on Spoken Language Translation","author":"Schneider","year":"2020"},{"key":"2025051914251602900_bib162","article-title":"Simultaneous translation for unsegmented input: A sliding window approach","author":"Sen","year":"2022","journal-title":"arXiv preprint arXiv:2210.09754"},{"key":"2025051914251602900_bib163","first-page":"217","article-title":"Learning segmentations that balance latency versus quality in spoken language translation","volume-title":"Proceedings of the 12th International Workshop on Spoken Language Translation: Papers","author":"Shavarani","year":"2015"},{"key":"2025051914251602900_bib164","article-title":"Constructing a speech translation system using simultaneous interpretation data","volume-title":"Proceedings of the 10th International Workshop on Spoken Language Translation: Papers","author":"Shimizu","year":"2013"},{"key":"2025051914251602900_bib165","first-page":"154","article-title":"Simultaneous translation using optimized segmentation","volume-title":"Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)","author":"Siahbani","year":"2018"},{"key":"2025051914251602900_bib166","doi-asserted-by":"publisher","first-page":"2351","DOI":"10.21437\/Interspeech.2014-511","article-title":"A semi-Markov model for speech segmentation with an utterance-break prior","volume-title":"Proceedings of Interspeech 2014","author":"Sinclair","year":"2014"},{"issue":"1","key":"2025051914251602900_bib167","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/97.736233","article-title":"A statistical model-based voice activity detection","volume":"6","author":"Sohn","year":"1999","journal-title":"IEEE Signal Processing Letters"},{"key":"2025051914251602900_bib168","doi-asserted-by":"publisher","first-page":"7409","DOI":"10.18653\/v1\/2020.acl-main.661","article-title":"Speech translation and the end-to-end promise: Taking stock of where we are","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Sperber","year":"2020"},{"issue":"2","key":"2025051914251602900_bib169","first-page":"116","article-title":"Machine translation of speech","volume":"6","author":"Stentiford","year":"1988","journal-title":"British Telecom Technology Journal"},{"key":"2025051914251602900_bib170","article-title":"Multilingual simultaneous speech translation","author":"Subramanya","year":"2022","journal-title":"arXiv preprint arXiv:2203.14835"},{"key":"2025051914251602900_bib171","article-title":"Streaming sequence transduction through dynamic compression","author":"Tan","year":"2024","journal-title":"arXiv preprint arXiv: 2402.01172"},{"key":"2025051914251602900_bib172","doi-asserted-by":"publisher","first-page":"12441","DOI":"10.18653\/v1\/2023.acl-long.695","article-title":"Hybrid transducer and attention based encoder-decoder modeling for speech-to-text tasks","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Tang","year":"2023"},{"issue":"6","key":"2025051914251602900_bib173","doi-asserted-by":"publisher","DOI":"10.1145\/3530811","article-title":"Efficient transformers: A survey","volume":"55","author":"Yi","year":"2022","journal-title":"ACM Computing Surveys"},{"key":"2025051914251602900_bib174","doi-asserted-by":"publisher","first-page":"82","DOI":"10.18653\/v1\/W17-4811","article-title":"Neural machine translation with extended context","volume-title":"Proceedings of the Third Workshop on Discourse in Machine Translation","author":"Tiedemann","year":"2017"},{"key":"2025051914251602900_bib175","doi-asserted-by":"publisher","first-page":"106","DOI":"10.21437\/Interspeech.2022-59","article-title":"SHAS: Approaching optimal segmentation for end-to-end speech translation","volume-title":"Proceedings of Interspeech 2022","author":"Tsiamas","year":"2022"},{"key":"2025051914251602900_bib176","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2004-156","article-title":"Speech translation: Past, present and future","volume-title":"Interspeech","author":"Waibel","year":"2004"},{"key":"2025051914251602900_bib177","doi-asserted-by":"publisher","first-page":"793","DOI":"10.1109\/ICASSP.1991.150456","article-title":"Janus: A speech-to-speech translation system using connectionist and symbolic processing strategies","volume":"2","author":"Waibel","year":"1991","journal-title":"[Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing"},{"key":"2025051914251602900_bib178","doi-asserted-by":"publisher","first-page":"6977","DOI":"10.1109\/ICASSP43922.2022.9746873","article-title":"Vadoi: Voice-activity-detection overlapping inference for end-to-end long-form speech recognition","volume-title":"ICASSP 2022 \u2013 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Wang","year":"2022"},{"key":"2025051914251602900_bib179","doi-asserted-by":"publisher","first-page":"247","DOI":"10.18653\/v1\/2022.iwslt-1.21","article-title":"The HW-TSC\u2019s simultaneous speech translation system for IWSLT 2022 evaluation","volume-title":"Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)","author":"Wang","year":"2022"},{"key":"2025051914251602900_bib180","doi-asserted-by":"publisher","first-page":"57","DOI":"10.21437\/Interspeech.2023-2004","article-title":"LAMASSU: A streaming language-agnostic multilingual speech recognition and translation model using neural transducers","volume-title":"Proceedings of INTERSPEECH 2023","author":"Wang","year":"2023"},{"key":"2025051914251602900_bib181","first-page":"139","article-title":"An efficient and effective online sentence segmenter for simultaneous interpretation","volume-title":"Proceedings of the 3rd Workshop on Asian Translation (WAT2016)","author":"Wang","year":"2016"},{"key":"2025051914251602900_bib182","first-page":"1","article-title":"Online sentence segmentation for simultaneous interpretation using multi-shifted recurrent neural network","volume-title":"Proceedings of Machine Translation Summit XVII: Research Track","author":"Wang","year":"2019"},{"key":"2025051914251602900_bib183","doi-asserted-by":"publisher","first-page":"2625","DOI":"10.21437\/Interspeech.2017-503","article-title":"Sequence-to-sequence models can directly translate foreign speech","volume-title":"Interspeech 2017","author":"Weiss","year":"2017"},{"key":"2025051914251602900_bib184","doi-asserted-by":"publisher","first-page":"2533","DOI":"10.18653\/v1\/2021.eacl-main.216","article-title":"Streaming models for joint speech recognition and translation","volume-title":"Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume","author":"Weller","year":"2021"},{"key":"2025051914251602900_bib185","doi-asserted-by":"publisher","first-page":"1435","DOI":"10.18653\/v1\/2022.findings-acl.113","article-title":"End-to-end speech translation for code switched speech","volume-title":"Findings of the Association for Computational Linguistics: ACL 2022","author":"Weller","year":"2022"},{"key":"2025051914251602900_bib186","doi-asserted-by":"publisher","first-page":"237","DOI":"10.18653\/v1\/2020.iwslt-1.29","article-title":"Neural simultaneous speech translation using alignment-based chunking","volume-title":"Proceedings of the 17th International Conference on Spoken Language Translation","author":"Wilken","year":"2020"},{"key":"2025051914251602900_bib187","doi-asserted-by":"publisher","first-page":"233","DOI":"10.1109\/SLT.2008.4777883","article-title":"Simultaneous machine translation of german lectures into english: Investigating research challenges for the future","volume-title":"2008 IEEE Spoken Language Technology Workshop","author":"Wolfel","year":"2008"},{"key":"2025051914251602900_bib188","doi-asserted-by":"publisher","first-page":"107","DOI":"10.1007\/978-3-319-05951-8_11","article-title":"Real-time statistical speech translation","volume-title":"New Perspectives in Information Systems and Technologies, Volume 1","author":"Wo\u0142k","year":"2014"},{"key":"2025051914251602900_bib189","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1007\/3-540-49478-2_3","article-title":"A modular approach to spoken language translation for large domains","volume-title":"Machine Translation and the Information Soup: Third Conference of the Association for Machine Translation in the Americas AMTA\u201998 Langhorne, PA, USA, October 28\u201331, 1998 Proceedings 3","author":"Woszczyna","year":"1998"},{"key":"2025051914251602900_bib190","doi-asserted-by":"publisher","first-page":"2132","DOI":"10.21437\/Interspeech.2020-2079","article-title":"Streaming transformer-based acoustic models using self-attention with augmented memory","volume-title":"Proceedings of Interspeech 2020","author":"Chunyang","year":"2020"},{"key":"2025051914251602900_bib191","article-title":"Dutongchuan: Context-aware translation model for simultaneous interpreting","author":"Xiong","year":"2019","journal-title":"arXiv preprint arXiv:1907.12984"},{"key":"2025051914251602900_bib192","doi-asserted-by":"publisher","first-page":"3263","DOI":"10.21437\/Interspeech.2022-10953","article-title":"Large-scale streaming end-to-end speech translation with neural transducers","volume-title":"Proceedings of Interspeech 2022","author":"Xue","year":"2022"},{"key":"2025051914251602900_bib193","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/ASRU57964.2023.10389799","article-title":"A weakly-supervised streaming multilingual speech model with truly zero-shot capability","volume-title":"2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","author":"Xue","year":"2023"},{"key":"2025051914251602900_bib194","doi-asserted-by":"publisher","first-page":"235","DOI":"10.18653\/v1\/2023.iwslt-1.20","article-title":"CMU\u2019s IWSLT 2023 simultaneous speech translation system","volume-title":"Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)","author":"Yan","year":"2023"},{"key":"2025051914251602900_bib195","doi-asserted-by":"publisher","first-page":"10866","DOI":"10.1109\/ICASSP48485.2024.10446050","article-title":"Diarist: Streaming speech translation with speaker diarization","volume-title":"ICASSP 2024 \u2013 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Yang","year":"2024"},{"key":"2025051914251602900_bib196","first-page":"123","article-title":"Dynamic masking for improved stability in online spoken language translation","volume-title":"Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)","author":"Yao","year":"2020"},{"key":"2025051914251602900_bib197","first-page":"1032","article-title":"Incremental segmentation and decoding strategies for simultaneous translation","volume-title":"Proceedings of the Sixth International Joint Conference on Natural Language Processing","author":"Yarmohammadi","year":"2013"},{"key":"2025051914251602900_bib198","doi-asserted-by":"publisher","first-page":"6999","DOI":"10.1109\/ICASSP40776.2020.9054358","article-title":"End-to-end automatic speech recognition integrated with ctc-based voice activity detection","volume-title":"ICASSP 2020 \u2013 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Yoshimura","year":"2020"},{"key":"2025051914251602900_bib199","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2022-10617","article-title":"Decision attentive regularization to improve simultaneous speech translation systems","author":"Zaidi","year":"2021","journal-title":"arXiv preprint arXiv:2110.15729"},{"key":"2025051914251602900_bib200","doi-asserted-by":"publisher","first-page":"116","DOI":"10.21437\/Interspeech.2022-10617","article-title":"Cross-modal decision regularization for simultaneous speech translation","volume-title":"Proceedings of Interspeech 2022","author":"Zaidi","year":"2022"},{"key":"2025051914251602900_bib201","doi-asserted-by":"publisher","first-page":"2461","DOI":"10.18653\/v1\/2021.findings-acl.218","article-title":"RealTranS: End-to-end simultaneous speech translation with convolutional weighted-shrinking transformer","volume-title":"Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021","author":"Zeng","year":"2021"},{"key":"2025051914251602900_bib202","doi-asserted-by":"publisher","first-page":"25","DOI":"10.18653\/v1\/2022.autosimtrans-1.5","article-title":"End-to-end simultaneous speech translation with pretraining and distillation: Huawei Noah\u2019s system for AutoSimTranS 2022","volume-title":"Proceedings of the Third Workshop on Automatic Simultaneous Translation","author":"Zeng","year":"2022"},{"key":"2025051914251602900_bib203","doi-asserted-by":"publisher","first-page":"2566","DOI":"10.18653\/v1\/2021.acl-long.200","article-title":"Beyond sentence-level end-to-end speech translation: Context helps","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Zhang","year":"2021"},{"key":"2025051914251602900_bib204","doi-asserted-by":"publisher","first-page":"7814","DOI":"10.18653\/v1\/2023.emnlp-main.484","article-title":"Training simultaneous speech translation with robust and random wait-k-tokens strategy","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Zhang","year":"2023"},{"key":"2025051914251602900_bib205","doi-asserted-by":"publisher","first-page":"7829","DOI":"10.1109\/ICASSP40776.2020.9053896","article-title":"Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss","volume-title":"ICASSP 2020 \u2013 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Zhang","year":"2020"},{"key":"2025051914251602900_bib206","doi-asserted-by":"publisher","first-page":"7862","DOI":"10.18653\/v1\/2022.acl-long.542","article-title":"Learning adaptive segmentation policy for end-to-end simultaneous translation","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Zhang","year":"2022"},{"key":"2025051914251602900_bib207","doi-asserted-by":"publisher","first-page":"8964","DOI":"10.18653\/v1\/2024.acl-long.485","article-title":"StreamSpeech: Simultaneous speech-to-speech translation with multi-task learning","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Zhang","year":"2024"},{"key":"2025051914251602900_bib208","doi-asserted-by":"publisher","first-page":"992","DOI":"10.18653\/v1\/2022.emnlp-main.65","article-title":"Information-transport-based policy for simultaneous translation","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Zhang","year":"2022"},{"key":"2025051914251602900_bib209","doi-asserted-by":"publisher","first-page":"7659","DOI":"10.18653\/v1\/2023.findings-acl.485","article-title":"End-to-end simultaneous speech translation with differentiable segmentation","volume-title":"Findings of the Association for Computational Linguistics: ACL 2023","author":"Zhang","year":"2023"},{"key":"2025051914251602900_bib210","article-title":"Unified segment-to-segment framework for simultaneous sequence generation","volume":"36","author":"Zhang","year":"2024","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2025051914251602900_bib211","article-title":"Google usm: Scaling automatic speech recognition beyond 100 languages","author":"Zhang","year":"2023","journal-title":"arXiv preprint arXiv:2303.01037"},{"key":"2025051914251602900_bib212","doi-asserted-by":"publisher","first-page":"437","DOI":"10.18653\/v1\/2020.acl-main.42","article-title":"Opportunistic decoding with timely correction for simultaneous translation","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Zheng","year":"2020"},{"key":"2025051914251602900_bib213","doi-asserted-by":"publisher","first-page":"208","DOI":"10.18653\/v1\/2022.iwslt-1.16","article-title":"The AISP-SJTU simultaneous translation system for IWSLT 2022","volume-title":"Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)","author":"Zhu","year":"2022"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00740\/2513610\/tacl_a_00740.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00740\/2513610\/tacl_a_00740.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,5,19]],"date-time":"2025-05-19T18:25:35Z","timestamp":1747679135000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00740\/128861\/How-Real-is-Your-Real-Time-Simultaneous-Speech-to"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025]]},"references-count":213,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00740","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025]]},"published":{"date-parts":[[2025]]}}}