{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,31]],"date-time":"2025-10-31T07:51:01Z","timestamp":1761897061462},"reference-count":28,"publisher":"MIT Press - Journals","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Transactions of the Association for Computational Linguistics"],"published-print":{"date-parts":[[2019,11]]},"abstract":"<jats:p> Speech translation has traditionally been approached through cascaded models consisting of a speech recognizer trained on a corpus of transcribed speech, and a machine translation system trained on parallel texts. Several recent works have shown the feasibility of collapsing the cascade into a single, direct model that can be trained in an end-to-end fashion on a corpus of translated speech. However, experiments are inconclusive on whether the cascade or the direct model is stronger, and have only been conducted under the unrealistic assumption that both are trained on equal amounts of data, ignoring other available speech recognition and machine translation corpora. <\/jats:p><jats:p> In this paper, we demonstrate that direct speech translation models require more data to perform well than cascaded models, and although they allow including auxiliary data through multi-task training, they are poor at exploiting such data, putting them at a severe disadvantage. As a remedy, we propose the use of end- to-end trainable models with two attention mechanisms, the first establishing source speech to source text alignments, the second modeling source to target text alignment. We show that such models naturally decompose into multi-task\u2013trainable recognition and translation tasks and propose an attention-passing technique that alleviates error propagation issues in a previous formulation of a model with two attention stages. Our proposed model outperforms all examined baselines and is able to exploit auxiliary training data much more effectively than direct attentional models. <\/jats:p>","DOI":"10.1162\/tacl_a_00270","type":"journal-article","created":{"date-parts":[[2019,6,19]],"date-time":"2019-06-19T17:40:02Z","timestamp":1560966002000},"page":"313-325","source":"Crossref","is-referenced-by-count":35,"title":["Attention-Passing Models for Robust and Data-Efficient End-to-End Speech Translation"],"prefix":"10.1162","volume":"7","author":[{"given":"Matthias","family":"Sperber","sequence":"first","affiliation":[{"name":"Karlsruhe Institute of Technology, Germany."}]},{"given":"Graham","family":"Neubig","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, USA."}]},{"given":"Jan","family":"Niehues","sequence":"additional","affiliation":[{"name":"Karlsruhe Institute of Technology, Germany"}]},{"given":"Alex","family":"Waibel","sequence":"additional","affiliation":[{"name":"Karlsruhe Institute of Technology, Germany"},{"name":"Carnegie Mellon University, USA."}]}],"member":"281","reference":[{"key":"bib1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00109"},{"key":"bib2","volume-title":"North American Chapter of the Association for Computational Linguistics (NAACL)","author":"Anastasopoulos Antonios","year":"2018"},{"key":"bib3","volume-title":"International Conference on Representation Learning (ICLR)","author":"Bahdanau Dzmitry","year":"2015"},{"key":"bib4","volume-title":"Annual Conference of the International Speech Communication Association (InterSpeech)","author":"Bansal Sameer","year":"2018"},{"key":"bib5","volume-title":"North American Chapter of the Association for Computational Linguistics (NAACL)","author":"Bansal Sameer","year":"2019"},{"key":"bib6","volume-title":"International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"B\u00e9rard Alexandre","year":"2018"},{"key":"bib7","volume-title":"Acoustics, Speech and Signal Processing (ICASSP)","author":"Chan William","year":"2016"},{"key":"bib8","first-page":"577","volume-title":"Advances in Neural Information Processing Systems (NIPS)","author":"Chorowski Jan K.","year":"2015"},{"key":"bib9","first-page":"69","volume-title":"Language Resources and Evaluation (LREC)","author":"Cieri Christopher","year":"2004"},{"key":"bib10","first-page":"949","volume-title":"North American Chapter of the Association for Computational Linguistics (NAACL)","author":"Duong Long","year":"2016"},{"key":"bib11","first-page":"1019","volume-title":"Neural Information Processing Systems Conference (NIPS)","author":"Gal Yarin","year":"2016"},{"key":"bib12","first-page":"2630","volume-title":"Annual Conference of the International Speech Communication Association (InterSpeech)","author":"Kano Takatomo","year":"2017"},{"key":"bib13","volume-title":"International Conference on Learning Representations (ICLR)","author":"Kingma Diederik P.","year":"2014"},{"key":"bib14","volume-title":"Language Resources and Evaluation (LREC)","author":"Kocabiyikoglu Ali Can","year":"2018"},{"key":"bib15","first-page":"923","volume-title":"Conference on Language Resources and Evaluation (LREC)","author":"Lison Pierre","year":"2016"},{"key":"bib16","volume-title":"Conference of the Association for Machine Translation in the Americas (AMTA) Open Source Software Showcase","author":"Neubig Graham","year":"2018"},{"key":"bib17","volume-title":"North American Chapter of the Association for Computational Linguistics (NAACL)","author":"Nguyen Toan Q.","year":"2018"},{"key":"bib18","doi-asserted-by":"crossref","first-page":"80","DOI":"10.18653\/v1\/W17-4708","volume-title":"Conference on Machine Translation (WMT)","author":"Niehues Jan","year":"2017"},{"key":"bib19","first-page":"5206","volume-title":"Acoustics, Speech and Signal Processing (ICASSP)","author":"Panayotov Vassil","year":"2015"},{"key":"bib20","volume-title":"International Workshop on Spoken Language Translation (IWSLT)","author":"Post Matt","year":"2013"},{"key":"bib21","volume-title":"International Workshop on Spoken Language Translation (IWSLT)","author":"Sperber Matthias","year":"2017"},{"key":"bib22","first-page":"2818","volume-title":"Computer Vision and Pattern Recognition (CVPR)","author":"Szegedy Christian","year":"2016"},{"key":"bib23","first-page":"431","volume-title":"International Joint Conference on Natural Language Processing (IJCNLP)","author":"Tjandra Andros","year":"2017"},{"key":"bib24","volume-title":"Annual Conference of the International Speech Communication Association (InterSpeech)","author":"Toshniwal Shubham","year":"2017"},{"key":"bib25","volume-title":"Conference on Artificial Intelligence (AAAI)","author":"Zhaopeng Tu","year":"2017"},{"key":"bib26","volume-title":"Annual Conference of the International Speech Communication Association (InterSpeech)","author":"Weiss Ron J.","year":"2017"},{"key":"bib27","volume-title":"Neural Information Processing Systems Conference (NIPS)","author":"Xia Yingce","year":"2017"},{"key":"bib28","volume-title":"International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Zhang Yu","year":"2017"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mitpressjournals.org\/doi\/pdf\/10.1162\/tacl_a_00270","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,3,12]],"date-time":"2021-03-12T21:39:23Z","timestamp":1615585163000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/43517"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,11]]},"references-count":28,"alternative-id":["10.1162\/tacl_a_00270"],"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00270","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,11]]}}}