{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,24]],"date-time":"2026-02-24T19:30:56Z","timestamp":1771961456996,"version":"3.50.1"},"reference-count":38,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,7,13]],"date-time":"2024-07-13T00:00:00Z","timestamp":1720828800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,7,13]],"date-time":"2024-07-13T00:00:00Z","timestamp":1720828800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61976236"],"award-info":[{"award-number":["61976236"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J AUDIO SPEECH MUSIC PROC."],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>End-to-end speech to text translation aims to directly translate speech from one language into text in another, posing a challenging cross-modal task particularly in scenarios of limited data. Multi-task learning serves as an effective strategy for knowledge sharing between speech translation and machine translation, which allows models to leverage extensive machine translation data to learn the mapping between source and target languages, thereby improving the performance of speech translation. However, in multi-task learning, finding a set of weights that balances various tasks is challenging and computationally expensive. We proposed an adaptive multi-task learning method to dynamically adjust multi-task weights based on the proportional losses incurred during training, enabling adaptive balance in multi-task learning for speech to text translation. Moreover, inherent representation disparities across different modalities impede speech translation models from harnessing textual data effectively. To bridge the gap across different modalities, we proposed to apply optimal transport in the input of end-to-end model to find the alignment between speech and text sequences and learn the shared representations between them. Experimental results show that our method effectively improved the performance on the Tibetan-Chinese, English-German, and English-French speech translation datasets.<\/jats:p>","DOI":"10.1186\/s13636-024-00359-1","type":"journal-article","created":{"date-parts":[[2024,7,13]],"date-time":"2024-07-13T13:01:50Z","timestamp":1720875710000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["Adaptive multi-task learning for speech to text translation"],"prefix":"10.1186","volume":"2024","author":[{"given":"Xin","family":"Feng","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7831-5721","authenticated-orcid":false,"given":"Yue","family":"Zhao","sequence":"additional","affiliation":[]},{"given":"Wei","family":"Zong","sequence":"additional","affiliation":[]},{"given":"Xiaona","family":"Xu","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,7,13]]},"reference":[{"issue":"2","key":"359_CR1","first-page":"116","volume":"6","author":"FWM Stentiford","year":"1988","unstructured":"F.W.M. Stentiford, M.G. Steer, Machine translation of speech. Br. Telecom Technol. J. 6(2), 116\u2013122 (1988)","journal-title":"Br. Telecom Technol. J."},{"key":"359_CR2","doi-asserted-by":"crossref","unstructured":"A. Waibel, A.N. Jain, ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing, JANUS: A speech-to-speech translation system using connectionist and symbolic processing strategies (Toronto, 1991), pp. 793\u2013796","DOI":"10.1109\/ICASSP.1991.150456"},{"key":"359_CR3","unstructured":"A. B\u00e9rard, O. Pietquin, C. Servan, Listen and translate: A proof of concept for end-to-end speech-to-text translation. CoRR. (2016). arXiv:1612.01744"},{"key":"359_CR4","doi-asserted-by":"crossref","unstructured":"L. Duong, A. Anastasopoulos, Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies, An Attentional Model for Speech Translation without Transcription (Association for Computational Linguistics (ACL),\u00a0Stroudsburg, 2016), pp. 949\u2013959","DOI":"10.18653\/v1\/N16-1109"},{"key":"359_CR5","doi-asserted-by":"crossref","unstructured":"A. Kendall, Y. Gal, Proceedings of the IEEE conference on computer vision and pattern recognition, Multi-task learning using uncertainty to weigh losses for scene geometry and semantics (IEEE Computer Society, Washington, DC, 2018), pp. 7482\u20137491","DOI":"10.1109\/CVPR.2018.00781"},{"issue":"7","key":"359_CR6","first-page":"3614","volume":"44","author":"W Vandenhende","year":"2021","unstructured":"W. Vandenhende, S. Georgoulis, Multi-task learning for dense prediction tasks: A survey. IEEE Trans. Pattern. Anal. Mach. Intel. 44(7), 3614\u20133633 (2021)","journal-title":"IEEE Trans. Pattern. Anal. Mach. Intel."},{"key":"359_CR7","unstructured":"Z. Chen, V. Badrinarayanan, in Proceedings of the International Conference on Machine Learning, Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks (Proceedings of Machine Learning Research (PMLR),\u00a0Cambridge, 2018), pp. 14\u201316"},{"key":"359_CR8","doi-asserted-by":"crossref","unstructured":"S. Liu, E. Johns, Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, End-to-end multi-task learning with attention (IEEE Computer Society,\u00a0Washington, DC, 2019), pp. 1871\u20131880","DOI":"10.1109\/CVPR.2019.00197"},{"key":"359_CR9","doi-asserted-by":"crossref","unstructured":"R. Ye, M. Wang, L. LI, Cross-modal contrastive learning for speech translation. Phys. Lett. 5099\u20135113 (2022). arXiv:2205.02444","DOI":"10.18653\/v1\/2022.naacl-main.376"},{"key":"359_CR10","first-page":"2053","volume":"28","author":"C Frogner","year":"2015","unstructured":"C. Frogner, C. Zhang, Learning with a Wasserstein loss. Adv. Neural Inf. Process. Syst. 28, 2053\u20132061 (2015)","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"359_CR11","doi-asserted-by":"crossref","unstructured":"A. Alinejad, A. Sarkar, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP), Effectively pretraining a speech translation decoder with machine translation data (Association for Computational Linguistics (ACL),\u00a0Stroudsburg, 2020), pp. 8014\u20138020","DOI":"10.18653\/v1\/2020.emnlp-main.644"},{"key":"359_CR12","unstructured":"R. Zheng, J. Chen, International Conference on Machine Learning, Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation (PMLR, 2021). pp. 12736\u201312746"},{"key":"359_CR13","doi-asserted-by":"publisher","unstructured":"C. Xu, B. Hu, Stacked acoustic-and-textual encoding: Integrating the pre-trained models into speech translation encoders (2021), pp. 2619\u20132630. https:\/\/doi.org\/10.18653\/v1\/2021.acl-long.204","DOI":"10.18653\/v1\/2021.acl-long.204"},{"key":"359_CR14","doi-asserted-by":"crossref","unstructured":"H. Le, J. Pino, C. Wang, Dual-decoder transformer for joint automatic speech recognition and multilingual speech translation (2020), pp. 3520\u20133533. arXiv:2011.00747","DOI":"10.18653\/v1\/2020.coling-main.314"},{"key":"359_CR15","doi-asserted-by":"crossref","unstructured":"H.K. Vydana, M. Karafi\u00e1t, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jointly trained transformers models for spoken language translation (IEEE Signal Processing Society,\u00a0Piscataway, 2021), pp. 7513\u20137517","DOI":"10.1109\/ICASSP39728.2021.9414159"},{"key":"359_CR16","doi-asserted-by":"crossref","unstructured":"Y. Tang, J. Pino, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), A general multi-task learning framework to leverage text data for speech to text tasks (IEEE Signal Processing Society,\u00a0Piscataway, 2021), pp. 6209\u20136213","DOI":"10.1109\/ICASSP39728.2021.9415058"},{"key":"359_CR17","doi-asserted-by":"crossref","unstructured":"M. Gaido, M.A. Di Gangi, M. Negri, End-to-end speech translation with knowledge distillation: FBK@ IWSLT2020 (2020), pp. 80\u201388. arXiv:2006.02965","DOI":"10.18653\/v1\/2020.iwslt-1.8"},{"key":"359_CR18","doi-asserted-by":"crossref","unstructured":"H. Inaguma, T. Kawahara, in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation (Association for\u00a0Computational Linguistics (ACL), Stroudsburg, 2020), pp. 1872\u20131881","DOI":"10.18653\/v1\/2021.naacl-main.150"},{"key":"359_CR19","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-71050-9","volume-title":"Optimal Transport: Old and New","author":"C Villani","year":"2009","unstructured":"C. Villani, Optimal transport: Old and new (Springer, Berlin, 2009)"},{"key":"359_CR20","doi-asserted-by":"crossref","unstructured":"Y.C. Chen, L. Li, L. Yu, European conference on computer vision, Uniter: Universal image-text representation learning (Springer International Publishing, Cham, 2020), pp. 104\u2013120","DOI":"10.1007\/978-3-030-58577-8_7"},{"key":"359_CR21","doi-asserted-by":"crossref","unstructured":"S. Gu, Y. Feng, in findings of the Association for Computational Linguistics: EMNLP, Improving zero-shot multilingual translation with universal representations and cross-mappings (Association for Computational Linguistics (ACL), Stroudsburg, 2022), pp. 6492\u20136504","DOI":"10.18653\/v1\/2022.findings-emnlp.485"},{"key":"359_CR22","doi-asserted-by":"crossref","unstructured":"Y. Zhou, Q. Fang, in proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, CMOT: Cross-modal mixup via optimal transport for speech translation (Association for Computational Linguistics (ACL),\u00a0Stroudsburg, 2023), pp. 7873\u20137887","DOI":"10.18653\/v1\/2023.acl-long.436"},{"key":"359_CR23","unstructured":"Y. Liu, J. Zhu, Bridging the modality gap for speech-to-text translation (2020). arXiv:2010.14920"},{"key":"359_CR24","doi-asserted-by":"crossref","unstructured":"Q. Fang, R. Ye, in proceedings of the 60th Annual Meeting of the Association for Computational Lingui stics, Stemm: Self-learning with speech-text manifold mixup for speech translation (Association for\u00a0Computational Linguistics (ACL), Stroudsburg, 2022), pp. 7050\u20137062","DOI":"10.18653\/v1\/2022.acl-long.486"},{"key":"359_CR25","doi-asserted-by":"crossref","unstructured":"C. Han, M. Wang, H. Ji, Learning shared semantic space for speech-to-text translation. CoRR. 2214\u20132225 (2021). arXiv:2105.03095","DOI":"10.18653\/v1\/2021.findings-acl.195"},{"key":"359_CR26","doi-asserted-by":"publisher","unstructured":"R. Ye, M. Wang, End-to-end speech translation via cross-modal progressive training 2267\u20132271 (2021). https:\/\/doi.org\/10.21437\/INTERSPEECH.2021-1065","DOI":"10.21437\/INTERSPEECH.2021-1065"},{"key":"359_CR27","first-page":"5998","volume":"30","author":"A Vaswani","year":"2017","unstructured":"A. Vaswani, N. Shazeer, Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998\u20136008 (2017)","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"359_CR28","unstructured":"A. Baevski, Y. Zhou, Wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449\u201312460 (2020)"},{"key":"359_CR29","doi-asserted-by":"crossref","unstructured":"G. Peyr\u00e9, M. Cuturi, Computational optimal transport: With applications to data science. Found. Trends\u00ae Mach. Learn. 11(5\u20136), 355\u2013607 (2019)","DOI":"10.1561\/2200000073"},{"key":"359_CR30","first-page":"2292","volume":"26","author":"M Cuturi","year":"2013","unstructured":"M. Cuturi, Sinkhorn distances: Lightspeed computation of optimal transport. Adv. Neural Inf. Process. Syst. 26, 2292\u20132300 (2013)","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"359_CR31","doi-asserted-by":"crossref","unstructured":"Y. Zhao, X. XU, An open speech resource for Tibetan multi-dialect and multitask recognition (OpenSLR, 2020). http:\/\/www.openslr.org\/124\/. Accessed 22 June 2023","DOI":"10.1504\/IJCSE.2020.107351"},{"key":"359_CR32","unstructured":"M.A. Di Gangi, R. Cattoni, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Must-c: A Multilingual Speech Translation Corpus (ELSEVIER SCI LTD, Oxon, 2019), pp. 2012\u20132017"},{"key":"359_CR33","unstructured":"O. Bojar, R. Chatterjee, First conference on machine translation, Findings of the 2016 Conference on Machine Translation (wmt16) (Association for\u00a0Computational Linguistics (ACL), Stroudsburg, 2016), pp. 131\u2013198"},{"key":"359_CR34","unstructured":"C. Wang, Y. Tang, Fairseq S2T: Fast speech-to-text modeling with Fairseq (2020), pp. 33\u201339. arXiv:2010.05171"},{"key":"359_CR35","doi-asserted-by":"crossref","unstructured":"T. Kudo, J. Richardson, in Proceedings of the 2018 Conference on Empirical Methods in Nat ural Language Processing: System Demonstrations, Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing (2018), pp. 66\u201371 2018.\u00a0eprint arXiv:1808.06226,cs.CL","DOI":"10.18653\/v1\/D18-2012"},{"key":"359_CR36","unstructured":"D.P. Kingma, J. Ba, Adam: A method for stochastic optimization (2014). arXiv:1412.6980"},{"key":"359_CR37","doi-asserted-by":"crossref","unstructured":"M. Post, Proceedings of the Third Conference on Machine Translation, A call for clarity in reporting BLEU scores (2018), pp. 186\u2013191. eprint arXiv:1804.08771, cs.CL","DOI":"10.18653\/v1\/W18-6319"},{"key":"359_CR38","doi-asserted-by":"crossref","unstructured":"Y. Tang, J. Pino, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Improving speech translation by understanding and learning from the auxiliary text translation task (Association for Computational Linguistics (ACL), Stroudsburg, 2021), pp. 4252\u20134261","DOI":"10.18653\/v1\/2021.acl-long.328"}],"container-title":["EURASIP Journal on Audio, Speech, and Music Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-024-00359-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13636-024-00359-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-024-00359-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,24]],"date-time":"2024-11-24T01:55:03Z","timestamp":1732413303000},"score":1,"resource":{"primary":{"URL":"https:\/\/asmp-eurasipjournals.springeropen.com\/articles\/10.1186\/s13636-024-00359-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,13]]},"references-count":38,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["359"],"URL":"https:\/\/doi.org\/10.1186\/s13636-024-00359-1","relation":{},"ISSN":["1687-4722"],"issn-type":[{"value":"1687-4722","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,7,13]]},"assertion":[{"value":"18 April 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 June 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 July 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"36"}}