{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,20]],"date-time":"2026-02-20T08:19:47Z","timestamp":1771575587620,"version":"3.50.1"},"reference-count":52,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2022,8,1]],"date-time":"2022-08-01T00:00:00Z","timestamp":1659312000000},"content-version":"vor","delay-in-days":212,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,7,27]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In Neural Machine Translation, it is typically assumed that the sentence with the highest estimated probability should also be the translation with the highest quality as measured by humans. In this work, we question this assumption and show that model estimates and translation quality only vaguely correlate. We apply Minimum Bayes Risk (MBR) decoding on unbiased samples to optimize diverse automated metrics of translation quality as an alternative inference strategy to beam search. Instead of targeting the hypotheses with the highest model probability, MBR decoding extracts the hypotheses with the highest estimated quality. Our experiments show that the combination of a neural translation model with a neural reference-based metric, Bleurt, results in significant improvement in human evaluations. This improvement is obtained with translations different from classical beam-search output: These translations have much lower model likelihood and are less favored by surface metrics like Bleu.<\/jats:p>","DOI":"10.1162\/tacl_a_00491","type":"journal-article","created":{"date-parts":[[2022,8,1]],"date-time":"2022-08-01T13:45:09Z","timestamp":1659361509000},"page":"811-825","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":7,"title":["High Quality Rather than High Model Probability: Minimum Bayes Risk Decoding with Neural Metrics"],"prefix":"10.1162","volume":"10","author":[{"given":"Markus","family":"Freitag","sequence":"first","affiliation":[{"name":"Google Research, USA. ::freitag@google.com"}]},{"given":"David","family":"Grangier","sequence":"additional","affiliation":[{"name":"Google Research, USA. grangier@google.com"}]},{"given":"Qijun","family":"Tan","sequence":"additional","affiliation":[{"name":"Google Research, USA. qijuntan@google.com"}]},{"given":"Bowen","family":"Liang","sequence":"additional","affiliation":[{"name":"Google Research, USA. bowenl@google.com"}]}],"member":"281","published-online":{"date-parts":[[2022,7,27]]},"reference":[{"key":"2022080113445134900_bib1","article-title":"An actor-critic algorithm for sequence prediction","volume-title":"5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings","author":"Bahdanau","year":"2017"},{"key":"2022080113445134900_bib2","article-title":"Neural machine translation by jointly learning to align and translate","volume-title":"3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings","author":"Bahdanau","year":"2015"},{"key":"2022080113445134900_bib3","first-page":"65","article-title":"METEOR: An automatic metric for MT evaluation with improved correlation with human judgments","volume-title":"Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization","author":"Banerjee","year":"2005"},{"key":"2022080113445134900_bib4","volume-title":"Proceedings of the Sixth Conference on Machine Translation","author":"Barrault","year":"2021"},{"key":"2022080113445134900_bib5","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18653\/v1\/W19-5301","article-title":"Findings of the 2019 conference on machine translation (WMT19)","volume-title":"Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)","author":"Barrault","year":"2019"},{"key":"2022080113445134900_bib6","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4757-4286-2","volume-title":"Statistical Decision Theory and Bayesian Analysis","author":"Berger","year":"1985","edition":"2nd ed."},{"key":"2022080113445134900_bib7","article-title":"Mathematical statistics: Basic ideas and selected topics","author":"Bickel","year":"1977","journal-title":"Holder-Day Series in Probability and Statistics, Holder-Day, San Francisco"},{"key":"2022080113445134900_bib8","article-title":"The University of Edinburgh\u2019s English-German and English-Hausa submissions to the WMT21 news translation task","volume-title":"Proceedings of the Sixth Conference on Machine Translation","author":"Chen","year":"2021"},{"key":"2022080113445134900_bib9","article-title":"Rethinking embedding coupling in pre-trained language models","author":"Chung","year":"2020"},{"key":"2022080113445134900_bib10","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00417","article-title":"A statistical analysis of summarization evaluation metrics using resampling methods","author":"Deutsch","year":"2021","journal-title":"arXiv preprint arXiv:2104.00054"},{"key":"2022080113445134900_bib11","first-page":"4171","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Devlin","year":"2019"},{"key":"2022080113445134900_bib12","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1033","article-title":"Classical structured prediction losses for sequence to sequence learning","volume-title":"Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)","author":"Edunov","year":"2018"},{"key":"2022080113445134900_bib13","doi-asserted-by":"publisher","first-page":"4506","DOI":"10.18653\/v1\/2020.coling-main.398","article-title":"Is MAP decoding all you need? The inadequacy of the mode in neural machine translation","volume-title":"Proceedings of the 28th International Conference on Computational Linguistics","author":"Eikema","year":"2020"},{"key":"2022080113445134900_bib14","article-title":"Sampling-based minimum Bayes risk decoding for neural machine translation","author":"Eikema","year":"2021"},{"key":"2022080113445134900_bib15","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00437","article-title":"Experts, errors, and context: A large-scale study of human evaluation for machine translation","author":"Freitag","year":"2021"},{"key":"2022080113445134900_bib16","doi-asserted-by":"publisher","first-page":"61","DOI":"10.18653\/v1\/2020.emnlp-main.5","article-title":"BLEU might be guilty but references are not innocent","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Freitag","year":"2020"},{"key":"2022080113445134900_bib17","article-title":"Results of the WMT21 metric shared task","volume-title":"Proceedings of the Sixth Conference on Machine Translation","author":"Freitag","year":"2021"},{"key":"2022080113445134900_bib18","first-page":"1243","article-title":"Convolutional sequence to sequence learning","volume-title":"Proceedings of the 34th International Conference on Machine Learning","author":"Gehring","year":"2017"},{"issue":"2","key":"2022080113445134900_bib19","doi-asserted-by":"publisher","first-page":"115","DOI":"10.1006\/csla.2000.0138","article-title":"Minimum bayes-risk automatic speech recognition","volume":"14","author":"Goel","year":"2000","journal-title":"Computer Speech & Language"},{"key":"2022080113445134900_bib20","doi-asserted-by":"publisher","first-page":"177","DOI":"10.3115\/981863.981887","article-title":"Parsing algorithms and metrics","volume-title":"Proceedings of the 34th Annual Meeting on Association for Computational Linguistics","author":"Goodman","year":"1996"},{"key":"2022080113445134900_bib21","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-2012","article-title":"Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing","author":"Kudo","year":"2018","journal-title":"arXiv preprint arXiv:1808.06226"},{"key":"2022080113445134900_bib22","doi-asserted-by":"publisher","first-page":"140","DOI":"10.3115\/1118693.1118712","article-title":"Minimum bayes-risk word alignments of bilingual texts","volume-title":"Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002)","author":"Kumar","year":"2002"},{"key":"2022080113445134900_bib23","first-page":"169","article-title":"Minimum Bayes-risk decoding for statistical machine translation","volume-title":"Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004","author":"Kumar","year":"2004"},{"key":"2022080113445134900_bib24","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.662","article-title":"Machine translation decoding beyond beam search","author":"Leblond","year":"2021"},{"key":"2022080113445134900_bib25","doi-asserted-by":"crossref","first-page":"501","DOI":"10.3115\/1220355.1220427","article-title":"Orange: A method for evaluating automatic evaluation metrics for machine translation","volume-title":"COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics","author":"Lin","year":"2004"},{"key":"2022080113445134900_bib26","doi-asserted-by":"crossref","first-page":"507","DOI":"10.18653\/v1\/W19-5358","article-title":"YiSi\u2014A unified semantic MT quality evaluation and estimation metric for languages with different levels of available resources","volume-title":"Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)","author":"Lo","year":"2019"},{"key":"2022080113445134900_bib27","first-page":"895","article-title":"Extended study on using pretrained language models and YiSi-1 for machine translation evaluation","volume-title":"Proceedings of the Fifth Conference on Machine Translation","author":"Lo","year":"2020"},{"key":"2022080113445134900_bib28","doi-asserted-by":"publisher","first-page":"455","DOI":"10.5565\/rev\/tradumatica.77","article-title":"Multidimensional Quality Metrics (MQM) : A Framework for Declaring and Describing Translation Quality Metrics","author":"Lommel","year":"2014","journal-title":"Tradum\u00e0tica"},{"key":"2022080113445134900_bib29","first-page":"688","article-title":"Results of the WMT20 metrics shared task","volume-title":"Proceedings of the Fifth Conference on Machine Translation","author":"Mathur","year":"2020"},{"key":"2022080113445134900_bib30","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-long.22","article-title":"Understanding the properties of minimum Bayes risk decoding in neural machine translation","volume-title":"Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021)","author":"M\u00fcller","year":"2021"},{"key":"2022080113445134900_bib31","doi-asserted-by":"publisher","first-page":"3956","DOI":"10.18653\/v1\/W18-6301","article-title":"Analyzing uncertainty in neural machine translation","volume-title":"Proceedings of the 35th International Conference on Machine Learning","author":"Ott","year":"2018"},{"key":"2022080113445134900_bib32","doi-asserted-by":"publisher","first-page":"311","DOI":"10.3115\/1073083.1073135","article-title":"BLEU: A method for automatic evaluation of machine translation","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics","author":"Papineni","year":"2002"},{"key":"2022080113445134900_bib33","doi-asserted-by":"publisher","first-page":"392","DOI":"10.18653\/v1\/W15-3049","article-title":"chrF: character n-gram F-score for automatic MT evaluation","volume-title":"Proceedings of the Tenth Workshop on Statistical Machine Translation","author":"Popovi\u0107","year":"2015"},{"key":"2022080113445134900_bib34","doi-asserted-by":"publisher","first-page":"186","DOI":"10.18653\/v1\/W18-6319","article-title":"A call for clarity in reporting BLEU scores","volume-title":"Proceedings of the Third Conference on Machine Translation: Research Papers","author":"Post","year":"2018"},{"key":"2022080113445134900_bib35","article-title":"Are references really needed? Unbabel-ist 2021 submission for the metrics shared task","volume-title":"Proceedings of the Sixth Conference on Machine Translation","author":"Rei","year":"2021"},{"key":"2022080113445134900_bib36","doi-asserted-by":"crossref","first-page":"2685","DOI":"10.18653\/v1\/2020.emnlp-main.213","article-title":"COMET: A neural framework for MT evaluation","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Rei","year":"2020"},{"key":"2022080113445134900_bib37","doi-asserted-by":"publisher","first-page":"7881","DOI":"10.18653\/v1\/2020.acl-main.704","article-title":"BLEURT: Learning robust metrics for text generation","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Sellam","year":"2020"},{"key":"2022080113445134900_bib38","article-title":"Learning to evaluate translation beyond english: Bleurt submissions to the WMT metrics 2020 shared task","author":"Sellam","year":"2020","journal-title":"arXiv preprint arXiv: 2010.04297"},{"key":"2022080113445134900_bib39","article-title":"Lingvo: A modular and scalable framework for sequence-to-sequence modeling","author":"Shen","year":"2019","journal-title":"CoRR"},{"key":"2022080113445134900_bib40","first-page":"183","article-title":"On maximizing metrics for syntactic disambiguation","volume-title":"Proceedings of the Eighth International Conference on Parsing Technologies","author":"Sima\u2019an","year":"2003"},{"key":"2022080113445134900_bib41","doi-asserted-by":"publisher","first-page":"787","DOI":"10.3115\/1273073.1273174","article-title":"Minimum risk annealing for training log-linear models","volume-title":"Proceedings of the COLING\/ACL 2006 Main Conference Poster Sessions","author":"Smith","year":"2006"},{"key":"2022080113445134900_bib42","doi-asserted-by":"publisher","first-page":"202","DOI":"10.3115\/v1\/D14-1025","article-title":"Fitting sentence level translation evaluation with many dense features","volume-title":"Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Stanojevi\u0107","year":"2014"},{"key":"2022080113445134900_bib43","doi-asserted-by":"crossref","DOI":"10.21437\/Eurospeech.1997-68","article-title":"Explicit word error minimization in n-best list rescoring","volume-title":"Fifth European Conference on Speech Communication and Technology","author":"Stolcke","year":"1997"},{"key":"2022080113445134900_bib44","first-page":"3104","article-title":"Sequence to sequence learning with neural networks","volume-title":"Advances in Neural Information Processing Systems","author":"Sutskever","year":"2014"},{"key":"2022080113445134900_bib45","first-page":"185","article-title":"Reassessing claims of human parity and super-human performance in machine translation at WMT 2019","volume-title":"Proceedings of the 22nd Annual Conference of the European Association for Machine Translation","author":"Toral","year":"2020"},{"key":"2022080113445134900_bib46","article-title":"Facebook AI\u2019s WMT21 news translation task submission","volume-title":"Proceedings of the Sixth Conference on Machine Translation","author":"Tran","year":"2021"},{"key":"2022080113445134900_bib47","doi-asserted-by":"publisher","first-page":"620","DOI":"10.3115\/1613715.1613792","article-title":"Lattice minimum Bayes-risk decoding for statistical machine translation","volume-title":"Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing","author":"Tromble","year":"2008"},{"key":"2022080113445134900_bib48","first-page":"12","article-title":"Multidimensional quality metrics: A new unified paradigm for human and machine translation quality assessment","author":"Uszkoreit","year":"2013","journal-title":"Localization World, London"},{"key":"2022080113445134900_bib49","article-title":"Attention is all you need","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani","year":"2017"},{"key":"2022080113445134900_bib50","doi-asserted-by":"publisher","first-page":"133","DOI":"10.18653\/v1\/W18-6314","article-title":"Denoising neural machine translation training with trusted data and online data selection","volume-title":"Proceedings of the Third Conference on Machine Translation: Research Papers","author":"Wang","year":"2018"},{"key":"2022080113445134900_bib51","article-title":"Google\u2019s neural machine translation system: Bridging the gap between human and machine translation","author":"Yonghui","year":"2016","journal-title":"arXiv preprint arXiv:1609.08144"},{"key":"2022080113445134900_bib52","article-title":"Bertscore: Evaluating text generation with BERT","volume-title":"International Conference on Learning Representations","author":"Zhang","year":"2020"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00491\/2037127\/tacl_a_00491.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00491\/2037127\/tacl_a_00491.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,2,13]],"date-time":"2023-02-13T06:52:29Z","timestamp":1676271149000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00491\/112497\/High-Quality-Rather-than-High-Model-Probability"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022]]},"references-count":52,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00491","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022]]},"published":{"date-parts":[[2022]]}}}