{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,14]],"date-time":"2026-04-14T00:57:19Z","timestamp":1776128239619,"version":"3.50.1"},"reference-count":61,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2024,11,21]],"date-time":"2024-11-21T00:00:00Z","timestamp":1732147200000},"content-version":"vor","delay-in-days":3,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,11,18]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Several uncertainty estimation methods have been recently proposed for machine translation evaluation. While these methods can provide a useful indication of when not to trust model predictions, we show in this paper that the majority of them tend to underestimate model uncertainty, and as a result, they often produce misleading confidence intervals that do not cover the ground truth. We propose as an alternative the use of conformal prediction, a distribution-free method to obtain confidence intervals with a theoretically established guarantee on coverage. First, we demonstrate that split conformal prediction can \u201ccorrect\u201d the confidence intervals of previous methods to yield a desired coverage level, and we demonstrate these findings across multiple machine translation evaluation metrics and uncertainty quantification methods. Further, we highlight biases in estimated confidence intervals, reflected in imbalanced coverage for different attributes, such as the language and the quality of translations. We address this by applying conditional conformal prediction techniques to obtain calibration subsets for each data subgroup, leading to equalized coverage. Overall, we show that, provided access to a calibration set, conformal prediction can help identify the most suitable uncertainty quantification methods and adapt the predicted confidence intervals to ensure fairness with respect to different attributes.1<\/jats:p>","DOI":"10.1162\/tacl_a_00711","type":"journal-article","created":{"date-parts":[[2024,11,21]],"date-time":"2024-11-21T19:15:54Z","timestamp":1732216554000},"page":"1460-1478","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":2,"title":["Conformalizing Machine Translation Evaluation"],"prefix":"10.1162","volume":"12","author":[{"given":"Chrysoula","family":"Zerva","sequence":"first","affiliation":[{"name":"Instituto de Telecomunica\u00e7\u00f5es, Portugal"},{"name":"Instituto Superior T\u00e9cnico, Universidade de Lisboa, Portugal. chrysoula.zerva@tecnico.ulisboa.pt"},{"name":"ELLIS Unit Lisbon, Portugal"}]},{"given":"Andr\u00e9 F. T.","family":"Martins","sequence":"additional","affiliation":[{"name":"Instituto de Telecomunica\u00e7\u00f5es, Portugal"},{"name":"Instituto Superior T\u00e9cnico, Universidade de Lisboa, Portugal. andre.t.martins@tecnico.ulisboa.pt"},{"name":"ELLIS Unit Lisbon, Portugal"},{"name":"Unbabel, Portugal"}]}],"member":"281","published-online":{"date-parts":[[2024,11,18]]},"reference":[{"key":"2024112119154778400_bib1","first-page":"593","article-title":"Quality estimation via backtranslation at the wmt 2022 quality estimation task","volume-title":"Proceedings of the Seventh Conference on Machine Translation (WMT)","author":"Agrawal","year":"2022"},{"key":"2024112119154778400_bib2","first-page":"14927","article-title":"Deep evidential regression","volume":"33","author":"Amini","year":"2020","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2024112119154778400_bib3","article-title":"A gentle introduction to conformal prediction and distribution-free uncertainty quantification","author":"Angelopoulos","year":"2021","journal-title":"arXiv preprint arXiv:2107.07511"},{"key":"2024112119154778400_bib4","doi-asserted-by":"publisher","first-page":"208","DOI":"10.18653\/v1\/K16-1021","article-title":"Exploring prediction uncertainty in machine translation quality estimation","volume-title":"Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning","author":"Beck","year":"2016"},{"key":"2024112119154778400_bib5","first-page":"114","article-title":"Mondrian conformal regressors","volume-title":"Conformal and Probabilistic Prediction and Applications","author":"Bostr\u00f6m","year":"2020"},{"key":"2024112119154778400_bib6","first-page":"24","article-title":"Mondrian conformal predictive distributions","volume-title":"Conformal and Probabilistic Prediction and Applications","author":"Bostr\u00f6m","year":"2021"},{"key":"2024112119154778400_bib7","article-title":"How the washington post estimates outstanding votes for the 2020 presidential election","author":"Cherian","year":"2020"},{"key":"2024112119154778400_bib8","article-title":"Statistical inference for fairness auditing","author":"Cherian","year":"2023","journal-title":"arXiv preprint arXiv:2305.03712"},{"key":"2024112119154778400_bib9","first-page":"179","article-title":"Funzione caratteristica di un fenomeno aleatorio","volume-title":"Atti del Congresso Internazionale dei Matematici: Bologna del 3 al 10 de settembre di 1928","author":"De Finetti","year":"1929"},{"key":"2024112119154778400_bib10","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW50498.2020.00010","article-title":"Revisiting the evaluation of uncertainty estimation and its application to explore model complexity-uncertainty trade-off","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops","author":"Ding","year":"2020"},{"key":"2024112119154778400_bib11","first-page":"3329","article-title":"Few-shot conformal prediction with auxiliary tasks","volume-title":"Proceedings of the 38th International Conference on Machine Learning","author":"Fisch","year":"2021"},{"key":"2024112119154778400_bib12","first-page":"6514","article-title":"Conformal prediction sets with limited false positives","volume-title":"International Conference on Machine Learning","author":"Fisch","year":"2022"},{"key":"2024112119154778400_bib13","first-page":"1050","article-title":"Dropout as a Bayesian approximation: Representing model uncertainty in deep learning","volume-title":"International Conference on Machine Learning","author":"Gal","year":"2016"},{"key":"2024112119154778400_bib14","article-title":"Evaluating machine translation quality with conformal predictive distributions","author":"Giovannotti","year":"2023"},{"key":"2024112119154778400_bib15","doi-asserted-by":"publisher","first-page":"3920","DOI":"10.18653\/v1\/2021.findings-emnlp.330","article-title":"Uncertainty-aware machine translation evaluation","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2021","author":"Glushkova","year":"2021"},{"key":"2024112119154778400_bib16","article-title":"Continuous measurement scales in human evaluation of machine translation","author":"Graham","year":"2013"},{"key":"2024112119154778400_bib17","first-page":"3309","article-title":"Dangers of Bayesian model averaging under covariate shift","volume":"34","author":"Izmailov","year":"2021","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2024112119154778400_bib18","article-title":"Selection by prediction with conformal p-values","author":"Jin","year":"2022","journal-title":"arXiv preprint arXiv:2210.01408"},{"key":"2024112119154778400_bib19","doi-asserted-by":"publisher","first-page":"461","DOI":"10.1109\/BigData.2014.7004263","article-title":"Regression trees for streaming data with local performance guarantees","volume-title":"2014 IEEE International Conference on Big Data (Big Data)","author":"Johansson","year":"2014"},{"key":"2024112119154778400_bib20","article-title":"What uncertainties do we need in Bayesian deep learning for computer vision?","volume":"30","author":"Kendall","year":"2017","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2024112119154778400_bib21","article-title":"What uncertainties do we need in Bayesian deep learning for computer vision?","volume-title":"NIPS","author":"Kendall","year":"2017"},{"key":"2024112119154778400_bib22","doi-asserted-by":"publisher","first-page":"33","DOI":"10.2307\/1913643","article-title":"Regression quantiles","author":"Koenker","year":"1978","journal-title":"Econometrica: Journal of the Econometric Society"},{"issue":"4","key":"2024112119154778400_bib23","doi-asserted-by":"publisher","first-page":"143","DOI":"10.1257\/jep.15.4.143","article-title":"Quantile regression","volume":"15","author":"Koenker","year":"2001","journal-title":"Journal of Economic Perspectives"},{"key":"2024112119154778400_bib24","first-page":"11683","article-title":"Calibrated and sharp uncertainties in deep learning via density estimation","volume-title":"Proceedings of the 39th International Conference on Machine Learning","author":"Kuleshov","year":"2022"},{"key":"2024112119154778400_bib25","first-page":"2796","article-title":"Accurate uncertainties for deep learning using calibrated regression","volume-title":"International Conference on Machine Learning","author":"Kuleshov","year":"2018"},{"key":"2024112119154778400_bib26","first-page":"2796","article-title":"Accurate uncertainties for deep learning using calibrated regression","volume-title":"Proceedings of the 35th International Conference on Machine Learning","author":"Kuleshov","year":"2018"},{"key":"2024112119154778400_bib27","article-title":"Conformal prediction with large language models for multi-choice question answering","author":"Kumar","year":"2023","journal-title":"arXiv preprint arXiv:2305.18404"},{"key":"2024112119154778400_bib28","article-title":"Deup: Direct epistemic uncertainty prediction","author":"Lahlou","year":"2021","journal-title":"arXiv preprint arXiv:2102.08501"},{"key":"2024112119154778400_bib29","article-title":"Simple and scalable predictive uncertainty estimation using deep ensembles","volume-title":"Advances in Neural Information Processing Systems","author":"Lakshminarayanan","year":"2017"},{"key":"2024112119154778400_bib30","doi-asserted-by":"publisher","first-page":"489","DOI":"10.1145\/1102351.1102413","article-title":"Heteroscedastic gaussian process regression","volume-title":"Proceedings of the 22nd International Conference on Machine Learning","author":"Le","year":"2005"},{"issue":"15","key":"2024112119154778400_bib31","doi-asserted-by":"publisher","first-page":"5540","DOI":"10.3390\/s22155540","article-title":"Evaluating and calibrating uncertainty prediction in regression tasks","volume":"22","author":"Levi","year":"2022","journal-title":"Sensors"},{"key":"2024112119154778400_bib32","article-title":"Introspective planning: Guiding language-enabled agents to refine their own uncertainty","author":"Liang","year":"2024"},{"issue":"12","key":"2024112119154778400_bib33","doi-asserted-by":"publisher","first-page":"455","DOI":"10.5565\/rev\/tradumatica.77","article-title":"Multidimensional quality metrics (mqm): A framework for declaring and describing translation quality metrics","author":"Lommel","year":"2014","journal-title":"Tradum\u00e0tica"},{"key":"2024112119154778400_bib34","doi-asserted-by":"publisher","first-page":"12008","DOI":"10.1609\/aaai.v36i11.21459","article-title":"Fair conformal predictors for applications in medical imaging","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Charles","year":"2022"},{"key":"2024112119154778400_bib35","doi-asserted-by":"publisher","first-page":"671","DOI":"10.18653\/v1\/W18-6450","article-title":"Results of the wmt18 metrics shared task: Both characters and embeddings achieve good performance","volume-title":"Proceedings of the Third Conference on Machine Translation: Shared Task Papers","author":"Ma","year":"2018"},{"key":"2024112119154778400_bib36","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-5302","article-title":"Results of the wmt19 metrics shared task: Segment-level and strong mt systems pose big challenges","author":"Ma","year":"2019"},{"key":"2024112119154778400_bib37","first-page":"269","article-title":"Bert-based conformal predictor for sentiment analysis","volume-title":"Conformal and Probabilistic Prediction and Applications","author":"Maltoudoglou","year":"2020"},{"key":"2024112119154778400_bib38","first-page":"688","article-title":"Results of the wmt20 metrics shared task","volume-title":"Proceedings of the Fifth Conference on Machine Translation","author":"Mathur","year":"2020"},{"key":"2024112119154778400_bib39","first-page":"346","article-title":"Automatically predicting sentence translation difficulty","volume-title":"Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)","author":"Mishra","year":"2013"},{"key":"2024112119154778400_bib40","first-page":"91","article-title":"Revisiting round-trip translation for quality estimation","volume-title":"22nd Annual Conference of the European Association for Machine Translation","author":"Moon","year":"2020"},{"key":"2024112119154778400_bib41","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v29i1.9602","article-title":"Obtaining well calibrated probabilities using Bayesian binning","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Naeini","year":"2015"},{"key":"2024112119154778400_bib42","doi-asserted-by":"publisher","DOI":"10.5772\/6078","article-title":"Inductive conformal prediction: Theory and application to neural networks","volume-title":"Tools in Artificial Intelligence","author":"Papadopoulos","year":"2008"},{"key":"2024112119154778400_bib43","doi-asserted-by":"publisher","first-page":"815","DOI":"10.1613\/jair.3198","article-title":"Regression conformal prediction with nearest neighbours","volume":"40","author":"Papadopoulos","year":"2011","journal-title":"Journal of Artificial Intelligence Research"},{"key":"2024112119154778400_bib44","article-title":"Conformal nucleus sampling","author":"Ravfogel","year":"2023","journal-title":"arXiv preprint arXiv:2305.02633"},{"key":"2024112119154778400_bib45","doi-asserted-by":"publisher","first-page":"1030","DOI":"10.18653\/v1\/2020.emnlp-main.213","article-title":"Are references really needed? Unbabel-ist 2021 submission for the metrics shared task","volume-title":"Proceedings of the Sixth Conference on Machine Translation","author":"Rei","year":"2021"},{"key":"2024112119154778400_bib46","doi-asserted-by":"publisher","first-page":"2685","DOI":"10.18653\/v1\/2020.emnlp-main.213","article-title":"Comet: A neural framework for MT evaluation","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Rei","year":"2020"},{"key":"2024112119154778400_bib47","article-title":"Robots that ask for help: Uncertainty alignment for large language model planners","volume-title":"7th Annual Conference on Robot Learning","author":"Ren","year":"2023"},{"issue":"2","key":"2024112119154778400_bib48","doi-asserted-by":"publisher","first-page":"4","DOI":"10.1162\/99608f92.03f00592","article-title":"With malice toward none: Assessing uncertainty via equalized coverage","volume":"2","author":"Romano","year":"2020","journal-title":"Harvard Data Science Review"},{"key":"2024112119154778400_bib49","article-title":"Conformalized quantile regression","volume":"32","author":"Romano","year":"2019","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2024112119154778400_bib50","doi-asserted-by":"publisher","first-page":"7881","DOI":"10.18653\/v1\/2020.acl-main.704","article-title":"BleuRT: Learning robust metrics for text generation","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Sellam","year":"2020"},{"key":"2024112119154778400_bib51","article-title":"Single-model uncertainties for deep learning","volume":"32","author":"Tagasovska","year":"2019","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2024112119154778400_bib52","article-title":"Conformal prediction under covariate shift","volume":"32","author":"Tibshirani","year":"2019","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2024112119154778400_bib53","first-page":"1909","article-title":"Non-exchangeable conformal language generation with nearest neighbors","volume-title":"Findings of the Association for Computational Linguistics: EACL 2024","author":"Ulmer","year":"2024"},{"key":"2024112119154778400_bib54","article-title":"Prior and posterior networks: A survey on evidential deep learning methods for uncertainty estimation","author":"Ulmer","year":"2023","journal-title":"Transactions on Machine Learning Research"},{"key":"2024112119154778400_bib55","volume-title":"Algorithmic Learning in a Random World","author":"Vovk","year":"2005"},{"key":"2024112119154778400_bib56","doi-asserted-by":"publisher","first-page":"19","DOI":"10.1007\/978-3-031-06649-8_2","article-title":"Conformal prediction: General case and regression","volume-title":"Algorithmic Learning in a Random World","author":"Vovk","year":"2022"},{"key":"2024112119154778400_bib57","first-page":"444","article-title":"Machine-learning applications of algorithmic randomness","volume-title":"Proceedings of the Sixteenth International Conference on Machine Learning","author":"Vovk","year":"1999"},{"key":"2024112119154778400_bib58","doi-asserted-by":"publisher","first-page":"8117","DOI":"10.18653\/v1\/2022.acl-long.558","article-title":"UniTE: Unified translation evaluation","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Wan","year":"2022"},{"key":"2024112119154778400_bib59","doi-asserted-by":"publisher","first-page":"680","DOI":"10.1162\/tacl_a_00483","article-title":"Uncertainty estimation and reduction of pre-trained models for text regression","volume":"10","author":"Wang","year":"2022","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024112119154778400_bib60","doi-asserted-by":"publisher","first-page":"8622","DOI":"10.18653\/v1\/2022.emnlp-main.591","article-title":"Disentangling uncertainty in machine translation evaluation","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Zerva","year":"2022"},{"key":"2024112119154778400_bib61","first-page":"961","article-title":"Ist-unbabel 2021 submission for the quality estimation shared task","volume-title":"Proceedings of the Sixth Conference on Machine Translation","author":"Zerva","year":"2021"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00711\/2480401\/tacl_a_00711.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00711\/2480401\/tacl_a_00711.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,21]],"date-time":"2024-11-21T19:16:04Z","timestamp":1732216564000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00711\/125277\/Conformalizing-Machine-Translation-Evaluation"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,18]]},"references-count":61,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00711","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,11,18]]}}}