{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,28]],"date-time":"2026-04-28T16:45:38Z","timestamp":1777394738813,"version":"3.51.4"},"reference-count":50,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2024,9,4]],"date-time":"2024-09-04T00:00:00Z","timestamp":1725408000000},"content-version":"vor","delay-in-days":247,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,9,4]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Widely used learned metrics for machine translation evaluation, such as Comet and Bleurt, estimate the quality of a translation hypothesis by providing a single sentence-level score. As such, they offer little insight into translation errors (e.g., what are the errors and what is their severity). On the other hand, generative large language models (LLMs) are amplifying the adoption of more granular strategies to evaluation, attempting to detail and categorize translation errors. In this work, we introduce xcomet, an open-source learned metric designed to bridge the gap between these approaches. xcomet integrates both sentence-level evaluation and error span detection capabilities, exhibiting state-of-the-art performance across all types of evaluation (sentence-level, system-level, and error span detection). Moreover, it does so while highlighting and categorizing error spans, thus enriching the quality assessment. We also provide a robustness analysis with stress tests, and show that xcomet is largely capable of identifying localized critical errors and hallucinations.<\/jats:p>","DOI":"10.1162\/tacl_a_00683","type":"journal-article","created":{"date-parts":[[2024,9,4]],"date-time":"2024-09-04T13:35:13Z","timestamp":1725456913000},"page":"979-995","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":41,"title":["<b>x<scp>comet<\/scp>\n                  <\/b>: Transparent Machine Translation Evaluation through Fine-grained Error Detection"],"prefix":"10.1162","volume":"12","author":[{"given":"Nuno M.","family":"Guerreiro","sequence":"first","affiliation":[{"name":"Unbabel Lisbon Portugal. nuno.guerreiro@unbabel.com"},{"name":"Instituto de Telecomunica\u00e7\u00f5es, Lisbon, Portugal"},{"name":"MICS, CentraleSup\u00e9lec, Universit\u00e9 Paris-Saclay, France"},{"name":"Instituto Superior T\u00e9cnico, University of Lisbon, Portugal"}]},{"given":"Ricardo","family":"Rei","sequence":"additional","affiliation":[{"name":"Unbabel Lisbon Portugal. ricardo.rei@unbabel.com"},{"name":"INESC-ID, Lisbon, Portugal"},{"name":"Instituto Superior T\u00e9cnico, University of Lisbon, Portugal"}]},{"given":"Daan van","family":"Stigt","sequence":"additional","affiliation":[{"name":"Unbabel Lisbon Portugal"}]},{"given":"Luisa","family":"Coheur","sequence":"additional","affiliation":[{"name":"INESC-ID, Lisbon, Portugal"},{"name":"Instituto Superior T\u00e9cnico, University of Lisbon, Portugal"}]},{"given":"Pierre","family":"Colombo","sequence":"additional","affiliation":[{"name":"MICS, CentraleSup\u00e9lec, Universit\u00e9 Paris-Saclay, France"}]},{"given":"Andr\u00e9 F. T.","family":"Martins","sequence":"additional","affiliation":[{"name":"Unbabel Lisbon Portugal"},{"name":"Instituto de Telecomunica\u00e7\u00f5es, Lisbon, Portugal"},{"name":"Instituto Superior T\u00e9cnico, University of Lisbon, Portugal"}]}],"member":"281","published-online":{"date-parts":[[2024,9,4]]},"reference":[{"key":"2024090413342479100_bib1","first-page":"469","article-title":"Robust MT evaluation with sentence-level multilingual augmentation","volume-title":"Proceedings of the Seventh Conference on Machine Translation (WMT)","author":"Alves","year":"2022"},{"key":"2024090413342479100_bib2","first-page":"1125","article-title":"Identifying weaknesses in machine translation metrics through minimum Bayes risk decoding: A case study for COMET","volume-title":"Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Amrhein","year":"2022"},{"key":"2024090413342479100_bib3","article-title":"Towards fine-grained information: Identifying the type and location of translation errors","author":"Bao","year":"2023"},{"key":"2024090413342479100_bib4","doi-asserted-by":"publisher","first-page":"169","DOI":"10.18653\/v1\/W17-4717","article-title":"Findings of the 2017 conference on machine translation (WMT17)","volume-title":"Proceedings of the Second Conference on Machine Translation","author":"Bojar","year":"2017"},{"key":"2024090413342479100_bib5","doi-asserted-by":"publisher","first-page":"1132","DOI":"10.1162\/tacl_a_00417","article-title":"A statistical analysis of summarization evaluation metrics using resampling methods","volume":"9","author":"Deutsch","year":"2021","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024090413342479100_bib6","doi-asserted-by":"publisher","first-page":"12914","DOI":"10.18653\/v1\/2023.emnlp-main.798","article-title":"Ties matter: Meta-evaluating modern metrics with pairwise accuracy and tie calibration","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Deutsch","year":"2023"},{"key":"2024090413342479100_bib7","doi-asserted-by":"publisher","first-page":"994","DOI":"10.18653\/v1\/2023.wmt-1.96","article-title":"Training and meta-evaluating machine translation evaluation metrics at the paragraph level","volume-title":"Proceedings of the Eighth Conference on Machine Translation","author":"Deutsch","year":"2023"},{"key":"2024090413342479100_bib8","doi-asserted-by":"publisher","first-page":"878","DOI":"10.18653\/v1\/2022.acl-long.62","article-title":"Language-agnostic BERT sentence embedding","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Feng","year":"2022"},{"key":"2024090413342479100_bib9","doi-asserted-by":"publisher","first-page":"1064","DOI":"10.18653\/v1\/2023.wmt-1.100","article-title":"The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation","volume-title":"Proceedings of the Eighth Conference on Machine Translation","author":"Fernandes","year":"2023"},{"key":"2024090413342479100_bib10","doi-asserted-by":"publisher","first-page":"1396","DOI":"10.18653\/v1\/2022.naacl-main.100","article-title":"Quality-aware decoding for neural machine translation","volume-title":"Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Fernandes","year":"2022"},{"key":"2024090413342479100_bib11","first-page":"4963","article-title":"MLQE-PE: A multilingual quality estimation and post-editing dataset","volume-title":"Proceedings of the Thirteenth Language Resources and Evaluation Conference","author":"Fomicheva","year":"2022"},{"key":"2024090413342479100_bib12","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18653\/v1\/W19-5401","article-title":"Findings of the WMT 2019 shared tasks on quality estimation","volume-title":"Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)","author":"Fonseca","year":"2019"},{"key":"2024090413342479100_bib13","doi-asserted-by":"publisher","first-page":"1460","DOI":"10.1162\/tacl_a_00437","article-title":"Experts, errors, and context: A large-scale study of human evaluation for machine translation","volume":"9","author":"Freitag","year":"2021","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024090413342479100_bib14","doi-asserted-by":"publisher","first-page":"578","DOI":"10.18653\/v1\/2023.wmt-1.51","article-title":"Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent","volume-title":"Proceedings of the Eighth Conference on Machine Translation","author":"Freitag","year":"2023"},{"key":"2024090413342479100_bib15","first-page":"46","article-title":"Results of WMT22 metrics shared task: Stop using BLEU \u2013 neural metrics are better and more robust","volume-title":"Proceedings of the Seventh Conference on Machine Translation (WMT)","author":"Freitag","year":"2022"},{"key":"2024090413342479100_bib16","first-page":"733","article-title":"Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain","volume-title":"Proceedings of the Sixth Conference on Machine Translation","author":"Freitag","year":"2021"},{"key":"2024090413342479100_bib17","doi-asserted-by":"publisher","first-page":"29","DOI":"10.18653\/v1\/2021.repl4nlp-1.4","article-title":"Larger-scale transformers for multilingual masked language modeling","volume-title":"Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)","author":"Goyal","year":"2021"},{"key":"2024090413342479100_bib18","first-page":"33","article-title":"Continuous measurement scales in human evaluation of machine translation","volume-title":"Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse","author":"Graham","year":"2013"},{"key":"2024090413342479100_bib19","doi-asserted-by":"publisher","first-page":"1059","DOI":"10.18653\/v1\/2023.eacl-main.75","article-title":"Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation","volume-title":"Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics","author":"Guerreiro","year":"2023"},{"key":"2024090413342479100_bib20","article-title":"Reinforced self- training (rest) for language modeling","author":"Gulcehre","year":"2023"},{"key":"2024090413342479100_bib21","doi-asserted-by":"publisher","first-page":"756","DOI":"10.18653\/v1\/2023.wmt-1.63","article-title":"MetricX-23: The Google submission to the WMT 2023 metrics shared task","volume-title":"Proceedings of the Eighth Conference on Machine Translation","author":"Juraska","year":"2023"},{"key":"2024090413342479100_bib22","doi-asserted-by":"publisher","first-page":"9540","DOI":"10.18653\/v1\/2022.emnlp-main.649","article-title":"DEMETR: Diagnosing evaluation metrics for translation","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Karpinska","year":"2022"},{"key":"2024090413342479100_bib23","doi-asserted-by":"publisher","first-page":"768","DOI":"10.18653\/v1\/2023.wmt-1.64","article-title":"GEMBA-MQM: Detecting translation quality error spans with GPT-4","volume-title":"Proceedings of the Eighth Conference on Machine Translation","author":"Kocmi","year":"2023"},{"key":"2024090413342479100_bib24","first-page":"193","article-title":"Large language models are state-of-the-art evaluators of translation quality","volume-title":"Proceedings of the 24th Annual Conference of the European Association for Machine Translation","author":"Kocmi","year":"2023"},{"key":"2024090413342479100_bib25","first-page":"478","article-title":"To ship or not to ship: An extensive evaluation of automatic metrics for machine translation","volume-title":"Proceedings of the Sixth Conference on Machine Translation","author":"Kocmi","year":"2021"},{"key":"2024090413342479100_bib26","article-title":"Towards explainable evaluation metrics for machine translation","author":"Leiter","year":"2023","journal-title":"ArXiv"},{"key":"2024090413342479100_bib27","doi-asserted-by":"publisher","first-page":"455","DOI":"10.5565\/rev\/tradumatica.77","article-title":"Multidimensional quality metrics (MQM): A framework for declaring and describing translation quality metrics","volume":"0","author":"Lommel","year":"2014","journal-title":"Tradum\u00e0tica: Tecnologies de la traducci\u00f3"},{"key":"2024090413342479100_bib28","doi-asserted-by":"publisher","first-page":"7297","DOI":"10.18653\/v1\/2021.acl-long.566","article-title":"Scientific credibility of machine translation research: A meta-evaluation of 769 papers","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Marie","year":"2021"},{"key":"2024090413342479100_bib29","unstructured":"OpenAI. 2023. Gpt-4 technical report."},{"key":"2024090413342479100_bib30","doi-asserted-by":"publisher","first-page":"311","DOI":"10.3115\/1073083.1073135","article-title":"Bleu: A method for automatic evaluation of machine translation","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics","author":"Papineni","year":"2002"},{"key":"2024090413342479100_bib31","first-page":"569","article-title":"MaTESe: Machine translation evaluation as a sequence tagging problem","volume-title":"Proceedings of the Seventh Conference on Machine Translation (WMT)","author":"Perrella","year":"2022"},{"key":"2024090413342479100_bib32","doi-asserted-by":"publisher","first-page":"392","DOI":"10.18653\/v1\/W15-3049","article-title":"chrF: Character n-gram F-score for automatic MT evaluation","volume-title":"Proceedings of the Tenth Workshop on Statistical Machine Translation","author":"Popovi\u0107","year":"2015"},{"key":"2024090413342479100_bib33","doi-asserted-by":"publisher","first-page":"751","DOI":"10.18653\/v1\/2021.emnlp-main.58","article-title":"Learning compact metrics for MT","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"Amy","year":"2021"},{"key":"2024090413342479100_bib34","doi-asserted-by":"publisher","first-page":"1172","DOI":"10.18653\/v1\/2021.naacl-main.92","article-title":"The curious case of hallucinations in neural machine translation","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Raunak","year":"2021"},{"key":"2024090413342479100_bib35","doi-asserted-by":"publisher","first-page":"5163","DOI":"10.18653\/v1\/2022.findings-emnlp.379","article-title":"SALTED: A framework for SAlient long-tail translation error detection","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2022","author":"Raunak","year":"2022"},{"key":"2024090413342479100_bib36","first-page":"578","article-title":"COMET-22: Unbabel-IST 2022 submission for the metrics shared task","volume-title":"Proceedings of the Seventh Conference on Machine Translation (WMT)","author":"Rei","year":"2022"},{"key":"2024090413342479100_bib37","doi-asserted-by":"publisher","first-page":"839","DOI":"10.18653\/v1\/2023.wmt-1.73","article-title":"Scaling up cometkiwi: Unbabel-ist 2023 submission for the quality estimation shared task","volume-title":"Proceedings of the Eighth Conference on Machine Translation","author":"Rei","year":"2023"},{"key":"2024090413342479100_bib38","doi-asserted-by":"publisher","first-page":"1089","DOI":"10.18653\/v1\/2023.acl-short.94","article-title":"The inside story: Towards better understanding of machine translation neural evaluation metrics","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)","author":"Rei","year":"2023"},{"key":"2024090413342479100_bib39","doi-asserted-by":"publisher","first-page":"2685","DOI":"10.18653\/v1\/2020.emnlp-main.213","article-title":"COMET: A neural framework for MT evaluation","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Rei","year":"2020"},{"key":"2024090413342479100_bib40","first-page":"634","article-title":"CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task","volume-title":"Proceedings of the Seventh Conference on Machine Translation (WMT)","author":"Rei","year":"2022"},{"key":"2024090413342479100_bib41","doi-asserted-by":"publisher","first-page":"14210","DOI":"10.18653\/v1\/2023.acl-long.795","article-title":"IndicMT eval: A dataset to meta-evaluate machine translation metrics for Indian languages","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Ananya Sai","year":"2023"},{"key":"2024090413342479100_bib42","doi-asserted-by":"publisher","first-page":"7881","DOI":"10.18653\/v1\/2020.acl-main.704","article-title":"BLEURT: Learning robust metrics for text generation","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Sellam","year":"2020"},{"key":"2024090413342479100_bib43","first-page":"223","article-title":"A study of translation edit rate with targeted human annotation","volume-title":"Proceedings of Association for Machine Translation in the Americas","author":"Snover","year":"2006"},{"key":"2024090413342479100_bib44","doi-asserted-by":"publisher","first-page":"39","DOI":"10.1007\/s10590-010-9077-2","article-title":"Machine translation evaluation versus quality estimation","volume":"24","author":"Specia","year":"2010","journal-title":"Machine Translation"},{"key":"2024090413342479100_bib45","article-title":"Llama: Open and efficient foundation language models","author":"Touvron","year":"2023"},{"key":"2024090413342479100_bib46","doi-asserted-by":"publisher","first-page":"3788","DOI":"10.18653\/v1\/2022.findings-emnlp.277","article-title":"A unified dialogue user simulator for few-shot data augmentation","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2022","author":"Wan","year":"2022"},{"key":"2024090413342479100_bib47","doi-asserted-by":"publisher","first-page":"8117","DOI":"10.18653\/v1\/2022.acl-long.558","article-title":"UniTE: Unified translation evaluation","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Wan","year":"2022"},{"key":"2024090413342479100_bib48","article-title":"Instructscore: Towards explainable text generation evaluation with automatic feedback","author":"Wenda","year":"2023"},{"key":"2024090413342479100_bib49","doi-asserted-by":"publisher","first-page":"483","DOI":"10.18653\/v1\/2021.naacl-main.41","article-title":"mT5: A massively multilingual pre-trained text-to-text transformer","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Xue","year":"2021"},{"key":"2024090413342479100_bib50","first-page":"69","article-title":"Findings of the WMT 2022 shared task on quality estimation","volume-title":"Proceedings of the Seventh Conference on Machine Translation (WMT)","author":"Zerva","year":"2022"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00683\/2468704\/tacl_a_00683.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00683\/2468704\/tacl_a_00683.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,4]],"date-time":"2024-09-04T13:35:26Z","timestamp":1725456926000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00683\/124263\/xcomet-Transparent-Machine-Translation-Evaluation"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024]]},"references-count":50,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00683","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024]]},"published":{"date-parts":[[2024]]}}}