{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,25]],"date-time":"2026-04-25T06:38:37Z","timestamp":1777099117142,"version":"3.51.4"},"reference-count":63,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2024,10,2]],"date-time":"2024-10-02T00:00:00Z","timestamp":1727827200000},"content-version":"vor","delay-in-days":275,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,9,30]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Despite the recent success of automatic metrics for assessing translation quality, their application in evaluating the quality of machine-translated chats has been limited. Unlike more structured texts like news, chat conversations are often unstructured, short, and heavily reliant on contextual information. This poses questions about the reliability of existing sentence-level metrics in this domain as well as the role of context in assessing the translation quality. Motivated by this, we conduct a meta-evaluation of existing automatic metrics, primarily designed for structured domains such as news, to assess the quality of machine-translated chats. We find that reference-free metrics lag behind reference-based ones, especially when evaluating translation quality in out-of-English settings. We then investigate how incorporating conversational contextual information in these metrics for sentence-level evaluation affects their performance. Our findings show that augmenting neural learned metrics with contextual information helps improve correlation with human judgments in the reference-free scenario and when evaluating translations in out-of-English settings. Finally, we propose a new evaluation metric, Context-MQM, that utilizes bilingual context with a large language model (LLM) and further validate that adding context helps even for LLM-based evaluation metrics.<\/jats:p>","DOI":"10.1162\/tacl_a_00700","type":"journal-article","created":{"date-parts":[[2024,10,2]],"date-time":"2024-10-02T17:18:16Z","timestamp":1727889496000},"page":"1250-1267","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":1,"title":["Assessing the Role of Context in Chat Translation Evaluation: Is Context Helpful and Under What Conditions?"],"prefix":"10.1162","volume":"12","author":[{"given":"Sweta","family":"Agrawal","sequence":"first","affiliation":[{"name":"Instituto de Telecomunica\u00e7\u00f5es, Portugal. swetaagrawal20@gmail.com"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Amin","family":"Farajian","sequence":"additional","affiliation":[{"name":"Unbabel, Portugal"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Patrick","family":"Fernandes","sequence":"additional","affiliation":[{"name":"Instituto de Telecomunica\u00e7\u00f5es, Portugal"},{"name":"Instituto Superior T\u00e9cnico & Universidade de Lisboa, Portugal"},{"name":"Carnegie Mellon University, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ricardo","family":"Rei","sequence":"additional","affiliation":[{"name":"Instituto de Telecomunica\u00e7\u00f5es, Portugal"},{"name":"Unbabel, Portugal"},{"name":"INESC-ID, Portugal"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Andr\u00e9 F. T.","family":"Martins","sequence":"additional","affiliation":[{"name":"Instituto de Telecomunica\u00e7\u00f5es, Portugal"},{"name":"Unbabel, Portugal"},{"name":"ELLIS Unit Lisbon, Portugal"},{"name":"Instituto Superior T\u00e9cnico & Universidade de Lisboa, Portugal"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"281","published-online":{"date-parts":[[2024,9,30]]},"reference":[{"key":"2024100217180871200_bib1","article-title":"Gpt-4 technical report","author":"Achiam","year":"2023","journal-title":"arXiv preprint arXiv: 2303.08774"},{"key":"2024100217180871200_bib2","doi-asserted-by":"publisher","first-page":"629","DOI":"10.18653\/v1\/2023.wmt-1.52","article-title":"Findings of the WMT 2023 shared task on quality estimation","volume-title":"Proceedings of the Eighth Conference on Machine Translation","author":"Blain","year":"2023"},{"key":"2024100217180871200_bib3","doi-asserted-by":"publisher","first-page":"489","DOI":"10.18653\/v1\/W17-4755","article-title":"Results of the WMT17 metrics shared task","volume-title":"Proceedings of the Second Conference on Machine Translation","author":"Bojar","year":"2017"},{"key":"2024100217180871200_bib4","doi-asserted-by":"publisher","first-page":"199","DOI":"10.18653\/v1\/W16-2302","article-title":"Results of the WMT16 metrics shared task","volume-title":"Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers","author":"Bojar","year":"2016"},{"key":"2024100217180871200_bib5","doi-asserted-by":"publisher","first-page":"362","DOI":"10.3115\/v1\/W14-3346","article-title":"A systematic comparison of smoothing techniques for sentence-level BLEU","volume-title":"Proceedings of the Ninth Workshop on Statistical Machine Translation","author":"Chen","year":"2014"},{"key":"2024100217180871200_bib6","doi-asserted-by":"publisher","first-page":"3576","DOI":"10.18653\/v1\/2021.naacl-main.280","article-title":"InfoXLM: An information-theoretic framework for cross-lingual language model pre-training","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Chi","year":"2021"},{"key":"2024100217180871200_bib7","doi-asserted-by":"publisher","first-page":"8440","DOI":"10.18653\/v1\/2020.acl-main.747","article-title":"Unsupervised cross-lingual representation learning at scale","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Conneau","year":"2020"},{"key":"2024100217180871200_bib8","doi-asserted-by":"publisher","first-page":"12914","DOI":"10.18653\/v1\/2023.emnlp-main.798","article-title":"Ties matter: Meta-evaluating modern metrics with pairwise accuracy and tie calibration","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Deutsch","year":"2023"},{"key":"2024100217180871200_bib9","doi-asserted-by":"publisher","first-page":"4171","DOI":"10.18653\/v1\/N19-1423","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Devlin","year":"2019"},{"key":"2024100217180871200_bib10","first-page":"724","article-title":"Findings of the WMT 2022 shared task on chat translation","volume-title":"Proceedings of the Seventh Conference on Machine Translation (WMT)","author":"Farinha","year":"2022"},{"key":"2024100217180871200_bib11","doi-asserted-by":"publisher","first-page":"1066","DOI":"10.18653\/v1\/2023.wmt-1.100","article-title":"The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation","volume-title":"Proceedings of the Eighth Conference on Machine Translation","author":"Fernandes","year":"2023"},{"key":"2024100217180871200_bib12","doi-asserted-by":"publisher","first-page":"606","DOI":"10.18653\/v1\/2023.acl-long.36","article-title":"When does translation require context? A data-driven, multilingual exploration","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Fernandes","year":"2023"},{"key":"2024100217180871200_bib13","doi-asserted-by":"publisher","first-page":"6467","DOI":"10.18653\/v1\/2021.acl-long.505","article-title":"Measuring and increasing context usage in context-aware machine translation","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Fernandes","year":"2021"},{"key":"2024100217180871200_bib14","first-page":"4963","article-title":"MLQE-PE: A multilingual quality estimation and post-editing dataset","volume-title":"Proceedings of the Thirteenth Language Resources and Evaluation Conference","author":"Fomicheva","year":"2022"},{"key":"2024100217180871200_bib15","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18653\/v1\/W19-5401","article-title":"Findings of the WMT 2019 shared tasks on quality estimation","volume-title":"Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)","author":"Fonseca","year":"2019"},{"key":"2024100217180871200_bib16","doi-asserted-by":"publisher","first-page":"578","DOI":"10.18653\/v1\/2023.wmt-1.51","article-title":"Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent","volume-title":"Proceedings of the Eighth Conference on Machine Translation","author":"Freitag","year":"2023"},{"key":"2024100217180871200_bib17","first-page":"46","article-title":"Results of WMT22 metrics shared task: Stop using BLEU \u2013 neural metrics are better and more robust","volume-title":"Proceedings of the Seventh Conference on Machine Translation (WMT)","author":"Freitag","year":"2022"},{"key":"2024100217180871200_bib18","first-page":"733","article-title":"Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain","volume-title":"Proceedings of the Sixth Conference on Machine Translation","author":"Freitag","year":"2021"},{"key":"2024100217180871200_bib19","doi-asserted-by":"publisher","first-page":"852","DOI":"10.1145\/2675133.2675197","article-title":"Two is better than one: Improving multilingual collaboration by giving two machine translation outputs","volume-title":"Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing","author":"Ge","year":"2015"},{"key":"2024100217180871200_bib20","first-page":"201","article-title":"Agent and user-generated content and its impact on customer support MT","volume-title":"Proceedings of the 23rd Annual Conference of the European Association for Machine Translation","author":"Gon\u00e7alves","year":"2022"},{"key":"2024100217180871200_bib21","article-title":"xcomet: Transparent machine translation evaluation through fine-grained error detection","author":"Guerreiro","year":"2023","journal-title":"arXiv preprint arXiv: 2310.10482"},{"key":"2024100217180871200_bib22","first-page":"15","article-title":"Translation quality assessment: A brief survey on manual and automatic methods","volume-title":"Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age","author":"Han","year":"2021"},{"key":"2024100217180871200_bib23","article-title":"Does neural machine translation benefit from larger context?","author":"Jean","year":"2017","journal-title":"ArXiv"},{"key":"2024100217180871200_bib24","doi-asserted-by":"publisher","first-page":"1550","DOI":"10.18653\/v1\/2022.naacl-main.111","article-title":"BlonDe: An automatic evaluation metric for document-level machine translation","volume-title":"Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Jiang","year":"2022"},{"key":"2024100217180871200_bib25","doi-asserted-by":"publisher","first-page":"756","DOI":"10.18653\/v1\/2023.wmt-1.63","article-title":"MetricX-23: The Google submission to the WMT 2023 metrics shared task","volume-title":"Proceedings of the Eighth Conference on Machine Translation","author":"Juraska","year":"2023"},{"key":"2024100217180871200_bib26","doi-asserted-by":"publisher","first-page":"768","DOI":"10.18653\/v1\/2023.wmt-1.64","article-title":"GEMBA-MQM: Detecting translation quality error spans with GPT-4","volume-title":"Proceedings of the Eighth Conference on Machine Translation","author":"Kocmi","year":"2023"},{"key":"2024100217180871200_bib27","first-page":"193","article-title":"Large language models are state-of-the-art evaluators of translation quality","volume-title":"Proceedings of the 24th Annual Conference of the European Association for Machine Translation","author":"Kocmi","year":"2023"},{"key":"2024100217180871200_bib28","doi-asserted-by":"publisher","first-page":"316","DOI":"10.18653\/v1\/W15-3037","article-title":"QUality estimation from ScraTCH (QUETCH): Deep learning for word-level translation quality estimation","volume-title":"Proceedings of the Tenth Workshop on Statistical Machine Translation","author":"Kreutzer","year":"2015"},{"key":"2024100217180871200_bib29","doi-asserted-by":"crossref","first-page":"88","DOI":"10.18653\/v1\/2022.eval4nlp-1.9","article-title":"Chat translation error detection for assisting cross-lingual communications","volume-title":"Proceedings of the 3rd Workshop on Evaluation and Comparison of NLP Systems","author":"Li","year":"2022"},{"key":"2024100217180871200_bib30","doi-asserted-by":"publisher","first-page":"726","DOI":"10.1162\/tacl_a_00343","article-title":"Multilingual denoising pre-training for neural machine translation","volume":"8","author":"Liu","year":"2020","journal-title":"Transactions of the Association for Computational Linguistics"},{"issue":"12","key":"2024100217180871200_bib31","doi-asserted-by":"publisher","first-page":"0455","DOI":"10.5565\/rev\/tradumatica.77","article-title":"Multidimensional quality metrics (mqm): A framework for declaring and describing translation quality metrics","author":"Lommel","year":"2014","journal-title":"Tradum\u00e0tica"},{"key":"2024100217180871200_bib32","article-title":"Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt","author":"Qingyu","year":"2023","journal-title":"arXiv preprint arXiv:2303.13809"},{"key":"2024100217180871200_bib33","doi-asserted-by":"crossref","first-page":"671","DOI":"10.18653\/v1\/W18-6450","article-title":"Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance","volume-title":"Proceedings of the Third Conference on Machine Translation: Shared Task Papers","author":"Ma","year":"2018"},{"key":"2024100217180871200_bib34","doi-asserted-by":"crossref","first-page":"62","DOI":"10.18653\/v1\/W19-5302","article-title":"Results of the WMT19 metrics shared task: Segment-level and strong MT systems pose big challenges","volume-title":"Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)","author":"Ma","year":"2019"},{"key":"2024100217180871200_bib35","doi-asserted-by":"publisher","first-page":"293","DOI":"10.3115\/v1\/W14-3336","article-title":"Results of the WMT14 metrics shared task","volume-title":"Proceedings of the Ninth Workshop on Statistical Machine Translation","author":"Mach\u00e1\u010dek","year":"2014"},{"key":"2024100217180871200_bib36","first-page":"495","article-title":"Project MAIA: Multilingual AI agent assistant","volume-title":"Proceedings of the 22nd Annual Conference of the European Association for Machine Translation","author":"Martins","year":"2020"},{"key":"2024100217180871200_bib37","doi-asserted-by":"publisher","first-page":"101","DOI":"10.18653\/v1\/W18-6311","article-title":"Contextual neural model for translating bilingual multi-speaker conversations","volume-title":"Proceedings of the Third Conference on Machine Translation: Research Papers","author":"Maruf","year":"2018"},{"key":"2024100217180871200_bib38","doi-asserted-by":"publisher","first-page":"3092","DOI":"10.18653\/v1\/N19-1313","article-title":"Selective attention for context-aware neural machine translation","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Maruf","year":"2019"},{"key":"2024100217180871200_bib39","first-page":"688","article-title":"Results of the WMT20 metrics shared task","volume-title":"Proceedings of the Fifth Conference on Machine Translation","author":"Mathur","year":"2020"},{"key":"2024100217180871200_bib40","first-page":"286","article-title":"A context-aware annotation framework for customer support live chat machine translation","volume-title":"Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track","author":"Menezes","year":"2023"},{"key":"2024100217180871200_bib41","article-title":"LLM evaluators recognize and favor their own generations","author":"Panickssery","year":"2024","journal-title":"arXiv preprint arXiv:2404.13076"},{"key":"2024100217180871200_bib42","doi-asserted-by":"publisher","first-page":"311","DOI":"10.3115\/1073083.1073135","article-title":"BLEU: A method for automatic evaluation of machine translation","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics","author":"Papineni","year":"2002"},{"key":"2024100217180871200_bib43","doi-asserted-by":"publisher","first-page":"392","DOI":"10.18653\/v1\/W15-3049","article-title":"chrf: character n-gram f-score for automatic mt evaluation","volume-title":"Proceedings of the Tenth Workshop on Statistical Machine Translation","author":"Popovi\u0107","year":"2015"},{"key":"2024100217180871200_bib44","doi-asserted-by":"publisher","first-page":"186","DOI":"10.18653\/v1\/W18-6319","article-title":"A call for clarity in reporting BLEU scores","volume-title":"Proceedings of the Third Conference on Machine Translation: Research Papers","author":"Post","year":"2018"},{"key":"2024100217180871200_bib45","doi-asserted-by":"publisher","first-page":"812","DOI":"10.18653\/v1\/2023.wmt-1.68","article-title":"Evaluating metrics for document-context evaluation in machine translation","volume-title":"Proceedings of the Eighth Conference on Machine Translation","author":"Raunak","year":"2023"},{"key":"2024100217180871200_bib46","first-page":"578","article-title":"COMET-22: Unbabel-IST 2022 submission for the metrics shared task","volume-title":"Proceedings of the Seventh Conference on Machine Translation (WMT)","author":"Rei","year":"2022"},{"key":"2024100217180871200_bib47","doi-asserted-by":"crossref","first-page":"2685","DOI":"10.18653\/v1\/2020.emnlp-main.213","article-title":"COMET: A neural framework for MT evaluation","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Rei","year":"2020"},{"key":"2024100217180871200_bib48","first-page":"911","article-title":"Unbabel\u2019s participation in the WMT20 metrics shared task","volume-title":"Proceedings of the Fifth Conference on Machine Translation","author":"Rei","year":"2020"},{"key":"2024100217180871200_bib49","first-page":"634","article-title":"CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task","volume-title":"Proceedings of the Seventh Conference on Machine Translation (WMT)","author":"Rei","year":"2022"},{"key":"2024100217180871200_bib50","doi-asserted-by":"publisher","first-page":"2223","DOI":"10.1145\/3531146.3534638","article-title":"Understanding and being understood: User strategies for identifying and recovering from mistranslations in machine translation-mediated chat","volume-title":"Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency","author":"Robertson","year":"2022"},{"key":"2024100217180871200_bib51","doi-asserted-by":"publisher","first-page":"7881","DOI":"10.18653\/v1\/2020.acl-main.704","article-title":"Bleurt: Learning robust metrics for text generation","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Sellam","year":"2020"},{"key":"2024100217180871200_bib52","first-page":"223","article-title":"A study of translation edit rate with targeted human annotation","volume-title":"Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers","author":"Snover","year":"2006"},{"key":"2024100217180871200_bib53","first-page":"684","article-title":"Findings of the WMT 2021 shared task on quality estimation","volume-title":"Proceedings of the Sixth Conference on Machine Translation","author":"Specia","year":"2021"},{"key":"2024100217180871200_bib54","doi-asserted-by":"publisher","first-page":"256","DOI":"10.18653\/v1\/W15-3031","article-title":"Results of the WMT15 metrics shared task","volume-title":"Proceedings of the Tenth Workshop on Statistical Machine Translation","author":"Stanojevi\u0107","year":"2015"},{"key":"2024100217180871200_bib55","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-4811","article-title":"Neural machine translation with extended context","volume-title":"DiscoMT@EMNLP","author":"Tiedemann","year":"2017"},{"key":"2024100217180871200_bib56","first-page":"118","article-title":"Embarrassingly easy document-level MT metrics: How to convert any pretrained metric into a document-level metric","volume-title":"Proceedings of the Seventh Conference on Machine Translation (WMT)","author":"Vernikos","year":"2022"},{"key":"2024100217180871200_bib57","doi-asserted-by":"publisher","first-page":"1198","DOI":"10.18653\/v1\/P19-1116","article-title":"When a good translation is wrong in context: Context-aware machine translation improves on deixis, ellipsis, and lexical cohesion","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Voita","year":"2019"},{"key":"2024100217180871200_bib58","first-page":"483","article-title":"mT5: A massively multilingual pre-trained text-to-text transformer","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Xue","year":"2021"},{"key":"2024100217180871200_bib59","doi-asserted-by":"publisher","first-page":"679","DOI":"10.1145\/1518701.1518807","article-title":"Difficulties in establishing common ground in multiparty groups using machine translation","volume-title":"Proceedings of the SIGCHI Conference on Human Factors in Computing Systems","author":"Yamashita","year":"2009"},{"key":"2024100217180871200_bib60","doi-asserted-by":"publisher","DOI":"10.1002\/0470011815.b2a15150","article-title":"Spearman rank correlation","volume":"7","author":"Zar","year":"2005","journal-title":"Encyclopedia of Biostatistics"},{"key":"2024100217180871200_bib61","article-title":"Bertscore: Evaluating text generation with BERT","volume-title":"International Conference on Learning Representations","author":"Zhang","year":"2019"},{"key":"2024100217180871200_bib62","doi-asserted-by":"publisher","first-page":"563","DOI":"10.18653\/v1\/D19-1053","article-title":"MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)","author":"Zhao","year":"2019"},{"key":"2024100217180871200_bib63","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2024.acl-short.45","article-title":"Fine-tuned machine translation metrics struggle in unseen domains","author":"Zouhar","year":"2024"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00700\/2473652\/tacl_a_00700.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00700\/2473652\/tacl_a_00700.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,29]],"date-time":"2024-11-29T00:44:39Z","timestamp":1732841079000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00700\/124628\/Assessing-the-Role-of-Context-in-Chat-Translation"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024]]},"references-count":63,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00700","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024]]},"published":{"date-parts":[[2024]]}}}