{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,14]],"date-time":"2026-05-14T20:06:45Z","timestamp":1778789205242,"version":"3.51.4"},"reference-count":84,"publisher":"Cambridge University Press (CUP)","issue":"2","license":[{"start":{"date-parts":[[2019,9,11]],"date-time":"2019-09-11T00:00:00Z","timestamp":1568160000000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2020,3]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>This article presents the most up-to-date, influential automated, semiautomated and human metrics used to evaluate the quality of machine translation (MT) output and provides the necessary background for MT evaluation projects. Evaluation is, as repeatedly admitted, highly relevant for the improvement of MT. This article is divided into three parts: the first one is dedicated to automated metrics; the second, to human metrics; and the last, to the challenges posed by neural machine translation (NMT) regarding its evaluation. The first part includes reference translation\u2013based metrics; confidence or quality estimation (QE) metrics, which are used as alternatives for quality assessment; and diagnostic evaluation based on linguistic checkpoints. Human evaluation metrics are classified according to the criterion of whether human judges directly express a so-called subjective evaluation judgment, such as \u2018good\u2019 or \u2018better than\u2019, or not, as is the case in error classification. The former methods are based on directly expressed judgment (DEJ); therefore, they are called \u2018DEJ-based evaluation methods\u2019, while the latter are called \u2018non-DEJ-based evaluation methods\u2019. In the DEJ-based evaluation section, tasks such as fluency and adequacy annotation, ranking and direct assessment (DA) are presented, whereas in the non-DEJ-based evaluation section, tasks such as error classification and postediting are detailed, with definitions and guidelines, thus rendering this article a useful guide for evaluation projects. Following the detailed presentation of the previously mentioned metrics, the specificities of NMT are set forth along with suggestions for its evaluation, according to the latest studies. As human translators are the most adequate judges of the quality of a translation, emphasis is placed on the human metrics seen from a translator-judge perspective to provide useful methodology tools for interdisciplinary research groups that evaluate MT systems.<\/jats:p>","DOI":"10.1017\/s1351324919000469","type":"journal-article","created":{"date-parts":[[2019,9,11]],"date-time":"2019-09-11T06:56:57Z","timestamp":1568185017000},"page":"137-161","source":"Crossref","is-referenced-by-count":67,"title":["How to evaluate machine translation: A review of automated and human metrics"],"prefix":"10.1017","volume":"26","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4628-6089","authenticated-orcid":false,"given":"Eirini","family":"Chatzikoumi","sequence":"first","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2019,9,11]]},"reference":[{"key":"S1351324919000469_ref78","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W18-6312"},{"key":"S1351324919000469_ref77","unstructured":"Tom\u00e1s, J. , Mas, J.A. and Casacuberta, F. (2003). A quantitative method for machine translation evaluation. In Proceedings of the EACL 2003 Workshop on Evaluation Initiatives in Natural Language Processing: Are Evaluation Methods, Metrics and Resources Reusable?, Budapest, Hungary."},{"key":"S1351324919000469_ref76","unstructured":"Temnikova, I. (2010). A cognitive evaluation approach for a controlled language post-editing experiment. In Proceedings of International Conference Language Resources and Evaluation (LREC2010), Valletta, Malta."},{"key":"S1351324919000469_ref75","unstructured":"Sutskever, I. , Vinyals, O. and Le, Q. (2014). Sequence to sequence learning with neural networks. In Proceedings of Advances in Neural Information Processing Systems, Montreal, Canada, pp. 3104\u20133112 ."},{"key":"S1351324919000469_ref73","unstructured":"Specia, L. , Shah, K. , De Souza, J.G.C. and Cohn, T. (2013). QuEst \u2013 A translation quality estimation framework. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp. 79\u201384."},{"key":"S1351324919000469_ref72","doi-asserted-by":"publisher","DOI":"10.1007\/s10590-010-9077-2"},{"key":"S1351324919000469_ref70","unstructured":"Sennrich, R. (2017). How grammatical is character-level neural machine translation? Assessing MT quality with contrastive translation pairs. arXiv:1612.04629v3."},{"key":"S1351324919000469_ref69","first-page":"806","volume-title":"Handbook of Natural Language Processing and Machine Translation. DARPA Global Autonomous Language Exploitation","author":"Sanders","year":"2011"},{"key":"S1351324919000469_ref66","unstructured":"Quirk, C.B. (2004). Training a sentence-level machine translation confidence measure. In Proceedings of the 4th Conference on Language Resources and Evaluation, Lisbon, Portugal, pp. 825\u2013828."},{"key":"S1351324919000469_ref65","doi-asserted-by":"publisher","DOI":"10.1007\/s10590-009-9065-6"},{"key":"S1351324919000469_ref64","volume-title":"Handbook of Natural Language Processing and Machine Translation. DARPA Global Autonomous Language Exploitation","author":"Przybocki","year":"2011"},{"key":"S1351324919000469_ref62","doi-asserted-by":"publisher","DOI":"10.3115\/1626355.1626362"},{"key":"S1351324919000469_ref80","unstructured":"Ueffing, N. and Ney, H. (2005). Application of word-level confidence measures in interactive statistical machine translation. In Proceedings of the 10th Conference of the European Association for Machine Translation, Budapest, Hungary, pp. 262\u2013270."},{"key":"S1351324919000469_ref61","author":"Popovi\u0107","year":"2018"},{"key":"S1351324919000469_ref58","unstructured":"Nord, C. (1997). Translating as a Purposeful Activity: Functionalist Approaches Explained. Manchester: St. Jerome."},{"key":"S1351324919000469_ref57","unstructured":"Niessen, S. , Och, F. , Leusch, G. and Ney, H. (2000). An evaluation tool for machine translation: Fast evaluation for MT research. In Proceedings of the 2nd International Conference on Language Resources and Evaluation, Athens, Greece."},{"key":"S1351324919000469_ref56","volume-title":"A Textbook of Translation.","author":"Newmark","year":"1988"},{"key":"S1351324919000469_ref55","doi-asserted-by":"publisher","DOI":"10.1145\/375360.375365"},{"key":"S1351324919000469_ref52","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00056"},{"key":"S1351324919000469_ref51","unstructured":"Lommel, A. , Popovi\u0107, M. and Burchardt, A. (2014). Assessing inter-annotator agreement for translation error annotation. In Proceedings of LREC Workshop on Automatic and manual Metrics for Operational Translation Evaluation, Reykjavik, Iceland."},{"key":"S1351324919000469_ref49","doi-asserted-by":"publisher","DOI":"10.3115\/1218955.1219032"},{"key":"S1351324919000469_ref54","doi-asserted-by":"publisher","DOI":"10.3115\/1073483.1073504"},{"key":"S1351324919000469_ref48","first-page":"707","article-title":"Binary codes capable of correcting deletions, insertions, and reversals","volume":"10","author":"Levenshtein","year":"1966","journal-title":"Soviet Physics \u2013 Doklady"},{"key":"S1351324919000469_ref47","unstructured":"Leusch, G , Ueffing, N. and Ney, H. (2006). CDER: Efficient MT evaluation using block movements. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy."},{"key":"S1351324919000469_ref46","unstructured":"Leusch, G , Ueffing, N. and Ney, H. (2003). A novel string-to-string distance measure with applications to machine translation evaluation. In Proceedings of MT Summit IX, New Orleans, LA, USA."},{"key":"S1351324919000469_ref45","volume-title":"Proceedings of the 13th MT Summit","author":"Lavie","year":"2011"},{"key":"S1351324919000469_ref42","doi-asserted-by":"publisher","DOI":"10.3115\/1654650.1654666"},{"key":"S1351324919000469_ref40","volume-title":"Statistical Machine Translation","author":"Koehn","year":"2010"},{"key":"S1351324919000469_ref39","doi-asserted-by":"publisher","DOI":"10.5565\/rev\/tradumatica.76"},{"key":"S1351324919000469_ref38","unstructured":"Klubi\u00e8ka, F. , Toral, A. and S\u00e1nchez-Cartagena, V. (2018). Quantitative fine-grained human evaluation of machine translation systems: A case study on English to Croatian. arXiv:1802.01451v1."},{"key":"S1351324919000469_ref36","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1263"},{"key":"S1351324919000469_ref34","unstructured":"Hassan, H. , Aue, A. , Chen, C. , Chowdhary, V. , Clark, J. , Federmann, C. , Huang, X. , Junczys-Dowmunt, M. , Lewis, W. , Li, M. , Liu, S. , Liu, T. , Luo, R. , Menezes, A. , Qin, T. , Seide, F. , Tan, X. , Tian, F. , Wu, L. , Wu, S. , Xia, Y. , Zhang, D. , Zhang, Z. and Zhou, M. (2018). Achieving human parity on automatic Chinese to English news translation. arXiv:1803.05567."},{"key":"S1351324919000469_ref32","unstructured":"Han, A.L.F. , Wong, D.F. and Chao, L.S. (2012). LEPOR: A robust evaluation metric for machine translation with augmented factors. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012): Posters, Mumbai, India, pp. 441\u2013450."},{"key":"S1351324919000469_ref31","doi-asserted-by":"publisher","DOI":"10.1017\/S1351324915000339"},{"key":"S1351324919000469_ref30","first-page":"33","volume-title":"Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse","author":"Graham","year":"2013"},{"key":"S1351324919000469_ref28","volume-title":"Asiya. An open toolkit for automatic machine translation (meta-)evaluation","author":"Gonz\u00e0lez","year":"2014"},{"key":"S1351324919000469_ref26","volume-title":"Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL (CONLL)","author":"Gandrabur","year":"2003"},{"key":"S1351324919000469_ref24","volume-title":"Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC)","author":"Federmann","year":"2010"},{"key":"S1351324919000469_ref8","first-page":"2158","volume-title":"Proceedings of the Eighth International Conference on Language Resources and Evaluation","author":"Berka","year":"2012"},{"key":"S1351324919000469_ref63","doi-asserted-by":"publisher","DOI":"10.1162\/COLI_a_00072"},{"key":"S1351324919000469_ref22","first-page":"162","volume-title":"Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Dreyer","year":"2012"},{"key":"S1351324919000469_ref5","volume-title":"Proceedings of ICRL 2015","author":"Bahdanau","year":"2015"},{"key":"S1351324919000469_ref50","doi-asserted-by":"publisher","DOI":"10.3115\/1220575.1220668"},{"key":"S1351324919000469_ref81","doi-asserted-by":"publisher","DOI":"10.1007\/s10590-013-9141-9"},{"key":"S1351324919000469_ref37","first-page":"1700","volume-title":"Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing","author":"Kalchbrenner","year":"2013"},{"key":"S1351324919000469_ref59","unstructured":"Papineni, K. , Roukos, S. , Ward, T. and Zhu, W.J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of ACL-2002: 40th Annual meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311\u2013318. CiteSeerX: 10.1.1.19.9416"},{"key":"S1351324919000469_ref15","first-page":"249","volume-title":"Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguisics (EACL)","author":"Callison-Burch","year":"2006"},{"key":"S1351324919000469_ref16","volume-title":"Proceedings of the Annual Conference of the European Association for Machine Translation","author":"Carl","year":"2010"},{"key":"S1351324919000469_ref44","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1512"},{"key":"S1351324919000469_ref10","volume-title":"Proceedings of the 20th International Conference on Computational Linguistics (COLING)","author":"Blatz","year":"2004"},{"key":"S1351324919000469_ref68","unstructured":"S\u00e1nchez-Gij\u00f3n, P. and Torres-Hostench, O. (2014). MT post-editing into the mother tongue or into a foreign language? Spanish-to-English MT translation output post-edited by translation trainees. In Proceedings of the Third Workshop on Post-Editing Technology and Practice, 11th Conference of the Association for Machine Translation in the Americas (AMTA), Vancouver, Canada, pp. 5\u201319."},{"key":"S1351324919000469_ref71","volume-title":"Proceedings of the 7th Conference of the Association for Machine Translation in the Americas","author":"Snover","year":"2006"},{"key":"S1351324919000469_ref41","unstructured":"Koehn, P. and Knowles, R. (2017). Six challenges for neural machine translation. arXiv:1706.03872v1."},{"key":"S1351324919000469_ref83","doi-asserted-by":"publisher","DOI":"10.3115\/1610075.1610087"},{"key":"S1351324919000469_ref25","first-page":"86","volume-title":"Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations","author":"Federmann","year":"2018"},{"key":"S1351324919000469_ref27","first-page":"120","volume-title":"Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations","author":"Girardi","year":"2014"},{"key":"S1351324919000469_ref35","doi-asserted-by":"crossref","DOI":"10.4324\/9781315752839","volume-title":"Translation Quality Assessment: Past and Present","author":"House","year":"2014"},{"key":"S1351324919000469_ref20","first-page":"128","volume-title":"Proceedings of the 2nd Human Language Technologies Conference (HLT-02)","author":"Doddington","year":"2002"},{"key":"S1351324919000469_ref43","volume-title":"Proceedings of the Third Workshop on Post-Editing Technology and Practice. 11th Conference of the Association for Machine Translation in the Americas","author":"Lacruz","year":"2014"},{"key":"S1351324919000469_ref33","unstructured":"Han, L. (2018). Machine translation evaluation resources and methods: A survey. arXiv:1605.04515v8. Cornell University Library."},{"key":"S1351324919000469_ref84","doi-asserted-by":"publisher","DOI":"10.3115\/1599081.1599222"},{"key":"S1351324919000469_ref82","unstructured":"Wu, Y. , Schuster, M. , Chen, Z. , Le, Q.V. , Norouzi, M. , Macherey, W. , Krikun, M. , Cao, Y. , Gao, Q. , Macherey, K. , Klingner, J. , Shah, A. , Johnson, M. , Liu, X. , Kaiser, L. , Gouws, S. , Kato, Y. , Kudo, T. , Kazawa, H. , Stevens, K. , Kurian, G. , Patil, N. , Wang, W. , Young, C. , Smith, J. , Riesa, J. , Rudnick, A. , Vinyals, O. , Corrado, G. , Hughes, M. and Dean, J. (2016). Google\u2019s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs\/1609.08144. Available at: http:\/\/arxiv.org\/abs\/1609.08144."},{"key":"S1351324919000469_ref29","unstructured":"G\u00f6r\u00f6g, A. (2014). Quantifying and benchmarking quality: The TAUS Dynamic Quality Framework. Revista Tradum\u00e0tica: tecnologies de la traducci\u00f3, Traducci\u00f3 i qualitat 12. ISSN: 1578\u20137559. Available at http:\/\/revistes.uab.cat\/tradumatica."},{"key":"S1351324919000469_ref9","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1134"},{"key":"S1351324919000469_ref7","unstructured":"Bentivogli, L. , Cettolo, M. , Federico, M. and Federmann, C. (2018). Machine translation human evaluation: An investigation of evaluation based on Post-editing and its relation with Direct Assessment. In Proceedings of the International Workshop on Spoken Language Translation, Bruges, Belgium, pp. 62\u201369."},{"key":"S1351324919000469_ref74","unstructured":"Specia, L. , Turchi, M. , Cancedda, N. , Dymetman, M. and Cristianini, N. (2009). Estimating the Sentence Level Quality of Machine Translation Systems. In EAMT09, Barcelona, Spain, pp. 28\u201337."},{"key":"S1351324919000469_ref53","unstructured":"Massardo, I. , Van der Meer, J. , O\u2019Brien, S. , Hollowood, F. , Aranberri, N. and Drescher, K. (2016). MT Post-Editing Guidelines. The Netherlands: TAUS Signature Editions."},{"key":"S1351324919000469_ref1","first-page":"228","volume-title":"Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Abend","year":"2013"},{"key":"S1351324919000469_ref60","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W15-3049"},{"key":"S1351324919000469_ref4","volume-title":"Proceedings of ACL (Association for Computational Linguistics)","author":"Babych","year":"2004"},{"key":"S1351324919000469_ref67","volume-title":"Sur la traduction","author":"Ricoeur","year":"2003"},{"key":"S1351324919000469_ref13","first-page":"286","volume-title":"Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing","author":"Callison-Burch","year":"2009"},{"key":"S1351324919000469_ref2","first-page":"137","volume-title":"Proceedings of the Conference of the European Association for Machine Translation","author":"Ageeva","year":"2015"},{"key":"S1351324919000469_ref3","unstructured":"Amigo, E. , Gim\u00e9nez, J. , Gonzalo, J. and M\u00e0rquez, L. (2006). MT evaluation: Human-like vs. human acceptable. In Proceedings of COLING-ACL06, Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Lingustics, Sydney, Australia."},{"key":"S1351324919000469_ref6","first-page":"65","volume-title":"Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and\/or Summarization","author":"Banerjee","year":"2005"},{"key":"S1351324919000469_ref11","doi-asserted-by":"publisher","DOI":"10.2478\/v10108-011-0005-2"},{"key":"S1351324919000469_ref12","unstructured":"Bojar, O. , Federmann, C. , Haddow, B. , Koehn, P. , Post, M. and Specia, L. (2016). Ten years of WMT evaluation campaigns: Lessons learnt. In Proceedings of the LREC 2016 Workshop \u201cTranslation Evaluation \u2013 From Fragmented Tools and Data Sets to an Integrated Ecosystem\u201d. Available at http:\/\/www.cracking-the-language-barrier.eu\/wp-content\/uploads\/Bojar-Federmann-etal.pdf."},{"key":"S1351324919000469_ref18","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-4012"},{"key":"S1351324919000469_ref79","doi-asserted-by":"publisher","DOI":"10.1093\/mind\/LIX.236.433"},{"key":"S1351324919000469_ref14","doi-asserted-by":"publisher","DOI":"10.3115\/1626355.1626373"},{"key":"S1351324919000469_ref17","volume-title":"Corpora, Language, Teaching and Resources: From Theory to Practice","author":"Castagnoli","year":"2010"},{"key":"S1351324919000469_ref19","first-page":"63","volume-title":"Proceedings of MT Summit IX","author":"Coughlin","year":"2003"},{"key":"S1351324919000469_ref21","first-page":"801","volume-title":"Handbook of Natural Language Processing and Machine Translation. DARPA Global Autonomous Language Exploitation","author":"Dorr","year":"2011"},{"key":"S1351324919000469_ref23","unstructured":"Euromatrix (2007). Survey of machine translation evaluation. Statistical and Hybrid Machine Translation Between All European Languages, IST 034291, Deliverable 1.3."}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324919000469","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,1,20]],"date-time":"2021-01-20T00:56:31Z","timestamp":1611104191000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324919000469\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,9,11]]},"references-count":84,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2020,3]]}},"alternative-id":["S1351324919000469"],"URL":"https:\/\/doi.org\/10.1017\/s1351324919000469","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,9,11]]}}}