{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,17]],"date-time":"2026-03-17T22:44:31Z","timestamp":1773787471388,"version":"3.50.1"},"reference-count":63,"publisher":"MIT Press - Journals","issue":"4","license":[{"start":{"date-parts":[[2021,8,13]],"date-time":"2021-08-13T00:00:00Z","timestamp":1628812800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,12,23]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>In order to simplify sentences, several rewriting operations can be performed, such as replacing complex words per simpler synonyms, deleting unnecessary information, and splitting long sentences. Despite this multi-operation nature, evaluation of automatic simplification systems relies on metrics that moderately correlate with human judgments on the simplicity achieved by executing specific operations (e.g., simplicity gain based on lexical replacements). In this article, we investigate how well existing metrics can assess sentence-level simplifications where multiple operations may have been applied and which, therefore, require more general simplicity judgments. For that, we first collect a new and more reliable data set for evaluating the correlation of metrics and human judgments of overall simplicity. Second, we conduct the first meta-evaluation of automatic metrics in Text Simplification, using our new data set (and other existing data) to analyze the variation of the correlation between metrics\u2019 scores and human judgments across three dimensions: the perceived simplicity level, the system type, and the set of references used for computation. We show that these three aspects affect the correlations and, in particular, highlight the limitations of commonly used operation-specific metrics. Finally, based on our findings, we propose a set of recommendations for automatic evaluation of multi-operation simplifications, suggesting which metrics to compute and how to interpret their scores.<\/jats:p>","DOI":"10.1162\/coli_a_00418","type":"journal-article","created":{"date-parts":[[2021,8,13]],"date-time":"2021-08-13T14:46:18Z","timestamp":1628865978000},"page":"861-889","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":26,"title":["The (Un)Suitability of Automatic Evaluation Metrics for Text Simplification"],"prefix":"10.1162","volume":"47","author":[{"given":"Fernando","family":"Alva-Manchego","sequence":"first","affiliation":[{"name":"Department of Computer Science University of Sheffield, UK. f.alva@sheffield.ac.uk"}]},{"given":"Carolina","family":"Scarton","sequence":"additional","affiliation":[{"name":"Department of Computer Science University of Sheffield, UK. c.scarton@sheffield.ac.uk"}]},{"given":"Lucia","family":"Specia","sequence":"additional","affiliation":[{"name":"Department of Computing Imperial College London, UK. l.specia@imperial.ac.uk"}]}],"member":"281","published-online":{"date-parts":[[2021,12,23]]},"reference":[{"key":"2022010319054990700_bib1","first-page":"228","article-title":"Universal conceptual cognitive annotation (UCCA)","volume-title":"Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Abend","year":"2013"},{"key":"2022010319054990700_bib2","doi-asserted-by":"publisher","first-page":"15","DOI":"10.1145\/1456536.145654","article-title":"A corpus analysis of simple account texts and the proposal of simplification strategies: First steps towards text simplification systems","volume-title":"Proceedings of the 26th Annual ACM International Conference on Design of Communication","author":"Alu\u00edsio","year":"2008"},{"key":"2022010319054990700_bib3","unstructured":"Alva-Manchego, Fernando\n          . 2020. Automatic Sentence Simplification with Multiple Rewriting Transformations. Ph.D. thesis, University of Sheffield, Sheffield, UK."},{"key":"2022010319054990700_bib4","first-page":"295","article-title":"Learning how to simplify from explicit labeling of complex-simplified text pairs","volume-title":"Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Alva-Manchego","year":"2017"},{"key":"2022010319054990700_bib5","doi-asserted-by":"publisher","first-page":"4668","DOI":"10.18653\/v1\/2020.acl-main.424","article-title":"ASSET: A data set for tuning and evaluation of sentence simplification models with multiple rewriting transformations","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Alva-Manchego","year":"2020"},{"key":"2022010319054990700_bib6","doi-asserted-by":"publisher","first-page":"49","DOI":"10.18653\/v1\/D19-3009","article-title":"EASSE: Easier automatic sentence simplification evaluation","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations","author":"Alva-Manchego","year":"2019"},{"issue":"1","key":"2022010319054990700_bib7","doi-asserted-by":"publisher","first-page":"135","DOI":"10.1162\/coli_a_00370","article-title":"Data-driven sentence simplification: Survey and benchmark","volume":"46","author":"Alva-Manchego","year":"2020","journal-title":"Computational Linguistics"},{"key":"2022010319054990700_bib8","doi-asserted-by":"publisher","first-page":"344","DOI":"10.18653\/v1\/W19-8642","article-title":"Agreement is overrated: A plea for correlation to assess human evaluation reliability","volume-title":"Proceedings of the 12th International Conference on Natural Language Generation","author":"Amidei","year":"2019"},{"key":"2022010319054990700_bib9","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18653\/v1\/W19-5301","article-title":"Findings of the 2019 conference on machine translation (WMT19)","volume-title":"Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)","author":"Barrault","year":"2019"},{"key":"2022010319054990700_bib10","doi-asserted-by":"publisher","first-page":"272","DOI":"10.18653\/v1\/W18-6401","article-title":"Findings of the 2018 conference on machine translation (WMT18)","volume-title":"Proceedings of the Third Conference on Machine Translation: Shared Task Papers","author":"Bojar","year":"2018"},{"key":"2022010319054990700_bib11","doi-asserted-by":"publisher","first-page":"489","DOI":"10.18653\/v1\/W17-4755","article-title":"Results of the WMT17 metrics shared task","volume-title":"Proceedings of the Second Conference on Machine Translation","author":"Bojar","year":"2017"},{"key":"2022010319054990700_bib12","doi-asserted-by":"publisher","first-page":"199","DOI":"10.18653\/v1\/W16-2302","article-title":"Results of the WMT16 metrics shared task","volume-title":"Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers","author":"Bojar","year":"2016"},{"key":"2022010319054990700_bib13","first-page":"87","article-title":"Spanish text simplification: An exploratory study","volume":"47","author":"Bott","year":"2011","journal-title":"Procesamiento del Lenguaje Natural"},{"key":"2022010319054990700_bib14","first-page":"5588","article-title":"CombiNMT: An exploration into neural text simplification models","volume-title":"Proceedings of the 12th Language Resources and Evaluation Conference","author":"Cooper","year":"2020"},{"key":"2022010319054990700_bib15","first-page":"4171","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Devlin","year":"2019"},{"key":"2022010319054990700_bib16","doi-asserted-by":"publisher","first-page":"3393","DOI":"10.18653\/v1\/P19-1331","article-title":"EditNTS: An neural programmer-interpreter model for sentence simplification through explicit editing","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Dong","year":"2019"},{"key":"2022010319054990700_bib17","first-page":"1","article-title":"Sentence simplification as tree transduction","volume-title":"Proceedings of the Second Workshop on Predicting and Improving Text Readability for Target Reader Populations","author":"Feblowitz","year":"2013"},{"issue":"3","key":"2022010319054990700_bib18","doi-asserted-by":"publisher","first-page":"515","DOI":"10.1162\/coli_a_00356","article-title":"Taking MT evaluation metrics to extremes: Beyond correlation with human judgments","volume":"45","author":"Fomicheva","year":"2019","journal-title":"Computational Linguistics"},{"key":"2022010319054990700_bib19","doi-asserted-by":"publisher","first-page":"61","DOI":"10.18653\/v1\/2020.emnlp-main.5","article-title":"BLEU might be guilty but references are not innocent","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Freitag","year":"2020"},{"key":"2022010319054990700_bib20","first-page":"758","article-title":"PPDB: The paraphrase database","volume-title":"Proceedings of NAACL-HLT","author":"Ganitkevitch","year":"2013"},{"key":"2022010319054990700_bib21","first-page":"33","article-title":"Continuous measurement scales in human evaluation of machine translation","volume-title":"Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse","author":"Graham","year":"2013"},{"issue":"1","key":"2022010319054990700_bib22","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1017\/S1351324915000339","article-title":"Can machine translation systems be evaluated by the crowd alone?","volume":"23","author":"Graham","year":"2017","journal-title":"Natural Language Engineering"},{"key":"2022010319054990700_bib23","first-page":"462","article-title":"Dynamic multi-level multi-task learning for sentence simplification","volume-title":"Proceedings of the 27th International Conference on Computational Linguistics","author":"Guo","year":"2018"},{"key":"2022010319054990700_bib24","doi-asserted-by":"publisher","first-page":"1627","DOI":"10.18653\/v1\/2020.findings-emnlp.147","article-title":"SimAlign: High quality word alignments without parallel training data using static and contextualized embeddings","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2020","author":"Jalili Sabet","year":"2020"},{"key":"2022010319054990700_bib25","doi-asserted-by":"publisher","first-page":"7943","DOI":"10.18653\/v1\/2020.acl-main.709","article-title":"Neural CRF model for sentence alignment in text simplification","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Jiang","year":"2020"},{"key":"2022010319054990700_bib26","doi-asserted-by":"crossref","unstructured":"Kincaid, J. P., R. P.Fishburne, R. L.Rogers, and B. S.Chissom. 1975. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Technical Report 8\u201375, Chief of Naval Technical Training: Naval Air Station Memphis. 10.21236\/ADA006655","DOI":"10.21236\/ADA006655"},{"key":"2022010319054990700_bib27","doi-asserted-by":"crossref","first-page":"177","DOI":"10.3115\/1557769.1557821","article-title":"Moses: Open source toolkit for statistical machine translation","volume-title":"Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions","author":"Koehn","year":"2007"},{"key":"2022010319054990700_bib28","doi-asserted-by":"publisher","first-page":"3137","DOI":"10.18653\/v1\/N19-1317","article-title":"Complexity-weighted loss and diverse reranking for sentence simplification","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Kriz","year":"2019"},{"key":"2022010319054990700_bib29","doi-asserted-by":"publisher","first-page":"7918","DOI":"10.18653\/v1\/2020.acl-main.707","article-title":"Iterative edit-based unsupervised sentence simplification","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Kumar","year":"2020"},{"issue":"1","key":"2022010319054990700_bib30","doi-asserted-by":"publisher","first-page":"159","DOI":"10.2307\/2529310","article-title":"The measurement of observer agreement for categorical data","volume":"33","author":"Landis","year":"1977","journal-title":"Biometrics"},{"key":"2022010319054990700_bib31","doi-asserted-by":"publisher","first-page":"671","DOI":"10.18653\/v1\/W18-6450","article-title":"Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance","volume-title":"Proceedings of the Third Conference on Machine Translation: Shared Task Papers","author":"Ma","year":"2018"},{"key":"2022010319054990700_bib32","doi-asserted-by":"publisher","first-page":"62","DOI":"10.18653\/v1\/W19-5302","article-title":"Results of the WMT19 metrics shared task: Segment-level and strong MT systems pose big challenges","volume-title":"Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)","author":"Ma","year":"2019"},{"key":"2022010319054990700_bib33","first-page":"3536","article-title":"Controllable text simplification with explicit paraphrasing","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Maddela","year":"2021"},{"key":"2022010319054990700_bib34","doi-asserted-by":"publisher","first-page":"4689","DOI":"10.18653\/v1\/2021.naacl-main.277","article-title":"Controllable sentence simplification","volume-title":"Proceedings of the 12th Language Resources and Evaluation Conference","author":"Martin","year":"2020"},{"key":"2022010319054990700_bib35","doi-asserted-by":"publisher","first-page":"4984","DOI":"10.18653\/v1\/2020.acl-main.448","article-title":"Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Mathur","year":"2020"},{"key":"2022010319054990700_bib36","first-page":"688","article-title":"Results of the WMT(20 metrics shared task","volume-title":"Proceedings of the Fifth Conference on Machine Translation","author":"Mathur","year":"2020"},{"key":"2022010319054990700_bib37","doi-asserted-by":"publisher","first-page":"435","DOI":"10.3115\/v1\/P14-1041","article-title":"Hybrid simplification using deep semantics and machine translation","volume-title":"Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Narayan","year":"2014"},{"key":"2022010319054990700_bib38","doi-asserted-by":"publisher","first-page":"111","DOI":"10.18653\/v1\/W16-6620","article-title":"Unsupervised sentence simplification using deep semantics","volume-title":"Proceedings of the 9th International Natural Language Generation Conference","author":"Narayan","year":"2016"},{"key":"2022010319054990700_bib39","doi-asserted-by":"publisher","first-page":"85","DOI":"10.18653\/v1\/P17-2014","article-title":"Exploring neural text simplification models","volume-title":"Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)","author":"Nisioi","year":"2017"},{"key":"2022010319054990700_bib40","doi-asserted-by":"publisher","first-page":"311","DOI":"10.3115\/1073083.1073135","article-title":"BLEU: A method for automatic evaluation of machine translation","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics","author":"Papineni","year":"2002"},{"key":"2022010319054990700_bib41","doi-asserted-by":"crossref","first-page":"143","DOI":"10.18653\/v1\/P16-2024","article-title":"Simple PPDB: A paraphrase database for simplification","volume-title":"Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)","author":"Pavlick","year":"2016"},{"key":"2022010319054990700_bib42","unstructured":"Petersen, Sarah E.\n          \n          2007. Natural Language Processing Tools for Reading Level Assessment and Text Simplification for Bilingual Education. Ph.D. thesis. University of Washington, Seattle, WA, USA. AAI3275902."},{"key":"2022010319054990700_bib43","doi-asserted-by":"publisher","first-page":"186","DOI":"10.18653\/v1\/W18-6319","article-title":"A call for clarity in reporting BLEU scores","volume-title":"Proceedings of the Third Conference on Machine Translation: Research Papers","author":"Post","year":"2018"},{"issue":"3","key":"2022010319054990700_bib44","doi-asserted-by":"crossref","first-page":"393","DOI":"10.1162\/coli_a_00322","article-title":"A structured review of the validity of BLEU","volume":"44","author":"Reiter","year":"2018","journal-title":"Computational Linguistics"},{"issue":"4","key":"2022010319054990700_bib45","doi-asserted-by":"publisher","first-page":"37","DOI":"10.1300\/J079v21n04_02","article-title":"Qualitative descriptors of strength of association and effect size","volume":"21","author":"Rosenthal","year":"1996","journal-title":"Journal of Social Service Research"},{"issue":"2","key":"2022010319054990700_bib46","doi-asserted-by":"publisher","first-page":"420","DOI":"10.1037\/0033-2909.86.2.420","article-title":"Intraclass correlations: Uses in assessing rater reliability","volume":"86","author":"Shrout","year":"1979","journal-title":"Psychological Bulletin"},{"key":"2022010319054990700_bib47","doi-asserted-by":"publisher","first-page":"738","DOI":"10.18653\/v1\/D18-1081","article-title":"BLEU is not suitable for the evaluation of text simplification","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","author":"Sulem","year":"2018"},{"key":"2022010319054990700_bib48","doi-asserted-by":"publisher","first-page":"685","DOI":"10.18653\/v1\/N18-1063","article-title":"Semantic structural evaluation for text simplification","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)","author":"Sulem","year":"2018"},{"key":"2022010319054990700_bib49","doi-asserted-by":"publisher","first-page":"162","DOI":"10.18653\/v1\/P18-1016","article-title":"Simple and effective text simplification using semantic and neural methods","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Sulem","year":"2018"},{"key":"2022010319054990700_bib50","doi-asserted-by":"publisher","first-page":"38","DOI":"10.21105\/joss.01026","article-title":"Joint learning of a dual SMT system for paraphrase generation","volume-title":"Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2","author":"Sun","year":"2012"},{"key":"2022010319054990700_bib51","doi-asserted-by":"publisher","first-page":"1","DOI":"10.3115\/v1\/W14-1201","article-title":"One step closer to automatic evaluation of text simplification systems","volume-title":"Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)","author":"\u0160tajner","year":"2014"},{"issue":"31","key":"2022010319054990700_bib52","doi-asserted-by":"crossref","first-page":"1026","DOI":"10.21105\/joss.01026","article-title":"Pingouin: Statistics in Python","volume":"3","author":"Vallat","year":"2018","journal-title":"Journal of Open Source Software"},{"key":"2022010319054990700_bib53","first-page":"5998","article-title":"Attention is all you need","volume-title":"Advances in Neural Information Processing Systems 30","author":"Vaswani","year":"2017"},{"key":"2022010319054990700_bib54","doi-asserted-by":"publisher","first-page":"79","DOI":"10.18653\/v1\/N18-2013","article-title":"Sentence simplification with memory-augmented neural networks","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)","author":"Vu","year":"2018"},{"key":"2022010319054990700_bib55","volume-title":"Regression Analysis","author":"Williams","year":"1959"},{"key":"2022010319054990700_bib56","first-page":"409","article-title":"Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming","volume-title":"Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing","author":"Woodsend","year":"2011"},{"key":"2022010319054990700_bib57","first-page":"1015","article-title":"Sentence simplification by monolingual machine translation","volume-title":"Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1","author":"Wubben","year":"2012"},{"key":"2022010319054990700_bib58","doi-asserted-by":"publisher","first-page":"283","DOI":"10.1162\/tacl_a_00139","article-title":"Problems in current text simplification research: New data can help","volume":"3","author":"Xu","year":"2015","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2022010319054990700_bib59","doi-asserted-by":"publisher","first-page":"401","DOI":"10.1162\/tacl_a_00107","article-title":"Optimizing statistical machine translation for text simplification","volume":"4","author":"Xu","year":"2016","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2022010319054990700_bib60","first-page":"43","article-title":"BERTScore: Evaluating text generation with BERT","volume-title":"International Conference on Learning Representations","author":"Zhang","year":"2020"},{"key":"2022010319054990700_bib61","doi-asserted-by":"publisher","first-page":"595","DOI":"10.18653\/v1\/D17-1062","article-title":"Sentence simplification with deep reinforcement learning","volume-title":"Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing","author":"Zhang","year":"2017"},{"key":"2022010319054990700_bib62","doi-asserted-by":"publisher","first-page":"3164","DOI":"10.18653\/v1\/D18-1355","article-title":"Integrating transformer and paraphrase rules for sentence simplification","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","author":"Zhao","year":"2018"},{"key":"2022010319054990700_bib63","first-page":"1353","article-title":"A monolingual tree-based translation model for sentence simplification","volume-title":"Proceedings of the 23rd International Conference on Computational Linguistics","author":"Zhu","year":"2010"}],"container-title":["Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/coli\/article-pdf\/47\/4\/861\/1979827\/coli_a_00418.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/coli\/article-pdf\/47\/4\/861\/1979827\/coli_a_00418.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,1,3]],"date-time":"2022-01-03T19:06:16Z","timestamp":1641236776000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/coli\/article\/47\/4\/861\/106930\/The-Un-Suitability-of-Automatic-Evaluation-Metrics"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,12]]},"references-count":63,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2021,12,23]]},"published-print":{"date-parts":[[2021,12,23]]}},"URL":"https:\/\/doi.org\/10.1162\/coli_a_00418","relation":{},"ISSN":["0891-2017","1530-9312"],"issn-type":[{"value":"0891-2017","type":"print"},{"value":"1530-9312","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021,12]]},"published":{"date-parts":[[2021,12]]}}}