{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,6]],"date-time":"2026-05-06T16:18:14Z","timestamp":1778084294862,"version":"3.51.4"},"reference-count":54,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T00:00:00Z","timestamp":1776988800000},"content-version":"vor","delay-in-days":113,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,4,15]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Current automated fact-checking (AFC) approaches typically evaluate evidence either implicitly via the predicted verdicts or through exact matches with predefined closed knowledge sources, such as Wikipedia. However, these methods are limited due to their reliance on evaluation metrics originally designed for other purposes and constraints from closed knowledge sources. In this work, we introduce Ev2R which combines the strengths of reference-based evaluation and verdict-level proxy scoring. Ev2R jointly assesses how well the evidence aligns with the gold references and how reliably it supports the verdict, addressing the shortcomings of prior methods. We evaluate Ev2R against three types of evidence evaluation approaches: reference-based, proxy-reference, and reference-less baselines. Assessments against human ratings and adversarial tests demonstrate that Ev2R consistently outperforms existing scoring approaches in accuracy and robustness. It achieves stronger correlation with human judgments and greater robustness to adversarial perturbations, establishing it as a reliable metric for evidence evaluation in AFC.1<\/jats:p>","DOI":"10.1162\/tacl.a.647","type":"journal-article","created":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T16:58:43Z","timestamp":1777049923000},"page":"530-561","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":1,"title":["Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking"],"prefix":"10.1162","volume":"14","author":[{"given":"Mubashara","family":"Akhtar","sequence":"first","affiliation":[{"name":"Department of Computer Science, ETH Zurich, Switzerland"},{"name":"ETH AI Center, Switzerland. mubashara.akhtar@ai.ethz.ch"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Michael","family":"Schlichtkrull","sequence":"additional","affiliation":[{"name":"School of Electronic Engineering and Computer Science, Queen Mary University of London, UK. m.schlichtkrull@qmul.ac.uk"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Andreas","family":"Vlachos","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Technology, University of Cambridge, UK. av308@cam.ac.uk"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"281","published-online":{"date-parts":[[2026,4,15]]},"reference":[{"key":"2026042412583696400_bib1","doi-asserted-by":"publisher","first-page":"5430","DOI":"10.18653\/v1\/2023.findings-emnlp.361","article-title":"Multimodal automated fact-checking: A survey","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2023","author":"Akhtar","year":"2023"},{"key":"2026042412583696400_bib2","doi-asserted-by":"publisher","first-page":"13921","DOI":"10.18653\/v1\/2024.findings-acl.828","article-title":"ChartCheck: Explainable fact-checking over real-world chart images","volume-title":"Findings of the Association for Computational Linguistics: ACL 2024","author":"Akhtar","year":"2024"},{"key":"2026042412583696400_bib3","doi-asserted-by":"publisher","first-page":"4685","DOI":"10.18653\/v1\/D19-1475","article-title":"MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)","author":"Augenstein","year":"2019"},{"key":"2026042412583696400_bib4","first-page":"65","article-title":"METEOR: An automatic metric for MT evaluation with improved correlation with human judgments","volume-title":"Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization","author":"Banerjee","year":"2005"},{"key":"2026042412583696400_bib5","article-title":"Overview of the CLEF-2018 checkthat! lab on automatic identification and verification of political claims. task 2: Factuality","volume-title":"Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10\u201314, 2018","author":"Barr\u00f3n-Cede\u00f1o","year":"2018"},{"key":"2026042412583696400_bib6","article-title":"Evaluation of text generation: A survey","author":"Celikyilmaz","year":"2020","journal-title":"CoRR"},{"key":"2026042412583696400_bib7","doi-asserted-by":"publisher","first-page":"3569","DOI":"10.18653\/v1\/2024.naacl-long.196","article-title":"Complex claim verification with evidence retrieved in the wild","volume-title":"Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)","author":"Chen","year":"2024"},{"key":"2026042412583696400_bib8","first-page":"4171","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Devlin","year":"2019"},{"key":"2026042412583696400_bib9","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2407.21783","article-title":"The llama 3 herd of models","author":"Dubey","year":"2024","journal-title":"CoRR"},{"key":"2026042412583696400_bib10","doi-asserted-by":"publisher","first-page":"5055","DOI":"10.18653\/v1\/2020.acl-main.454","article-title":"FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Durmus","year":"2020"},{"key":"2026042412583696400_bib11","doi-asserted-by":"publisher","first-page":"3938","DOI":"10.18653\/v1\/N19-1395","article-title":"Question answering as an automatic evaluation metric for news article summarization","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Eyal","year":"2019"},{"issue":"5","key":"2026042412583696400_bib12","doi-asserted-by":"publisher","first-page":"378","DOI":"10.1037\/h0031619","article-title":"Measuring nominal scale agreement among several raters","volume":"76","author":"Fleiss","year":"1971","journal-title":"Psychological Bulletin"},{"key":"2026042412583696400_bib13","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2302.04166","article-title":"Gptscore: Evaluate as you desire","author":"Jinlan","year":"2023","journal-title":"CoRR"},{"key":"2026042412583696400_bib14","doi-asserted-by":"publisher","first-page":"166","DOI":"10.1145\/3292500.3330955","article-title":"Assessing the factual accuracy of generated text","volume-title":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4\u20138, 2019","author":"Goodrich","year":"2019"},{"key":"2026042412583696400_bib15","unstructured":"L.\n              Graves\n            \n          . 2018. Understanding the promise and limits of automated fact-checking."},{"issue":"3","key":"2026042412583696400_bib16","doi-asserted-by":"publisher","first-page":"518","DOI":"10.1111\/cccr.12163","article-title":"Anatomy of a fact check: Objective practice and the contested epistemology of fact checking","volume":"10","author":"Graves","year":"2017","journal-title":"Communication, Culture and Critique"},{"key":"2026042412583696400_bib17","doi-asserted-by":"publisher","first-page":"178","DOI":"10.1162\/tacl_a_00454","article-title":"A survey on automated fact- checking","volume":"10","author":"Guo","year":"2022","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2026042412583696400_bib18","article-title":"DeBERTa: Decoding-enhanced BERT with disentangled attention","volume-title":"9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3\u20137, 2021","author":"He","year":"2021"},{"key":"2026042412583696400_bib19","doi-asserted-by":"publisher","first-page":"518","DOI":"10.1609\/icwsm.v12i1.14982","article-title":"Sampling the news producers: A large news and feature data set for the study of the complex media landscape","volume-title":"Proceedings of the Twelfth International Conference on Web and Social Media, ICWSM 2018, Stanford, California, USA, June 25\u201328, 2018","author":"Horne","year":"2018"},{"key":"2026042412583696400_bib20","first-page":"87","article-title":"Evaluation of review summaries via question- answering","volume-title":"Proceedings of the The 19th Annual Workshop of the Australasian Language Technology Association","author":"Huang","year":"2021"},{"issue":"12","key":"2026042412583696400_bib21","doi-asserted-by":"publisher","first-page":"248:1\u2013248:38","DOI":"10.1145\/3571730","article-title":"Survey of hallucination in natural language generation","volume":"55","author":"Ji","year":"2023","journal-title":"ACM Computing Surveys"},{"key":"2026042412583696400_bib22","doi-asserted-by":"publisher","first-page":"3441","DOI":"10.18653\/v1\/2020.findings-emnlp.309","article-title":"HoVer: A dataset for many-hop fact extraction and claim verification","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2020","author":"Jiang","year":"2020"},{"key":"2026042412583696400_bib23","first-page":"28","article-title":"NUBIA: NeUral based interchangeability assessor for text generation","volume-title":"Proceedings of the 1st Workshop on Evaluating NLG Evaluation","author":"Kane","year":"2020"},{"key":"2026042412583696400_bib24","article-title":"Content analysis: An introduction to its methodology","author":"Krippendorff","year":"2004"},{"key":"2026042412583696400_bib25","first-page":"74","article-title":"ROUGE: A package for automatic evaluation of summaries","volume-title":"Text Summarization Branches Out","author":"Lin","year":"2004"},{"key":"2026042412583696400_bib26","doi-asserted-by":"publisher","first-page":"6826","DOI":"10.18653\/v1\/2022.findings-emnlp.508","article-title":"WANLI: Worker and AI collaboration for natural language inference dataset creation","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2022","author":"Liu","year":"2022"},{"key":"2026042412583696400_bib27","doi-asserted-by":"publisher","first-page":"3428","DOI":"10.18653\/v1\/P19-1334","article-title":"Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"McCoy","year":"2019"},{"key":"2026042412583696400_bib28","doi-asserted-by":"publisher","first-page":"12076","DOI":"10.18653\/v1\/2023.emnlp-main.741","article-title":"FActScore: Fine-grained atomic evaluation of factual precision in long form text generation","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Min","year":"2023"},{"key":"2026042412583696400_bib29","doi-asserted-by":"publisher","first-page":"3950","DOI":"10.18653\/v1\/D18-1429","article-title":"Towards a better metric for evaluating question generation systems","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","author":"Nema","year":"2018"},{"key":"2026042412583696400_bib30","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33016859","article-title":"Combining fact extraction and verification with neural semantic matching networks","volume-title":"Association for the Advancement of Artificial Intelligence (AAAI)","author":"Nie","year":"2019"},{"key":"2026042412583696400_bib31","doi-asserted-by":"publisher","first-page":"4885","DOI":"10.18653\/v1\/2020.acl-main.441","article-title":"Adversarial NLI: A new benchmark for natural language understanding","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Nie","year":"2020"},{"key":"2026042412583696400_bib32","first-page":"422","article-title":"DanFEVER: Claim verification dataset for Danish","volume-title":"Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)","author":"N\u00f8rregaard","year":"2021"},{"key":"2026042412583696400_bib33","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2303.08774","article-title":"GPT-4 technical report","author":"OpenAI","year":"2023","journal-title":"CoRR"},{"key":"2026042412583696400_bib34","unstructured":"GPT4o\n              OpenAI\n            \n          . 2024. Blog post: Hello gpt-4o.https:\/\/openai.com\/index\/hello-gpt-4o\/. Accessed: 2024-10-23."},{"key":"2026042412583696400_bib35","doi-asserted-by":"publisher","first-page":"311","DOI":"10.3115\/1073083.1073135","article-title":"Bleu: A method for automatic evaluation of machine translation","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics","author":"Papineni","year":"2002"},{"key":"2026042412583696400_bib36","doi-asserted-by":"publisher","first-page":"4886","DOI":"10.18653\/v1\/2021.findings-emnlp.421","article-title":"Does putting a linguist in the loop improve NLU data collection?","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2021","author":"Parrish","year":"2021"},{"issue":"187","key":"2026042412583696400_bib37","doi-asserted-by":"publisher","first-page":"253","DOI":"10.1098\/rsta.1896.0007","article-title":"Vii. mathematical contributions to the theory of evolution.\u2013iii. Regression, heredity, and panmixia","author":"Pearson","year":"1896","journal-title":"Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character"},{"key":"2026042412583696400_bib38","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2403.05530","article-title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","author":"Reid","year":"2024","journal-title":"CoRR"},{"key":"2026042412583696400_bib39","doi-asserted-by":"publisher","first-page":"4902","DOI":"10.18653\/v1\/2020.acl-main.442","article-title":"Beyond accuracy: Behavioral testing of NLP models with CheckList","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Ribeiro","year":"2020"},{"issue":"2","key":"2026042412583696400_bib40","doi-asserted-by":"publisher","first-page":"26:1\u201326:39","DOI":"10.1145\/3485766","article-title":"A survey of evaluation metrics used for NLG systems","volume":"55","author":"Sai","year":"2023","journal-title":"ACM Computing Surveys"},{"key":"2026042412583696400_bib41","first-page":"6874","article-title":"Automated fact-checking of claims from Wikipedia","volume-title":"Proceedings of the Twelfth Language Resources and Evaluation Conference","author":"Sathe","year":"2020"},{"key":"2026042412583696400_bib42","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18653\/v1\/2024.fever-1.1","article-title":"The automated verification of textual claims (AVeriTeC) shared task","volume-title":"Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER)","author":"Schlichtkrull","year":"2024"},{"key":"2026042412583696400_bib43","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2305.13117","article-title":"Averitec: A dataset for real-world claim verification with evidence from the Web","author":"Schlichtkrull","year":"2023","journal-title":"CoRR"},{"key":"2026042412583696400_bib44","doi-asserted-by":"publisher","first-page":"624","DOI":"10.18653\/v1\/2021.naacl-main.52","article-title":"Get your vitamin C! robust fact verification with contrastive evidence","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Schuster","year":"2021"},{"key":"2026042412583696400_bib45","doi-asserted-by":"publisher","first-page":"3246","DOI":"10.18653\/v1\/D19-1320","article-title":"Answers unite! unsupervised metrics for reinforced summarization models","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)","author":"Scialom","year":"2019"},{"key":"2026042412583696400_bib46","doi-asserted-by":"publisher","first-page":"7881","DOI":"10.18653\/v1\/2020.acl-main.704","article-title":"BLEURT: Learning robust metrics for text generation","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Sellam","year":"2020"},{"key":"2026042412583696400_bib47","article-title":"Fakecovid- A multilingual cross-domain fact check news dataset for COVID-19","volume-title":"Workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, ICWSM 2020 Workshops, Atlanta, Georgia, USA [virtual], June 8, 2020","author":"Shahi","year":"2020"},{"issue":"3\/4","key":"2026042412583696400_bib48","doi-asserted-by":"publisher","first-page":"441","DOI":"10.2307\/1422689","article-title":"The proof and measurement of association between two things","volume":"100","author":"Spearman","year":"1987","journal-title":"The American Journal of Psychology"},{"key":"2026042412583696400_bib49","doi-asserted-by":"publisher","first-page":"809","DOI":"10.18653\/v1\/N18-1074","article-title":"FEVER: A large-scale dataset for fact extraction and VERification","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)","author":"Thorne","year":"2018"},{"key":"2026042412583696400_bib50","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18653\/v1\/W18-5501","article-title":"The fact extraction and VERification (FEVER) shared task","volume-title":"Proceedings of the First Workshop on Fact Extraction and VERification (FEVER)","author":"Thorne","year":"2018"},{"key":"2026042412583696400_bib51","article-title":"Chain-of-thought prompting elicits reasoning in large language models","volume-title":"Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28\u2013December 9, 2022","author":"Wei","year":"2022"},{"key":"2026042412583696400_bib52","doi-asserted-by":"publisher","first-page":"1112","DOI":"10.18653\/v1\/N18-1101","article-title":"A broad-coverage challenge corpus for sentence understanding through inference","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)","author":"Williams","year":"2018"},{"key":"2026042412583696400_bib53","article-title":"BERTScore: Evaluating text generation with BERT","volume-title":"8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26\u201330, 2020","author":"Zhang","year":"2020"},{"key":"2026042412583696400_bib54","doi-asserted-by":"publisher","first-page":"563","DOI":"10.18653\/v1\/D19-1053","article-title":"MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)","author":"Zhao","year":"2019"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/TACL.a.647\/2596847\/tacl.a.647.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/TACL.a.647\/2596847\/tacl.a.647.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T16:58:47Z","timestamp":1777049927000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/TACL.a.647\/136433\/Ev2R-Evaluating-Evidence-Retrieval-in-Automated"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026]]},"references-count":54,"URL":"https:\/\/doi.org\/10.1162\/tacl.a.647","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2026]]},"published":{"date-parts":[[2026]]}}}