{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,5]],"date-time":"2026-04-05T21:26:34Z","timestamp":1775424394518,"version":"3.50.1"},"reference-count":55,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2024,5,23]],"date-time":"2024-05-23T00:00:00Z","timestamp":1716422400000},"content-version":"vor","delay-in-days":143,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,5,16]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Instruction-following models are attractive alternatives to fine-tuned approaches for question answering (QA). By simply prepending relevant documents and an instruction to their input, these models can be adapted to various information domains and tasks without additional training. However, these models tend to produce verbose responses with supplementary information, which makes traditional QA metrics like exact match (EM) and F1 unreliable for accurately quantifying model performance. In this work, we evaluate instruction-following models along two fronts: 1) how well they satisfy user\u2019s information need (correctness), and 2) whether they disseminate information supported by the provided knowledge (faithfulness). Guided by human evaluation and analysis, we highlight the shortcomings of traditional metrics for both correctness and faithfulness and propose simple token-overlap metrics that correlate highly with human judgments. Our analysis reveals that for correctness, instruction-following models perform comparably to models specifically fine-tuned for that task. However, they struggle to accurately judge the relevance of the provided knowledge and often hallucinate in their responses. We hope our work encourages more holistic evaluation of instruction-following models for QA. Our code and human annotation data is available at https:\/\/github.com\/McGill-NLP\/instruct-qa.<\/jats:p>","DOI":"10.1162\/tacl_a_00667","type":"journal-article","created":{"date-parts":[[2024,5,23]],"date-time":"2024-05-23T17:48:40Z","timestamp":1716486520000},"page":"681-699","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":58,"title":["Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering"],"prefix":"10.1162","volume":"12","author":[{"given":"Vaibhav","family":"Adlakha","sequence":"first","affiliation":[{"name":"Mila, McGill University, Canada. vaibhav.adlakha@mila.quebec"},{"name":"ServiceNow Research, Canada"}]},{"given":"Parishad","family":"BehnamGhader","sequence":"additional","affiliation":[{"name":"Mila, McGill University, Canada. parishad.behnamghader@mila.quebec"}]},{"given":"Xing Han","family":"Lu","sequence":"additional","affiliation":[{"name":"Mila, McGill University, Canada. xing-han.lu@mila.quebec"}]},{"given":"Nicholas","family":"Meade","sequence":"additional","affiliation":[{"name":"Mila, McGill University, Canada. nicholas.meade@mila.quebec"}]},{"given":"Siva","family":"Reddy","sequence":"additional","affiliation":[{"name":"Mila, McGill University, Canada. siva.reddy@mila.quebec"},{"name":"ServiceNow Research, Canada"},{"name":"Facebook CIFAR AI Chair, Canada"}]}],"member":"281","published-online":{"date-parts":[[2024,5,16]]},"reference":[{"key":"2024052819541901700_bib1","doi-asserted-by":"publisher","first-page":"468","DOI":"10.1162\/tacl_a_00471","article-title":"TopiOCQA: Open-domain conversational question answering with topic switching","volume":"10","author":"Adlakha","year":"2022","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024052819541901700_bib2","first-page":"65","article-title":"METEOR: An automatic metric for MT evaluation with improved correlation with human judgments","volume-title":"Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization","author":"Banerjee","year":"2005"},{"key":"2024052819541901700_bib3","first-page":"1877","article-title":"Language models are few-shot learners","volume-title":"Advances in Neural Information Processing Systems","author":"Brown","year":"2020"},{"key":"2024052819541901700_bib4","doi-asserted-by":"publisher","first-page":"291","DOI":"10.18653\/v1\/2022.emnlp-main.20","article-title":"Tomayto, tomahto. beyond token-level answer equivalence for question answering evaluation","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Bulian","year":"2022"},{"key":"2024052819541901700_bib5","doi-asserted-by":"publisher","first-page":"15607","DOI":"10.18653\/v1\/2023.acl-long.870","article-title":"Can large language models be an alternative to human evaluations?","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Chiang","year":"2023"},{"key":"2024052819541901700_bib6","article-title":"Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality","author":"Chiang","year":"2023"},{"key":"2024052819541901700_bib7","doi-asserted-by":"publisher","first-page":"947","DOI":"10.18653\/v1\/2023.findings-acl.60","article-title":"The dangers of trusting stochastic parrots: Faithfulness and trust in open-domain conversational question answering","volume-title":"Findings of the Association for Computational Linguistics: ACL 2023","author":"Chiesurin","year":"2023"},{"key":"2024052819541901700_bib8","article-title":"Deep reinforcement learning from human preferences","volume-title":"Advances in Neural Information Processing Systems","author":"Christiano","year":"2017"},{"key":"2024052819541901700_bib9","article-title":"Scaling instruction-finetuned language models","author":"Chung","year":"2022","journal-title":"arXiv preprint arXiv:2210.11416"},{"key":"2024052819541901700_bib10","doi-asserted-by":"publisher","first-page":"1473","DOI":"10.1162\/tacl_a_00529","article-title":"FaithDial: A faithful benchmark for information-seeking dialogue","volume":"10","author":"Dziri","year":"2022","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024052819541901700_bib11","article-title":"Faith and fate: Limits of transformers on compositionality","author":"Dziri","year":"2023","journal-title":"arXiv preprint arXiv:2305.18654"},{"key":"2024052819541901700_bib12","doi-asserted-by":"publisher","first-page":"5271","DOI":"10.18653\/v1\/2022.naacl-main.387","article-title":"On the origin of hallucinations in conversational models: Is it the datasets or the models?","volume-title":"Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Dziri","year":"2022"},{"key":"2024052819541901700_bib13","doi-asserted-by":"publisher","first-page":"1066","DOI":"10.1162\/tacl_a_00506","article-title":"Evaluating attribution in dialogue systems: The BEGIN benchmark","volume":"10","author":"Dziri","year":"2022","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024052819541901700_bib14","article-title":"News summarization and evaluation in the era of GPT-3","author":"Goyal","year":"2022","journal-title":"arXiv preprint arXiv:2209.12356"},{"key":"2024052819541901700_bib15","doi-asserted-by":"publisher","first-page":"7856","DOI":"10.18653\/v1\/2021.emnlp-main.619","article-title":"Q2: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"Or","year":"2021"},{"key":"2024052819541901700_bib16","article-title":"OPT-IML: Scaling language model instruction meta learning through the lens of generalization","author":"Iyer","year":"2022","journal-title":"arXiv preprint arXiv:2212.12017"},{"key":"2024052819541901700_bib17","doi-asserted-by":"publisher","first-page":"874","DOI":"10.18653\/v1\/2021.eacl-main.74","article-title":"Leveraging passage retrieval with generative models for open domain question answering","volume-title":"Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume","author":"Izacard","year":"2021"},{"key":"2024052819541901700_bib18","doi-asserted-by":"publisher","first-page":"5591","DOI":"10.18653\/v1\/2023.acl-long.307","article-title":"Evaluating open-domain question answering in the era of large language models","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Kamalloo","year":"2023"},{"key":"2024052819541901700_bib19","doi-asserted-by":"publisher","first-page":"6769","DOI":"10.18653\/v1\/2020.emnlp-main.550","article-title":"Dense passage retrieval for open-domain question answering","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Karpukhin","year":"2020"},{"issue":"3","key":"2024052819541901700_bib20","doi-asserted-by":"publisher","first-page":"239","DOI":"10.1093\/biomet\/33.3.239","article-title":"The treatment of ties in ranking problems","volume":"33","author":"Kendall","year":"1945","journal-title":"Biometrika"},{"key":"2024052819541901700_bib21","doi-asserted-by":"publisher","first-page":"452","DOI":"10.1162\/tacl_a_00276","article-title":"Natural questions: A benchmark for question answering research","volume":"7","author":"Kwiatkowski","year":"2019","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024052819541901700_bib22","article-title":"Internet-augmented language models through few-shot prompting for open-domain question answering","author":"Lazaridou","year":"2022","journal-title":"arXiv preprint arXiv:2203 .05115"},{"key":"2024052819541901700_bib23","doi-asserted-by":"publisher","first-page":"6086","DOI":"10.18653\/v1\/P19-1612","article-title":"Latent retrieval for weakly supervised open domain question answering","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Lee","year":"2019"},{"key":"2024052819541901700_bib24","first-page":"9459","article-title":"Retrieval-augmented generation for knowledge-intensive NLP tasks","volume-title":"Advances in Neural Information Processing Systems","author":"Lewis","year":"2020"},{"key":"2024052819541901700_bib25","first-page":"74","article-title":"ROUGE: A package for automatic evaluation of summaries","volume-title":"Text Summarization Branches Out","author":"Lin","year":"2004"},{"key":"2024052819541901700_bib26","article-title":"Lost in the middle: How language models use long contexts","author":"Liu","year":"2023","journal-title":"arXiv preprint arXiv:2307.03172"},{"key":"2024052819541901700_bib27","doi-asserted-by":"publisher","first-page":"2511","DOI":"10.18653\/v1\/2023.emnlp-main.153","article-title":"G-eval: NLG evaluation using gpt-4 with better human alignment","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Liu","year":"2023"},{"key":"2024052819541901700_bib28","doi-asserted-by":"publisher","first-page":"9802","DOI":"10.18653\/v1\/2023.acl-long.546","article-title":"When not to trust language models: Investigating effectiveness of parametric and non-parametric memories","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Mallen","year":"2023"},{"key":"2024052819541901700_bib29","first-page":"86","article-title":"NeurIPS 2020 EfficientQA Competition: Systems, analyses and lessons learned","volume-title":"Proceedings of the NeurIPS 2020 Competition and Demonstration Track","author":"Min","year":"2021"},{"key":"2024052819541901700_bib30","doi-asserted-by":"publisher","first-page":"3470","DOI":"10.18653\/v1\/2022.acl-long.244","article-title":"Cross-task generalization via natural language crowdsourcing instructions","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Mishra","year":"2022"},{"key":"2024052819541901700_bib31","article-title":"Gpt-4 technical report","author":"OpenAI","year":"2023","journal-title":"arXiv preprint arXiv:2303.08774"},{"key":"2024052819541901700_bib32","first-page":"27730","article-title":"Training language models to follow instructions with human feedback","volume-title":"Advances in Neural Information Processing Systems","author":"Ouyang","year":"2022"},{"key":"2024052819541901700_bib33","article-title":"Instruction tuning with gpt-4","author":"Peng","year":"2023","journal-title":"arXiv preprint arXiv:2304.03277"},{"issue":"140","key":"2024052819541901700_bib34","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"Journal of Machine Learning Research"},{"key":"2024052819541901700_bib35","doi-asserted-by":"publisher","first-page":"2383","DOI":"10.18653\/v1\/D16-1264","article-title":"SQuAD: 100,000+ questions for machine comprehension of text","volume-title":"Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing","author":"Rajpurkar","year":"2016"},{"key":"2024052819541901700_bib36","article-title":"Measuring attribution in natural language generation models","author":"Rashkin","year":"2021","journal-title":"ArXiv"},{"key":"2024052819541901700_bib37","doi-asserted-by":"publisher","first-page":"704","DOI":"10.18653\/v1\/2021.acl-long.58","article-title":"Increasing faithfulness in knowledge-grounded dialogue with controllable features","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Rashkin","year":"2021"},{"key":"2024052819541901700_bib38","doi-asserted-by":"publisher","first-page":"249","DOI":"10.1162\/tacl_a_00266","article-title":"CoQA: A conversational question answering challenge","volume":"7","author":"Reddy","year":"2019","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024052819541901700_bib39","first-page":"109","article-title":"Okapi at trec-3","volume":"109","author":"Robertson","year":"1995","journal-title":"Nist Special Publication Sp"},{"key":"2024052819541901700_bib40","article-title":"Multitask prompted training enables zero-shot task generalization","volume-title":"International Conference on Learning Representations","author":"Sanh","year":"2022"},{"key":"2024052819541901700_bib41","article-title":"REPLUG: Retrieval-augmented black-box language models","author":"Shi","year":"2023"},{"key":"2024052819541901700_bib42","doi-asserted-by":"publisher","first-page":"3784","DOI":"10.18653\/v1\/2021.findings-emnlp.320","article-title":"Retrieval augmentation reduces hallucination in conversation","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2021","author":"Shuster","year":"2021"},{"key":"2024052819541901700_bib43","doi-asserted-by":"publisher","first-page":"58","DOI":"10.18653\/v1\/2021.sustainlp-1.7","article-title":"Combining lexical and dense retrieval for computationally efficient multi-hop question answering","volume-title":"Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing","author":"Sidiropoulos","year":"2021"},{"key":"2024052819541901700_bib44","article-title":"Stanford Alpaca: An instruction-following LLaMA model","author":"Taori","year":"2023"},{"key":"2024052819541901700_bib45","article-title":"LLaMA: Open and efficient foundation language models","author":"Touvron","year":"2023"},{"key":"2024052819541901700_bib46","article-title":"Llama 2: Open foundation and fine-tuned chat models","author":"Touvron","year":"2023"},{"key":"2024052819541901700_bib47","article-title":"Large language models are not fair evaluators","author":"Wang","year":"2023","journal-title":"arXiv preprint arXiv:2305.17926"},{"key":"2024052819541901700_bib48","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-long.754","article-title":"Self-instruct: Aligning language model with self generated instructions","author":"Wang","year":"2022","journal-title":"arXiv preprint arXiv:2212.10560"},{"key":"2024052819541901700_bib49","doi-asserted-by":"publisher","first-page":"5085","DOI":"10.18653\/v1\/2022.emnlp-main.340","article-title":"Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Wang","year":"2022"},{"key":"2024052819541901700_bib50","article-title":"Finetuned language models are zero-shot learners","volume-title":"International Conference on Learning Representations","author":"Wei","year":"2022"},{"key":"2024052819541901700_bib51","article-title":"Answering complex open-domain questions with multi-hop dense retrieval","volume-title":"International Conference on Learning Representations","author":"Xiong","year":"2021"},{"key":"2024052819541901700_bib52","doi-asserted-by":"publisher","first-page":"3225","DOI":"10.18653\/v1\/2023.acl-long.181","article-title":"A critical evaluation of evaluations for long-form question answering","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Fangyuan","year":"2023"},{"key":"2024052819541901700_bib53","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1259","article-title":"HotpotQA: A dataset for diverse, explainable multi-hop question answering","volume-title":"Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Yang","year":"2018"},{"key":"2024052819541901700_bib54","article-title":"OPT: Open pre-trained transformer language models","author":"Zhang","year":"2022"},{"key":"2024052819541901700_bib55","article-title":"BERTScore: Evaluating text generation with BERT","volume-title":"International Conference on Learning Representations","author":"Zhang","year":"2020"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00667\/2374800\/tacl_a_00667.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00667\/2374800\/tacl_a_00667.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,28]],"date-time":"2024-05-28T19:55:28Z","timestamp":1716926128000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00667\/121196\/Evaluating-Correctness-and-Faithfulness-of"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024]]},"references-count":55,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00667","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024]]},"published":{"date-parts":[[2024]]}}}