{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,8]],"date-time":"2026-06-08T02:55:24Z","timestamp":1780887324012,"version":"3.54.1"},"reference-count":39,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2025,1,27]],"date-time":"2025-01-27T00:00:00Z","timestamp":1737936000000},"content-version":"vor","delay-in-days":26,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,1,7]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Retrieval Augmented Generation (RAG) has become a popular application for large language models. It is preferable that successful RAG systems provide accurate answers that are supported by being grounded in a passage without any hallucinations. While considerable work is required for building a full RAG pipeline, being able to benchmark performance is also necessary. We present CLAPnq, a benchmark Long-form Question Answering dataset for the full RAG pipeline. CLAPnq includes long answers with grounded gold passages from Natural Questions (NQ) and a corpus to perform either retrieval, generation, or the full RAG pipeline. The CLAPnq answers are concise, 3x smaller than the full passage, and cohesive, meaning that the answer is composed fluently, often by integrating multiple pieces of the passage that are not contiguous. RAG models must adapt to these properties to be successful at CLAPnq. We present baseline experiments and analysis for CLAPnq that highlight areas where there is still significant room for improvement in grounded RAG. CLAPnq is publicly available at https:\/\/github.com\/primeqa\/clapnq.<\/jats:p>","DOI":"10.1162\/tacl_a_00729","type":"journal-article","created":{"date-parts":[[2025,1,10]],"date-time":"2025-01-10T19:08:43Z","timestamp":1736536123000},"page":"53-72","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":13,"title":["<scp>CLAPnq<\/scp>: <u>C<\/u>ohesive <u>L<\/u>ong-form <u>A<\/u>nswers from <u>P<\/u>assages in Natural Questions for RAG systems"],"prefix":"10.1162","volume":"13","author":[{"given":"Sara","family":"Rosenthal","sequence":"first","affiliation":[{"name":"IBM Research AI, USA. sjrosenthal@us.ibm.com"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Avirup","family":"Sil","sequence":"additional","affiliation":[{"name":"IBM Research AI, USA. avi@us.ibm.com"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Radu","family":"Florian","sequence":"additional","affiliation":[{"name":"IBM Research AI, USA. raduf@us.ibm.com"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Salim","family":"Roukos","sequence":"additional","affiliation":[{"name":"IBM Research AI, USA. roukos@us.ibm.com"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"281","published-online":{"date-parts":[[2025,1,7]]},"reference":[{"key":"2025012714411232600_bib1","doi-asserted-by":"publisher","first-page":"468","DOI":"10.1162\/tacl_a_00471","article-title":"TopiOCQA: Open-domain conversational question answering with topic switching","volume":"10","author":"Adlakha","year":"2022","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2025012714411232600_bib2","first-page":"97","article-title":"QAMPARI: A benchmark for open-domain questions with many answers","volume-title":"Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)","author":"Amouyal","year":"2023"},{"key":"2025012714411232600_bib3","article-title":"Attributed question answering: Evaluation and modeling for attributed large language models","author":"Bohnet","year":"2023","journal-title":"ArXiv preprint:2212.08037 v2"},{"key":"2025012714411232600_bib4","first-page":"1877","article-title":"Language models are few-shot learners","volume-title":"Advances in Neural Information Processing Systems","author":"Brown","year":"2020"},{"key":"2025012714411232600_bib5","article-title":"BGE M3-Embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation","author":"Chen","year":"2024","journal-title":"ArXiv preprint:2402.03216 v4"},{"key":"2025012714411232600_bib6","doi-asserted-by":"publisher","first-page":"17754","DOI":"10.1609\/aaai.v38i16.29728","article-title":"Benchmarking large language models in retrieval-augmented generation","volume-title":"Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20\u201327, 2024, Vancouver, Canada","author":"Chen","year":"2024"},{"key":"2025012714411232600_bib7","article-title":"Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality","author":"Chiang","year":"2023"},{"issue":"70","key":"2025012714411232600_bib8","first-page":"1","article-title":"Scaling instruction-finetuned language models","volume":"25","author":"Chung","year":"2024","journal-title":"Journal of Machine Learning Research"},{"key":"2025012714411232600_bib9","doi-asserted-by":"publisher","first-page":"7651","DOI":"10.1609\/aaai.v34i05.6266","article-title":"Joint learning of answer selection and answer summary generation in community question answering","volume-title":"The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7\u201312, 2020","author":"Deng","year":"2020"},{"key":"2025012714411232600_bib10","first-page":"150","article-title":"RAGAs: Automated evaluation of retrieval augmented generation","volume-title":"Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations","author":"Es","year":"2024"},{"key":"2025012714411232600_bib11","doi-asserted-by":"publisher","first-page":"3558","DOI":"10.18653\/v1\/P19-1346","article-title":"ELI5: Long form question answering","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Fan","year":"2019"},{"key":"2025012714411232600_bib12","article-title":"Proceedings of the 3rd Workshop on Machine Reading for Question Answering","author":"Fisch","year":"2021"},{"key":"2025012714411232600_bib13","doi-asserted-by":"publisher","first-page":"6465","DOI":"10.18653\/v1\/2023.emnlp-main.398","article-title":"Enabling large language models to generate text with citations","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Gao","year":"2023"},{"key":"2025012714411232600_bib14","article-title":"Retrieval-augmented generation for large language models: A survey","author":"Gao","year":"2024","journal-title":"arXiv preprint:2312.10997 v5"},{"key":"2025012714411232600_bib15","article-title":"REALM: Retrieval-augmented language model pre-training","author":"Guu","year":"2020","journal-title":"arXiv preprint:2002.08909 v1"},{"key":"2025012714411232600_bib16","article-title":"Mistral 7b","author":"Jiang","year":"2023","journal-title":"arXiv preprint: 2310.06825 v1"},{"key":"2025012714411232600_bib17","article-title":"AQuaMuSe: Automatically generating datasets for query- based multi-document summarization","author":"Kulkarni","year":"2020","journal-title":"arXiv preprint: 2010.12694 v1"},{"key":"2025012714411232600_bib18","doi-asserted-by":"publisher","first-page":"452","DOI":"10.1162\/tacl_a_00276","article-title":"Natural Questions: A benchmark for question answering research","volume":"7","author":"Kwiatkowski","year":"2019","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2025012714411232600_bib19","doi-asserted-by":"publisher","first-page":"6086","DOI":"10.18653\/v1\/P19-1612","article-title":"Latent retrieval for weakly supervised open domain question answering","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Lee","year":"2019"},{"key":"2025012714411232600_bib20","article-title":"Retrieval-augmented generation for knowledge-intensive nlp tasks","volume-title":"Proceedings of the 34th International Conference on Neural Information Processing Systems","author":"Lewis","year":"2020"},{"key":"2025012714411232600_bib21","doi-asserted-by":"publisher","first-page":"3214","DOI":"10.18653\/v1\/2022.acl-long.229","article-title":"TruthfulQA: Measuring how models mimic human falsehoods","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Lin","year":"2022"},{"key":"2025012714411232600_bib22","article-title":"RECALL: A benchmark for llms robustness against external counterfactual knowledge","author":"Yi","year":"2023","journal-title":"arXiv preprint: 2311.08147 v1"},{"key":"2025012714411232600_bib23","doi-asserted-by":"publisher","first-page":"3025","DOI":"10.18653\/v1\/2024.naacl-long.167","article-title":"ExpertQA: Expert-curated questions and attributed answers","volume-title":"Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)","author":"Malaviya","year":"2024"},{"key":"2025012714411232600_bib24","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511809071","volume-title":"Introduction to Information Retrieval","author":"Manning","year":"2008"},{"key":"2025012714411232600_bib25","doi-asserted-by":"publisher","first-page":"5783","DOI":"10.18653\/v1\/2020.emnlp-main.466","article-title":"AmbigQA: Answering ambiguous open-domain questions","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Min","year":"2020"},{"issue":"2","key":"2025012714411232600_bib26","doi-asserted-by":"publisher","DOI":"10.1145\/3597307","article-title":"Biases in large language models: Origins, inventory, and discussion","volume":"15","author":"Navigli","year":"2023","journal-title":"Journal of Data and Information Quality"},{"key":"2025012714411232600_bib27","doi-asserted-by":"publisher","first-page":"2523","DOI":"10.18653\/v1\/2021.naacl-main.200","article-title":"KILT: A benchmark for knowledge intensive language tasks","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Petroni","year":"2021"},{"key":"2025012714411232600_bib28","doi-asserted-by":"publisher","first-page":"784","DOI":"10.18653\/v1\/P18-2124","article-title":"Know what you don\u2019t know: Unanswerable questions for SQuAD","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)","author":"Rajpurkar","year":"2018"},{"key":"2025012714411232600_bib29","doi-asserted-by":"publisher","first-page":"2383","DOI":"10.18653\/v1\/D16-1264","article-title":"SQuAD: 100,000+ questions for machine comprehension of text","volume-title":"Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing","author":"Rajpurkar","year":"2016"},{"issue":"4","key":"2025012714411232600_bib30","doi-asserted-by":"publisher","first-page":"333","DOI":"10.1561\/1500000019","article-title":"The probabilistic relevance framework: BM25 and beyond","volume":"3","author":"Robertson","year":"2009","journal-title":"Foundations and Trends\u00ae in Information Retrieval"},{"issue":"10","key":"2025012714411232600_bib31","doi-asserted-by":"publisher","DOI":"10.1145\/3560260","article-title":"QA dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension","volume":"55","author":"Rogers","year":"2023","journal-title":"ACM Computing Surveys"},{"key":"2025012714411232600_bib32","doi-asserted-by":"publisher","first-page":"8273","DOI":"10.18653\/v1\/2022.emnlp-main.566","article-title":"ASQA: Factoid questions meet long-form answers","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Stelmakh","year":"2022"},{"key":"2025012714411232600_bib33","article-title":"Stanford Alpaca: An instruction-following LLaMA model","author":"Taori","year":"2023"},{"key":"2025012714411232600_bib34","article-title":"BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models","volume-title":"Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks","author":"Thakur","year":"2021"},{"key":"2025012714411232600_bib35","article-title":"LLaMA 2: Open foundation and fine-tuned chat models","author":"Touvron","year":"2023","journal-title":"arXiv preprint: 2307.09288 v1"},{"key":"2025012714411232600_bib36","volume-title":"TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)","author":"Voorhees","year":"2005"},{"key":"2025012714411232600_bib37","article-title":"Text embeddings by weakly-supervised contrastive pre-training","author":"Wang","year":"2024","journal-title":"ArXiv preprint:2212.03533 v2"},{"key":"2025012714411232600_bib38","doi-asserted-by":"publisher","first-page":"8","DOI":"10.18653\/v1\/2023.dialdoc-1.2","article-title":"MoQA: Benchmarking multi-type open-domain question answering","volume-title":"Proceedings of the Third DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering","author":"Yen","year":"2023"},{"key":"2025012714411232600_bib39","first-page":"55734","article-title":"Large language model as attributed training data generator: A tale of diversity and bias","volume-title":"Advances in Neural Information Processing Systems","author":"Yue","year":"2023"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00729\/2499744\/tacl_a_00729.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00729\/2499744\/tacl_a_00729.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,1,27]],"date-time":"2025-01-27T14:41:27Z","timestamp":1737988887000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00729\/127456\/CLAPnq-Cohesive-Long-form-Answers-from-Passages-in"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025]]},"references-count":39,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00729","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025]]},"published":{"date-parts":[[2025]]}}}