{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,6]],"date-time":"2026-04-06T10:06:22Z","timestamp":1775469982509,"version":"3.50.1"},"reference-count":63,"publisher":"MIT Press - Journals","license":[{"start":{"date-parts":[[2021,10,12]],"date-time":"2021-10-12T00:00:00Z","timestamp":1633996800000},"content-version":"vor","delay-in-days":284,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,10,7]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Open-domain Question Answering models that directly leverage question-answer (QA) pairs, such as closed-book QA (CBQA) models and QA-pair retrievers, show promise in terms of speed and memory compared with conventional models which retrieve and read from text corpora. QA-pair retrievers also offer interpretable answers, a high degree of control, and are trivial to update at test time with new knowledge. However, these models fall short of the accuracy of retrieve-and-read systems, as substantially less knowledge is covered by the available QA-pairs relative to text corpora like Wikipedia. To facilitate improved QA-pair models, we introduce Probably Asked Questions (PAQ), a very large resource of 65M automatically generated QA-pairs. We introduce a new QA-pair retriever, RePAQ, to complement PAQ. We find that PAQ preempts and caches test questions, enabling RePAQ to match the accuracy of recent retrieve-and-read models, whilst being significantly faster. Using PAQ, we train CBQA models which outperform comparable baselines by 5%, but trail RePAQ by over 15%, indicating the effectiveness of explicit retrieval. RePAQ can be configured for size (under 500MB) or speed (over 1K questions per second) while retaining high accuracy. Lastly, we demonstrate RePAQ\u2019s strength at selective QA, abstaining from answering when it is likely to be incorrect. This enables RePAQ to \u201cback-off\u201d to a more expensive state-of-the-art model, leading to a combined system which is both more accurate and 2x faster than the state-of-the-art model alone.<\/jats:p>","DOI":"10.1162\/tacl_a_00415","type":"journal-article","created":{"date-parts":[[2021,10,17]],"date-time":"2021-10-17T23:02:36Z","timestamp":1634511756000},"page":"1098-1115","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":50,"title":["PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them"],"prefix":"10.1162","volume":"9","author":[{"given":"Patrick","family":"Lewis","sequence":"first","affiliation":[{"name":"Facebook AI Research"},{"name":"University College London, UK plewis@fb.com"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yuxiang","family":"Wu","sequence":"additional","affiliation":[{"name":"University College London, UK. yuxiang.wu@cs.ucl.ac.uk"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Linqing","family":"Liu","sequence":"additional","affiliation":[{"name":"University College London, UK. linqing.liu@cs.ucl.ac.uk"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Pasquale","family":"Minervini","sequence":"additional","affiliation":[{"name":"University College London, UK. p.minervini@cs.ucl.ac.uk"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Heinrich","family":"K\u00fcttler","sequence":"additional","affiliation":[{"name":"Facebook AI Research. hnr@fb.com"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Aleksandra","family":"Piktus","sequence":"additional","affiliation":[{"name":"Facebook AI Research. piktus@fb.com"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Pontus","family":"Stenetorp","sequence":"additional","affiliation":[{"name":"University College London, UK. p.stenetorp@cs.ucl.ac.uk"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sebastian","family":"Riedel","sequence":"additional","affiliation":[{"name":"Facebook AI Research"},{"name":"University College London, UK. sriedel@fb.com"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"281","published-online":{"date-parts":[[2021,10,7]]},"reference":[{"key":"2021101221391872200_bib1","doi-asserted-by":"crossref","first-page":"6168","DOI":"10.18653\/v1\/P19-1620","article-title":"Synthetic QA corpora generation with roundtrip consistency","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Alberti","year":"2019"},{"key":"2021101221391872200_bib2","first-page":"344","article-title":"Leveraging linguistic structure for open domain information extraction","volume-title":"Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Angeli","year":"2015"},{"key":"2021101221391872200_bib3","article-title":"Challenges in information seeking QA: Unanswerable questions and paragraph retrieval","author":"Asai","year":"2020","journal-title":"arXiv:2010.11915 [cs]"},{"key":"2021101221391872200_bib4","first-page":"1533","article-title":"Semantic parsing on freebase from question-answer pairs","volume-title":"Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing","author":"Berant","year":"2013"},{"key":"2021101221391872200_bib5","doi-asserted-by":"crossref","first-page":"1247","DOI":"10.1145\/1376616.1376746","article-title":"Freebase: A collaboratively created graph database for structuring human knowledge","volume-title":"Proceedings of the 2008 ACM SIGMOD international conference on Management of data","author":"Bollacker","year":"2008"},{"key":"2021101221391872200_bib6","article-title":"Autoregressive entity retrieval","volume-title":"International Conference on Learning Representations","author":"Cao","year":"2021"},{"key":"2021101221391872200_bib7","doi-asserted-by":"crossref","first-page":"1870","DOI":"10.18653\/v1\/P17-1171","article-title":"Reading Wikipedia to answer open-domain questions","volume-title":"Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Chen","year":"2017"},{"key":"2021101221391872200_bib8","doi-asserted-by":"crossref","first-page":"34","DOI":"10.18653\/v1\/2020.acl-tutorials.8","article-title":"Open-domain question answering","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts","author":"Chen","year":"2020"},{"key":"2021101221391872200_bib9","first-page":"4171","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","volume-title":"Proceedings of the 2019 Conferenceof the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Devlin","year":"2019"},{"key":"2021101221391872200_bib10","article-title":"Every Model Learned by Gradient Descent Is Approximately a Kernel Machine","author":"Domingos","year":"2020","journal-title":"arXiv:2012.00152 [cs, stat]"},{"key":"2021101221391872200_bib11","first-page":"1342","article-title":"Learning to ask: Neural question generation for reading comprehension","volume-title":"Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Xinya","year":"2017"},{"key":"2021101221391872200_bib12","article-title":"Accelerating real-time question answering via question generation","author":"Fang","year":"2020","journal-title":"arXiv:2009.05167 [cs]"},{"key":"2021101221391872200_bib13","doi-asserted-by":"crossref","first-page":"2051","DOI":"10.18653\/v1\/P18-1191","article-title":"Large-scale QA-SRL parsing","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"FitzGerald","year":"2018"},{"issue":"5","key":"2021101221391872200_bib14","doi-asserted-by":"crossref","first-page":"1189","DOI":"10.1214\/aos\/1013203451","article-title":"Greedy function approximation: A gradient boosting machine.","volume":"29","author":"Friedman","year":"2001","journal-title":"Annals of Statistics"},{"key":"2021101221391872200_bib15","doi-asserted-by":"crossref","first-page":"4937","DOI":"10.18653\/v1\/2020.emnlp-main.400","article-title":"Entities as experts: Sparse memory access with entity supervision","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"F\u00e9vry","year":"2020"},{"key":"2021101221391872200_bib16","first-page":"3929","article-title":"Retrieval augmented language model pre-training","volume-title":"Proceedings of the 37th International Conference on Machine Learning","author":"Guu","year":"2020"},{"key":"2021101221391872200_bib17","article-title":"spaCy: Industrial-strength natural language processing in python","author":"Honnibal","year":"2020"},{"key":"2021101221391872200_bib18","first-page":"2278","article-title":"Evaluating rewards for question generation models","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Hosking","year":"2019"},{"key":"2021101221391872200_bib19","first-page":"57","article-title":"OntoNotes: The 90% solution","volume-title":"Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers","author":"Hovy","year":"2006"},{"key":"2021101221391872200_bib20","first-page":"874","article-title":"Leveraging passage retrieval with generative models for open domain question answering","volume-title":"Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume","author":"Izacard","year":"2021"},{"key":"2021101221391872200_bib21","article-title":"A memory efficient baseline for open domain question answering","author":"Izacard","year":"2020","journal-title":"arXiv: 2012.15156 [cs]"},{"key":"2021101221391872200_bib22","article-title":"How can we know when language models know?","author":"Jiang","year":"2020","journal-title":"arXiv:2012 .00955 [cs]"},{"key":"2021101221391872200_bib23","doi-asserted-by":"crossref","first-page":"2120","DOI":"10.18653\/v1\/2020.acl-main.192","article-title":"Generalizing natural language analysis through span-relation representations","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Jiang","year":"2020"},{"key":"2021101221391872200_bib24","first-page":"1","article-title":"Billion-scale similarity search with GPUs","author":"Johnson","year":"2019","journal-title":"IEEE Transactions on Big Data"},{"key":"2021101221391872200_bib25","doi-asserted-by":"crossref","first-page":"1601","DOI":"10.18653\/v1\/P17-1147","article-title":"TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension","volume-title":"Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Joshi","year":"2017"},{"issue":"1","key":"2021101221391872200_bib26","doi-asserted-by":"publisher","first-page":"117","DOI":"10.1109\/TPAMI.2010.57","article-title":"Product quantization for nearest neighbor search","volume":"33","author":"J\u00e9gou","year":"2011","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"2021101221391872200_bib27","doi-asserted-by":"crossref","first-page":"6769","DOI":"10.18653\/v1\/2020.emnlp-main.550","article-title":"Dense Passage retrieval for open-domain question answering","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Karpukhin","year":"2020"},{"key":"2021101221391872200_bib28","doi-asserted-by":"publisher","first-page":"452","DOI":"10.1162\/tacl_a_00276","article-title":"Natural questions: A benchmark for question answering research","volume":"7","author":"Kwiatkowski","year":"2019","journal-title":"Transactions of the Association of Computational Linguistics"},{"key":"2021101221391872200_bib29","article-title":"Albert: A lite bert for self-supervised learning of language representations","volume-title":"International Conference on Learning Representations","author":"Lan","year":"2020"},{"key":"2021101221391872200_bib30","article-title":"Learning dense representations of phrases at scale","author":"Lee","year":"2021","journal-title":"arXiv:2012.12624 [cs]"},{"key":"2021101221391872200_bib31","doi-asserted-by":"crossref","first-page":"6086","DOI":"10.18653\/v1\/P19-1612","article-title":"Latent retrieval for weakly supervised open domain question answering","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Lee","year":"2019"},{"key":"2021101221391872200_bib32","article-title":"Generative question answering: Learning to answer the whole question","volume-title":"International Conference on Learning Representations","author":"Lewis","year":"2018"},{"key":"2021101221391872200_bib33","doi-asserted-by":"crossref","first-page":"7871","DOI":"10.18653\/v1\/2020.acl-main.703","article-title":"BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Lewis","year":"2020"},{"key":"2021101221391872200_bib34","doi-asserted-by":"crossref","first-page":"4896","DOI":"10.18653\/v1\/P19-1484","article-title":"Unsupervised question answering by cloze translation","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Lewis","year":"2019"},{"key":"2021101221391872200_bib35","first-page":"9459","article-title":"Retrieval-augmented generation for knowledge-intensive NLP tasks","volume-title":"Advances in Neural Information Processing Systems","author":"Lewis","year":"2020"},{"key":"2021101221391872200_bib36","first-page":"1000","article-title":"Question and answer test-train overlap in open-domain question answering datasets","volume-title":"Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume","author":"Lewis","year":"2021"},{"key":"2021101221391872200_bib37","article-title":"RoBERTa: A robustly optimized BERT pretraining approach","author":"Liu","year":"2019","journal-title":"arXiv: 1907.11692 [cs]"},{"issue":"4","key":"2021101221391872200_bib38","doi-asserted-by":"publisher","first-page":"824","DOI":"10.1109\/TPAMI.2018.2889473","article-title":"Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs","volume":"42","author":"Yu","year":"2020","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"2021101221391872200_bib39","article-title":"NeurIPS 2020 EfficientQA Competition: Systems, analyses and lessons learned","author":"Min","year":"2020","journal-title":"arXiv:2101.00133 [cs]"},{"key":"2021101221391872200_bib40","doi-asserted-by":"crossref","first-page":"5783","DOI":"10.18653\/v1\/2020.emnlp-main.466","article-title":"AmbigQA: Answering ambiguous open-domain questions","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Min","year":"2020"},{"key":"2021101221391872200_bib41","article-title":"Document expansion by query prediction","author":"Nogueira","year":"2019","journal-title":"arXiv:1904.08375 [cs]"},{"key":"2021101221391872200_bib42","first-page":"48","article-title":"fairseq: A fast, extensible toolkit for sequence modeling","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)","author":"Ott","year":"2019"},{"key":"2021101221391872200_bib43","first-page":"8024","article-title":"PyTorch: An imperative style, high","volume-title":"Advances in Neural Information Processing Systems 32","author":"Paszke","year":"2019"},{"key":"2021101221391872200_bib44","article-title":"How context affects language models\u2019 factual predictions","volume-title":"Automated Knowledge Base Construction","author":"Petroni","year":"2020"},{"key":"2021101221391872200_bib45","first-page":"2523","article-title":"KILT: A benchmark for knowledge intensive language tasks","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Petroni","year":"2021"},{"issue":"140","key":"2021101221391872200_bib46","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"Journal of Machine Learning Research"},{"key":"2021101221391872200_bib47","doi-asserted-by":"crossref","first-page":"784","DOI":"10.18653\/v1\/P18-2124","article-title":"Know what you don\u2019t know: Unanswerable questions for SQuAD","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)","author":"Rajpurkar","year":"2018"},{"key":"2021101221391872200_bib48","doi-asserted-by":"crossref","first-page":"5418","DOI":"10.18653\/v1\/2020.emnlp-main.437","article-title":"How much knowledge can you pack into the parameters of a language model?","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Roberts","year":"2020"},{"key":"2021101221391872200_bib49","article-title":"Quizbowl: The case for incremental question answering","author":"Rodriguez","year":"2019","journal-title":"arXiv:1904.04792 [cs]"},{"key":"2021101221391872200_bib50","doi-asserted-by":"crossref","first-page":"559","DOI":"10.18653\/v1\/D18-1052","article-title":"Phrase-indexed question answering: A new challenge for scalable document comprehension","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","author":"Seo","year":"2018"},{"key":"2021101221391872200_bib51","doi-asserted-by":"crossref","first-page":"4430","DOI":"10.18653\/v1\/P19-1436","article-title":"Real-time open-domain question answering with dense-sparse phrase index","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Seo","year":"2019"},{"key":"2021101221391872200_bib52","doi-asserted-by":"crossref","first-page":"588","DOI":"10.18653\/v1\/P16-1056","article-title":"Generating factoid questions with recurrent neural networks: The 30M factoid question-answer corpus","volume-title":"Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Serban","year":"2016"},{"issue":"1","key":"2021101221391872200_bib53","doi-asserted-by":"publisher","first-page":"53","DOI":"10.1145\/363707.363732","article-title":"Answering English questions by computer: A survey","volume":"8","author":"Simmons","year":"1965","journal-title":"Communications of the Association for Computing Machinery"},{"key":"2021101221391872200_bib54","article-title":"Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling and Temporal Slot Filling","author":"Surdeanu","year":"2013","journal-title":"TAC"},{"key":"2021101221391872200_bib55","article-title":"Facts as experts: Adaptable and interpretable neural memory over symbolic knowledge","author":"Verga","year":"2020","journal-title":"arXiv:2007.00849 [cs]"},{"key":"2021101221391872200_bib56","first-page":"77","article-title":"The TREC-8 Question Answering Track Report","volume-title":"Proceedings of TREC-8","author":"Voorhees","year":"1999"},{"key":"2021101221391872200_bib57","article-title":"Overview of the TREC 2002 question answering track","volume-title":"Proceedings of The Eleventh Text REtrieval Conference, TREC 2002, Gaithersburg, Maryland, USA, November 19\u201322, 2002","author":"Voorhees","year":"2002"},{"key":"2021101221391872200_bib58","volume-title":"Proceedings of The Eighth Text REtrieval Conference, TREC 1999, Gaithersburg, Maryland, USA, November 17\u201319, 1999","author":"Voorhees","year":"1999"},{"key":"2021101221391872200_bib59","article-title":"Safer classification by synthesis","author":"Wang","year":"2018","journal-title":"arXiv:1711.08534 [cs, stat]"},{"key":"2021101221391872200_bib60","doi-asserted-by":"crossref","first-page":"38","DOI":"10.18653\/v1\/2020.emnlp-demos.6","article-title":"Transformers: State- of-the-art natural language processing","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations","author":"Wolf","year":"2020"},{"key":"2021101221391872200_bib61","first-page":"61","article-title":"Open-domain question answering with pre- constructed question spaces","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop","author":"Xiao","year":"2021"},{"key":"2021101221391872200_bib62","article-title":"Data augmentation for BERT fine-tuning in open-domain question answering","author":"Yang","year":"2019","journal-title":"arXiv:1904.06652 [cs]"},{"key":"2021101221391872200_bib63","article-title":"Studying strategically: Learning to mask for closed-book QA","author":"Ye","year":"2021","journal-title":"arXiv:2012.15856 [cs]"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00415\/1966205\/tacl_a_00415.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00415\/1966205\/tacl_a_00415.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,10,17]],"date-time":"2021-10-17T23:03:50Z","timestamp":1634511830000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00415\/107615\/PAQ-65-Million-Probably-Asked-Questions-and-What"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021]]},"references-count":63,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00415","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021]]},"published":{"date-parts":[[2021]]}}}