{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,2]],"date-time":"2026-05-02T14:55:59Z","timestamp":1777733759131,"version":"3.51.4"},"reference-count":27,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2021,4,28]],"date-time":"2021-04-28T00:00:00Z","timestamp":1619568000000},"content-version":"vor","delay-in-days":117,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,4,26]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>A key limitation in current datasets for multi-hop reasoning is that the required steps for answering the question are mentioned in it explicitly. In this work, we introduce StrategyQA, a question answering (QA) benchmark where the required reasoning steps are implicit in the question, and should be inferred using a strategy. A fundamental challenge in this setup is how to elicit such creative questions from crowdsourcing workers, while covering a broad range of potential strategies. We propose a data collection procedure that combines term-based priming to inspire annotators, careful control over the annotator population, and adversarial filtering for eliminating reasoning shortcuts. Moreover, we annotate each question with (1) a decomposition into reasoning steps for answering it, and (2) Wikipedia paragraphs that contain the answers to each step. Overall, StrategyQA includes 2,780 examples, each consisting of a strategy question, its decomposition, and evidence paragraphs. Analysis shows that questions in StrategyQA are short, topic-diverse, and cover a wide range of strategies. Empirically, we show that humans perform well (87%) on this task, while our best baseline reaches an accuracy of \u223c 66%.<\/jats:p>","DOI":"10.1162\/tacl_a_00370","type":"journal-article","created":{"date-parts":[[2021,4,28]],"date-time":"2021-04-28T23:50:50Z","timestamp":1619653850000},"page":"346-361","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":152,"title":["<i>Did Aristotle Use a Laptop?<\/i>A Question Answering Benchmark with Implicit Reasoning Strategies"],"prefix":"10.1162","volume":"9","author":[{"given":"Mor","family":"Geva","sequence":"first","affiliation":[{"name":"Tel Aviv University, Israel"},{"name":"Allen Institute for AI, United States. morgeva@mail.tau.ac.il"}]},{"given":"Daniel","family":"Khashabi","sequence":"additional","affiliation":[{"name":"Allen Institute for AI, United States. danielk@allenai.org"}]},{"given":"Elad","family":"Segal","sequence":"additional","affiliation":[{"name":"Tel Aviv University, Israel. elad.segal@gmail.com"}]},{"given":"Tushar","family":"Khot","sequence":"additional","affiliation":[{"name":"Allen Institute for AI, United States. tushark@allenai.org"}]},{"given":"Dan","family":"Roth","sequence":"additional","affiliation":[{"name":"University of Pennsylvania, United States. danroth@seas.upenn.edu"}]},{"given":"Jonathan","family":"Berant","sequence":"additional","affiliation":[{"name":"Tel Aviv University, Israel"},{"name":"Allen Institute for AI, United States. joberant@cs.tau.ac.il"}]}],"member":"281","published-online":{"date-parts":[[2021,4,26]]},"reference":[{"key":"2021060823401751300_bib1","doi-asserted-by":"crossref","first-page":"662","DOI":"10.1162\/tacl_a_00338","article-title":"Beat the AI: Investigating adversarial human annotation for reading comprehension","volume":"8","author":"Bartolo","year":"2020","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2021060823401751300_bib2","first-page":"2924","article-title":"BoolQ: Exploring the surprising difficulty of natural yes\/no questions","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Clark","year":"2019"},{"key":"2021060823401751300_bib3","doi-asserted-by":"crossref","first-page":"454","DOI":"10.1162\/tacl_a_00317","article-title":"TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages","volume":"8","author":"Clark","year":"2020","journal-title":"Transactions of the Association for Computational Linguistics (TACL)"},{"key":"2021060823401751300_bib4","doi-asserted-by":"crossref","first-page":"4443","DOI":"10.18653\/v1\/2020.acl-main.408","article-title":"ERASER: A benchmark to evaluate rationalized NLP models","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"DeYoung","year":"2020"},{"key":"2021060823401751300_bib5","article-title":"DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs","volume-title":"North American Chapter of the Association for Computational Linguistics (NAACL)","author":"Dua","year":"2019"},{"key":"2021060823401751300_bib6","doi-asserted-by":"crossref","first-page":"1161","DOI":"10.18653\/v1\/D19-1107","article-title":"Are we modeling the task or the annotator? An investigation of annotator bias in natural language understanding datasets","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)","author":"Geva","year":"2019"},{"key":"2021060823401751300_bib7","first-page":"107","article-title":"Annotation artifacts in natural language inference data","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)","author":"Gururangan","year":"2018"},{"key":"2021060823401751300_bib8","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/P19-1262","article-title":"Avoiding reasoning shortcuts: Adversarial evaluation, training, and model development for multi-hop QA","volume-title":"Association for Computational Linguistics (ACL)","author":"Jiang","year":"2019"},{"key":"2021060823401751300_bib9","first-page":"252","article-title":"Looking beyond the surface: A challenge set for reading comprehension over multiple sentences","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)","author":"Khashabi","year":"2018"},{"key":"2021060823401751300_bib10","doi-asserted-by":"crossref","DOI":"10.1609\/aaai.v34i05.6319","article-title":"QASC: A dataset for question answering via sentence composition","volume-title":"AAAI","author":"Khot","year":"2020"},{"key":"2021060823401751300_bib11","article-title":"Text modular networks: Learning to decompose tasks in the language of existing models","author":"Khot","year":"2020","journal-title":"arXiv preprint arXiv:2009.00751"},{"key":"2021060823401751300_bib12","doi-asserted-by":"crossref","first-page":"453","DOI":"10.1162\/tacl_a_00276","article-title":"Natural questions: A benchmark for question answering research","volume":"7","author":"Kwiatkowski","year":"2019","journal-title":"Transactions of the Association for Computational Linguistics (TACL)"},{"key":"2021060823401751300_bib13","doi-asserted-by":"crossref","first-page":"7871","DOI":"10.18653\/v1\/2020.acl-main.703","article-title":"BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Lewis","year":"2020"},{"key":"2021060823401751300_bib14","article-title":"RoBERTa: A robustly optimized BERT pretraining approach","author":"Liu","year":"2019","journal-title":"arXiv preprint arXiv:1907.11692"},{"key":"2021060823401751300_bib15","doi-asserted-by":"crossref","first-page":"2381","DOI":"10.18653\/v1\/D18-1260","article-title":"Can a suit of armor conduct electricity? A new dataset for open book question answering","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","author":"Mihaylov","year":"2018"},{"key":"2021060823401751300_bib16","doi-asserted-by":"crossref","first-page":"6097","DOI":"10.18653\/v1\/P19-1613","article-title":"Multi-hop reading comprehension through question decomposition and rescoring","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Min","year":"2019"},{"key":"2021060823401751300_bib17","article-title":"MS MARCO: A human generated machine reading comprehension dataset","volume-title":"Workshop on Cognitive Computing at NIPS","author":"Nguyen","year":"2016"},{"key":"2021060823401751300_bib18","doi-asserted-by":"crossref","first-page":"8864","DOI":"10.18653\/v1\/2020.emnlp-main.713","article-title":"Unsupervised question decomposition for question answering","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Perez","year":"2020"},{"key":"2021060823401751300_bib19","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/D16-1264","article-title":"SQuAD: 100,000+ questions for machine comprehension of text","volume-title":"Empirical Methods in Natural Language Processing (EMNLP)","author":"Rajpurkar","year":"2016"},{"key":"2021060823401751300_bib20","first-page":"109","article-title":"Okapi at TREC-3","volume-title":"Overview of the Third Text REtrieval Conference (TREC-3)","author":"Robertson","year":"1995"},{"key":"2021060823401751300_bib21","doi-asserted-by":"crossref","first-page":"3074","DOI":"10.18653\/v1\/2020.emnlp-main.248","article-title":"A simple and effective model for answering multi-span questions","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Segal","year":"2020"},{"key":"2021060823401751300_bib22","doi-asserted-by":"crossref","first-page":"pages 6418\u2013pages 6428","DOI":"10.18653\/v1\/P19-1644","article-title":"A corpus for reasoning about natural language grounded in photographs","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Suhr","year":"2019"},{"key":"2021060823401751300_bib23","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/N18-1059","article-title":"The web as a knowledge-base for answering complex questions","volume-title":"North American Chapter of the Association for Computational Linguistics (NAACL)","author":"Talmor","year":"2018"},{"key":"2021060823401751300_bib24","doi-asserted-by":"crossref","first-page":"287","DOI":"10.1162\/tacl_a_00021","article-title":"Constructing datasets for multi-hop reading comprehension across documents","volume":"6","author":"Welbl","year":"2018","journal-title":"Transactions of the Association for Computational Linguistics (TACL)"},{"key":"2021060823401751300_bib25","first-page":"1112","article-title":"A broad-coverage challenge corpus for sentence understanding through inference","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)","author":"Williams","year":"2018"},{"key":"2021060823401751300_bib26","doi-asserted-by":"crossref","DOI":"10.1162\/tacl_a_00309","article-title":"Break it down: A question understanding benchmark","author":"Wolfson","year":"2020","journal-title":"Transactions of the Association for Computational Linguistics (TACL)"},{"key":"2021060823401751300_bib27","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/D18-1259","article-title":"HotpotQA: A dataset for diverse, explainable multi-hop question answering","volume-title":"Empirical Methods in Natural Language Processing (EMNLP)","author":"Yang","year":"2018"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00370\/1924104\/tacl_a_00370.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00370\/1924104\/tacl_a_00370.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,25]],"date-time":"2022-12-25T19:35:03Z","timestamp":1671996903000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00370\/100680\/Did-Aristotle-Use-a-Laptop-A-Question-Answering"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021]]},"references-count":27,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00370","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021]]},"published":{"date-parts":[[2021]]}}}