{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,16]],"date-time":"2026-06-16T07:56:25Z","timestamp":1781596585595,"version":"3.54.5"},"reference-count":43,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2024,8,1]],"date-time":"2024-08-01T00:00:00Z","timestamp":1722470400000},"content-version":"vor","delay-in-days":213,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,8,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Recent advances in LLMs have led to an abundance of evaluation benchmarks, which typically rely on a single instruction template per task. We create a large-scale collection of instruction paraphrases and comprehensively analyze the brittleness introduced by single-prompt evaluations across 6.5M instances, involving 20 different LLMs and 39 tasks from 3 benchmarks. We find that different instruction templates lead to very different performance, both absolute and relative. Instead, we propose a set of diverse metrics on multiple instruction paraphrases, specifically tailored for different use cases (e.g., LLM vs. downstream development), ensuring a more reliable and meaningful assessment of LLM capabilities. We show that our metrics provide new insights into the strengths and limitations of current LLMs.<\/jats:p>","DOI":"10.1162\/tacl_a_00681","type":"journal-article","created":{"date-parts":[[2024,8,1]],"date-time":"2024-08-01T20:13:41Z","timestamp":1722543221000},"page":"933-949","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":87,"title":["State of What Art? A Call for Multi-Prompt LLM Evaluation"],"prefix":"10.1162","volume":"12","author":[{"given":"Moran","family":"Mizrahi","sequence":"first","affiliation":[{"name":"School of Computer Science, The Hebrew University of Jerusalem, Israel. moran.mizrahi@mail.huji.ac.il"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Guy","family":"Kaplan","sequence":"additional","affiliation":[{"name":"School of Computer Science, The Hebrew University of Jerusalem, Israel. guy.kaplan2@mail.huji.ac.il"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Dan","family":"Malkin","sequence":"additional","affiliation":[{"name":"School of Computer Science, The Hebrew University of Jerusalem, Israel. dan.malkinhueb@mail.huji.ac.il"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Rotem","family":"Dror","sequence":"additional","affiliation":[{"name":"Department of Information Systems, University of Haifa, Israel. rdror@is.haifa.ac.il"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Dafna","family":"Shahaf","sequence":"additional","affiliation":[{"name":"School of Computer Science, The Hebrew University of Jerusalem, Israel. dshahaf@cs.huji.ac.il"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Gabriel","family":"Stanovsky","sequence":"additional","affiliation":[{"name":"School of Computer Science, The Hebrew University of Jerusalem, Israel. gabriel.stanovsky@mail.huji.ac.il"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"281","published-online":{"date-parts":[[2024,8,1]]},"reference":[{"key":"2024080120131438000_bib1","article-title":"Gpt-4 technical report","author":"Achiam","year":"2023","journal-title":"arXiv preprint arXiv: 2303.08774"},{"key":"2024080120131438000_bib2","unstructured":"Ebtesam\n              Almazrouei\n            , HamzaAlobeidli, AbdulazizAlshamsi, AlessandroCappelli, RuxandraCojocaru, MerouaneDebbah, EtienneGoffinet, DanielHeslow, JulienLaunay, QuentinMalartic, \n          2023. Falcon- 40b: An open large language model with state-of-the-art performance. Technical report, Technology Innovation Institute."},{"issue":"240","key":"2024080120131438000_bib3","first-page":"1","article-title":"Palm: Scaling language modeling with pathways","volume":"24","author":"Chowdhery","year":"2023","journal-title":"Journal of Machine Learning Research"},{"issue":"70","key":"2024080120131438000_bib4","first-page":"1","article-title":"Scaling instruction-finetuned language models","volume":"25","author":"Chung","year":"2024","journal-title":"Journal of Machine Learning Research"},{"key":"2024080120131438000_bib5","unstructured":"OpenAccess AI Collective. 2023. Minotaur. https:\/\/huggingface.co\/openaccess-ai-collective\/minotaur-15b. Last Accessed: 2024-04-30."},{"key":"2024080120131438000_bib6","volume-title":"Nonparametric Statistics for Non-Statisticians","author":"Corder","year":"2011"},{"key":"2024080120131438000_bib7","doi-asserted-by":"publisher","first-page":"3029","DOI":"10.18653\/v1\/2023.emnlp-main.183","article-title":"Enhancing chat language models by scaling high-quality instructional conversations","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Ding","year":"2023"},{"key":"2024080120131438000_bib8","unstructured":"Jon\n              Durbin\n            \n          . 2023. Airoboros. https:\/\/github.com\/jondurbin\/airoboros. Last Accessed: 2024-04-30."},{"key":"2024080120131438000_bib9","doi-asserted-by":"publisher","first-page":"10476","DOI":"10.18653\/v1\/2023.findings-acl.666","article-title":"Lmentry: A language model benchmark of elementary language tasks","volume-title":"Findings of the Association for Computational Linguistics: ACL 2023","author":"Efrat","year":"2023"},{"key":"2024080120131438000_bib10","article-title":"Gemini: A family of highly capable multimodal models","author":"Google","year":"2023","journal-title":"arXiv preprint arXiv: 2312.11805"},{"key":"2024080120131438000_bib11","doi-asserted-by":"publisher","first-page":"10136","DOI":"10.18653\/v1\/2023.findings-emnlp.679","article-title":"Demystifying prompts in language models via perplexity estimation","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2023","author":"Gonen","year":"2023"},{"key":"2024080120131438000_bib12","doi-asserted-by":"publisher","first-page":"13935","DOI":"10.1016\/j.learninstruc.2022.101692","article-title":"Robustness of learning from task instructions","volume-title":"Findings of the Association for Computational Linguistics: ACL 2023","author":"Jiasheng","year":"2023"},{"key":"2024080120131438000_bib13","article-title":"Measuring massive multitask language understanding","volume-title":"International Conference on Learning Representations","author":"Hendrycks","year":"2020"},{"key":"2024080120131438000_bib14","doi-asserted-by":"publisher","first-page":"14409","DOI":"10.18653\/v1\/2023.acl-long.806","article-title":"Unnatural instructions: Tuning language models with (almost) no human labor","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Or","year":"2023"},{"key":"2024080120131438000_bib15","doi-asserted-by":"publisher","first-page":"1935","DOI":"10.18653\/v1\/2023.acl-long.108","article-title":"Instruction induction: From few examples to natural language task descriptions","volume-title":"61st Annual Meeting of the Association for Computational Linguistics, ACL 2023","author":"Or","year":"2023"},{"key":"2024080120131438000_bib16","unstructured":"Leonard J.\n              Kazmier\n            , Michael K.Staton, and Daniel L.Fulks. 2003. Business statistics: Based on schaums outline of theory and problems of business statistics, by Leonard J. Kazmier, McGraw-Hill."},{"issue":"3","key":"2024080120131438000_bib17","doi-asserted-by":"publisher","first-page":"239","DOI":"10.1093\/biomet\/33.3.239","article-title":"The treatment of ties in ranking problems","volume":"33","author":"Kendall","year":"1945","journal-title":"Biometrika"},{"issue":"3","key":"2024080120131438000_bib18","doi-asserted-by":"publisher","first-page":"275","DOI":"10.1214\/aoms\/1177732186","article-title":"The problem of m rankings","volume":"10","author":"Kendall","year":"1939","journal-title":"The Annals of Mathematical Statistics"},{"key":"2024080120131438000_bib19","first-page":"arXiv","article-title":"Bloom: A 176b-parameter open-access multilingual language model","author":"Scao","year":"2022","journal-title":"arXiv e-prints"},{"key":"2024080120131438000_bib20","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.243","article-title":"The power of scale for parameter-efficient prompt tuning","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"Lester","year":"2021"},{"key":"2024080120131438000_bib21","article-title":"Holistic evaluation of language models","author":"Liang","year":"2023","journal-title":"Transactions on Machine Learning Research"},{"key":"2024080120131438000_bib22","article-title":"Is prompt all you need? No. A comprehensive and broader view of instruction learning","author":"Lou","year":"2023","journal-title":"arXiv preprint arXiv:2303.10475"},{"key":"2024080120131438000_bib23","doi-asserted-by":"publisher","first-page":"3470","DOI":"10.18653\/v1\/2022.acl-long.244","article-title":"Cross-task generalization via natural language crowdsourcing instructions","volume-title":"60th Annual Meeting of the Association for Computational Linguistics, ACL 2022","author":"Mishra","year":"2022"},{"key":"2024080120131438000_bib24","unstructured":"NousResearch. 2023. Nous-hermes. https:\/\/huggingface.co\/NousResearch\/Nous-Hermes-13b. Last Accessed: 2024-04-30."},{"key":"2024080120131438000_bib25","article-title":"Efficient benchmarking (of language models)","author":"Perlitz","year":"2023","journal-title":"arXiv preprint arXiv:2308.11696"},{"key":"2024080120131438000_bib26","article-title":"Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks","author":"Rao","year":"2023","journal-title":"arXiv preprint arXiv:2305.14965"},{"key":"2024080120131438000_bib27","article-title":"Multitask prompted training enables zero-shot task generalization","volume-title":"International Conference on Learning Representations","author":"Sanh","year":"2021"},{"key":"2024080120131438000_bib28","article-title":"Quantifying language models\u2019 sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting","volume-title":"The Twelfth International Conference on Learning Representations","author":"Sclar","year":"2023"},{"key":"2024080120131438000_bib29","article-title":"Beyond the imitation game: Quantifying and extrapolating the capabilities of language models","author":"Srivastava","year":"2023","journal-title":"Transactions on Machine Learning Research"},{"key":"2024080120131438000_bib30","article-title":"Evaluating the zero-shot robustness of instruction-tuned language models","volume-title":"The Twelfth International Conference on Learning Representations","author":"Sun","year":"2023"},{"key":"2024080120131438000_bib31","doi-asserted-by":"publisher","first-page":"13003","DOI":"10.18653\/v1\/2023.findings-acl.824","article-title":"Challenging big-bench tasks and whether chain-of-thought can solve them","volume-title":"Findings of the Association for Computational Linguistics: ACL 2023","author":"Suzgun","year":"2023"},{"key":"2024080120131438000_bib32","article-title":"Alpaca: A strong, replicable instruction-following model","author":"Taori","year":"2023","journal-title":"Stanford Center for Research on Foundation Models"},{"key":"2024080120131438000_bib33","article-title":"Introducing mpt-7b: A new standard for open-source, commercially usable llms","author":"Team","year":"2023"},{"key":"2024080120131438000_bib34","article-title":"Llama: Open and efficient foundation language models","author":"Touvron","year":"2023","journal-title":"arXiv preprint arXiv:2302.13971"},{"key":"2024080120131438000_bib35","article-title":"Mind your format: Towards consistent evaluation of in-context learning improvements","author":"Voronov","year":"2024","journal-title":"arXiv preprint arXiv:2401.06766"},{"key":"2024080120131438000_bib36","article-title":"Adversarial glue: A multi-task benchmark for robustness evaluation of language models","volume-title":"Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)","author":"Wang","year":"2021"},{"key":"2024080120131438000_bib37","article-title":"On the robustness of chatgpt: An adversarial and out-of-distribution perspective","volume-title":"ICLR 2023 Workshop on Trustworthy and Reliable Large-Scale Machine Learning Models","author":"Wang","year":"2023"},{"key":"2024080120131438000_bib38","doi-asserted-by":"publisher","first-page":"4569","DOI":"10.18653\/v1\/2022.naacl-main.339","article-title":"Measure and improve robustness in nlp models: A survey","volume-title":"2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022","author":"Wang","year":"2022"},{"key":"2024080120131438000_bib39","doi-asserted-by":"publisher","first-page":"294","DOI":"10.18653\/v1\/2023.conll-1.20","article-title":"Mind the instructions: A holistic evaluation of consistency and interactions in prompt-based learning","volume-title":"Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)","author":"Weber","year":"2023"},{"key":"2024080120131438000_bib40","article-title":"Finetuned language models are zero-shot learners","volume-title":"International Conference on Learning Representations","author":"Wei","year":"2021"},{"key":"2024080120131438000_bib41","first-page":"24824","article-title":"Chain-of-thought prompting elicits reasoning in large language models","volume":"35","author":"Wei","year":"2022","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2024080120131438000_bib42","article-title":"Judging llm-as-a-judge with mt-bench and chatbot arena","volume":"36","author":"Zheng","year":"2024","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2024080120131438000_bib43","article-title":"Promptbench: Towards evaluating the robustness of large language models on adversarial prompts","author":"Zhu","year":"2023","journal-title":"arXiv preprint arXiv:2306.04528"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00681\/2464098\/tacl_a_00681.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00681\/2464098\/tacl_a_00681.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,1]],"date-time":"2024-08-01T20:14:03Z","timestamp":1722543243000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00681\/123885\/State-of-What-Art-A-Call-for-Multi-Prompt-LLM"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024]]},"references-count":43,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00681","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024]]},"published":{"date-parts":[[2024]]}}}