{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,27]],"date-time":"2025-12-27T10:09:36Z","timestamp":1766830176968,"version":"3.30.2"},"reference-count":140,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2024,12,9]],"date-time":"2024-12-09T00:00:00Z","timestamp":1733702400000},"content-version":"vor","delay-in-days":343,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,12,4]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>We introduce Holmes, a new benchmark designed to assess language models\u2019 (LMs\u2019) linguistic competence\u2014their unconscious understanding of linguistic phenomena. Specifically, we use classifier-based probing to examine LMs\u2019 internal representations regarding distinct linguistic phenomena (e.g., part-of-speech tagging). As a result, we meet recent calls to disentangle LMs\u2019 linguistic competence from other cognitive abilities, such as following instructions in prompting-based evaluations. Composing Holmes, we review over 270 probing studies and include more than 200 datasets to assess syntax, morphology, semantics, reasoning, and discourse phenomena. Analyzing over 50 LMs reveals that, aligned with known trends, their linguistic competence correlates with model size. However, surprisingly, model architecture and instruction tuning also significantly influence performance, particularly in morphology and syntax. Finally, we propose FlashHolmes, a streamlined version that reduces the computation load while maintaining high-ranking precision.<\/jats:p>","DOI":"10.1162\/tacl_a_00718","type":"journal-article","created":{"date-parts":[[2024,12,9]],"date-time":"2024-12-09T18:37:56Z","timestamp":1733769476000},"page":"1616-1647","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":4,"title":["<tt>Holmes<\/tt> \u2315 A Benchmark to Assess the Linguistic Competence of Language Models"],"prefix":"10.1162","volume":"12","author":[{"given":"Andreas","family":"Waldis","sequence":"first","affiliation":[{"name":"Ubiquitous Knowledge Processing Lab (UKP Lab), Technical University of Darmstadt, Germany"}]},{"given":"Andreas","family":"Waldis","sequence":"additional","affiliation":[{"name":"Information Systems Research Lab, Lucerne University of Applied Sciences and Arts, Switzerland"}]},{"given":"Yotam","family":"Perlitz","sequence":"additional","affiliation":[{"name":"ABM Research AI, Israel"}]},{"given":"Leshem","family":"Choshen","sequence":"additional","affiliation":[{"name":"MIT CSAIL, USA"}]},{"given":"Leshem","family":"Choshen","sequence":"additional","affiliation":[{"name":"MIT-IBM Watson AI Lab, USA"}]},{"given":"Yufang","family":"Hou","sequence":"additional","affiliation":[{"name":"IBM Research Europe, Ireland"}]},{"given":"Iryna","family":"Gurevych","sequence":"additional","affiliation":[{"name":"Ubiquitous Knowledge Processing Lab (UKP Lab), Technical University of Darmstadt, Germany"}]}],"member":"281","published-online":{"date-parts":[[2024,12,4]]},"reference":[{"key":"2024120918375040000_bib1","article-title":"Fine-grained analysis of sentence embeddings using auxiliary prediction tasks","volume-title":"5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24\u201326, 2017, Conference Track Proceedings","author":"Adi","year":"2017"},{"key":"2024120918375040000_bib2","doi-asserted-by":"publisher","first-page":"2037","DOI":"10.18653\/v1\/2022.acl-long.144","article-title":"Metaphors in pre-trained language models: Probing and generalization across datasets and languages","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Aghazadeh","year":"2022"},{"key":"2024120918375040000_bib3","first-page":"166","article-title":"Large language models for psycholinguistic plausibility pretesting","volume-title":"Findings of the Association for Computational Linguistics: EACL 2024","author":"Amouyal","year":"2024"},{"key":"2024120918375040000_bib4","article-title":"Instruction-tuning aligns llms to the human brain","author":"Aw","year":"2023","journal-title":"CoRR"},{"key":"2024120918375040000_bib5","first-page":"67","article-title":"Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs","volume-title":"Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Balloccu","year":"2024"},{"key":"2024120918375040000_bib6","article-title":"Identifying and controlling important neurons in neural machine translation","volume-title":"7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6\u20139, 2019","author":"Bau","year":"2019"},{"article-title":"Open llm leaderboard","year":"2023","author":"Beeching","key":"2024120918375040000_bib7"},{"issue":"1","key":"2024120918375040000_bib8","doi-asserted-by":"publisher","first-page":"207","DOI":"10.1162\/coli_a_00422","article-title":"Probing classifiers: Promises, shortcomings, and advances","volume":"48","author":"Belinkov","year":"2022","journal-title":"Computational Linguistics"},{"key":"2024120918375040000_bib9","doi-asserted-by":"publisher","first-page":"861","DOI":"10.18653\/v1\/P17-1080","article-title":"What do neural machine translation models learn about morphology?","volume-title":"Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Belinkov","year":"2017"},{"key":"2024120918375040000_bib10","first-page":"2397","article-title":"Pythia: A suite for analyzing large language models across training and scaling","volume-title":"International Conference on Machine Learning, ICML 2023, 23\u201329 July 2023, Honolulu, Hawaii, USA","author":"Biderman","year":"2023"},{"key":"2024120918375040000_bib11","first-page":"329","article-title":"A clustering approach for nearly unsupervised recognition of nonliteral language","volume-title":"11th Conference of the European Chapter of the Association for Computational Linguistics","author":"Birke","year":"2006"},{"key":"2024120918375040000_bib12","doi-asserted-by":"publisher","first-page":"6649","DOI":"10.18653\/v1\/2023.acl-long.367","article-title":"Prompting language models for linguistic structure","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Blevins","year":"2023"},{"key":"2024120918375040000_bib14","doi-asserted-by":"publisher","first-page":"1944","DOI":"10.18653\/v1\/P16-1183","article-title":"N-gram language models for massively parallel devices","volume-title":"Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Bogoychev","year":"2016"},{"key":"2024120918375040000_bib15","article-title":"Language models are few-shot learners","volume-title":"Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6\u201312, 2020, virtual","author":"Brown","year":"2020"},{"key":"2024120918375040000_bib16","doi-asserted-by":"publisher","DOI":"10.3115\/1118078.1118083","article-title":"Building a discourse-tagged corpus in the framework of rhetorical structure theory","volume-title":"Proceedings of the SIGDIAL 2001 Workshop, The 2nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, Saturday, September 1, 2001 to Sunday, September 2, 2001, Aalborg, Denmark","author":"Carlson","year":"2001"},{"key":"2024120918375040000_bib17","doi-asserted-by":"publisher","DOI":"10.21236\/AD0616323","volume-title":"Aspects of the Theory of Syntax","author":"Chomsky","year":"1965"},{"key":"2024120918375040000_bib18","article-title":"Scaling instruction-finetuned language models","author":"Chung","year":"2022","journal-title":"CoRR"},{"key":"2024120918375040000_bib19","article-title":"ELECTRA: pre-training text encoders as discriminators rather than generators","volume-title":"8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26\u201330, 2020","author":"Clark","year":"2020"},{"key":"2024120918375040000_bib20","article-title":"Training verifiers to solve math word problems","author":"Cobbe","year":"2021","journal-title":"CoRR"},{"key":"2024120918375040000_bib21","doi-asserted-by":"publisher","first-page":"2126","DOI":"10.18653\/v1\/P18-1198","article-title":"What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Conneau","year":"2018"},{"article-title":"Free dolly: Introducing the world\u2019s first truly open instruction-tuned LLM","year":"2023","author":"Conover","key":"2024120918375040000_bib22"},{"key":"2024120918375040000_bib23","doi-asserted-by":"publisher","first-page":"248","DOI":"10.1109\/CVPR.2009.5206848","article-title":"ImageNet: A large-scale hierarchical image database","author":"Deng","year":"2009","journal-title":"2009 IEEE Conference on Computer Vision and Pattern Recognition"},{"key":"2024120918375040000_bib24","first-page":"4171","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Devlin","year":"2019"},{"key":"2024120918375040000_bib25","doi-asserted-by":"publisher","first-page":"160","DOI":"10.1162\/tacl_a_00359","article-title":"Amnesic probing: Behavioral explanation with amnesic counterfactuals","volume":"9","author":"Elazar","year":"2021","journal-title":"Transactions of the Association for Computational Linguistics"},{"issue":"3","key":"2024120918375040000_bib26","doi-asserted-by":"publisher","first-page":"221","DOI":"10.1037\/h0057532","article-title":"A new readability yardstick","volume":"32","author":"Flesch","year":"1948","journal-title":"The Journal of Applied Psychology"},{"key":"2024120918375040000_bib27","doi-asserted-by":"publisher","first-page":"17","DOI":"10.1162\/tacl_a_00445","article-title":"Decomposing and recomposing event structure","volume":"10","author":"Gantt","year":"2022","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024120918375040000_bib28","article-title":"Robust pronoun use fidelity with english llms: Are they reasoning, repeating, or just biased?","author":"Gautam","year":"2024","journal-title":"CoRR"},{"key":"2024120918375040000_bib29","doi-asserted-by":"publisher","first-page":"240","DOI":"10.18653\/v1\/W18-5426","article-title":"Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information","volume-title":"Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP","author":"Giulianelli","year":"2018"},{"key":"2024120918375040000_bib30","doi-asserted-by":"publisher","first-page":"501","DOI":"10.1162\/tacl_a_00285","article-title":"Decomposing generalization: Models of generic, habitual, and episodic statements","volume":"7","author":"Govindarajan","year":"2019","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024120918375040000_bib31","doi-asserted-by":"publisher","first-page":"12","DOI":"10.18653\/v1\/D15-1002","article-title":"Distributional vectors encode referential attributes","volume-title":"Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing","author":"Gupta","year":"2015"},{"issue":"2\u20133","key":"2024120918375040000_bib32","doi-asserted-by":"publisher","first-page":"146","DOI":"10.1080\/00437956.1954.11659520","article-title":"Distributional structure","volume":"10","author":"Harris","year":"1954","journal-title":"Word"},{"key":"2024120918375040000_bib33","article-title":"Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing","volume-title":"The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\u20135, 2023","author":"He","year":"2023"},{"key":"2024120918375040000_bib34","article-title":"Deberta: Decoding-enhanced bert with disentangled attention","volume-title":"9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3\u20137, 2021","author":"He","year":"2021"},{"key":"2024120918375040000_bib35","doi-asserted-by":"publisher","first-page":"33","DOI":"10.3115\/1621969.1621986","article-title":"SemEval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals","volume-title":"Proceedings of the 5th International Workshop on Semantic Evaluation","author":"Hendrickx","year":"2010"},{"key":"2024120918375040000_bib36","article-title":"Measuring massive multitask language understanding","volume-title":"9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3\u20137, 2021","author":"Hendrycks","year":"2021"},{"key":"2024120918375040000_bib37","article-title":"Lossless and near-lossless compression for foundation models","author":"Hershcovitch","year":"2024","journal-title":"CoRR"},{"key":"2024120918375040000_bib38","doi-asserted-by":"publisher","first-page":"2733","DOI":"10.18653\/v1\/D19-1275","article-title":"Designing and interpreting probes with control tasks","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)","author":"Hewitt","year":"2019"},{"key":"2024120918375040000_bib39","first-page":"4129","article-title":"A structural probe for finding syntax in word representations","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Hewitt","year":"2019"},{"key":"2024120918375040000_bib40","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18653\/v1\/N18-2001","article-title":"Enhanced word representations for bridging anaphora resolution","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)","author":"Hou","year":"2018"},{"key":"2024120918375040000_bib41","doi-asserted-by":"publisher","first-page":"1428","DOI":"10.18653\/v1\/2020.acl-main.132","article-title":"Bridging anaphora resolution as question answering","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Hou","year":"2020"},{"key":"2024120918375040000_bib42","article-title":"Wikicontradict: A benchmark for evaluating llms on real-world knowledge conflicts from wikipedia","author":"Hou","year":"2024","journal-title":"arXiv preprint arXiv:2406.13805"},{"key":"2024120918375040000_bib43","doi-asserted-by":"publisher","first-page":"5040","DOI":"10.18653\/v1\/2023.emnlp-main.306","article-title":"Prompting is not a substitute for probability measurements in large language models","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Jennifer","year":"2023"},{"key":"2024120918375040000_bib44","doi-asserted-by":"publisher","first-page":"624","DOI":"10.18653\/v1\/2021.conll-1.49","article-title":"BabyBERTa: Learning more grammar with small-scale child-directed language","volume-title":"Proceedings of the 25th Conference on Computational Natural Language Learning","author":"Huebner","year":"2021"},{"key":"2024120918375040000_bib45","doi-asserted-by":"publisher","first-page":"5617","DOI":"10.24963\/ijcai.2018\/796","article-title":"Visualisation and \u2018diagnostic classifiers\u2019 reveal how recurrent and recursive neural networks process hierarchical structure (extended abstract)","volume-title":"Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18","author":"Hupkes","year":"2018"},{"key":"2024120918375040000_bib46","doi-asserted-by":"publisher","first-page":"1839","DOI":"10.18653\/v1\/2022.acl-long.129","article-title":"Probing as quantifying inductive bias","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Immer","year":"2022"},{"key":"2024120918375040000_bib47","article-title":"Mistral 7b","author":"Jiang","year":"2023","journal-title":"CoRR"},{"key":"2024120918375040000_bib48","article-title":"Mixtral of experts","author":"Jiang","year":"2024","journal-title":"CoRR"},{"key":"2024120918375040000_bib49","doi-asserted-by":"publisher","first-page":"3250","DOI":"10.18653\/v1\/2021.eacl-main.284","article-title":"Multilingual LAMA: Investigating knowledge in multilingual pretrained language models","volume-title":"Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume","author":"Kassner","year":"2021"},{"key":"2024120918375040000_bib50","doi-asserted-by":"publisher","first-page":"4801","DOI":"10.18653\/v1\/2020.acl-main.434","article-title":"Spying on your neighbors: Fine-grained probing of contextual embeddings for information about surrounding words","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Klafka","year":"2020"},{"key":"2024120918375040000_bib51","doi-asserted-by":"publisher","first-page":"2067","DOI":"10.18653\/v1\/D15-1246","article-title":"What\u2019s in an embedding? Analyzing word embeddings through multilingual evaluation","volume-title":"Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing","author":"K\u00f6hn","year":"2015"},{"key":"2024120918375040000_bib52","first-page":"3190","article-title":"A review corpus annotated for negation, speculation and their scope","volume-title":"Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC\u201912)","author":"Konstantinova","year":"2012"},{"key":"2024120918375040000_bib53","doi-asserted-by":"publisher","first-page":"3849","DOI":"10.18653\/v1\/2021.naacl-main.301","article-title":"Discourse probing of pretrained language models","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Koto","year":"2021"},{"key":"2024120918375040000_bib54","doi-asserted-by":"publisher","first-page":"5729","DOI":"10.18653\/v1\/P19-1573","article-title":"Empirical linguistic study of sentence embeddings","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Krasnowska-Kiera\u015b","year":"2019"},{"key":"2024120918375040000_bib55","doi-asserted-by":"publisher","first-page":"8","DOI":"10.18653\/v1\/2021.repl4nlp-1.2","article-title":"Probing multilingual language models for discourse","volume-title":"Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)","author":"Kurfal\u0131","year":"2021"},{"key":"2024120918375040000_bib56","article-title":"ALBERT: A lite BERT for self-supervised learning of language representations","volume-title":"8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26\u201330, 2020","author":"Lan","year":"2020"},{"key":"2024120918375040000_bib57","doi-asserted-by":"publisher","first-page":"377","DOI":"10.18653\/v1\/2023.acl-long.23","article-title":"What about \u201cem\u201d? How commercial machine translation fails to handle (neo-)pronouns","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Lauscher","year":"2023"},{"key":"2024120918375040000_bib58","doi-asserted-by":"publisher","first-page":"7871","DOI":"10.18653\/v1\/2020.acl-main.703","article-title":"BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Lewis","year":"2020"},{"key":"2024120918375040000_bib59","article-title":"Holistic evaluation of language models","author":"Liang","year":"2023","journal-title":"Transactions on Machine Learning Research"},{"key":"2024120918375040000_bib60","doi-asserted-by":"publisher","first-page":"521","DOI":"10.1162\/tacl_a_00115","article-title":"Assessing the ability of LSTMs to learn syntax-sensitive dependencies","volume":"4","author":"Linzen","year":"2016","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024120918375040000_bib61","article-title":"Roberta: A robustly optimized BERT pretraining approach","author":"Liu","year":"2019","journal-title":"CoRR"},{"key":"2024120918375040000_bib62","article-title":"Decoupled weight decay regularization","volume-title":"7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6\u20139, 2019","author":"Loshchilov","year":"2019"},{"key":"2024120918375040000_bib63","article-title":"Are emergent abilities in large language models just in-context learning?","author":"Sheng","year":"2023","journal-title":"CoRR"},{"key":"2024120918375040000_bib64","doi-asserted-by":"publisher","DOI":"10.1016\/j.tics.2024.01.011","article-title":"Dissociating language and thought in large language models","author":"Mahowald","year":"2024","journal-title":"Trends in Cognitive Sciences"},{"volume-title":"The Concise Oxford Dictionary of Linguistics","year":"2014","author":"Matthews","key":"2024120918375040000_bib65"},{"issue":"11","key":"2024120918375040000_bib66","doi-asserted-by":"publisher","first-page":"39","DOI":"10.1145\/219717.219748","article-title":"Wordnet: A lexical database for English","volume":"38","author":"Miller","year":"1995","journal-title":"Communications of the ACM"},{"key":"2024120918375040000_bib67","doi-asserted-by":"publisher","first-page":"11048","DOI":"10.18653\/v1\/2022.emnlp-main.759","article-title":"Rethinking the role of demonstrations: What makes in-context learning work?","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Min","year":"2022"},{"key":"2024120918375040000_bib68","article-title":"Orca 2: Teaching small language models how to reason","author":"Mitra","year":"2023","journal-title":"CoRR"},{"key":"2024120918375040000_bib69","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00681","article-title":"State of what art? A call for multi-prompt LLM evaluation","author":"Mizrahi","year":"2024","journal-title":"CoRR"},{"key":"2024120918375040000_bib70","first-page":"4221","article-title":"Introducing the LCC metaphor datasets","volume-title":"Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC\u201916)","author":"Mohler","year":"2016"},{"key":"2024120918375040000_bib71","first-page":"265","article-title":"*SEM 2012 shared task: Resolving the scope and focus of negation","volume-title":"*SEM 2012: The First Joint Conference on Lexical and Computational Semantics \u2013 Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)","author":"Morante","year":"2012"},{"key":"2024120918375040000_bib72","doi-asserted-by":"publisher","first-page":"68","DOI":"10.18653\/v1\/2020.blackboxnlp-1.7","article-title":"On the interplay between fine-tuning and sentence-level probing for linguistic knowledge in pre-trained transformers","volume-title":"Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP","author":"Mosbach","year":"2020"},{"key":"2024120918375040000_bib73","doi-asserted-by":"publisher","first-page":"2502","DOI":"10.18653\/v1\/2020.findings-emnlp.227","article-title":"On the interplay between fine-tuning and sentence-level probing for linguistic knowledge in pre-trained transformers","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2020","author":"Mosbach","year":"2020"},{"key":"2024120918375040000_bib74","doi-asserted-by":"publisher","first-page":"2014","DOI":"10.18653\/v1\/2023.eacl-main.148","article-title":"MTEB: Massive text embedding benchmark","volume-title":"Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics","author":"Muennighoff","year":"2023"},{"key":"2024120918375040000_bib75","doi-asserted-by":"publisher","first-page":"1797","DOI":"10.18653\/v1\/D18-1206","article-title":"Don\u2019t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","author":"Narayan","year":"2018"},{"key":"2024120918375040000_bib76","doi-asserted-by":"publisher","first-page":"4497","DOI":"10.18653\/v1\/P19-1442","article-title":"DisSent: Learning sentence representations from explicit discourse relations","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Nie","year":"2019"},{"key":"2024120918375040000_bib77","doi-asserted-by":"publisher","first-page":"4885","DOI":"10.18653\/v1\/2020.acl-main.441","article-title":"Adversarial NLI: A new benchmark for natural language understanding","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Nie","year":"2020"},{"key":"2024120918375040000_bib78","article-title":"Training language models to follow instructions with human feedback","volume-title":"Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 \u2013 December 9, 2022","author":"Ouyang","year":"2022"},{"key":"2024120918375040000_bib79","first-page":"560","article-title":"SemEval 2016 task 11: Complex word identification","volume-title":"Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)","author":"Paetzold","year":"2016"},{"key":"2024120918375040000_bib80","doi-asserted-by":"publisher","first-page":"4153","DOI":"10.18653\/v1\/2021.naacl-main.327","article-title":"Probing for bridging inference in transformer language models","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Pandit","year":"2021"},{"key":"2024120918375040000_bib81","doi-asserted-by":"publisher","first-page":"5015","DOI":"10.18653\/v1\/2022.emnlp-main.335","article-title":"COPEN: Probing conceptual knowledge in pre-trained language models","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Peng","year":"2022"},{"key":"2024120918375040000_bib82","doi-asserted-by":"publisher","first-page":"1532","DOI":"10.3115\/v1\/D14-1162","article-title":"GloVe: Global vectors for word representation","volume-title":"Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Pennington","year":"2014"},{"key":"2024120918375040000_bib83","article-title":"Efficient benchmarking (of language models)","author":"Perlitz","year":"2023","journal-title":"CoRR"},{"key":"2024120918375040000_bib84","article-title":"Benchmark agreement testing done right: A guide for llm benchmark evaluation","author":"Perlitz","year":"2024","journal-title":"CoRR"},{"key":"2024120918375040000_bib85","article-title":"How context affects language models\u2019 factual predictions","volume-title":"Conference on Automated Knowledge Base Construction, AKBC 2020, Virtual, June 22\u201324, 2020","author":"Petroni","year":"2020"},{"key":"2024120918375040000_bib86","doi-asserted-by":"publisher","first-page":"2463","DOI":"10.18653\/v1\/D19-1250","article-title":"Language models as knowledge bases?","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)","author":"Petroni","year":"2019"},{"key":"2024120918375040000_bib87","doi-asserted-by":"publisher","first-page":"2463","DOI":"10.18653\/v1\/D19-1250","article-title":"Language models as knowledge bases?","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3\u20137, 2019","author":"Petroni","year":"2019"},{"key":"2024120918375040000_bib88","doi-asserted-by":"publisher","first-page":"4609","DOI":"10.18653\/v1\/2020.acl-main.420","article-title":"Information-theoretic probing for linguistic structure","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Pimentel","year":"2020"},{"issue":"8","key":"2024120918375040000_bib89","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford","year":"2019","journal-title":"OpenAI blog"},{"issue":"140","key":"2024120918375040000_bib90","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"Journal of Machine Learning Research"},{"key":"2024120918375040000_bib91","doi-asserted-by":"publisher","DOI":"10.4324\/9781315843186","volume-title":"A Short History of Linguistics","author":"Robins","year":"2013"},{"key":"2024120918375040000_bib92","doi-asserted-by":"publisher","first-page":"4486","DOI":"10.18653\/v1\/2021.acl-long.346","article-title":"Evaluation examples are not equally informative: How should that change NLP leaderboards?","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Rodriguez","year":"2021"},{"key":"2024120918375040000_bib93","doi-asserted-by":"publisher","first-page":"842","DOI":"10.1162\/tacl_a_00349","article-title":"A primer in BERTology: What we know about how BERT works","volume":"8","author":"Rogers","year":"2020","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024120918375040000_bib94","doi-asserted-by":"publisher","first-page":"944","DOI":"10.18653\/v1\/D18-1114","article-title":"Neural-Davidsonian semantic proto-role labeling","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","author":"Rudinger","year":"2018"},{"key":"2024120918375040000_bib95","doi-asserted-by":"publisher","first-page":"731","DOI":"10.18653\/v1\/N18-1067","article-title":"Neural models of factuality","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)","author":"Rudinger","year":"2018"},{"volume-title":"Cours de linguistique g\u00e9n\u00e9rale","year":"1916","author":"de Saussure","key":"2024120918375040000_bib96"},{"key":"2024120918375040000_bib97","article-title":"Are emergent abilities of large language models a mirage?","volume-title":"Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \u2013 16, 2023","author":"Schaeffer","year":"2023"},{"key":"2024120918375040000_bib98","doi-asserted-by":"publisher","first-page":"4486","DOI":"10.18653\/v1\/2021.findings-emnlp.382","article-title":"A multilabel approach to morphosyntactic probing","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2021","author":"Shapiro","year":"2021"},{"key":"2024120918375040000_bib99","article-title":"The truth is in there: Improving reasoning in language models with layer-selective rank reduction","author":"Sharma","year":"2023","journal-title":"CoRR"},{"key":"2024120918375040000_bib100","doi-asserted-by":"publisher","first-page":"1526","DOI":"10.18653\/v1\/D16-1159","article-title":"Does string-based neural MT learn source syntax?","volume-title":"Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing","author":"Shi","year":"2016"},{"key":"2024120918375040000_bib101","first-page":"2897","article-title":"A gold standard dependency corpus for English","volume-title":"Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC\u201914)","author":"Silveira","year":"2014"},{"key":"2024120918375040000_bib102","doi-asserted-by":"crossref","first-page":"1631","DOI":"10.18653\/v1\/D13-1170","article-title":"Recursive deep models for semantic compositionality over a sentiment treebank","volume-title":"Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing","author":"Socher","year":"2013"},{"key":"2024120918375040000_bib103","article-title":"Beyond the imitation game: Quantifying and extrapolating the capabilities of language models","author":"Srivastava","year":"2022","journal-title":"CoRR"},{"key":"2024120918375040000_bib104","doi-asserted-by":"publisher","DOI":"10.1075\/celcr.14","volume-title":"A method for linguistic metaphor identification","author":"Steen","year":"2010"},{"key":"2024120918375040000_bib105","article-title":"LAB: large-scale alignment for chatbots","author":"Sudalairaj","year":"2024","journal-title":"CoRR"},{"key":"2024120918375040000_bib106","doi-asserted-by":"publisher","first-page":"38","DOI":"10.3115\/1572306.1572314","article-title":"The BioScope corpus: Annotation for negation, uncertainty and their scope in biomedical texts","volume-title":"Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing","author":"Szarvas","year":"2008"},{"key":"2024120918375040000_bib107","doi-asserted-by":"publisher","first-page":"743","DOI":"10.1162\/tacl_a_00342","article-title":"oLMpics-on what language model pre-training captures","volume":"8","author":"Talmor","year":"2020","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024120918375040000_bib108","doi-asserted-by":"publisher","first-page":"743","DOI":"10.1162\/tacl_a_00342","article-title":"oLMpics-On what language model pre-training captures","volume":"8","author":"Talmor","year":"2020","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024120918375040000_bib109","article-title":"UL2: Unifying language learning paradigms","volume-title":"The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\u20135, 2023","author":"Yi","year":"2023"},{"key":"2024120918375040000_bib110","doi-asserted-by":"publisher","first-page":"4593","DOI":"10.18653\/v1\/P19-1452","article-title":"BERT rediscovers the classical NLP pipeline","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Tenney","year":"2019"},{"key":"2024120918375040000_bib111","article-title":"What do you learn from context? Probing for sentence structure in contextualized word representations","volume-title":"7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6\u20139, 2019","author":"Tenney","year":"2019"},{"key":"2024120918375040000_bib112","article-title":"BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models","volume-title":"Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual","author":"Thakur","year":"2021"},{"key":"2024120918375040000_bib113","doi-asserted-by":"publisher","first-page":"197","DOI":"10.18653\/v1\/2020.emnlp-main.15","article-title":"Intrinsic probing through dimension selection","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Hennigen","year":"2020"},{"key":"2024120918375040000_bib114","article-title":"Llama 2: Open foundation and fine-tuned chat models","author":"Touvron","year":"2023","journal-title":"CoRR"},{"key":"2024120918375040000_bib115","doi-asserted-by":"publisher","first-page":"249","DOI":"10.18653\/v1\/2022.blackboxnlp-1.20","article-title":"It is not easy to detect paraphrases: Analysing semantic similarity with antonyms and negation using the new SemAntoNeg benchmark","volume-title":"Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP","author":"Vahtola","year":"2022"},{"key":"2024120918375040000_bib116","doi-asserted-by":"publisher","first-page":"2906","DOI":"10.18653\/v1\/P19-1280","article-title":"Fine-grained temporal relation extraction","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Vashishtha","year":"2019"},{"key":"2024120918375040000_bib117","article-title":"Diagnostic classifiers revealing how neural networks process hierarchical structure","volume-title":"Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016","author":"Veldhoen","year":"2016"},{"key":"2024120918375040000_bib118","doi-asserted-by":"publisher","first-page":"183","DOI":"10.18653\/v1\/2020.emnlp-main.14","article-title":"Information-theoretic probing with minimum description length","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Voita","year":"2020"},{"key":"2024120918375040000_bib119","first-page":"2197","article-title":"Dive into the chasm: Probing the gap between in- and cross-topic generalization","volume-title":"Findings of the Association for Computational Linguistics: EACL 2024","author":"Waldis","year":"2024"},{"key":"2024120918375040000_bib120","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.acl-long.795","article-title":"How to handle different types of out-of-distribution scenarios in computational argumentation? A comprehensive and fine-grained field study","author":"Waldis","year":"2024","journal-title":"CoRR"},{"key":"2024120918375040000_bib121","article-title":"Superglue: A stickier benchmark for general-purpose language understanding systems","volume-title":"Advances in Neural Information Processing Systems","author":"Wang","year":"2019"},{"key":"2024120918375040000_bib122","article-title":"GLUE: A multi-task benchmark and analysis platform for natural language understanding","volume-title":"7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6\u20139, 2019","author":"Wang","year":"2019"},{"key":"2024120918375040000_bib123","article-title":"Adversarial GLUE: A multi-task benchmark for robustness evaluation of language models","volume-title":"Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual","author":"Wang","year":"2021"},{"key":"2024120918375040000_bib124","article-title":"How far can camels go? Exploring the state of instruction tuning on open resources","volume-title":"Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \u2013 16, 2023","author":"Wang","year":"2023"},{"key":"2024120918375040000_bib125","doi-asserted-by":"publisher","first-page":"5085","DOI":"10.18653\/v1\/2022.emnlp-main.340","article-title":"Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Wang","year":"2022"},{"key":"2024120918375040000_bib126","doi-asserted-by":"publisher","first-page":"377","DOI":"10.1162\/tacl_a_00321","article-title":"BLiMP: The benchmark of linguistic minimal pairs for English","volume":"8","author":"Warstadt","year":"2020","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024120918375040000_bib127","article-title":"The Penn discourse treebank 3.0 annotation manual","author":"Webber","year":"2019","journal-title":"Philadelphia, University of Pennsylvania"},{"key":"2024120918375040000_bib128","first-page":"170","article-title":"Ontonotes release 5.0","volume":"23","author":"Weischedel","year":"2013","journal-title":"Linguistic Data Consortium, Philadelphia, PA"},{"key":"2024120918375040000_bib129","doi-asserted-by":"publisher","first-page":"1713","DOI":"10.18653\/v1\/D16-1177","article-title":"Universal decompositional semantics on Universal Dependencies","volume-title":"Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing","author":"White","year":"2016"},{"key":"2024120918375040000_bib130","doi-asserted-by":"publisher","first-page":"4166","DOI":"10.18653\/v1\/2020.acl-main.383","article-title":"Perturbed masking: Parameter-free probing for analyzing and interpreting BERT","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Zhiyong","year":"2020"},{"key":"2024120918375040000_bib131","article-title":"Wizardlm: Empowering large language models to follow complex instructions","author":"Can","year":"2023","journal-title":"CoRR"},{"key":"2024120918375040000_bib132","article-title":"Compeft: Compression for communicating parameter efficient updates via sparsification and quantization","author":"Yadav","year":"2023","journal-title":"CoRR"},{"key":"2024120918375040000_bib133","doi-asserted-by":"publisher","first-page":"12731","DOI":"10.18653\/v1\/2023.findings-acl.806","article-title":"GLUE-X: Evaluating natural language understanding models from an out-of-distribution generalization perspective","volume-title":"Findings of the Association for Computational Linguistics: ACL 2023","author":"Yang","year":"2023"},{"key":"2024120918375040000_bib134","article-title":"Probelm: Plausibility ranking evaluation for language models","author":"Yuan","year":"2024","journal-title":"CoRR"},{"issue":"3","key":"2024120918375040000_bib135","doi-asserted-by":"publisher","first-page":"581","DOI":"10.1007\/s10579-016-9343-x","article-title":"The GUM corpus: Creating multilayer resources in the classroom","volume":"51","author":"Zeldes","year":"2017","journal-title":"Language Resources and Evaluation"},{"key":"2024120918375040000_bib136","doi-asserted-by":"publisher","first-page":"292","DOI":"10.18653\/v1\/2020.blackboxnlp-1.27","article-title":"Do language embeddings capture scales?","volume-title":"Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP","author":"Zhang","year":"2020"},{"key":"2024120918375040000_bib137","doi-asserted-by":"publisher","first-page":"1112","DOI":"10.18653\/v1\/2021.acl-long.90","article-title":"When do you need billions of words of pretraining data?","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Zhang","year":"2021"},{"key":"2024120918375040000_bib138","article-title":"Judging llm-as-a-judge with mt-bench and chatbot arena","volume-title":"Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \u2013 16, 2023","author":"Zheng","year":"2023"},{"key":"2024120918375040000_bib139","article-title":"LIMA: Less is more for alignment","volume-title":"Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \u2013 16, 2023","author":"Zhou","year":"2023"},{"key":"2024120918375040000_bib140","doi-asserted-by":"publisher","first-page":"11534","DOI":"10.18653\/v1\/2022.emnlp-main.793","article-title":"Predicting fine-tuning performance with probing","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Zhu","year":"2022"},{"key":"2024120918375040000_bib141","doi-asserted-by":"publisher","first-page":"4132","DOI":"10.18653\/v1\/2022.findings-acl.326","article-title":"On the data requirements of probing","volume-title":"Findings of the Association for Computational Linguistics: ACL 2022","author":"Zhu","year":"2022"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00718\/2482684\/tacl_a_00718.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00718\/2482684\/tacl_a_00718.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,12,9]],"date-time":"2024-12-09T18:38:20Z","timestamp":1733769500000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00718\/125534\/Holmes-A-Benchmark-to-Assess-the-Linguistic"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024]]},"references-count":140,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00718","relation":{},"ISSN":["2307-387X"],"issn-type":[{"type":"electronic","value":"2307-387X"}],"subject":[],"published-other":{"date-parts":[[2024]]},"published":{"date-parts":[[2024]]}}}