{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,29]],"date-time":"2026-03-29T10:08:33Z","timestamp":1774778913795,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":115,"publisher":"ACM","license":[{"start":{"date-parts":[[2024,8,12]],"date-time":"2024-08-12T00:00:00Z","timestamp":1723420800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Stanford Institute for Human-Centered Artificial Intelligence"},{"name":"OpenAI"},{"name":"Stanford Accelerator for Learning"},{"name":"McCoy Family Center for Ethics in Society"},{"name":"Center for Research on Foundation Models"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,8,12]]},"DOI":"10.1145\/3632620.3671097","type":"proceedings-article","created":{"date-parts":[[2024,8,6]],"date-time":"2024-08-06T12:47:29Z","timestamp":1722948449000},"page":"452-468","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":13,"title":["Using Benchmarking Infrastructure to Evaluate LLM Performance on CS Concept Inventories: Challenges, Opportunities, and Critiques"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2998-3165","authenticated-orcid":false,"given":"Murtaza","family":"Ali","sequence":"first","affiliation":[{"name":"University of Washington, United States of America"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-0229-7073","authenticated-orcid":false,"given":"Prerna","family":"Rao","sequence":"additional","affiliation":[{"name":"University of Washington, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-7270-2607","authenticated-orcid":false,"given":"Yifan","family":"Mai","sequence":"additional","affiliation":[{"name":"Stanford University, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3275-992X","authenticated-orcid":false,"given":"Benjamin","family":"Xie","sequence":"additional","affiliation":[{"name":"Stanford University, United States of America"}]}],"member":"320","published-online":{"date-parts":[[2024,8,12]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Vibhor Agarwal Nakul Thureja Madhav\u00a0Krishan Garg Sahiti Dharmavaram Meghna and Dhruv Kumar. 2024. \u201cWhich LLM should I use?\u201d: Evaluating LLMs for tasks performed by Undergraduate Computer Science Students in India. (Jan. 2024). arxiv:2402.01687\u00a0[cs.CY]"},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i21.30362"},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3568813.3600120"},{"key":"e_1_3_2_1_4_1","volume-title":"Introduction to Measurement Theory","author":"Allen J","unstructured":"Mary\u00a0J Allen and Wendy\u00a0M Yen. 2001. Introduction to Measurement Theory. Waveland Press."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1007\/s40593-016-0105-0"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3587102.3588852"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.3847\/AER2006020"},{"key":"e_1_3_2_1_8_1","volume-title":"A neural probabilistic language model. Advances in neural information processing systems 13","author":"Bengio Yoshua","year":"2000","unstructured":"Yoshua Bengio, R\u00e9jean Ducharme, and Pascal Vincent. 2000. A neural probabilistic language model. Advances in neural information processing systems 13 (2000)."},{"key":"e_1_3_2_1_9_1","volume-title":"Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research","year":"2023","unstructured":"BIG-bench authors. 2023. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research (2023)."},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1017\/9781108654555.004"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3593013.3593996"},{"key":"e_1_3_2_1_12_1","volume-title":"Risks of AI Foundation Models in Education. (Oct","author":"Blodgett Su\u00a0Lin","year":"2021","unstructured":"Su\u00a0Lin Blodgett and Michael Madaio. 2021. Risks of AI Foundation Models in Education. (Oct. 2021). arxiv:2110.10024\u00a0[cs.CY]"},{"key":"e_1_3_2_1_13_1","volume-title":"On the Opportunities and Risks of Foundation Models. (Aug","author":"Bommasani Rishi","year":"2021","unstructured":"Rishi Bommasani, Drew\u00a0A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael\u00a0S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared\u00a0Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel\u00a0E Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang\u00a0Wei Koh, Mark Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang\u00a0Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher\u00a0D Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman, Allen Nie, Juan\u00a0Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon\u00a0Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Rob Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher R\u00e9, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori, Armin\u00a0W Thomas, Florian Tram\u00e8r, Rose\u00a0E Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang\u00a0Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. 2021. On the Opportunities and Risks of Foundation Models. (Aug. 2021). arxiv:2108.07258\u00a0[cs.LG]"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3501385.3543971"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3304221.3319771"},{"key":"e_1_3_2_1_16_1","volume-title":"A Survey on Evaluation of Large Language Models. (July","author":"Chang Yupeng","year":"2023","unstructured":"Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip\u00a0S Yu, Qiang Yang, and Xing Xie. 2023. A Survey on Evaluation of Large Language Models. (July 2023). arxiv:2307.03109\u00a0[cs.CL]"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3622841"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1119\/1.1374249"},{"key":"e_1_3_2_1_19_1","volume-title":"The theory and practice of item response theory","author":"De\u00a0Ayala R\u00a0J","unstructured":"R\u00a0J De\u00a0Ayala. 2009. The theory and practice of item response theory. Guilford Press, New York."},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3545945.3569822"},{"key":"e_1_3_2_1_21_1","volume-title":"The Benchmark Lottery. (July","author":"Dehghani Mostafa","year":"2021","unstructured":"Mostafa Dehghani, Yi Tay, Alexey\u00a0A Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler, and Oriol Vinyals. 2021. The Benchmark Lottery. (July 2021). arxiv:2107.07002\u00a0[cs.LG]"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626252.3630863"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3442188.3445924"},{"key":"e_1_3_2_1_24_1","unstructured":"Qingxiu Dong Lei Li Damai Dai Ce Zheng Zhiyong Wu Baobao Chang Xu Sun Jingjing Xu Lei Li and Zhifang Sui. 2023. A Survey on In-context Learning. arxiv:2301.00234\u00a0[cs.CL]"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1111\/j.2044-8317.1985.tb00817.x"},{"key":"e_1_3_2_1_26_1","volume-title":"Proceedings of the ninth international conference on mathematics education in a global community, Vol.\u00a09. Citeseer, 165\u2013170","author":"Epstein Jerome","year":"2007","unstructured":"Jerome Epstein. 2007. Development and validation of the Calculus Concept Inventory. In Proceedings of the ninth international conference on mathematics education in a global community, Vol.\u00a09. Citeseer, 165\u2013170."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.7551\/mitpress\/5657.001.0001"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1080\/1369183X.1979.9975576"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3576123.3576134"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3576123.3576134"},{"key":"e_1_3_2_1_31_1","volume-title":"Introduction to Women\u2019s and Gender Studies: An Interdisciplinary Approach","author":"Gillis J","unstructured":"Melissa\u00a0J Gillis and Andrew\u00a0T Jacobs. 2019. Introduction to Women\u2019s and Gender Studies: An Interdisciplinary Approach. Oxford University Press."},{"key":"e_1_3_2_1_32_1","unstructured":"Global Future Council on Artificial Intelligence for Humanity. 2022. A Blueprint for Equity and Inclusion in Artificial Intelligence. Technical Report. World Economic Forum."},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/1352322.1352226"},{"key":"e_1_3_2_1_34_1","volume-title":"Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks. (March","author":"Gong Linyuan","year":"2024","unstructured":"Linyuan Gong, Sida Wang, Mostafa Elhoushi, and Alvin Cheung. 2024. Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks. (March 2024). arxiv:2403.04814\u00a0[cs.CL]"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2306.12424"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1119\/1.14030"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1080\/08993408.2017.1414728"},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3470652"},{"key":"e_1_3_2_1_39_1","volume-title":"Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938","author":"Hendrycks Dan","year":"2021","unstructured":"Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, 2021. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938 (2021)."},{"key":"e_1_3_2_1_40_1","volume-title":"Measuring Mathematical Problem Solving With the MATH Dataset. (March","author":"Hendrycks Dan","year":"2021","unstructured":"Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring Mathematical Problem Solving With the MATH Dataset. (March 2021). arxiv:2103.03874\u00a0[cs.LG]"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3545945.3569762"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/1734263.1734298"},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/1734263.1734335"},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1119\/1.2343497"},{"key":"e_1_3_2_1_45_1","unstructured":"Charles\u00a0L Hulin Fritz Drasgow and Charles\u00a0K Parsons. 1983. Item Response Theory: Application to Psychological Measurement. Dow Jones-Irwin."},{"key":"e_1_3_2_1_46_1","volume-title":"Statistical methods for speech recognition","author":"Jelinek Frederick","unstructured":"Frederick Jelinek. 1998. Statistical methods for speech recognition. MIT press."},{"key":"e_1_3_2_1_47_1","unstructured":"Hong Jiao and Robert\u00a0W Lissitz. 2020. Application of Artificial Intelligence to Assessment. IAP."},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.3390\/app11146421"},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626252.3630803"},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/1734263.1734299"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1111\/jedm.12000"},{"key":"e_1_3_2_1_52_1","volume-title":"Evaluating Gender Bias in Large Language Models via Chain-of-Thought Prompting. (Jan","author":"Kaneko Masahiro","year":"2024","unstructured":"Masahiro Kaneko, Danushka Bollegala, Naoaki Okazaki, and Timothy Baldwin. 2024. Evaluating Gender Bias in Large Language Models via Chain-of-Thought Prompting. (Jan. 2024). arxiv:2401.15585\u00a0[cs.CL]"},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/2538862.2538902"},{"key":"e_1_3_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.4236\/psych.2018.911145"},{"key":"e_1_3_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1080\/10401334.2016.1146608"},{"key":"e_1_3_2_1_56_1","volume-title":"Comparing Code Explanations Created by Students and Large Language Models. (April","author":"Leinonen Juho","year":"2023","unstructured":"Juho Leinonen, Paul Denny, Stephen MacNeil, Sami Sarsa, Seth Bernstein, Joanne Kim, Andrew Tran, and Arto Hellas. 2023. Comparing Code Explanations Created by Students and Large Language Models. (April 2023). arxiv:2304.03938\u00a0[cs.CY]"},{"key":"e_1_3_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/3545945.3569770"},{"key":"e_1_3_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/2493394.2493415"},{"key":"e_1_3_2_1_59_1","volume-title":"Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461","author":"Lewis Mike","year":"2019","unstructured":"Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)."},{"key":"e_1_3_2_1_60_1","volume-title":"Task Contamination: Language Models May Not Be Few-Shot Anymore. (Dec.","author":"Li Changmao","year":"2023","unstructured":"Changmao Li and Jeffrey Flanigan. 2023. Task Contamination: Language Models May Not Be Few-Shot Anymore. (Dec. 2023). arxiv:2312.16337\u00a0[cs.CL]"},{"key":"e_1_3_2_1_61_1","volume-title":"Holistic Evaluation of Language Models. (Nov","author":"Liang Percy","year":"2022","unstructured":"Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher\u00a0D Manning, Christopher R\u00e9, Diana Acosta-Navas, Drew\u00a0A Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang\u00a0Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. 2022. Holistic Evaluation of Language Models. (Nov. 2022). arxiv:2211.09110\u00a0[cs.CL]"},{"key":"e_1_3_2_1_62_1","unstructured":"Percy Liang Yifan Mai Josselin Somerville Farzaan Kaiyom Tony Lee and Rishi Bommasani. 2023. HELM Lite: Lightweight and Broad Capabilities Evaluation. https:\/\/crfm.stanford.edu\/2023\/12\/19\/helm-lite.html. Accessed: 2024-3-2."},{"key":"e_1_3_2_1_63_1","volume-title":"Concept Inventories in Higher Education Science. In National Research Council Promising Practices in Undergraduate STEM Education Workshop, Vol.\u00a013","author":"Libarkin Julie","year":"2008","unstructured":"Julie Libarkin. 2008. Concept Inventories in Higher Education Science. In National Research Council Promising Practices in Undergraduate STEM Education Workshop, Vol.\u00a013. 14."},{"key":"e_1_3_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.5408\/1089-9995-53.4.394"},{"key":"e_1_3_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1145\/3545945.3569785"},{"key":"e_1_3_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1145\/3501709.3544280"},{"key":"e_1_3_2_1_67_1","volume-title":"fairness","author":"Madaio Michael","year":"2021","unstructured":"Michael Madaio, Su\u00a0Lin Blodgett, Elijah Mayfield, and Ezekiel Dixon-Rom\u00e1n. 2021. Beyond \u201cfairness:\u201d structural (in)justice lenses on AI for education. (May 2021). arxiv:2105.08847\u00a0[cs.CY]"},{"key":"e_1_3_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/3610969.3610982"},{"key":"e_1_3_2_1_69_1","doi-asserted-by":"publisher","unstructured":"Wojciech Malec. 2024. Investigating the quality of AI-generated distractors for a multiple-choice vocabulary test. https:\/\/doi.org\/10.5220\/0012762400003693","DOI":"10.5220\/0012762400003693"},{"key":"e_1_3_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1145\/3291279.3339409"},{"key":"e_1_3_2_1_71_1","unstructured":"Nestor Maslej Loredana Fattorini Erik Brynjolfsson John Etchemendy Katrina Ligett Terah Lyons James Manyika Helen Ngo Juan\u00a0Carlos Niebles Vanessa Parli Yoav Shoham Russell Wald Jack Clark and Raymond Perrault. 2023. The AI Index 2023 Annual Report. Technical Report. Stanford University."},{"key":"e_1_3_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1145\/2858036.2858349"},{"key":"e_1_3_2_1_73_1","doi-asserted-by":"crossref","unstructured":"Daniel McCaffrey Jodi Casabianca Kathryn Ricker-Pedley Ren\u00e9 Lawless and Cathy Wendler. 2021. Best Practices for Constructed-Response Scoring. Technical Report. Educational Testing Services.","DOI":"10.1002\/ets2.12358"},{"key":"e_1_3_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1145\/3291279.3339401"},{"key":"e_1_3_2_1_75_1","volume-title":"Educational Measurement","author":"Messick Samuel","unstructured":"Samuel Messick. 1993. Validity. In Educational Measurement. Third Edition. American Council on Education Series on Higher Education. Oryx Press, 4041 North Central at Indian School Road, Phoenix, AZ 85012-3397., 13\u2013103."},{"key":"e_1_3_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1037\/0003-066X.50.9.741"},{"key":"e_1_3_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1145\/2960310.2960330"},{"key":"e_1_3_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1021\/ed079p739"},{"key":"e_1_3_2_1_79_1","volume-title":"StereoSet: Measuring stereotypical bias in pretrained language models. (April","author":"Nadeem Moin","year":"2020","unstructured":"Moin Nadeem, Anna Bethke, and Siva Reddy. 2020. StereoSet: Measuring stereotypical bias in pretrained language models. (April 2020). arxiv:2004.09456\u00a0[cs.CL]"},{"key":"e_1_3_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.1145\/3230977.3230992"},{"key":"e_1_3_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.1145\/3587102.3588794"},{"key":"e_1_3_2_1_82_1","doi-asserted-by":"publisher","DOI":"10.1145\/2960310.2960316"},{"key":"e_1_3_2_1_83_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.findings-acl.165"},{"key":"e_1_3_2_1_84_1","doi-asserted-by":"publisher","unstructured":"Baolin Peng Xiujun Li Lihong Li Jianfeng Gao Asli Celikyilmaz Sungjin Lee and Kam-Fai Wong. 2017. Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning. (2017). https:\/\/doi.org\/10.18653\/v1\/d17-1237","DOI":"10.18653\/v1\/d17-1237"},{"key":"e_1_3_2_1_85_1","doi-asserted-by":"publisher","DOI":"10.1145\/3613372.3614197"},{"key":"e_1_3_2_1_86_1","doi-asserted-by":"publisher","DOI":"10.1145\/3291279.3339404"},{"key":"e_1_3_2_1_87_1","doi-asserted-by":"publisher","DOI":"10.1145\/3623762.3633499"},{"key":"e_1_3_2_1_88_1","article-title":"Development and validation of the middle grades computer science concept inventory (MG-CSCI) assessment","volume":"16","author":"Rachmatullah Arif","year":"2020","unstructured":"Arif Rachmatullah, Bita Akram, Danielle Boulden, Bradford Mott, Kristy Boyer, James Lester, and Eric Wiebe. 2020. Development and validation of the middle grades computer science concept inventory (MG-CSCI) assessment. EURASIA Journal of Mathematics, Science and Technology Education 16, 5 (2020), em1841.","journal-title":"EURASIA Journal of Mathematics, Science and Technology Education"},{"key":"e_1_3_2_1_89_1","volume-title":"AI and the Everything in the Whole Wide World Benchmark. (Nov","author":"Raji Inioluwa\u00a0Deborah","year":"2021","unstructured":"Inioluwa\u00a0Deborah Raji, Emily\u00a0M Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna. 2021. AI and the Everything in the Whole Wide World Benchmark. (Nov. 2021). arxiv:2111.15366\u00a0[cs.LG]"},{"key":"e_1_3_2_1_90_1","doi-asserted-by":"publisher","DOI":"10.1145\/3287324.3287504"},{"key":"e_1_3_2_1_91_1","volume-title":"NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark. (Oct","author":"Sainz Oscar","year":"2023","unstructured":"Oscar Sainz, Jon\u00a0Ander Campos, Iker Garc\u00eda-Ferrero, Julen Etxaniz, Oier\u00a0Lopez de Lacalle, and Eneko Agirre. 2023. NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark. (Oct. 2023). arxiv:2310.18018\u00a0[cs.CL]"},{"key":"e_1_3_2_1_92_1","doi-asserted-by":"publisher","DOI":"10.1145\/3576882.3617909"},{"key":"e_1_3_2_1_93_1","doi-asserted-by":"publisher","DOI":"10.1145\/3501385.3543957"},{"key":"e_1_3_2_1_94_1","doi-asserted-by":"publisher","DOI":"10.1145\/3501385.3543957"},{"key":"e_1_3_2_1_95_1","volume-title":"Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models. (April","author":"S\u00e4uberli Andreas","year":"2024","unstructured":"Andreas S\u00e4uberli and Simon Clematide. 2024. Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models. (April 2024). arxiv:2404.07720\u00a0[cs.CL]"},{"key":"e_1_3_2_1_96_1","doi-asserted-by":"publisher","DOI":"10.1145\/3568813.3600142"},{"key":"e_1_3_2_1_97_1","volume-title":"Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions about Code. (March","author":"Savelka Jaromir","year":"2023","unstructured":"Jaromir Savelka, Arav Agarwal, Christopher Bogart, and Majd Sakr. 2023. Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions about Code. (March 2023). arxiv:2303.08033\u00a0[cs.CL]"},{"key":"e_1_3_2_1_98_1","volume-title":"Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses? (March","author":"Savelka Jaromir","year":"2023","unstructured":"Jaromir Savelka, Arav Agarwal, Christopher Bogart, Yifan Song, and Majd Sakr. 2023. Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses? (March 2023). arxiv:2303.09325\u00a0[cs.AI]"},{"key":"e_1_3_2_1_99_1","unstructured":"Stanford Center for Research on Foundation Models. 2022. Ecosystem Graphs for Foundation Models. https:\/\/crfm.stanford.edu\/ecosystem-graphs\/index.html?mode=table. Accessed: 2024-3-12."},{"key":"e_1_3_2_1_100_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626252.3630822"},{"key":"e_1_3_2_1_101_1","doi-asserted-by":"publisher","DOI":"10.1080\/08993408.2014.970779"},{"key":"e_1_3_2_1_102_1","doi-asserted-by":"publisher","DOI":"10.1145\/1953163.1953200"},{"key":"e_1_3_2_1_103_1","unstructured":"The National Science Foundation and The Institute of Education Sciences. 2018. Companion Guidelines on Replication & Reproducibility in Education Research. Technical Report. NSF and IES."},{"key":"e_1_3_2_1_104_1","doi-asserted-by":"publisher","DOI":"10.1080\/08993408.2014.970782"},{"key":"e_1_3_2_1_105_1","doi-asserted-by":"publisher","DOI":"10.1145\/3386252"},{"key":"e_1_3_2_1_106_1","unstructured":"Jules White Quchen Fu Sam Hays Michael Sandborn Carlos Olea Henry Gilbert Ashraf Elnashar Jesse Spencer-Smith and Douglas\u00a0C. Schmidt. 2023. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arxiv:2302.11382\u00a0[cs.SE]"},{"key":"e_1_3_2_1_107_1","doi-asserted-by":"publisher","DOI":"10.1145\/2839509.2844629"},{"key":"e_1_3_2_1_108_1","unstructured":"Ben Williamson. 2024. AI in education is a public problem. https:\/\/codeactsineducation.wordpress.com\/2024\/02\/22\/ai-in-education-is-a-public-problem\/. Accessed: 2024-5-30."},{"key":"e_1_3_2_1_109_1","doi-asserted-by":"publisher","DOI":"10.1109\/JAS.2023.123618"},{"key":"e_1_3_2_1_110_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.bea-1.52"},{"key":"e_1_3_2_1_111_1","first-page":"1","volume-title":"SIGCSE 2019","author":"Xie Benjamin","year":"2019","unstructured":"Benjamin Xie. 2019. Supplementary Info for \u201dAn Item Response Theory Evaluation of a Language-Independent CS1 Knowledge Assessment\u201d (Xie et al. SIGCSE 2019). https:\/\/github.com\/codeandcognition\/archive-2019sigcse-xie. Accessed: 2024-1-15."},{"key":"e_1_3_2_1_112_1","doi-asserted-by":"publisher","DOI":"10.1145\/3287324.3287370"},{"key":"e_1_3_2_1_113_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.hcc.2024.100211"},{"key":"e_1_3_2_1_114_1","unstructured":"Wayne\u00a0Xin Zhao Kun Zhou Junyi Li Tianyi Tang Xiaolei Wang Yupeng Hou Yingqian Min Beichen Zhang Junjie Zhang Zican Dong Yifan Du Chen Yang Yushuo Chen Zhipeng Chen Jinhao Jiang Ruiyang Ren Yifan Li Xinyu Tang Zikang Liu Peiyu Liu Jian-Yun Nie and Ji-Rong Wen. 2023. A Survey of Large Language Models. (2023). arxiv:2303.18223\u00a0[cs.CL]"},{"key":"e_1_3_2_1_115_1","volume-title":"Don\u2019t Make Your LLM an Evaluation Benchmark Cheater. (Nov","author":"Zhou Kun","year":"2023","unstructured":"Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne\u00a0Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, and Jiawei Han. 2023. Don\u2019t Make Your LLM an Evaluation Benchmark Cheater. (Nov. 2023). arxiv:2311.01964\u00a0[cs.CL]"}],"event":{"name":"ICER 2024: ACM Conference on International Computing Education Research","location":"Melbourne VIC Australia","acronym":"ICER 2024","sponsor":["SIGCSE ACM Special Interest Group on Computer Science Education"]},"container-title":["Proceedings of the 2024 ACM Conference on International Computing Education Research - Volume 1"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3632620.3671097","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3632620.3671097","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,23]],"date-time":"2025-08-23T00:34:22Z","timestamp":1755909262000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3632620.3671097"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,8,12]]},"references-count":115,"alternative-id":["10.1145\/3632620.3671097","10.1145\/3632620"],"URL":"https:\/\/doi.org\/10.1145\/3632620.3671097","relation":{},"subject":[],"published":{"date-parts":[[2024,8,12]]},"assertion":[{"value":"2024-08-12","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}