{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,13]],"date-time":"2026-02-13T15:21:20Z","timestamp":1770996080446,"version":"3.50.1"},"reference-count":104,"publisher":"Association for Computing Machinery (ACM)","issue":"3","funder":[{"name":"JST CRONOS","award":["JPMJCS24K8"],"award-info":[{"award-number":["JPMJCS24K8"]}]},{"name":"JSPS KAKENHI","award":["JP21H04877, No. JP23H03372, and No. JP24K02920"],"award-info":[{"award-number":["JP21H04877, No. JP23H03372, and No. JP24K02920"]}]},{"name":"Autoware Foundation"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Softw. Eng. Methodol."],"published-print":{"date-parts":[[2026,3,31]]},"abstract":"<jats:p>Performance evaluation plays a crucial role in the development lifecycle of large language models (LLMs). It estimates the model\u2019s capability, elucidates behavior characteristics, and facilitates the identification of potential issues and limitations, thereby guiding further improvement. Given that LLMs\u2019 diverse task-handling abilities stem from large volumes of training data, a comprehensive evaluation also necessitates abundant, well-annotated, and representative test data to assess LLM performance across various downstream tasks. However, the demand for high-quality test data often entails substantial time, computational resources, and manual efforts, sometimes causing the evaluation to be inefficient or impractical. To address these challenges, researchers propose active testing, which estimates the overall performance by selecting a subset of test data. Nevertheless, the existing active testing methods tend to be inefficient, even inapplicable, given the unique new challenges of LLMs (e.g., diverse task types, increased model complexity, and unavailability of training data). To mitigate such limitations and expedite the development cycle of LLMs, in this work, we introduce AcTracer, an active testing framework tailored for LLMs that strategically selects a small subset of test data to achieve a more accurate performance estimation for LLMs. AcTracer utilizes both internal and external information from LLMs to guide the test sampling process, reducing variance through a multi-stage pool-based active selection. Our experiment results demonstrate that AcTracer achieves state-of-the-art performance compared to existing methods across various tasks.<\/jats:p>","DOI":"10.1145\/3744340","type":"journal-article","created":{"date-parts":[[2025,8,1]],"date-time":"2025-08-01T15:18:43Z","timestamp":1754061523000},"page":"1-30","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["AcTracer: Active Testing of Large Language Model via Multi-Stage Sampling"],"prefix":"10.1145","volume":"35","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3666-4020","authenticated-orcid":false,"given":"Yuheng","family":"Huang","sequence":"first","affiliation":[{"name":"The University of Tokyo, Tokyo, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-7093-9781","authenticated-orcid":false,"given":"Jiayang","family":"Song","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8251-1669","authenticated-orcid":false,"given":"Qiang","family":"Hu","sequence":"additional","affiliation":[{"name":"Tianjin University, Tianjin, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0857-8611","authenticated-orcid":false,"given":"Felix","family":"Juefei-Xu","sequence":"additional","affiliation":[{"name":"New York University, New York, New York, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8621-2420","authenticated-orcid":false,"given":"Lei","family":"Ma","sequence":"additional","affiliation":[{"name":"University of Alberta, Edmonton, Alberta, Canada and The University of Tokyo, Tokyo, Japan"}]}],"member":"320","published-online":{"date-parts":[[2026,2,13]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"Marah Abdin Jyoti Aneja Harkirat Behl S\u00e9bastien Bubeck Ronen Eldan Suriya Gunasekar Michael Harrison Russell J. Hewett Mojan Javaheripi Piero Kauffmann et al. 2024. Phi-4 technical report. arXiv:2412.08905. Retrieved from https:\/\/arxiv.org\/abs\/2412.08905"},{"key":"e_1_3_2_3_2","unstructured":"Martin Arjovsky Soumith Chintala and L\u00e9on Bottou. 2017. Wasserstein GAN. arXiv:1701.07875. Retrieved from https:\/\/arxiv.org\/abs\/1701.07875"},{"key":"e_1_3_2_4_2","doi-asserted-by":"crossref","unstructured":"Shir Ashury-Tahan Benjamin Sznajder Leshem Choshen Liat Ein-Dor Eyal Shnarch and Ariel Gera. 2024. Label-efficient model selection for text generation. arXiv:2402.07891. Retrieved from https:\/\/arxiv.org\/abs\/2402.07891","DOI":"10.18653\/v1\/2024.acl-long.456"},{"key":"e_1_3_2_5_2","unstructured":"Jacob Austin Augustus Odena Maxwell Nye Maarten Bosma Henryk Michalewski David Dohan Ellen Jiang Carrie Cai Michael Terry Quoc Le et al. 2021. Program synthesis with large language models. arXiv:2108.07732. Retrieved from https:\/\/arxiv.org\/abs\/2108.07732"},{"key":"e_1_3_2_6_2","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Azaria Amos","year":"2023","unstructured":"Amos Azaria and Tom Mitchell. 2023. The internal state of an LLM knows when it\u2019s lying. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Retrieved from https:\/\/openreview.net\/forum?id=y2V6YgLaW7"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.ijcnlp-main.45"},{"key":"e_1_3_2_8_2","first-page":"552","volume-title":"Proceedings of International Conference on Machine Learning","author":"Bengio Yoshua","year":"2013","unstructured":"Yoshua Bengio, Gr\u00e9goire Mesnil, Yann Dauphin, and Salah Rifai. 2013. Better mixing via deep representations. In Proceedings of International Conference on Machine Learning. PMLR, 552\u2013560."},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1007\/BF00247653"},{"key":"e_1_3_2_10_2","first-page":"1877","article-title":"Language models are few-shot learners","volume":"33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 33, 1877\u20131901.","journal-title":"Proceedings of the Advances in Neural Information Processing Systems, Vol"},{"key":"e_1_3_2_11_2","unstructured":"Alexandra Carpentier Remi Munos and Andr\u00e1s Antos. 2015. Adaptive strategy for stratified Monte Carlo sampling. Journal of Machine Learning Research 16 (2015) 2231\u20132271."},{"key":"e_1_3_2_12_2","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"Chen Chao","unstructured":"Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. 2024. INSIDE: LLMs\u2019 internal states retain the power of hallucination detection. In Proceedings of the 12th International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=Zj12nzlQbz"},{"key":"e_1_3_2_13_2","first-page":"8301","volume-title":"Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing","author":"Hardy Chen Guiming","year":"2024","unstructured":"Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. 2024. Humans or LLMs as the Judge? A study on Judgement Bias. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.), Association for Computational Linguistics, 8301\u20138327. Retrieved from https:\/\/aclanthology.org\/2024.emnlp-main.474"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394112"},{"key":"e_1_3_2_15_2","unstructured":"Mark Chen Jerry Tworek Heewoo Jun Qiming Yuan Henrique Ponde de Oliveira Pinto Jared Kaplan Harri Edwards Yuri Burda Nicholas Joseph Greg Brockman et al. 2021. Evaluating large language models trained on code. arXiv:2107.03374. Retrieved from https:\/\/arxiv.org\/abs\/2107.03374"},{"key":"e_1_3_2_16_2","first-page":"17817","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"38","author":"Chen Yuheng","year":"2024","unstructured":"Yuheng Chen, Pengfei Cao, Yubo Chen, Kang Liu, and Jun Zhao. 2024. Journey to the center of the knowledge neurons: Discoveries of language-independent knowledge neurons and degenerate knowledge neurons. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, 17817\u201317825."},{"key":"e_1_3_2_17_2","unstructured":"Karl Cobbe Vineet Kosaraju Mohammad Bavarian Mark Chen Heewoo Jun Lukasz Kaiser Matthias Plappert Jerry Tworek Jacob Hilton Reiichiro Nakano et al. 2021. Training verifiers to solve math word problems. arXiv:2110.14168. Retrieved from https:\/\/arxiv.org\/abs\/2110.14168"},{"key":"e_1_3_2_18_2","unstructured":"NVIDIA Corporation. 2024. NVIDIA cuVS. Retrieved from https:\/\/developer.nvidia.com\/cuvs"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.3934\/aci.2023008"},{"key":"e_1_3_2_20_2","first-page":"7658","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Deng Weijian","year":"2023","unstructured":"Weijian Deng, Yumin Suh, Stephen Gould, and Liang Zheng. 2023. Confidence and dispersity speak: Characterizing prediction matrix for unsupervised accuracy estimation. In Proceedings of the International Conference on Machine Learning. PMLR, 7658\u20137674."},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01482"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.2006.871582"},{"key":"e_1_3_2_23_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=YicbFdNTTy"},{"key":"e_1_3_2_24_2","unstructured":"Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Amy Yang Angela Fan et al. 2024. The Llama 3 herd of models. arXiv:2407.21783. Retrieved from https:\/\/arxiv.org\/abs\/2407.21783"},{"key":"e_1_3_2_25_2","unstructured":"Reuben Feinman Ryan R. Curtin Saurabh Shintre and Andrew B. Gardner. 2017. Detecting adversarial samples from artifacts. arXiv:1703.00410. Retrieved from https:\/\/arxiv.org\/abs\/1703.00410"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3395363.3397357"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00440-014-0583-7"},{"key":"e_1_3_2_28_2","unstructured":"Simon Frieder Luca Pinchetti Ryan-Rhys Griffiths Tommaso Salvatori Thomas Lukasiewicz Philipp Christian Petersen Alexis Chevalier and Julius Berner. 2023. Mathematical capabilities of ChatGPT. arXiv:2301.13867. Retrieved from https:\/\/arxiv.org\/abs\/2301.13867"},{"key":"e_1_3_2_29_2","unstructured":"Harvey Yiyun Fu Qinyuan Ye Albert Xu Xiang Ren and Robin Jia. 2023. Estimating large language model capabilities without labeled test data. arXiv:2305.14802. Retrieved from https:\/\/arxiv.org\/abs\/2305.14802"},{"key":"e_1_3_2_30_2","unstructured":"Asma Ghandeharioun Avi Caciularu Adam Pearce Lucas Dixon and Mor Geva. 2024. Patchscope: A unifying framework for inspecting hidden representations of language models. arXiv:2401.06102. Retrieved from https:\/\/arxiv.org\/abs\/2401.06102"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1016\/B0-12-369398-5\/00066-9"},{"key":"e_1_3_2_32_2","first-page":"1","volume-title":"Proceedings of the IEEE\/ACM 46th International Conference on Software Engineering","author":"Guerriero Antonio","year":"2024","unstructured":"Antonio Guerriero, Roberto Pietrantuono, and Stefano Russo. 2024. DeepSample: DNN sampling-based testing for operational accuracy assessment. In Proceedings of the IEEE\/ACM 46th International Conference on Software Engineering, 1\u201312."},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00117"},{"key":"e_1_3_2_34_2","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"Gurnee Wes","year":"2024","unstructured":"Wes Gurnee and Max Tegmark. 2024. Language models represent space and time. In Proceedings of the 12th International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=jE8xbmvFin"},{"key":"e_1_3_2_35_2","unstructured":"Suchin Gururangan Margaret Li Mike Lewis Weijia Shi Tim Althoff Noah A. Smith and Luke Zettlemoyer. 2023. Scaling expert language models with unsupervised domain discovery. arXiv:2303.14177. Retrieved from https:\/\/arxiv.org\/abs\/2303.14177"},{"key":"e_1_3_2_36_2","first-page":"70","article-title":"The dip test of unimodality","author":"Hartigan John A.","year":"1985","unstructured":"John A. Hartigan and Pamela M. Hartigan. 1985. The dip test of unimodality. The Annals of Statistics (1985), 70\u201384.","journal-title":"The Annals of Statistics"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.eacl-main.199"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.findings-emnlp.624"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.619"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3695988"},{"key":"e_1_3_2_41_2","first-page":"1776","volume-title":"2023 IEEE\/ACM 45th International Conference on Software Engineering (ICSE \u201923)","author":"Hu Qiang","year":"2023","unstructured":"Qiang Hu, Yuejun Guo, Xiaofei Xie, Maxime Cordy, Mike Papadakis, Lei Ma, and Yves Le Traon. 2023. Aries: Efficient testing of deep neural networks via labeling-free accuracy estimation. In 2023 IEEE\/ACM 45th International Conference on Software Engineering (ICSE \u201923). IEEE, 1776\u20131787."},{"key":"e_1_3_2_42_2","unstructured":"Yuheng Huang Jiayang Song Zhijie Wang Huaming Chen and Lei Ma. 2023. Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv:2307.10236. Retrieved from https:\/\/arxiv.org\/abs\/2307.10236"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/276698.276876"},{"key":"e_1_3_2_44_2","unstructured":"Neel Jain Khalid Saifullah Yuxin Wen John Kirchenbauer Manli Shu Aniruddha Saha Micah Goldblum Jonas Geiping and Tom Goldstein. 2023. Bring your own data! Self-supervised evaluation for large language models. arXiv:2306.13651. Retrieved from https:\/\/arxiv.org\/abs\/2306.13651"},{"key":"e_1_3_2_45_2","unstructured":"Mingyu Jin Qinkai Yu Jingyuan Huang Qingcheng Zeng Zhenting Wang Wenyue Hua Haiyan Zhao Kai Mei Yanda Meng Kaize Ding et al. 2024. Exploring concept depth: How large language models acquire knowledge at different layers? arXiv:2404.07066. Retrieved from https:\/\/arxiv.org\/abs\/2404.07066"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-1147"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE.2019.00108"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.findings-acl.29"},{"key":"e_1_3_2_49_2","first-page":"5753","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Kossen Jannik","year":"2021","unstructured":"Jannik Kossen, Sebastian Farquhar, Yarin Gal, and Tom Rainforth. 2021. Active testing: Sample-efficient model evaluation. In Proceedings of the International Conference on Machine Learning. PMLR, 5753\u20135763."},{"key":"e_1_3_2_50_2","first-page":"24557","article-title":"Active surrogate estimators: An active learning approach to label-efficient model evaluation","volume":"35","author":"Kossen Jannik","year":"2022","unstructured":"Jannik Kossen, Sebastian Farquhar, Yarin Gal, and Thomas Rainforth. 2022. Active surrogate estimators: An active learning approach to label-efficient model evaluation. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 35, 24557\u201324570.","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_51_2","volume-title":"Proceedings of the 11th International Conference on Learning Representations","author":"Kuhn Lorenz","year":"2023","unstructured":"Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In Proceedings of the 11th International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=VD-AYtP0dve"},{"key":"e_1_3_2_52_2","unstructured":"John Lee Max Dabagia Eva Dyer and Christopher Rozell. 2019. Hierarchical optimal transport for multimodal distribution alignment. In Proceedings of the 33rd International Conference on Neural Information Processing Systems 13475\u201313485."},{"key":"e_1_3_2_53_2","first-page":"16443","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Lee JoonHo","year":"2023","unstructured":"JoonHo Lee, Jae Oh Woo, Hankyu Moon, and Kwonho Lee. 2023. Unsupervised accuracy estimation of deep visual models using domain-adaptive adversarial perturbation without source samples. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, 16443\u201316452."},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1612"},{"key":"e_1_3_2_55_2","doi-asserted-by":"crossref","first-page":"3197","DOI":"10.1145\/3534678.3539147","volume-title":"Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining","author":"Lees Alyssa","year":"2022","unstructured":"Alyssa Lees, Vinh Q. Tran, Yi Tay, Jeffrey Sorensen, Jai Gupta, Donald Metzler, and Lucy Vasserman. 2022. A new generation of perspective API: Efficient multilingual character-level transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 3197\u20133207."},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1145\/3338906.3338930"},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.acl-long.229"},{"key":"e_1_3_2_58_2","unstructured":"Yen-Ting Lin and Yun-Nung Chen. 2023. LLM-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. arXiv:2305.13711. Retrieved from https:\/\/arxiv.org\/abs\/2305.13711"},{"key":"e_1_3_2_59_2","volume-title":"Proceedings of the 37th Conference on Neural Information Processing Systems","author":"Liu Jiawei","year":"2023","unstructured":"Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. In Proceedings of the 37th Conference on Neural Information Processing Systems. Retrieved from https:\/\/openreview.net\/forum?id=1qvx610Cu7"},{"key":"e_1_3_2_60_2","volume-title":"Proceedings of the 1st Conference on Language Modeling","author":"Liu Jiawei","year":"2024","unstructured":"Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng Ding, and Lingming Zhang. 2024. Evaluating language models for efficient code generation. In Proceedings of the 1st Conference on Language Modeling. Retrieved from https:\/\/openreview.net\/forum?id=IBCBMeAhmC"},{"key":"e_1_3_2_61_2","unstructured":"Xiao Liu Hao Yu Hanchen Zhang Yifan Xu Xuanyu Lei Hanyu Lai Yu Gu Hangliang Ding Kaiwen Men Kejuan Yang et al. 2023. Agentbench: Evaluating LLMs as agents. arXiv:2308.03688. Retrieved from https:\/\/arxiv.org\/abs\/2308.03688"},{"key":"e_1_3_2_62_2","unstructured":"Yuzhe Lu Yilong Qin Runtian Zhai Andrew Shen Ketong Chen Zhenlin Wang Soheil Kolouri Simon Stepputtis Joseph Campbell and Katia Sycara. 2024. Characterizing out-of-distribution error via optimal transport. In Proceedings of the 37th International Conference on Neural Information Processing Systems 17602\u201317622."},{"key":"e_1_3_2_63_2","first-page":"1","volume-title":"Proceedings of the ACM on Programming Languages","volume":"2","author":"Majumdar Rupak","year":"2017","unstructured":"Rupak Majumdar and Filip Niksic. 2017. Why is random testing effective for partition tolerance bugs? Proceedings of the ACM on Programming Languages 2, POPL (2017), 1\u201324."},{"key":"e_1_3_2_64_2","doi-asserted-by":"crossref","first-page":"32","DOI":"10.1007\/978-3-662-44415-3_4","volume-title":"Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshop (S+ SSPR \u201914)","author":"Malinen Mikko I.","year":"2014","unstructured":"Mikko I. Malinen and Pasi Fr\u00e4nti. 2014. Balanced k-means for clustering. In Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshop (S+ SSPR \u201914). Springer, 32\u201341."},{"key":"e_1_3_2_65_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v37i12.26752"},{"key":"e_1_3_2_66_2","unstructured":"Samuel Marks and Max Tegmark. 2024. The geometry of truth: Emergent linear structure in large language model representations of true\/false datasets. arXiv:2310.06824. Retrieved from https:\/\/arxiv.org\/abs\/2310.06824"},{"key":"e_1_3_2_67_2","unstructured":"Aman Mehra Rahul Saxena Taeyoun Kim Christina Baek Zico Kolter and Aditi Raghunathan. 2024. Predicting the performance of foundation models via agreement-on-the-line. arXiv:2404.01542. Retrieved from https:\/\/arxiv.org\/abs\/2404.01542"},{"key":"e_1_3_2_68_2","first-page":"21395","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"38","author":"Miao Shuyu","year":"2024","unstructured":"Shuyu Miao, Jian Liu, Lin Zheng, and Hong Jin. 2024. Divide-and-aggregate learning for evaluating performance on unlabeled data. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, 21395\u201321402."},{"key":"e_1_3_2_69_2","first-page":"3298","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Miao Shuyu","year":"2023","unstructured":"Shuyu Miao, Lin Zheng, Jingjing Liu, and Hong Jin. 2023. K-means clustering based feature consistency alignment for label-free model evaluation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 3298\u20133306."},{"key":"e_1_3_2_70_2","unstructured":"Shiyu Ni Keping Bi Jiafeng Guo Lulu Yu Baolong Bi and Xueqi Cheng. 2025. Towards fully exploiting LLM internal states to enhance knowledge boundary perception. arXiv:2502.11677. Retrieved from https:\/\/arxiv.org\/abs\/2502.11677"},{"key":"e_1_3_2_71_2","doi-asserted-by":"publisher","DOI":"10.5281\/zenodo1182437"},{"key":"e_1_3_2_72_2","first-page":"1","volume-title":"Proceedings of the ACM on Programming Languages","volume":"2","author":"Ozkan Burcu Kulahcioglu","year":"2018","unstructured":"Burcu Kulahcioglu Ozkan, Rupak Majumdar, Filip Niksic, Mitra Tabaei Befrouei, and Georg Weissenbacher. 2018. Randomized testing of distributed systems with probabilistic guarantees. Proceedings of the ACM on Programming Languages 2, OOPSLA (2018), 1\u201328."},{"key":"e_1_3_2_73_2","unstructured":"Arjun Panickssery Samuel R Bowman and Shi Feng. 2024. LLM evaluators recognize and favor their own generations. arXiv:2404.13076. Retrieved from https:\/\/arxiv.org\/abs\/2404.13076"},{"key":"e_1_3_2_74_2","doi-asserted-by":"publisher","DOI":"10.1145\/3132747.3132785"},{"key":"e_1_3_2_75_2","doi-asserted-by":"crossref","first-page":"318","DOI":"10.1007\/978-1-4612-5931-2_7","volume-title":"Concepts of Nonparametric Theory","author":"Pratt John W.","year":"1981","unstructured":"John W. Pratt, Jean D. Gibbons, John W. Pratt, and Jean D. Gibbons. 1981. Kolmogorov-Smirnov two-sample tests. In Concepts of Nonparametric Theory, Springer Series in Statistics, Springer, 318\u2013344."},{"key":"e_1_3_2_76_2","unstructured":"Chengwei Qin Aston Zhang Zhuosheng Zhang Jiaao Chen Michihiro Yasunaga and Diyi Yang. 2023. Is ChatGPT a general-purpose natural language processing task solver? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Retrieved from https:\/\/openreview.net\/forum?id=u03xn1COsO"},{"issue":"8","key":"e_1_3_2_77_2","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.","journal-title":"OpenAI Blog"},{"key":"e_1_3_2_78_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1264"},{"key":"e_1_3_2_79_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0167-9473(96)00077-1"},{"key":"e_1_3_2_80_2","unstructured":"Baptiste Rozi\u00e8re Jonas Gehring Fabian Gloeckle Sten Sootla Itai Gat Xiaoqing Ellen Tan Yossi Adi Jingyu Liu Tal Remez J\u00e9r\u00e9my Rapin et al. 2023. Code Llama: Open foundation models for code. arXiv:2308.12950. Retrieved from https:\/\/arxiv.org\/abs\/2308.12950"},{"key":"e_1_3_2_81_2","first-page":"2205","volume-title":"Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23)","author":"Sandoval Gustavo","year":"2023","unstructured":"Gustavo Sandoval, Hammond Pearce, Teo Nys, Ramesh Karri, Siddharth Garg, and Brendan Dolan-Gavitt. 2023. Lost at C: A user study on the security implications of large language model code assistants. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23), 2205\u20132222."},{"key":"e_1_3_2_82_2","unstructured":"Sashank Santhanam Behnam Hedayatnia Spandana Gella Aishwarya Padmakumar Seokhwan Kim Yang Liu and Dilek Hakkani-Tur. 2021. Rome was built in 1776: A case study on factual correctness in knowledge-grounded response generation. arXiv:2110.05456. Retrieved from https:\/\/arxiv.org\/abs\/2110.05456"},{"key":"e_1_3_2_83_2","first-page":"166","volume-title":"Proceedings of the 2011 31st International Conference on Distributed Computing Systems Workshops","author":"Satopaa Ville","year":"2011","unstructured":"Ville Satopaa, Jeannie Albrecht, David Irwin, and Barath Raghavan. 2011. Finding a \u201cKneedle\u201d in a haystack: Detecting knee points in system behavior. In Proceedings of the 2011 31st International Conference on Distributed Computing Systems Workshops. IEEE, 166\u2013171."},{"key":"e_1_3_2_84_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Sener Ozan","year":"2018","unstructured":"Ozan Sener and Silvio Savarese. 2018. Active learning for convolutional neural networks: A core-set approach. In Proceedings of the International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=H1aIuk-RW"},{"key":"e_1_3_2_85_2","unstructured":"Jasper Snoek Hugo Larochelle and Ryan P. Adams. 2012. Practical Bayesian optimization of machine learning algorithms. In Proceedings of the 26th International Conference on Neural Information Processing Systems Vol. 2 2951\u20132959."},{"key":"e_1_3_2_86_2","unstructured":"Lichao Sun Yue Huang Haoran Wang Siyuan Wu Qihui Zhang Chujie Gao Yixin Huang Wenhan Lyu Yixuan Zhang Xiner Li et al. 2024. Trustllm: Trustworthiness in large language models. arXiv:2401.05561. Retrieved from https:\/\/arxiv.org\/abs\/2401.05561"},{"key":"e_1_3_2_87_2","volume-title":"Proceedings of the 13th International Conference on Learning Representations","author":"Tan Sijun","year":"2025","unstructured":"Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Yuan Tang, Alejandro Cuadron, Chenguang Wang, Raluca Popa, and Ion Stoica. 2025. JudgeBench: A benchmark for evaluating LLM-based Judges. In Proceedings of the 13th International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=G0dksFayVq"},{"key":"e_1_3_2_88_2","unstructured":"Tianyi Tang Wenyang Luo Haoyang Huang Dongdong Zhang Xiaolei Wang Xin Zhao Furu Wei and Ji-Rong Wen. 2024. Language-specific neurons: The key to multilingual capabilities in large language models. arXiv:2402.16438. Retrieved from https:\/\/arxiv.org\/abs\/2402.16438"},{"key":"e_1_3_2_89_2","doi-asserted-by":"publisher","DOI":"10.1007\/BF02289263"},{"key":"e_1_3_2_90_2","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288. Retrieved from https:\/\/arxiv.org\/abs\/2307.09288"},{"key":"e_1_3_2_91_2","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems 6000\u20136010."},{"key":"e_1_3_2_92_2","unstructured":"Boxin Wang Weixin Chen Hengzhi Pei Chulin Xie Mintong Kang Chenhui Zhang Chejian Xu Zidi Xiong Ritik Dutta Rylan Schaeffer et al. 2023. DecodingTrust: A comprehensive assessment of trustworthiness in GPT models. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS \u201923). Curran Associates Inc. Red Hook NY USA Article 1361 31232\u201331339."},{"key":"e_1_3_2_93_2","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"Wang Yidong","year":"2024","unstructured":"Yidong Wang, Zhuohao Yu, Wenjin Yao, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, et al. 2024. PandaLM: An automatic evaluation benchmark for LLM instruction tuning optimization. In Proceedings of the 12th International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=5Nn2BLV7SB"},{"key":"e_1_3_2_94_2","unstructured":"Zhijie Wang Yuheng Huang Lei Ma Haruki Yokoyama Susumu Tokumoto and Kazuki Munakata. 2022. An exploratory study of AI system risk assessment from the lens of data distribution and uncertainty. arXiv:2212.06828. Retrieved from https:\/\/arxiv.org\/abs\/2212.06828"},{"key":"e_1_3_2_95_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE43902.2021.00046"},{"key":"e_1_3_2_96_2","unstructured":"Zhijie Wang Zijie Zhou Da Song Yuheng Huang Shengmai Chen Lei Ma and Tianyi Zhang .2024. Where do large language models fail when generating code? arXiv:2406.08731. Retrieved from https:\/\/arxiv.org\/abs\/2406.08731"},{"key":"e_1_3_2_97_2","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"Xiong Miao","year":"2024","unstructured":"Miao Xiong, Zhiyuan Hu, Xinyang Lu, YIFEI LI, Jie Fu, Junxian He, and Bryan Hooi. 2024. Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs. In Proceedings of the 12th International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=gjeQKFxFpZ"},{"key":"e_1_3_2_98_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.findings-emnlp.337"},{"key":"e_1_3_2_99_2","unstructured":"An Yang Baosong Yang Beichen Zhang Binyuan Hui Bo Zheng Bowen Yu Chengyuan Li Dayiheng Liu Fei Huang Haoran Wei et al. 2024. Qwen2. 5 technical report. arXiv:2412.15115. Retrieved from https:\/\/arxiv.org\/abs\/2412.15115"},{"issue":"1","key":"e_1_3_2_100_2","doi-asserted-by":"crossref","first-page":"1485","DOI":"10.1038\/s41598-023-28834-3","article-title":"Foundations of human spatial problem solving","volume":"13","author":"Zarr Noah","year":"2023","unstructured":"Noah Zarr and Joshua W. Brown. 2023. Foundations of human spatial problem solving. Scientific Reports 13, 1 (2023), 1485.","journal-title":"Scientific Reports"},{"key":"e_1_3_2_101_2","unstructured":"Oussama Zekri Ambroise Odonnat Abdelhakim Benechehab Linus Bleistein Nicolas Boull\u00e9 and Ievgen Redko. 2024. Large language models as Markov chains. arXiv:2410.02724. Retrieved from https:\/\/arxiv.org\/abs\/2410.02724"},{"key":"e_1_3_2_102_2","doi-asserted-by":"crossref","unstructured":"Lin Zhao Tianchen Zhao Zinan Lin Xuefei Ning Guohao Dai Huazhong Yang and Yu Wang. 2024. FlashEval: Towards fast and accurate evaluation of text-to-image diffusion generative models. arXiv:2403.16379. Retrieved from https:\/\/arxiv.org\/abs\/2403.16379","DOI":"10.1109\/CVPR52733.2024.01526"},{"key":"e_1_3_2_103_2","first-page":"46595","article-title":"Judging LLM-as-a-judge with mt-bench and chatbot arena","volume":"36","author":"Zheng Lianmin","year":"2023","unstructured":"Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging LLM-as-a-judge with mt-bench and chatbot arena. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 36, 46595\u201346623.","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_104_2","unstructured":"Wanjun Zhong Ruixiang Cui Yiduo Guo Yaobo Liang Shuai Lu Yanlin Wang Amin Saied Weizhu Chen and Nan Duan. 2023. Agieval: A human-centric benchmark for evaluating foundation models. arXiv:2304.06364. Retrieved from https:\/\/arxiv.org\/abs\/2304.06364"},{"key":"e_1_3_2_105_2","unstructured":"Andy Zou Long Phan Sarah Chen James Campbell Phillip Guo Richard Ren Alexander Pan Xuwang Yin Mantas Mazeika Ann-Kathrin Dombrowski et al. 2023. Representation engineering: A top-down approach to AI transparency. arXiv:2310.01405. Retrieved from https:\/\/arxiv.org\/abs\/2310.01405"}],"container-title":["ACM Transactions on Software Engineering and Methodology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3744340","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,13]],"date-time":"2026-02-13T14:35:28Z","timestamp":1770993328000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3744340"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,13]]},"references-count":104,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2026,3,31]]}},"alternative-id":["10.1145\/3744340"],"URL":"https:\/\/doi.org\/10.1145\/3744340","relation":{},"ISSN":["1049-331X","1557-7392"],"issn-type":[{"value":"1049-331X","type":"print"},{"value":"1557-7392","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,13]]},"assertion":[{"value":"2024-12-08","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-06-05","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-02-13","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}