{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T16:48:42Z","timestamp":1777049322880,"version":"3.51.4"},"reference-count":45,"publisher":"Association for Computing Machinery (ACM)","issue":"5","license":[{"start":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T00:00:00Z","timestamp":1776988800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Commun. ACM"],"published-print":{"date-parts":[[2026,5,1]]},"abstract":"<jats:p>Rigorous evaluation of general-purpose AI systems such as large language models should allow for deepened understanding of their capabilities and effective mitigation of their risks. The current evaluation paradigm, mostly reliant on benchmarks aggregating scores on one or more tasks, lacks the scientific machinery for predicting performance on unforeseen tasks and explaining the variability of results. Moreover, existing benchmarks raise growing concerns about their reliability and validity. To tackle these challenges, we vindicate psychometrics, the science of psychological measurement, as a methodology for identifying and measuring constructs that underlie AI performance across multiple tasks. To raise awareness, we first identify the key advantages of adapting psychometric principles to AI evaluation through concrete examples; second, we distinguish sound applications of psychometric techniques from oversimplified ones and warn against common pitfalls; and third, to encourage general use, we introduce a systematic psychometric framework and an operational evaluation pipeline, which provide practical implementation guidance. In the end, we discuss underexplored avenues and societal implications that open new research directions for the use of psychometrics in broader AI research.<\/jats:p>","DOI":"10.1145\/3769688","type":"journal-article","created":{"date-parts":[[2026,4,14]],"date-time":"2026-04-14T15:37:54Z","timestamp":1776181074000},"page":"92-102","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Evaluating General-Purpose AI with Psychometrics"],"prefix":"10.1145","volume":"69","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5768-1095","authenticated-orcid":false,"given":"Xiting","family":"Wang","sequence":"first","affiliation":[{"name":"Renmin University of China, Gaoling School of Artificial Intelligence, Beijing, Beijing, China"},{"name":"Beijing Key Laboratory of Research on Large Models and Intelligent Governance, Beijing, Beijing, China"},{"name":"Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE, Beijing, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6464-2326","authenticated-orcid":false,"given":"Liming","family":"Jiang","sequence":"additional","affiliation":[{"name":"Beijing Normal University, Beijing, Beijing, China"},{"name":"Microsoft Research Asia, Beijing, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9746-7632","authenticated-orcid":false,"given":"Jos\u00e9","family":"Hern\u00e1ndez-Orallo","sequence":"additional","affiliation":[{"name":"University of Cambridge, Leverhulme Centre for the Future of Intelligence, Cambridge, Cambridgeshire, United Kingdom"},{"name":"Universitat Polit\u00e8cnica de Val\u00e8ncia, Valencia, Comunitat Valenciana, Spain"},{"name":"ValgrAI, Valencia, Comunitat Valenciana, Spain"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0174-3212","authenticated-orcid":false,"given":"David","family":"Stillwell","sequence":"additional","affiliation":[{"name":"University of Cambridge, Cambridge, Cambridgeshire, United Kingdom of Great Britain and Northern Ireland"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-2991-8447","authenticated-orcid":false,"given":"Shiqiang","family":"Chen","sequence":"additional","affiliation":[{"name":"Renmin University of China, Gaoling School of Artificial Intelligence, Beijing, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2470-4278","authenticated-orcid":false,"given":"Luning","family":"Sun","sequence":"additional","affiliation":[{"name":"University of Cambridge, Cambridge, Cambridgeshire, United Kingdom of Great Britain and Northern Ireland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3281-9574","authenticated-orcid":false,"given":"Fang","family":"Luo","sequence":"additional","affiliation":[{"name":"Beijing Normal University, Faculty of Psychology, Beijing, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-3257-3077","authenticated-orcid":false,"given":"Xing","family":"Xie","sequence":"additional","affiliation":[{"name":"Microsoft Research Asia, Beijing, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2026,4,24]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"crossref","unstructured":"Abdulhai M. et al. Moral foundations of large language models. In Proceedings of the 2024 Conf. on Empirical Methods in Natural Language Processing. (2024) 17737\u201317752.","DOI":"10.18653\/v1\/2024.emnlp-main.982"},{"key":"e_1_3_1_3_2","doi-asserted-by":"crossref","unstructured":"Amershi S. et al. Guidelines for human-AI interaction. In Proceedings of the 2019 CHI Conf. on Human Factors in Computing Systems. ACM (2019) 1\u201313.","DOI":"10.1145\/3290605.3300233"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.2307\/258555"},{"key":"e_1_3_1_5_2","doi-asserted-by":"crossref","unstructured":"Bender E.M. et al. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conf. on Fairness Accountability and Transparency. ACM (2021) 610\u2013623.","DOI":"10.1145\/3442188.3445922"},{"key":"e_1_3_1_6_2","unstructured":"Burden J. et al. Inferring capabilities from task performance with Bayesian triangulation. arXiv preprint arXiv:2309.11975 (2023)."},{"key":"e_1_3_1_7_2","unstructured":"Burnell R. et al. Revealing the structure of language model capabilities. \u00a0arXiv preprint arXiv:2306.10062 (2023)."},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1126\/science.adf6369"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1111\/j.1745-3984.1978.tb00065.x"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1016\/S1574-6526(07)03013-1"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1111\/j.1745-3984.2007.00039.x"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.intell.2011.12.001"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.4324\/9781410605269"},{"key":"e_1_3_1_14_2","unstructured":"European Parliament. Artificial Intelligence Act (June 2023); https:\/\/www.europarl.europa.eu\/doceo\/document\/TA-9-2023-0236_EN.pdf"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1037\/1040-3590.4.1.26"},{"key":"e_1_3_1_16_2","volume-title":"Multilevel Statistical Models","author":"Goldstein H.","year":"2011","unstructured":"Goldstein, H. Multilevel Statistical Models. John Wiley & Sons\u00a0(2011)."},{"key":"e_1_3_1_17_2","unstructured":"Greenblatt R. et al. Alignment faking in large language models. \u00a0arXiv preprint arXiv:2412.14093 (2024)."},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1017\/9781316594179"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.intell.2024.101858"},{"key":"e_1_3_1_20_2","first-page":"423438","article-title":"How can we know what language models know?","volume":"8","author":"Jiang Z.","year":"2020","unstructured":"Jiang, Z. et al. How can we know what language models know? Trans. of the Assoc. for Computational Linguistics 8\u00a0(2020), 423438.","journal-title":"Trans. of the Assoc. for Computational Linguistics"},{"key":"e_1_3_1_21_2","volume-title":"Using Assessment to Improve the Quality of Education","author":"Kellaghan T.","year":"2001","unstructured":"Kellaghan, T. and Greaney, V. Using Assessment to Improve the Quality of Education. United Nations Educational, Scientific and Cultural Organisation\u00a0(2001)."},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1207\/S15327965PLI1401_01"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1080\/026432997381411"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.paid.2008.07.006"},{"key":"e_1_3_1_25_2","doi-asserted-by":"crossref","unstructured":"Li J. et al. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conf. on Empirical Methods in Natural Language Processing (2023) 6449\u20136464.","DOI":"10.18653\/v1\/2023.emnlp-main.397"},{"key":"e_1_3_1_26_2","doi-asserted-by":"crossref","unstructured":"Li M. et al. Evaluating readability and faithfulness of concept-based explanations. In Proceedings of the 2024 Conf. on Empirical Methods in Natural Language Processing. (2024) 607\u2013625.","DOI":"10.18653\/v1\/2024.emnlp-main.36"},{"key":"e_1_3_1_27_2","unstructured":"Li X. et al. Does GPT-3 demonstrate psychopathy? evaluating large language models from a psychological perspective. \u00a0arXiv preprint arXiv:2212.10529 (2022)."},{"key":"e_1_3_1_28_2","volume-title":"Rebooting AI: Building Artificial Intelligence We Can Trust","author":"Marcus G.","year":"2019","unstructured":"Marcus, G. and Davis, E. Rebooting AI: Building Artificial Intelligence We Can Trust. Vintage\u00a0(2019)."},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1017\/S0954579400005812"},{"key":"e_1_3_1_30_2","unstructured":"McIntosh T.R. et al. Inadequacies of large language model benchmarks in the era of generative artificial intelligence. \u00a0arXiv preprint arXiv:2402.09880 (2024)."},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.7205\/MILMED-D-13-00213"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1146\/annurev.psych.57.102904.190127"},{"key":"e_1_3_1_33_2","unstructured":"Pellert M. et al. AI psychometrics: Using psychometric inventories to obtain psychological profiles of large language models. PsyArXiv (2023)."},{"key":"e_1_3_1_34_2","first-page":"223","volume-title":"Handbook of Nonverbal Assessment.","author":"Raven J.","year":"2003","unstructured":"Raven, J. Raven progressive matrices. In Handbook of Nonverbal Assessment. Springer\u00a0(2003), 223\u2013237."},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1111\/j.1745-3984.2007.00040.x"},{"key":"e_1_3_1_36_2","unstructured":"Ruan Y. Maddison C.J. and Hashimoto T. Observational scaling laws and the predictability of language model performance. In Proceedings of the 2024 Conf. on Neural Information Processing Systems. (2024)."},{"key":"e_1_3_1_37_2","volume-title":"Modern Psychometrics: The Science of Psychological Assessment","author":"Rust J.","year":"2021","unstructured":"Rust, J., Kosinski, M., and Stillwell, D. Modern Psychometrics: The Science of Psychological Assessment. Taylor & Francis (2021)."},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41586-023-06647-8"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.compedu.2019.103672"},{"key":"e_1_3_1_40_2","volume-title":"Multiple Factor Analysis","author":"Thurstone L.L.","year":"1947","unstructured":"Thurstone, L.L. Multiple Factor Analysis. University of Chicago Press (1947)."},{"key":"e_1_3_1_41_2","volume-title":"The Delphi Method-Techniques and Applications","author":"Turoff M.","year":"2002","unstructured":"Turoff, M. and Linstone, H.A. The Delphi Method-Techniques and Applications. Addison Wesley (2002)."},{"key":"e_1_3_1_42_2","doi-asserted-by":"crossref","unstructured":"Whiteson S. et al Protecting against evaluation overfitting in empirical reinforcement learning. In 2011 IEEE Symp. on Adaptive Dynamic Programming and Reinforcement Learning. (2011) 120\u2013127.","DOI":"10.1109\/ADPRL.2011.5967363"},{"key":"e_1_3_1_43_2","doi-asserted-by":"crossref","unstructured":"Xiao Z. et al. Evaluating evaluation metrics: A framework for analyzing NLG evaluation metrics using measurement theory. In Proceedings of the 2023 Conf. on Empirical Methods in Natural Language Processing (2023) 10967\u201310982.","DOI":"10.18653\/v1\/2023.emnlp-main.676"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/TETCI.2021.3100641"},{"key":"e_1_3_1_45_2","unstructured":"Zhuang Y. et al. Efficiently measuring the cognitive ability of LLMs: an adaptive testing perspective. \u00a0arXiv preprint arXiv:2306.10512 (2023)."},{"key":"e_1_3_1_46_2","unstructured":"Zou H. et al. Can LLM \u201cself-report\u201d? Evaluating the validity of self-report scales in measuring personality design in LLM-based chatbots. arXiv preprint arXiv:2412.00207 (2024)."}],"container-title":["Communications of the ACM"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3769688","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T16:14:45Z","timestamp":1777047285000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3769688"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,4,24]]},"references-count":45,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2026,5,1]]}},"alternative-id":["10.1145\/3769688"],"URL":"https:\/\/doi.org\/10.1145\/3769688","relation":{},"ISSN":["0001-0782","1557-7317"],"issn-type":[{"value":"0001-0782","type":"print"},{"value":"1557-7317","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,4,24]]},"assertion":[{"value":"2024-08-31","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-04-24","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}