{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,3]],"date-time":"2026-07-03T12:37:33Z","timestamp":1783082253400,"version":"3.54.6"},"reference-count":53,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2025,1,21]],"date-time":"2025-01-21T00:00:00Z","timestamp":1737417600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,1,21]],"date-time":"2025-01-21T00:00:00Z","timestamp":1737417600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["1900644"],"award-info":[{"award-number":["1900644"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["1900644"],"award-info":[{"award-number":["1900644"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Hasso Plattner Institute (HPI) Research Center in Machine Learning and Data Science at the University of California, Irvine"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Nat Mach Intell"],"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>As artificial intelligence systems, particularly large language models (LLMs), become increasingly integrated into decision-making processes, the ability to trust their outputs is crucial. To earn human trust, LLMs must be well calibrated such that they can accurately assess and communicate the likelihood of their predictions being correct. Whereas recent work has focused on LLMs\u2019 internal confidence, less is understood about how effectively they convey uncertainty to users. Here we explore the calibration gap, which refers to the difference between human confidence in LLM-generated answers and the models\u2019 actual confidence, and the discrimination gap, which reflects how well humans and models can distinguish between correct and incorrect answers. Our experiments with multiple-choice and short-answer questions reveal that users tend to overestimate the accuracy of LLM responses when provided with default explanations. Moreover, longer explanations increased user confidence, even when the extra length did not improve answer accuracy. By adjusting LLM explanations to better reflect the models\u2019 internal confidence, both the calibration gap and the discrimination gap narrowed, significantly improving user perception of LLM accuracy. These findings underscore the importance of accurate uncertainty communication and highlight the effect of explanation length in influencing user trust in artificial-intelligence-assisted decision-making environments.<\/jats:p>","DOI":"10.1038\/s42256-024-00976-7","type":"journal-article","created":{"date-parts":[[2025,1,21]],"date-time":"2025-01-21T10:04:14Z","timestamp":1737453854000},"page":"221-231","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":138,"title":["What large language models know and what people think they know"],"prefix":"10.1038","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1466-5647","authenticated-orcid":false,"given":"Mark","family":"Steyvers","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Heliodoro","family":"Tejeda","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Aakriti","family":"Kumar","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Catarina","family":"Belem","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Sheer","family":"Karny","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Xinyue","family":"Hu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Lukas W.","family":"Mayer","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Padhraic","family":"Smyth","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2025,1,21]]},"reference":[{"key":"976_CR1","doi-asserted-by":"publisher","first-page":"508","DOI":"10.1038\/nclimate2194","volume":"4","author":"DV Budescu","year":"2014","unstructured":"Budescu, D. V., Por, H.-H., Broomell, S. B. & Smithson, M. The interpretation of IPCC probabilistic statements around the world. Nat. Clim. Change 4, 508\u2013512 (2014).","journal-title":"Nat. Clim. Change"},{"key":"976_CR2","doi-asserted-by":"publisher","first-page":"43","DOI":"10.1177\/237946151500100206","volume":"1","author":"EH Ho","year":"2015","unstructured":"Ho, E. H., Budescu, D. V., Dhami, M. K. & Mandel, D. R. Improving the communication of uncertainty in climate science and intelligence analysis. Behav. Sci. Policy 1, 43\u201355 (2015).","journal-title":"Behav. Sci. Policy"},{"key":"976_CR3","unstructured":"Karelitz, T. M., Dhami, M. K., Budescu, D. V. & Wallsten, T. S. Toward a universal translator of verbal probabilities. In Proc. 15th International Florida Artificial Intelligence Research Society Conference (eds Haller, M. S. & Simmons, G.) 498\u2013502 (AAAI Press, 2002)."},{"key":"976_CR4","unstructured":"Wallsten, T. S., Shlomi, Y. & Ting, H. Final Report for Research Contract \u2018Expressing Probability in Intelligence Analysis\u2019 (2008)."},{"key":"976_CR5","first-page":"98","volume":"39","author":"BJ O\u2019Brien","year":"1989","unstructured":"O\u2019Brien, B. J. Words or numbers? The evaluation of probability expressions in general practice. J. R. Coll. Gen. Pract. 39, 98\u2013100 (1989).","journal-title":"J. R. Coll. Gen. Pract."},{"key":"976_CR6","doi-asserted-by":"publisher","first-page":"179","DOI":"10.1016\/S2589-7500(23)00048-1","volume":"5","author":"SR Ali","year":"2023","unstructured":"Ali, S. R., Dobbs, T. D., Hutchings, H. A. & Whitaker, I. S. Using ChatGPT to write patient clinic letters. Lancet Digit. Health 5, 179\u2013181 (2023).","journal-title":"Lancet Digit. Health"},{"key":"976_CR7","doi-asserted-by":"crossref","unstructured":"Zambrano, A. F. et al. From nCoder to ChatGPT: from automated coding to refining human coding. In Proc. International Conference on Quantitative Ethnography (eds Arastoopour Irgens, G. & Knight, S.) 470\u2013485 (Springer, 2023).","DOI":"10.1007\/978-3-031-47014-1_32"},{"key":"976_CR8","first-page":"1","volume":"23","author":"J Whalen","year":"2023","unstructured":"Whalen, J. et al. ChatGPT: challenges, opportunities, and implications for teacher education. Contemp. Iss. Technol. Teach. Educ. 23, 1\u201323 (2023).","journal-title":"Contemp. Iss. Technol. Teach. Educ."},{"key":"976_CR9","doi-asserted-by":"crossref","first-page":"214","DOI":"10.1038\/d41586-023-00340-6","volume":"614","author":"A Jo","year":"2023","unstructured":"Jo, A. The promise and peril of generative AI. Nature 614, 214\u2013216 (2023).","journal-title":"Nature"},{"key":"976_CR10","unstructured":"Huang, L. et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. Preprint at https:\/\/arxiv.org\/abs\/2311.05232 (2024)."},{"key":"976_CR11","unstructured":"Introducing ChatGPT (OpenAI, 2022)."},{"key":"976_CR12","unstructured":"Achiam, J. et al. GPT-4 technical report. Preprint at https:\/\/arxiv.org\/abs\/2303.08774 (2023)."},{"key":"976_CR13","unstructured":"Kadavath, S. et al. Language models (mostly) know what they know. Preprint at https:\/\/arxiv.org\/abs\/2207.05221 (2022)."},{"key":"976_CR14","unstructured":"Srivastava, A. et al. Beyond the imitation game: quantifying and extrapolating the capabilities of language models. Trans. Mach. Learn. Res. (2023)."},{"key":"976_CR15","doi-asserted-by":"crossref","unstructured":"Yin, Z. et al. Do large language models know what they don\u2019t know? In Proc. Findings of the Association for Computational Linguistics (eds Rogers, A. et al.) 8653\u20138665 (ACL, 2023).","DOI":"10.18653\/v1\/2023.findings-acl.551"},{"key":"976_CR16","unstructured":"Azaria, A. & Mitchell, T. in Findings of the Association for Computational Linguistics: EMNLP 2023 (eds Bouamor, H. et al.) 967\u2013976 (ACL, 2023)."},{"key":"976_CR17","doi-asserted-by":"publisher","first-page":"625","DOI":"10.1038\/s41586-024-07421-0","volume":"630","author":"S Farquhar","year":"2024","unstructured":"Farquhar, S., Kossen, J., Kuhn, L. & Gal, Y. Detecting hallucinations in large language models using semantic entropy. Nature 630, 625\u2013630 (2024).","journal-title":"Nature"},{"key":"976_CR18","doi-asserted-by":"publisher","first-page":"962","DOI":"10.1162\/tacl_a_00407","volume":"9","author":"Z Jiang","year":"2021","unstructured":"Jiang, Z., Araki, J., Ding, H. & Neubig, G. How can we know when language models know? On the calibration of language models for question answering. Trans. Assoc. Comput. Linguist. 9, 962\u2013977 (2021).","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"976_CR19","unstructured":"Hendrycks, D. et al. Measuring massive multitask language understanding. In Proc. International Conference on Learning Representations (2021)."},{"key":"976_CR20","unstructured":"GPT-3.5 (OpenAI, 2022)."},{"key":"976_CR21","unstructured":"Anil, R. et al. Palm 2 Technical Report (2023)."},{"key":"976_CR22","doi-asserted-by":"crossref","unstructured":"Joshi, M., Choi, E., Weld, D. S. & Zettlemoyer, L. Triviaqa: a large scale distantly supervised challenge dataset for reading comprehension. In Proc. 55th Annual Meeting of the Association for Computational Linguistics Vol. 1 (eds Barzilay, R. & Kan, M.-Y.) 1601\u20131611 (ACL, 2017).","DOI":"10.18653\/v1\/P17-1147"},{"key":"976_CR23","unstructured":"Xiao, Y. et al. in Findings of the Association for Computational Linguistics: EMNLP 2022 (eds Goldberg, Y. et al.) 7273\u20137284 (ACL, 2022)."},{"key":"976_CR24","unstructured":"Xiong, M. et al. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs. In The Twelfth International Conference on Learning Representations (2024)."},{"key":"976_CR25","unstructured":"Tanneru, S. H., Agarwal, C. & Lakkaraju, H. Quantifying uncertainty in natural language explanations of large language models. In International Conference on Artificial Intelligence and Statistics 1072\u20131080 (PMLR, 2024)."},{"key":"976_CR26","doi-asserted-by":"crossref","unstructured":"Zhou, K., Hwang, J., Ren, X. & Sap, M. in Relying on the Unreliable: The Impact of Language Models\u2019 Reluctance to Express Uncertainty 3623\u20133643 (Association for Computational Linguistics, 2024).","DOI":"10.18653\/v1\/2024.acl-long.198"},{"key":"976_CR27","doi-asserted-by":"publisher","first-page":"69","DOI":"10.1037\/0022-3514.46.1.69","volume":"46","author":"RE Petty","year":"1984","unstructured":"Petty, R. E. & Cacioppo, J. T. The effects of involvement on responses to argument quantity and quality: central and peripheral routes to persuasion. J. Person. Soc. Psychol. 46, 69 (1984).","journal-title":"J. Person. Soc. Psychol."},{"key":"976_CR28","doi-asserted-by":"publisher","first-page":"139","DOI":"10.1002\/acp.1178","volume":"20","author":"DM Oppenheimer","year":"2006","unstructured":"Oppenheimer, D. M. Consequences of erudite vernacular utilized irrespective of necessity: problems with using long words needlessly. Appl. Cogn. Psychol. 20, 139\u2013156 (2006).","journal-title":"Appl. Cogn. Psychol."},{"key":"976_CR29","unstructured":"Goldberg, A. et al. Peer reviews of peer reviews: a randomized controlled trial and other experiments. Preprint at https:\/\/arxiv.org\/abs\/2311.09497 (2023)."},{"key":"976_CR30","first-page":"27730","volume":"35","author":"L Ouyang","year":"2022","unstructured":"Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 35, 27730\u201327744 (2022).","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"976_CR31","doi-asserted-by":"crossref","unstructured":"Bower, A. H., Han, N., Soni, A., Eckstein, M. P. & Steyvers, M. How experts and novices judge other people\u2019s knowledgeability from language use. Psychonom. Bull. Rev. 1\u201311 (2024).","DOI":"10.3758\/s13423-023-02433-9"},{"key":"976_CR32","unstructured":"Saito, K., Wachi, A., Wataoka, K. & Akimoto, Y. Verbosity bias in preference labeling by large language models. Preprint at https:\/\/arxiv.org\/abs\/2310.10076 (2023)."},{"key":"976_CR33","doi-asserted-by":"publisher","first-page":"132","DOI":"10.1111\/1467-9280.00228","volume":"11","author":"M Mather","year":"2000","unstructured":"Mather, M., Shafir, E. & Johnson, M. K. Misremembrance of options past: source monitoring and choice. Psychol. Sci. 11, 132\u2013138 (2000).","journal-title":"Psychol. Sci."},{"key":"976_CR34","doi-asserted-by":"crossref","unstructured":"Rong, Y. et al. Towards human-centered explainable AI: a survey of user studies for model explanations. In Proc. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 46 1\u201320 (IEEE, 2023).","DOI":"10.1109\/TPAMI.2023.3331846"},{"key":"976_CR35","doi-asserted-by":"crossref","unstructured":"Smith-Renner, A. et al. No explainability without accountability: an empirical study of explanations and feedback in interactive ML. In Proc. 2020 CHI Conference on Human Factors in Computing Systems 1\u201313 (Association for Computing Machinery, 2020).","DOI":"10.1145\/3313831.3376624"},{"key":"976_CR36","doi-asserted-by":"crossref","unstructured":"Feng, S. & Boyd-Graber, J. What can AI do for me? Evaluating machine learning interpretations in cooperative play. In Proc. 24th International Conference on Intelligent User Interfaces (eds Fu, W.-T. & Pan, S.) 229\u2013239 (ACL, 2019).","DOI":"10.1145\/3301275.3302265"},{"key":"976_CR37","first-page":"722","volume":"19","author":"M Steyvers","year":"2023","unstructured":"Steyvers, M. & Kumar, A. Three challenges for AI-assisted decision-making. Perspect. Psychol. Sci. 19, 722\u2013734 (2023).","journal-title":"Psychol. Sci."},{"key":"976_CR38","doi-asserted-by":"crossref","unstructured":"Bansal, G. et al. Does the whole exceed its parts? The effect of AI explanations on complementary team performance. In Proc. 2021 CHI Conference on Human Factors in Computing Systems (eds Kitamura, Y. & Quigley, A.) 1\u201316 (ACL, 2021).","DOI":"10.1145\/3411764.3445717"},{"key":"976_CR39","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3449287","volume":"5","author":"Z Bu\u00e7inca","year":"2021","unstructured":"Bu\u00e7inca, Z., Malaya, M. B. & Gajos, K. Z. To trust or to think: cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making. Proc. ACM Hum. Comput. Interact. 5, 1\u201321 (2021).","journal-title":"Proc. ACM Hum. Comput. Interact."},{"key":"976_CR40","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3519266","volume":"12","author":"X Wang","year":"2022","unstructured":"Wang, X. & Yin, M. Effects of explanations in AI-assisted decision making: principles and comparisons. ACM Trans. Interact. Intell. Syst. 12, 1\u201336 (2022).","journal-title":"ACM Trans. Interact. Intell. Syst."},{"key":"976_CR41","unstructured":"Hoffmann, J. et al. Training compute-optimal large language models. Preprint at https:\/\/arxiv.org\/abs\/2203.15556 (2022)."},{"key":"976_CR42","unstructured":"Rae, J. W. et al. Scaling language models: methods, analysis & insights from training gopher. Preprint at https:\/\/arxiv.org\/abs\/2112.11446 (2021)."},{"key":"976_CR43","unstructured":"Geng, J. et al. A survey of language model confidence estimation and calibration. Preprint at https:\/\/arxiv.org\/abs\/2311.08298 (2023)."},{"key":"976_CR44","doi-asserted-by":"crossref","unstructured":"Zhou, K., Jurafsky, D. & Hashimoto, T. Navigating the grey area: how expressions of uncertainty and overconfidence affect language models. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing 5506\u20135524 (Association for Computational Linguistics, 2023).","DOI":"10.18653\/v1\/2023.emnlp-main.335"},{"key":"976_CR45","unstructured":"Lin, S., Hilton, J. & Evans, O. Teaching models to express their uncertainty in words. Trans. Mach. Learn. Res. (2022)."},{"key":"976_CR46","doi-asserted-by":"crossref","unstructured":"Tian, K. et al. Just ask for calibration: strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing 5433\u20135442 (Association for Computational Linguistics, 2023).","DOI":"10.18653\/v1\/2023.emnlp-main.330"},{"key":"976_CR47","doi-asserted-by":"publisher","first-page":"443","DOI":"10.3389\/fnhum.2014.00443","volume":"8","author":"SM Fleming","year":"2014","unstructured":"Fleming, S. M. & Lau, H. C. How to measure metacognition. Front. Hum. Neurosci. 8, 443 (2014).","journal-title":"Front. Hum. Neurosci."},{"key":"976_CR48","unstructured":"Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. On calibration of modern neural networks. In Proc. 34th International Conference on Machine Learning (eds Precup, D. & Teh, Y. W.) vol. 70 of Proceedings of Machine Learning Research 1321\u20131330 (PMLR, 2017)."},{"key":"976_CR49","doi-asserted-by":"crossref","unstructured":"Naeini, M. P., Cooper, G. & Hauskrecht, M. Obtaining well calibrated probabilities using Bayesian binning. In Proc. AAAI Conference on Artificial Intelligence Vol. 29 2901\u20132907(AAAI, 2015).","DOI":"10.1609\/aaai.v29i1.9602"},{"key":"976_CR50","unstructured":"Kumar, A., Liang, P. S. & Ma, T. Verified uncertainty calibration. Adv. Neural Inf. Process. Syst. 32, (2019)."},{"key":"976_CR51","first-page":"8618","volume":"35","author":"S Gruber","year":"2022","unstructured":"Gruber, S. & Buettner, F. Better uncertainty calibration via proper scores for classification and beyond. Adv. Neural Inf. Process. Syst. 35, 8618\u20138632 (2022).","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"976_CR52","doi-asserted-by":"publisher","first-page":"356","DOI":"10.1016\/j.jmp.2012.08.001","volume":"56","author":"JN Rouder","year":"2012","unstructured":"Rouder, J. N., Morey, R. D., Speckman, P. L. & Province, J. M. Default Bayes factors for ANOVA designs. J. Math. Psychol. 56, 356\u2013374 (2012).","journal-title":"J. Math. Psychol."},{"key":"976_CR53","doi-asserted-by":"publisher","unstructured":"Steyvers, M., Tejeda, H. & Belem, C. What large language models know and what people think they know. OSF https:\/\/doi.org\/10.17605\/OSF.IO\/Y7PR6 (2024).","DOI":"10.17605\/OSF.IO\/Y7PR6"}],"container-title":["Nature Machine Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.nature.com\/articles\/s42256-024-00976-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s42256-024-00976-7","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s42256-024-00976-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,2,25]],"date-time":"2025-02-25T23:24:04Z","timestamp":1740525844000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.nature.com\/articles\/s42256-024-00976-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1,21]]},"references-count":53,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2025,2]]}},"alternative-id":["976"],"URL":"https:\/\/doi.org\/10.1038\/s42256-024-00976-7","relation":{},"ISSN":["2522-5839"],"issn-type":[{"value":"2522-5839","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,1,21]]},"assertion":[{"value":"2 May 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 December 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"21 January 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}]}}