{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,22]],"date-time":"2026-06-22T18:29:04Z","timestamp":1782152944373,"version":"3.54.5"},"reference-count":18,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,7,23]],"date-time":"2024-07-23T00:00:00Z","timestamp":1721692800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,7,23]],"date-time":"2024-07-23T00:00:00Z","timestamp":1721692800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100000002","name":"U.S. Department of Health & Human Services | National Institutes of Health","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["npj Digit. Med."],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V\u2019s rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving <jats:italic>New England Journal of Medicine<\/jats:italic> (NEJM) Image Challenges\u2014an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V\u2019s high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.<\/jats:p>","DOI":"10.1038\/s41746-024-01185-7","type":"journal-article","created":{"date-parts":[[2024,7,23]],"date-time":"2024-07-23T10:10:03Z","timestamp":1721729403000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":112,"title":["Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine"],"prefix":"10.1038","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1268-7239","authenticated-orcid":false,"given":"Qiao","family":"Jin","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9891-4753","authenticated-orcid":false,"given":"Fangyuan","family":"Chen","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-7457-7075","authenticated-orcid":false,"given":"Yiliang","family":"Zhou","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ziyang","family":"Xu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Justin M.","family":"Cheung","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Robert","family":"Chen","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ronald M.","family":"Summers","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2817-9124","authenticated-orcid":false,"given":"Justin F.","family":"Rousseau","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Peiyun","family":"Ni","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-8641-0338","authenticated-orcid":false,"given":"Marc J.","family":"Landsman","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5271-7690","authenticated-orcid":false,"given":"Sally L.","family":"Baxter","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2258-2035","authenticated-orcid":false,"given":"Subhi J.","family":"Al\u2019Aref","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5300-0571","authenticated-orcid":false,"given":"Yijia","family":"Li","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3183-0088","authenticated-orcid":false,"given":"Alexander","family":"Chen","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Josef A.","family":"Brejt","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8172-7636","authenticated-orcid":false,"given":"Michael F.","family":"Chiang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9309-8331","authenticated-orcid":false,"given":"Yifan","family":"Peng","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9998-916X","authenticated-orcid":false,"given":"Zhiyong","family":"Lu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2024,7,23]]},"reference":[{"key":"1185_CR1","doi-asserted-by":"publisher","unstructured":"OpenAI. GPT-4 Technical Report. Preprint at arXiv https:\/\/doi.org\/10.48550\/arXiv.2303.08774 (2023).","DOI":"10.48550\/arXiv.2303.08774"},{"key":"1185_CR2","doi-asserted-by":"publisher","DOI":"10.1093\/bib\/bbad493","volume":"25","author":"S Tian","year":"2024","unstructured":"Tian, S. et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief. Bioinforma. 25, bbad493 (2024).","journal-title":"Brief. Bioinforma."},{"key":"1185_CR3","doi-asserted-by":"publisher","first-page":"158","DOI":"10.1038\/s41746-023-00896-7","volume":"6","author":"L Tang","year":"2023","unstructured":"Tang, L. et al. Evaluating large language models on medical evidence summarization. NPJ Digit. Med. 6, 158 (2023).","journal-title":"NPJ Digit. Med."},{"key":"1185_CR4","doi-asserted-by":"crossref","unstructured":"Jin, Q., Leaman, R. & Lu, Z. Retrieve, summarize, and verify: how will ChatGPT affect information seeking from the medical literature? J. Am. Soc. Nephrol. 34, 1302-1304 (2023).","DOI":"10.1681\/ASN.0000000000000166"},{"key":"1185_CR5","doi-asserted-by":"publisher","first-page":"104988","DOI":"10.1016\/j.ebiom.2024.104988","volume":"100","author":"Q Jin","year":"2024","unstructured":"Jin, Q., Leaman, R. & Lu, Z. PubMed and beyond: biomedical literature search in the age of artificial intelligence. EBioMedicine 100, 104988 (2024).","journal-title":"EBioMedicine"},{"key":"1185_CR6","doi-asserted-by":"publisher","first-page":"172","DOI":"10.1038\/s41586-023-06291-2","volume":"620","author":"K Singhal","year":"2023","unstructured":"Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172\u2013180 (2023).","journal-title":"Nature"},{"key":"1185_CR7","doi-asserted-by":"publisher","unstructured":"Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of gpt-4 on medical challenge problems. Preprint at arXiv https:\/\/doi.org\/10.48550\/arXiv.2303.13375 (2023).","DOI":"10.48550\/arXiv.2303.13375"},{"key":"1185_CR8","doi-asserted-by":"crossref","unstructured":"Li\u00e9vin, V., Hother, C. E., Motzfeldt, A. G. & Winther, O. Can large language models reason about medical questions? Patterns 5, 100943 (2023).","DOI":"10.1016\/j.patter.2024.100943"},{"key":"1185_CR9","doi-asserted-by":"publisher","unstructured":"Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. Preprint at arXiv https:\/\/doi.org\/10.48550\/arXiv.2311.16452 (2023).","DOI":"10.48550\/arXiv.2311.16452"},{"key":"1185_CR10","doi-asserted-by":"publisher","unstructured":"Jin, Q., Wang, Z., Floudas, C., Sun, J. & Lu, Z. Matching patients to clinical trials with large language models. Preprint at arXiv https:\/\/doi.org\/10.48550\/arXiv.2307.15051 (2023).","DOI":"10.48550\/arXiv.2307.15051"},{"key":"1185_CR11","doi-asserted-by":"publisher","first-page":"1773","DOI":"10.1038\/s41591-022-01981-2","volume":"28","author":"JN Acosta","year":"2022","unstructured":"Acosta, J. N., Falcone, G. J., Rajpurkar, P. & Topol, E. J. Multimodal biomedical AI. Nat. Med. 28, 1773\u20131784 (2022).","journal-title":"Nat. Med."},{"key":"1185_CR12","doi-asserted-by":"publisher","first-page":"adk6139","DOI":"10.1126\/science.adk6139","volume":"381","author":"EJ Topol","year":"2023","unstructured":"Topol, E. J. As artificial intelligence goes multimodal, medical applications multiply. Science 381, adk6139 (2023).","journal-title":"Science"},{"key":"1185_CR13","unstructured":"Wu, C. et al. Can gpt-4v (ision) serve medical applications? Case studies on gpt-4v for multimodal medical diagnosis. Preprint at arXiv https:\/\/arxiv.org\/abs\/2310.09909 (2023)."},{"key":"1185_CR14","unstructured":"Yan, Z. et al. Multimodal ChatGPT for medical applications: an experimental study of GPT-4V. Preprint at arXiv https:\/\/arxiv.org\/abs\/2310.19061 (2023)."},{"key":"1185_CR15","doi-asserted-by":"publisher","unstructured":"Yang, Z. et al. Performance of multimodal GPT-4V on USMLE with Image: potential for imaging diagnostic support with explanations. Preprint at https:\/\/doi.org\/10.1101\/2023.10.26.23297629 (2023).","DOI":"10.1101\/2023.10.26.23297629"},{"key":"1185_CR16","doi-asserted-by":"publisher","unstructured":"Buckley, T., Diao, J. A., Rodman, A. & Manrai, A. K. Accuracy of a vision-language model on challenging medical cases. Preprint at arXiv https:\/\/doi.org\/10.48550\/arXiv.2311.05591 (2023).","DOI":"10.48550\/arXiv.2311.05591"},{"key":"1185_CR17","doi-asserted-by":"publisher","unstructured":"Zhang, S. et al. Large-scale domain-specific pretraining for biomedical vision-language processing. Preprint at arXiv https:\/\/doi.org\/10.48550\/arXiv.2303.00915 (2023).","DOI":"10.48550\/arXiv.2303.00915"},{"key":"1185_CR18","doi-asserted-by":"publisher","first-page":"833","DOI":"10.1056\/NEJMicm2206513","volume":"388","author":"X Tang","year":"2023","unstructured":"Tang, X. & Sun, L. Encapsulating peritoneal sclerosis. N. Engl. J. Med. 388, 833 (2023).","journal-title":"N. Engl. J. Med."}],"container-title":["npj Digital Medicine"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.nature.com\/articles\/s41746-024-01185-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-024-01185-7","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-024-01185-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,23]],"date-time":"2024-07-23T10:25:43Z","timestamp":1721730343000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.nature.com\/articles\/s41746-024-01185-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,23]]},"references-count":18,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["1185"],"URL":"https:\/\/doi.org\/10.1038\/s41746-024-01185-7","relation":{},"ISSN":["2398-6352"],"issn-type":[{"value":"2398-6352","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,7,23]]},"assertion":[{"value":"22 January 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 July 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 July 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare no competing interests but the following competing financial interests: R.S. receives royalties for patents or software licenses from iCAD, Philips, ScanMed, PingAn, Translation Holdings, and MGB. R.S. received research support from PingAn.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"190"}}