{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,5]],"date-time":"2026-06-05T04:57:07Z","timestamp":1780635427781,"version":"3.54.1"},"reference-count":37,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,11,21]],"date-time":"2024-11-21T00:00:00Z","timestamp":1732147200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,11,21]],"date-time":"2024-11-21T00:00:00Z","timestamp":1732147200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100000038","name":"U.S. Department of Health & Human Services | U.S. Food and Drug Administration","doi-asserted-by":"publisher","award":["5U01FD005942-05"],"award-info":[{"award-number":["5U01FD005942-05"]}],"id":[{"id":"10.13039\/100000038","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000038","name":"U.S. Department of Health & Human Services | U.S. Food and Drug Administration","doi-asserted-by":"publisher","award":["5U01FD005942-05"],"award-info":[{"award-number":["5U01FD005942-05"]}],"id":[{"id":"10.13039\/100000038","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["npj Digit. Med."],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>A fundamental goal of evaluating the performance of a clinical model is to ensure it performs well across a diverse intended patient population. A primary challenge is that the data used in model development and testing often consist of many overlapping, heterogeneous patient subgroups that may not be explicitly defined or labeled. While a model\u2019s average performance on a dataset may be high, the model can have significantly lower performance for certain subgroups, which may be hard to detect. We describe an algorithmic framework for identifying subgroups with potential performance disparities (AFISP), which produces a set of interpretable phenotypes corresponding to subgroups for which the model\u2019s performance may be relatively lower. This could allow model evaluators, including developers and users, to identify possible failure modes prior to wide-scale deployment. We illustrate the application of AFISP by applying it to a patient deterioration model to detect significant subgroup performance disparities, and show that AFISP is significantly more scalable than existing algorithmic approaches.<\/jats:p>","DOI":"10.1038\/s41746-024-01275-6","type":"journal-article","created":{"date-parts":[[2024,11,21]],"date-time":"2024-11-21T19:50:14Z","timestamp":1732218614000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":16,"title":["A data-driven framework for identifying patient subgroups on which an AI\/machine learning model may underperform"],"prefix":"10.1038","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2020-1806","authenticated-orcid":false,"given":"Adarsh","family":"Subbaswamy","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Berkman","family":"Sahiner","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5167-8899","authenticated-orcid":false,"given":"Nicholas","family":"Petrick","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Vinay","family":"Pai","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6859-3007","authenticated-orcid":false,"given":"Roy","family":"Adams","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Matthew C.","family":"Diamond","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7667-5210","authenticated-orcid":false,"given":"Suchi","family":"Saria","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2024,11,21]]},"reference":[{"key":"1275_CR1","unstructured":"US Food and Drug Administration. Artificial intelligence and machine learning (AI\/ML)-enabled medical devices. https:\/\/www.fda.gov\/medical-devices\/software-medical-device-samd\/artificial-intelligence-and-machine-learning-aiml-enabled-medical-devices (2022)."},{"key":"1275_CR2","doi-asserted-by":"crossref","unstructured":"Adams, R. et al. Prospective, multi-site study of patient outcomes after implementation of the TREWS machine learning-based early warning system for sepsis. Nat. Med. 28, 1455\u20131460 (2022).","DOI":"10.1038\/s41591-022-01894-0"},{"key":"1275_CR3","doi-asserted-by":"publisher","first-page":"1951","DOI":"10.1056\/NEJMsa2001090","volume":"383","author":"GJ Escobar","year":"2020","unstructured":"Escobar, G. J. et al. Automated identification of adults at risk for in-hospital clinical deterioration. N. Engl. J. Med. 383, 1951\u20131960 (2020).","journal-title":"N. Engl. J. Med."},{"key":"1275_CR4","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41746-018-0040-6","volume":"1","author":"MD Abr\u00e0moff","year":"2018","unstructured":"Abr\u00e0moff, M. D., Lavin, P. T., Birch, M., Shah, N. & Folk, J. C. Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices. npj Digital Med. 1, 1\u20138 (2018).","journal-title":"npj Digital Med."},{"key":"1275_CR5","doi-asserted-by":"crossref","unstructured":"Oakden-Rayner, L., Dunnmon, J., Carneiro, G. & R\u00e9, C. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In Proc. ACM Conference on Health, Inference, and Learning 151\u2013159 (ACM, 2020).","DOI":"10.1145\/3368555.3384468"},{"key":"1275_CR6","doi-asserted-by":"publisher","first-page":"e1002683","DOI":"10.1371\/journal.pmed.1002683","volume":"15","author":"JR Zech","year":"2018","unstructured":"Zech, J. R. et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS Med. 15, e1002683 (2018).","journal-title":"PLoS Med."},{"key":"1275_CR7","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41746-019-0105-1","volume":"2","author":"MA Badgeley","year":"2019","unstructured":"Badgeley, M. A. et al. Deep learning predicts hip fracture using confounding patient and healthcare variables. npj Digital Med. 2, 1\u201310 (2019).","journal-title":"npj Digital Med."},{"key":"1275_CR8","doi-asserted-by":"publisher","first-page":"1135","DOI":"10.1001\/jamadermatol.2019.1735","volume":"155","author":"JK Winkler","year":"2019","unstructured":"Winkler, J. K. et al. Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition. JAMA Dermatol. 155, 1135\u20131141 (2019).","journal-title":"JAMA Dermatol."},{"key":"1275_CR9","doi-asserted-by":"publisher","first-page":"283","DOI":"10.1056\/NEJMc2104626","volume":"385","author":"SG Finlayson","year":"2021","unstructured":"Finlayson, S. G. et al. The clinician and dataset shift in artificial intelligence. N. Engl. J. Med. 385, 283 (2021).","journal-title":"N. Engl. J. Med."},{"key":"1275_CR10","doi-asserted-by":"publisher","first-page":"418","DOI":"10.1097\/CCM.0000000000005267","volume":"50","author":"Y Tarabichi","year":"2021","unstructured":"Tarabichi, Y. et al. Improving timeliness of antibiotic administration using a provider and pharmacist facing sepsis early warning system in the emergency department setting: a randomized controlled quality improvement initiative. Crit. Care Med. 50, 418\u2013427 (2021).","journal-title":"Crit. Care Med."},{"key":"1275_CR11","doi-asserted-by":"publisher","first-page":"1065","DOI":"10.1001\/jamainternmed.2021.2626","volume":"181","author":"A Wong","year":"2021","unstructured":"Wong, A. et al. External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Intern. Med. 181, 1065\u20131070 (2021).","journal-title":"JAMA Intern. Med."},{"key":"1275_CR12","doi-asserted-by":"crossref","unstructured":"Lyons, P. G. et al. Factors associated with variability in the performance of a proprietary sepsis prediction model across 9 networked hospitals in the US. JAMA Intern. Med. 183, 611\u2013612 (2023).","DOI":"10.1001\/jamainternmed.2022.7182"},{"key":"1275_CR13","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41746-022-00611-y","volume":"5","author":"J Feng","year":"2022","unstructured":"Feng, J. et al. Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare. npj Digital Med. 5, 1\u20139 (2022).","journal-title":"npj Digital Med."},{"key":"1275_CR14","doi-asserted-by":"crossref","unstructured":"Liu, X. et al. The medical algorithmic audit. Lancet Digit. Health 4, e384-e397 (2022).","DOI":"10.1016\/S2589-7500(22)00003-6"},{"key":"1275_CR15","doi-asserted-by":"crossref","unstructured":"Chung, Y., Kraska, T., Polyzotis, N., Tae, K. H. & Whang, S. E. Slice finder: automated data slicing for model validation. In 2019 IEEE 35th International Conference on Data Engineering (ICDE) 1550\u20131553 (IEEE, 2019).","DOI":"10.1109\/ICDE.2019.00139"},{"key":"1275_CR16","doi-asserted-by":"crossref","unstructured":"Sagadeeva, S. & Boehm, M. Sliceline: fast, linear-algebra-based slice finding for ML model debugging. In Proc. 2021 International Conference on Management of Data 2290\u20132299 (ACM, 2021).","DOI":"10.1145\/3448016.3457323"},{"key":"1275_CR17","first-page":"842","volume":"29","author":"X Zhang","year":"2022","unstructured":"Zhang, X. et al. Sliceteller: a data slice-driven approach for machine learning model validation. IEEE Trans. Vis. Comput. Graph. 29, 842\u2013852 (2022).","journal-title":"IEEE Trans. Vis. Comput. Graph."},{"key":"1275_CR18","unstructured":"Eyuboglu, S. et al. Domino: discovering systematic errors with cross-modal embeddings. In The Tenth International Conference on Learning Representations (OpenReview.net, 2022)."},{"key":"1275_CR19","doi-asserted-by":"publisher","first-page":"10","DOI":"10.1016\/j.jbi.2016.09.013","volume":"64","author":"P Kipnis","year":"2016","unstructured":"Kipnis, P. et al. Development and validation of an electronic medical record-based alert score for detection of inpatient deterioration outside the ICU. J. Biomed. Inform. 64, 10\u201319 (2016).","journal-title":"J. Biomed. Inform."},{"key":"1275_CR20","doi-asserted-by":"publisher","first-page":"866","DOI":"10.7326\/M18-1990","volume":"169","author":"A Rajkomar","year":"2018","unstructured":"Rajkomar, A., Hardt, M., Howell, M. D., Corrado, G. & Chin, M. H. Ensuring fairness in machine learning to advance health equity. Ann. Intern. Med. 169, 866\u2013872 (2018).","journal-title":"Ann. Intern. Med."},{"key":"1275_CR21","doi-asserted-by":"publisher","first-page":"2176","DOI":"10.1038\/s41591-021-01595-0","volume":"27","author":"L Seyyed-Kalantari","year":"2021","unstructured":"Seyyed-Kalantari, L., Zhang, H., McDermott, M., Chen, I. Y. & Ghassemi, M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat. Med. 27, 2176\u20132182 (2021).","journal-title":"Nat. Med."},{"key":"1275_CR22","doi-asserted-by":"publisher","first-page":"427","DOI":"10.1214\/20-EJS1792","volume":"15","author":"C B\u00e9nard","year":"2021","unstructured":"B\u00e9nard, C., Biau, G., Da Veiga, S. & Scornet, E. Sirus: stable and interpretable rule set for classification. Electron. J. Stat. 15, 427\u2013505 (2021).","journal-title":"Electron. J. Stat."},{"key":"1275_CR23","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/0003-2670(86)80028-9","volume":"185","author":"P Geladi","year":"1986","unstructured":"Geladi, P. & Kowalski, B. R. Partial least-squares regression: a tutorial. Anal. Chim. Acta 185, 1\u201317 (1986).","journal-title":"Anal. Chim. Acta"},{"key":"1275_CR24","doi-asserted-by":"publisher","first-page":"42","DOI":"10.1016\/S2213-2600(14)70239-5","volume":"3","author":"R Pirracchio","year":"2015","unstructured":"Pirracchio, R. et al. Mortality prediction in intensive care units with the super icu learner algorithm (sicula): a population-based study. Lancet Respir. Med. 3, 42\u201352 (2015).","journal-title":"Lancet Respir. Med."},{"key":"1275_CR25","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/sdata.2016.35","volume":"3","author":"AE Johnson","year":"2016","unstructured":"Johnson, A. E. et al. Mimic-iii, a freely accessible critical care database. Sci. Data 3, 1\u20139 (2016).","journal-title":"Sci. Data"},{"key":"1275_CR26","doi-asserted-by":"publisher","first-page":"1337","DOI":"10.1038\/s41591-019-0548-6","volume":"25","author":"J Wiens","year":"2019","unstructured":"Wiens, J. et al. Do no harm: a roadmap for responsible machine learning for health care. Nat. Med. 25, 1337\u20131340 (2019).","journal-title":"Nat. Med."},{"key":"1275_CR27","unstructured":"Schulam, P. & Saria, S. Can you trust this prediction? auditing pointwise reliability after learning. In The 22nd International Conference on Artificial Intelligence and Statistics 1022\u20131031 (PMLR, 2019)."},{"key":"1275_CR28","first-page":"35907","volume":"35","author":"D Prinster","year":"2022","unstructured":"Prinster, D., Liu, A. & Saria, S. JAWS: auditing predictive uncertainty under covariate shift. Adv. Neural Inf. Process. Syst. 35, 35907\u201335920 (2022).","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"1275_CR29","first-page":"3539","volume":"31","author":"I Chen","year":"2018","unstructured":"Chen, I., Johansson, F. D. & Sontag, D. Why is my classifier discriminatory? Adv. Neural Inform. Process. Syst. 31, 3539\u20133550 (2018).","journal-title":"Adv. Neural Inform. Process. Syst."},{"key":"1275_CR30","doi-asserted-by":"crossref","unstructured":"Bansal, G. et al. Updates in human-ai teams: Understanding and addressing the performance\/compatibility tradeoff. In Proc. 33rd AAAI Conference on Artificial Intelligence 2429\u20132437 (AAA1, 2019).","DOI":"10.1609\/aaai.v33i01.33012429"},{"key":"1275_CR31","doi-asserted-by":"crossref","unstructured":"Srivastava, M., Nushi, B., Kamar, E., Shah, S. & Horvitz, E. An empirical analysis of backward compatibility in machine learning systems. In Proc. 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 3272\u20133280 (ACM, 2020).","DOI":"10.1145\/3394486.3403379"},{"key":"1275_CR32","doi-asserted-by":"publisher","first-page":"1345","DOI":"10.1109\/TKDE.2009.191","volume":"22","author":"SJ Pan","year":"2009","unstructured":"Pan, S. J. & Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345\u20131359 (2009).","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"1275_CR33","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s40537-016-0043-6","volume":"3","author":"K Weiss","year":"2016","unstructured":"Weiss, K., Khoshgoftaar, T. M. & Wang, D. A survey of transfer learning. J. Big Data 3, 1\u201340 (2016).","journal-title":"J. Big Data"},{"key":"1275_CR34","unstructured":"Subbaswamy, A., Adams, R. & Saria, S. Evaluating model robustness and stability to dataset shift. In International Conference on Artificial Intelligence and Statistics 2611\u20132619 (PMLR, 2021)."},{"key":"1275_CR35","unstructured":"Molnar, C. Interpretable Machine Learning (Lulu. com, 2020)."},{"key":"1275_CR36","doi-asserted-by":"crossref","unstructured":"Chakraborty, D. P. Observer Performance Methods for Diagnostic Imaging: Foundations, Modeling, and Applications with R-based Examples (CRC Press, 2017).","DOI":"10.1201\/9781351228190"},{"key":"1275_CR37","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825\u20132830 (2011).","journal-title":"J. Mach. Learn. Res."}],"container-title":["npj Digital Medicine"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.nature.com\/articles\/s41746-024-01275-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-024-01275-6","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-024-01275-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,21]],"date-time":"2024-11-21T20:05:09Z","timestamp":1732219509000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.nature.com\/articles\/s41746-024-01275-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,21]]},"references-count":37,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["1275"],"URL":"https:\/\/doi.org\/10.1038\/s41746-024-01275-6","relation":{},"ISSN":["2398-6352"],"issn-type":[{"value":"2398-6352","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,11,21]]},"assertion":[{"value":"6 December 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"27 September 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"21 November 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"334"}}