{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,21]],"date-time":"2026-03-21T04:44:55Z","timestamp":1774068295885,"version":"3.50.1"},"reference-count":37,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,10,23]],"date-time":"2024-10-23T00:00:00Z","timestamp":1729641600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,10,23]],"date-time":"2024-10-23T00:00:00Z","timestamp":1729641600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BioData Mining"],"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n                <jats:title>Background<\/jats:title>\n                <jats:p>In medical device validation and verification studies, the area under the receiver operating characteristic curve (AUROC) is often used as a primary endpoint despite multiple reports showing its limitations. Hence, researchers are encouraged to consider alternative metrics as primary endpoints. A new metric called G4 is presented, which is the geometric mean of sensitivity, specificity, the positive predictive value, and the negative predictive value. G4 is part of a balanced metric family which includes the Unified Performance Measure (also known as P4) and the Matthews\u2019 Correlation Coefficient (MCC). The purpose of this manuscript is to unveil the benefits of using G4 together with the balanced metric family when analyzing the overall performance of binary classifiers.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Results<\/jats:title>\n                <jats:p>Simulated datasets encompassing different prevalence rates of the minority class were analyzed under a multi-reader-multi-case study design. In addition, data from an independently published study that tested the performance of a unique ultrasound artificial intelligence algorithm in the context of breast cancer detection was also considered. Within each dataset, AUROC was reported alongside the balanced metric family for comparison. When the dataset prevalence and bias of the minority class approached 50%, all three balanced metrics provided equivalent interpretations of an AI\u2019s performance. As the prevalence rate increased \/ decreased and the data became more imbalanced, AUROC tended to overvalue \/ undervalue the true classifier performance, while the balanced metric family was resistant to such imbalance. Under certain circumstances where data imbalance was strong (minority-class prevalence\u2009&lt;\u200910%), MCC was preferred for standalone assessments while P4 provided a stronger effect size when evaluating between-groups analyses. G4 acted as a middle ground for maximizing both standalone assessments and between-groups analyses.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Conclusions<\/jats:title>\n                <jats:p>Use of AUROC as the primary endpoint in binary classification problems provides misleading results as the dataset becomes more imbalanced. This is explicitly noticed when incorporating AUROC in medical device validation and verification studies. G4, P4, and MCC do not share this limitation and paint a more complete picture of a medical device\u2019s performance in a clinical setting. Therefore, researchers are encouraged to explore the balanced metric family when evaluating binary classification problems.<\/jats:p>\n              <\/jats:sec>","DOI":"10.1186\/s13040-024-00402-z","type":"journal-article","created":{"date-parts":[[2024,10,23]],"date-time":"2024-10-23T16:02:47Z","timestamp":1729699367000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["G4 &amp; the balanced metric family \u2013 a novel approach to solving binary classification problems in medical device validation &amp; verification studies"],"prefix":"10.1186","volume":"17","author":[{"given":"Andrew","family":"Marra","sequence":"first","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,10,23]]},"reference":[{"issue":"6","key":"402_CR1","doi-asserted-by":"publisher","first-page":"624","DOI":"10.1001\/jamacardio.2021.0185","volume":"6","author":"A Narang","year":"2021","unstructured":"Narang A, Bae R, Hong H, et al. Utility of a Deep-Learning Algorithm to Guide Novices to Acquire Echocardiograms for Limited Diagnostic Use. JAMA Cardiol. 2021;6(6):624\u201332. https:\/\/doi.org\/10.1001\/jamacardio.2021.0185.","journal-title":"JAMA Cardiol"},{"key":"402_CR2","unstructured":"Food & Drug Administration, Center for Devices and Radiological Health. September 2022. Clinical Performance Assessment: Considerations for Computer-Assisted Detection Devices Applied to Radiology Images and Radiology Device Data in Premarket Notification (510(k)) Submissions. Guidance for Industry and FDA Staff. https:\/\/www.fda.gov\/media\/77642\/download."},{"issue":"4","key":"402_CR3","doi-asserted-by":"publisher","first-page":"463","DOI":"10.1016\/j.acra.2011.12.016","volume":"19","author":"BD Gallas","year":"2012","unstructured":"Gallas BD, Chan HP, D\u2019Orsi CJ, Dodd LE, Giger ML, Gur D, Krupinski EA, Metz CE, Myers KJ, Obuchowski NA, Sahiner B, Toledano AY, Zuley ML. Evaluating imaging and computer-aided detection and diagnosis devices at the FDA. Acad Radiol. 2012;19(4):463\u201377. https:\/\/doi.org\/10.1016\/j.acra.2011.12.016. Epub 2012 Feb 3. PMID: 22306064; PMCID: PMC5557046.","journal-title":"Acad Radiol"},{"issue":"1","key":"402_CR4","doi-asserted-by":"publisher","first-page":"26","DOI":"10.1148\/radiol.211593","volume":"303","author":"NA Obuchowski","year":"2022","unstructured":"Obuchowski NA, Bullen J. Multireader Diagnostic Accuracy Imaging Studies: Fundamentals of Design and Analysis. Radiology. 2022;303(1):26\u201334. https:\/\/doi.org\/10.1148\/radiol.211593. Epub 2022 Feb 15 PMID: 35166584.","journal-title":"Radiology"},{"issue":"12","key":"402_CR5","doi-asserted-by":"publisher","first-page":"e116018","DOI":"10.1371\/journal.pone.0116018","volume":"9","author":"T Dendumrongsup","year":"2014","unstructured":"Dendumrongsup T, Plumb AA, Halligan S, Fanshawe TR, Altman DG, et al. Multi-Reader Multi-Case Studies Using the Area under the Receiver Operator Characteristic Curve as a Measure of Diagnostic Accuracy: Systematic Review with a Focus on Quality of Data Reporting. PLoS One. 2014;9(12):e116018. https:\/\/doi.org\/10.1371\/journal.pone.0116018.","journal-title":"PLoS One"},{"issue":"1","key":"402_CR6","doi-asserted-by":"publisher","first-page":"4","DOI":"10.1186\/s13040-023-00322-4.PMID:36800973;PMCID:PMC9938573","volume":"16","author":"D Chicco","year":"2023","unstructured":"Chicco D, Jurman G. The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification. BioData Min. 2023;16(1):4. https:\/\/doi.org\/10.1186\/s13040-023-00322-4.PMID:36800973;PMCID:PMC9938573.","journal-title":"BioData Min"},{"key":"402_CR7","doi-asserted-by":"publisher","first-page":"3762651","DOI":"10.1155\/2017\/3762651","volume":"2017","author":"I Unal","year":"2017","unstructured":"Unal I. Defining an Optimal Cut-Point Value in ROC Analysis: An Alternative Approach. Comput Math Methods Med. 2017;2017:3762651. https:\/\/doi.org\/10.1155\/2017\/3762651. Epub 2017 May 31. PMID: 28642804; PMCID: PMC5470053.","journal-title":"Comput Math Methods Med"},{"key":"402_CR8","unstructured":"Cruz-Uribe D, Neugebauer CJ. Sharp error bounds for the trapezoidal rule and Simpson's rule. JIPAM. J Inequalities Pure Appl\u00a0Math\u00a0[electronic only] 3.4. 2002;49:22. http:\/\/eudml.org\/doc\/123201."},{"issue":"1","key":"402_CR9","doi-asserted-by":"publisher","first-page":"6","DOI":"10.1186\/s12864-019-6413-7.PMID:31898477;PMCID:PMC6941312","volume":"21","author":"D Chicco","year":"2020","unstructured":"Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6. https:\/\/doi.org\/10.1186\/s12864-019-6413-7.PMID:31898477;PMCID:PMC6941312.","journal-title":"BMC Genomics"},{"key":"402_CR10","doi-asserted-by":"publisher","first-page":"145","DOI":"10.1111\/j.1466-8238.2007.00358.x","volume":"17","author":"J Lobo","year":"2008","unstructured":"Lobo J, Jim\u00e9nez-Valverde A, Real R. AUC: A misleading measure of the performance of predictive distribution models. Journal of Global Ecology and Biogeography. 2008;17:145\u201351.","journal-title":"Journal of Global Ecology and Biogeography"},{"key":"402_CR11","doi-asserted-by":"publisher","first-page":"35","DOI":"10.1186\/s13040-017-0155-3","volume":"10","author":"DT Chicco","year":"2017","unstructured":"Chicco DT. quick tips for machine learning in computational biology. BioData Mining. 2017;10:35. https:\/\/doi.org\/10.1186\/s13040-017-0155-3.","journal-title":"BioData Mining"},{"key":"402_CR12","doi-asserted-by":"publisher","unstructured":"Julius Sim, Chris C Wright. The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements. Phys Ther. 2005;85(3):257\u2013268. https:\/\/doi.org\/10.1093\/ptj\/85.3.257.","DOI":"10.1093\/ptj\/85.3.257"},{"key":"402_CR13","doi-asserted-by":"publisher","first-page":"5645","DOI":"10.1038\/s41467-021-26023-2","volume":"12","author":"Y Shen","year":"2021","unstructured":"Shen Y, Shamout FE, Oliver JR, et al. Artificial intelligence system reduces false-positive findings in the interpretation of breast ultrasound exams. Nat Commun. 2021;12:5645. https:\/\/doi.org\/10.1038\/s41467-021-26023-2.","journal-title":"Nat Commun"},{"key":"402_CR14","doi-asserted-by":"publisher","unstructured":"Ana R. Redondo, Jorge Navarro, Rub\u00e9n R. Fern\u00e1ndez, Isaac Mart\u00edn de Diego, Javier M. Moguerza, and Juan Jos\u00e9 Fern\u00e1ndez-Mu\u00f1oz. \u00a0Unified Performance Measure for Binary Classification Problems. In: Intelligent Data Engineering and Automated Learning \u2013 IDEAL 2020. Berlin, Heidelberg: 21st International Conference, Guimaraes, Portugal, November 4\u20136, 2020, Proceedings, Part II. Springer-Verlag,; 2020. p. 104\u201312. https:\/\/doi.org\/10.1007\/978-3-030-62365-4_10.","DOI":"10.1007\/978-3-030-62365-4_10"},{"key":"402_CR15","doi-asserted-by":"publisher","first-page":"12049","DOI":"10.1007\/s10489-021-03041-7","volume":"52","author":"IM De Diego","year":"2022","unstructured":"De Diego IM, Redondo AR, Fern\u00e1ndez RR, et al. General Performance Score for classification problems. Appl Intell. 2022;52:12049\u201363. https:\/\/doi.org\/10.1007\/s10489-021-03041-7.","journal-title":"Appl Intell"},{"issue":"383","key":"402_CR16","doi-asserted-by":"publisher","first-page":"553","DOI":"10.2307\/2288117","volume":"78","author":"EB Fowlkes","year":"1983","unstructured":"Fowlkes EB, Mallows CL. A Method for Comparing Two Hierarchical Clusterings. J Am Stat Assoc. 1983;78(383):553\u201369. https:\/\/doi.org\/10.2307\/2288117.","journal-title":"J Am Stat Assoc"},{"issue":"2","key":"402_CR17","doi-asserted-by":"publisher","first-page":"442","DOI":"10.1016\/0005-2795(75)90109-9","volume":"405","author":"BW Matthews","year":"1975","unstructured":"Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta. 1975;405(2):442\u201351. https:\/\/doi.org\/10.1016\/0005-2795(75)90109-9. PMID: 1180967.","journal-title":"Biochim Biophys Acta"},{"key":"402_CR18","doi-asserted-by":"publisher","unstructured":"Sitarz M. Extending F1 metric, probabilistic approach.\u00a0Adv Artif Intell Mach Learn; Res. 2023;3(2):1025\u201338. https:\/\/doi.org\/10.48550\/arXiv.2210.11997.","DOI":"10.48550\/arXiv.2210.11997"},{"issue":"3","key":"402_CR19","doi-asserted-by":"publisher","first-page":"696","DOI":"10.1007\/s00357-019-09345-1","volume":"37","author":"J Muschelli","year":"2020","unstructured":"Muschelli J. ROC and AUC with a Binary Predictor: a Potentially Misleading Metric. J Classif. 2020;37(3):696\u2013708. https:\/\/doi.org\/10.1007\/s00357-019-09345-1. Epub 2019 Dec 23. PMID: 33250548; PMCID: PMC7695228.","journal-title":"J Classif"},{"key":"402_CR20","doi-asserted-by":"publisher","first-page":"13","DOI":"10.1186\/s13040-021-00244-z","volume":"14","author":"D Chicco","year":"2021","unstructured":"Chicco D, T\u00f6tsch N, Jurman G. The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData Mining. 2021;14:13. https:\/\/doi.org\/10.1186\/s13040-021-00244-z.","journal-title":"BioData Mining"},{"key":"402_CR21","doi-asserted-by":"publisher","first-page":"78368","DOI":"10.1109\/ACCESS.2021.3084050","volume":"9","author":"D Chicco","year":"2021","unstructured":"Chicco D, Warrens MJ, Jurman G. The Matthews Correlation Coefficient (MCC) is More Informative Than Cohen\u2019s Kappa and Brier Score in Binary Classification Assessment. IEEE Access. 2021;9:78368\u201381. https:\/\/doi.org\/10.1109\/ACCESS.2021.3084050.","journal-title":"IEEE Access"},{"key":"402_CR22","unstructured":"R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2018. URL https:\/\/www.R-project.org\/."},{"key":"402_CR23","unstructured":"Peter A. Flach and Meelis Kull. Precision-Recall-Gain curves: PR analysis done right. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 (NIPS'15). MIT Press, Cambridge, MA, USA. 2015. p 838\u2013846."},{"key":"402_CR24","doi-asserted-by":"publisher","first-page":"35","DOI":"10.5093\/ejpalc2018a5","volume":"10","author":"J Salgado","year":"2018","unstructured":"Salgado J. Transforming the Area under the Normal Curve (AUC) into Cohen\u2019s d, Pearson\u2019s r pb, Odds-Ratio, and Natural Log Odds-Ratio: Two Conversion Tables. The European Journal of Psychology Applied to Legal Context. 2018;10:35\u201347. https:\/\/doi.org\/10.5093\/ejpalc2018a5.","journal-title":"The European Journal of Psychology Applied to Legal Context"},{"key":"402_CR25","doi-asserted-by":"publisher","unstructured":"David W. Hosmer Jr., Stanley Lemeshow, Rodney X. Sturdivant. First published. Applied Logistic Regression. Book Series: Wiley Series in Probability and Statistics. John Wiley & Sons, Inc. Print ISBN:9780470582473. Online ISBN:9781118548387.\u00a02013.\u00a0https:\/\/doi.org\/10.1002\/9781118548387.","DOI":"10.1002\/9781118548387"},{"issue":"3","key":"402_CR26","doi-asserted-by":"publisher","first-page":"369","DOI":"10.1111\/j.1467-9868.2007.00593.x","volume":"69","author":"CA Field","year":"2007","unstructured":"Field CA, Welsh AH. Bootstrapping Clustered Data. J R Stat Soc Series B Stat Method. 2007;69(3):369\u201390. https:\/\/doi.org\/10.1111\/j.1467-9868.2007.00593.x.","journal-title":"J R Stat Soc Series B Stat Method"},{"key":"402_CR27","doi-asserted-by":"publisher","first-page":"572","DOI":"10.3758\/s13428-019-01252-y","volume":"52","author":"M Deen","year":"2020","unstructured":"Deen M, de Rooij M. ClusterBootstrap: An R package for the analysis of hierarchical data using generalized linear models with the cluster bootstrap. Behav Res. 2020;52:572\u201390. https:\/\/doi.org\/10.3758\/s13428-019-01252-y.","journal-title":"Behav Res"},{"key":"402_CR28","doi-asserted-by":"publisher","unstructured":"Efron B, Tibshirani RJ. An Introduction to the Bootstrap. 1st ed. Chapman and Hall\/CRC; 1994. https:\/\/doi.org\/10.1201\/9780429246593.","DOI":"10.1201\/9780429246593"},{"key":"402_CR29","doi-asserted-by":"crossref","unstructured":"Efron B. The Jackknife, the Bootstrap, and Other Resampling Plans. Society for Industrial and Applied Mathematics. CBMS-NSF Regional Conference Series in Applied Mathematics. 1982. SN: 9780898711790. https:\/\/books.google.com\/books?id=JukZvUd4CAcC.","DOI":"10.1137\/1.9781611970319"},{"issue":"1","key":"402_CR30","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1214\/aos\/1176344552","volume":"7","author":"B Efron","year":"1979","unstructured":"Efron B. Bootstrap Methods: Another Look at the Jackknife. Ann Stat. 1979;7(1):1\u201326 http:\/\/www.jstor.org\/stable\/2958830.","journal-title":"Ann Stat"},{"issue":"3","key":"402_CR31","doi-asserted-by":"publisher","first-page":"204","DOI":"10.1370\/afm.141","volume":"2","author":"S Killip","year":"2004","unstructured":"Killip S, Mahfoud Z, Pearce K. What is an intracluster correlation coefficient? Crucial concepts for primary care researchers. Ann Fam Med. 2004;2(3):204\u20138. https:\/\/doi.org\/10.1370\/afm.141. PMID: 15209195; PMCID: PMC1466680.","journal-title":"Ann Fam Med"},{"issue":"2","key":"402_CR32","doi-asserted-by":"publisher","first-page":"567","DOI":"10.2307\/2533958","volume":"53","author":"NA Obuchowski","year":"1997","unstructured":"Obuchowski NA. Nonparametric Analysis of Clustered ROC Curve Data. Biometrics. 1997;53(2):567\u201378. https:\/\/doi.org\/10.2307\/2533958.","journal-title":"Biometrics"},{"issue":"3","key":"402_CR33","doi-asserted-by":"publisher","first-page":"1051","DOI":"10.1093\/ije\/dyv113","volume":"44","author":"C Rutterford","year":"2015","unstructured":"Rutterford C, Copas A, Eldridge S. Methods for sample size determination in cluster randomized trials. Int J Epidemiol. 2015;44(3):1051\u201367. https:\/\/doi.org\/10.1093\/ije\/dyv113. Epub 2015 Jul 13. PMID: 26174515; PMCID: PMC4521133.","journal-title":"Int J Epidemiol"},{"issue":"3","key":"402_CR34","doi-asserted-by":"publisher","first-page":"357","DOI":"10.1080\/10543400600609478","volume":"16","author":"M Chen","year":"2006","unstructured":"Chen M, Kianifard F, Dhar SK. A bootstrap-based test for establishing noninferiority in clinical trials. J Biopharm Stat. 2006;16(3):357\u201363. https:\/\/doi.org\/10.1080\/10543400600609478. PMID: 16724490.","journal-title":"J Biopharm Stat"},{"key":"402_CR35","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3310986.3311023","volume-title":"Proceedings of the 3rd International Conference on Machine Learning and Soft Computing (ICMLSC '19)","author":"SH Shah Newaz","year":"2019","unstructured":"Chongomweru Halimu, Asem Kasem, Shah Newaz SH. Empirical Comparison of Area under ROC curve (AUC) and Matthew Correlation Coefficient (MCC) for Evaluating Machine Learning Algorithms on Imbalanced Datasets for Binary Classification. In: Proceedings of the 3rd International Conference on Machine Learning and Soft Computing (ICMLSC \u201919). New York: Association for Computing Machinery; 2019. p. 1\u20136. https:\/\/doi.org\/10.1145\/3310986.3311023."},{"key":"402_CR36","unstructured":"Cao C, Chicco D, Hoffman MM. The MCC-F1 curve: a performance evaluation technique for binary classification. ArXiv:2006.11278.\u00a02020.\u00a0http:\/\/dx.doi.org\/10.48550\/arXiv.2006.11278."},{"key":"402_CR37","doi-asserted-by":"publisher","first-page":"17","DOI":"10.1186\/s41512-017-0017-y","volume":"1","author":"G Thomas","year":"2017","unstructured":"Thomas G, Kenny LC, Baker PN, et al. A novel method for interrogating receiver operating characteristic curves for assessing prognostic tests. Diagn Progn Res. 2017;1:17. https:\/\/doi.org\/10.1186\/s41512-017-0017-y.","journal-title":"Diagn Progn Res"}],"container-title":["BioData Mining"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13040-024-00402-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13040-024-00402-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13040-024-00402-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,10,23]],"date-time":"2024-10-23T16:02:54Z","timestamp":1729699374000},"score":1,"resource":{"primary":{"URL":"https:\/\/biodatamining.biomedcentral.com\/articles\/10.1186\/s13040-024-00402-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10,23]]},"references-count":37,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["402"],"URL":"https:\/\/doi.org\/10.1186\/s13040-024-00402-z","relation":{},"ISSN":["1756-0381"],"issn-type":[{"value":"1756-0381","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,10,23]]},"assertion":[{"value":"23 February 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"15 October 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 October 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"43"}}