{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,9]],"date-time":"2026-03-09T14:31:26Z","timestamp":1773066686428,"version":"3.50.1"},"reference-count":25,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2026,3,1]],"date-time":"2026-03-01T00:00:00Z","timestamp":1772323200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,3,9]],"date-time":"2026-03-09T00:00:00Z","timestamp":1773014400000},"content-version":"vor","delay-in-days":8,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100007797","name":"University of Helsinki","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100007797","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Mach Learn"],"published-print":{"date-parts":[[2026,3]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    Model monitoring is a critical component of the machine learning lifecycle, safeguarding against undetected drops in the model\u2019s performance after deployment. Traditionally, performance monitoring has required access to ground truth labels, which are not always readily available. This can result in unacceptable latency or render performance monitoring altogether impossible. Recently, methods designed to estimate the accuracy of classifier models without access to labels have shown promising results. However, there are various other metrics that might be more suitable for assessing model performance in many cases. Until now, none of these important metrics has received similar interest from the scientific community. In this work, we address this gap by presenting\n                    <jats:italic>Confidence-based Performance estimation<\/jats:italic>\n                    (CBPE), a novel method that can estimate any binary classification metric defined using the confusion matrix. In particular, we choose four metrics from this large family: accuracy, precision, recall, and\n                    <jats:inline-formula>\n                      <jats:tex-math>$$\\hbox {F}_1$$<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    , to demonstrate our method. CBPE treats the elements of the confusion matrix as random variables and leverages calibrated confidence scores of the model to estimate their distributions. The desired metric is then also treated as a random variable, whose full probability distribution can be derived from the estimated confusion matrix. CBPE is shown to produce estimates that come with strong theoretical guarantees and valid confidence intervals.\n                  <\/jats:p>","DOI":"10.1007\/s10994-025-06970-3","type":"journal-article","created":{"date-parts":[[2026,3,9]],"date-time":"2026-03-09T10:13:30Z","timestamp":1773051210000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Performance Estimation in Binary Classification Using Calibrated Confidence"],"prefix":"10.1007","volume":"115","author":[{"given":"Juhani","family":"Kivim\u00e4ki","sequence":"first","affiliation":[]},{"given":"Jakub","family":"Bia\u0142ek","sequence":"additional","affiliation":[]},{"given":"Wojtek","family":"Kuberski","sequence":"additional","affiliation":[]},{"given":"Jukka K.","family":"Nurminen","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2026,3,9]]},"reference":[{"key":"6970_CR1","doi-asserted-by":"crossref","unstructured":"Baek, C., Jiang, Y., Raghunathan, A., & Kolter, J. Z. (2022). Agreement-on-the-line: Predicting the performance of neural networks under distribution shift. In: Advances in neural information processing systems, vol. 35, pp. 19274\u201319289. Curran Associates, Inc., Red Hook, NY, USA.","DOI":"10.52202\/068431-1401"},{"issue":"10","key":"6970_CR2","first-page":"23","volume":"3","author":"M Bekkar","year":"2013","unstructured":"Bekkar, M., Djemaa, H. K., & Alitouche, T. A. (2013). Evaluation measures for models assessment over imbalanced data sets. Journal of Information Engineering and Applications,3(10), 23\u201338.","journal-title":"Journal of Information Engineering and Applications"},{"key":"6970_CR4","first-page":"14980","volume":"34","author":"J Chen","year":"2021","unstructured":"Chen, J., Liu, F., Avci, B., Wu, X., Liang, Y., & Jha, S. (2021). Detecting errors and estimating accuracy on unlabeled data with self-training ensembles. Advances in Neural Information Processing Systems,34, 14980\u201314992.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"6970_CR3","unstructured":"Chen, M., Goel, K., Sohoni, N. S., Poms, F., Fatahalian, K., & R\u00e9, C. (2021). Mandoline: Model evaluation under distribution shift. In: International conference on machine learning, pp. 1617\u20131629. PMLR"},{"key":"6970_CR5","unstructured":"Deng, W., Suh, Y., Gould, S., & Zheng, L. (2023). Confidence and dispersity speak: Characterizing prediction matrix for unsupervised accuracy estimation. In: International conference on machine learning, pp. 7658\u20137674. PMLR"},{"key":"6970_CR6","doi-asserted-by":"crossref","unstructured":"Gardner, J., Popovic, Z., & Schmidt, L. (2024). Benchmarking distribution shift in tabular data with tableshift. Advances in Neural Information Processing Systems, 36.","DOI":"10.52202\/075280-2324"},{"key":"6970_CR7","unstructured":"Garg, S., & Balakrishnan, S. (2022). Leveraging unlabeled data to predict out-of-distribution performance. ICLR."},{"key":"6970_CR8","doi-asserted-by":"crossref","unstructured":"Guillory, D., Shankar, V., Ebrahimi, S., Darrell, T., & Schmidt, L. (2021). Predicting with confidence on unseen distributions. In: Proceedings of the IEEE\/CVF international conference on computer vision, pp. 1134\u20131144","DOI":"10.1109\/ICCV48922.2021.00117"},{"key":"6970_CR9","unstructured":"Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. In: Proceedings of the 34th international conference on machine learning - Volume 70. ICML\u201917, pp. 1321\u20131330."},{"key":"6970_CR10","unstructured":"Hendrycks, D., & Gimpel, K. (2017). A baseline for detecting misclassified and out-of-distribution examples in neural networks. In: International conference on learning representations."},{"key":"6970_CR11","doi-asserted-by":"publisher","first-page":"41","DOI":"10.1016\/j.csda.2012.10.006","volume":"59","author":"Y Hong","year":"2013","unstructured":"Hong, Y. (2013). On computing the distribution function for the Poisson binomial distribution. Computational Statistics & Data Analysis,59, 41\u201351.","journal-title":"Computational Statistics & Data Analysis"},{"key":"6970_CR12","unstructured":"Jiang, Y., Nagarajan, V., Baek, C., & Kolter, J. Z. (2022). Assessing generalization of SGD via disagreement. In: International conference on learning representations."},{"key":"6970_CR14","doi-asserted-by":"publisher","first-page":"209","DOI":"10.1613\/jair.1.16709","volume":"82","author":"J Kivim\u00e4ki","year":"2025","unstructured":"Kivim\u00e4ki, J., Bia\u0142ek, J., Nurminen, J. K., & Kuberski, W. (2025). Confidence-based estimators for predictive performance in model monitoring. Journal of Artificial Intelligence Research,82, 209\u2013240.","journal-title":"Journal of Artificial Intelligence Research"},{"key":"6970_CR13","doi-asserted-by":"crossref","unstructured":"Kivim\u00e4ki, J., Lebedev, A., & Nurminen, J. K. (2023). Failure prediction in 2d document information extraction with calibrated confidence scores. In: 2023 IEEE 47th annual computers, software, and applications conference (COMPSAC), pp. 193\u2013202","DOI":"10.1109\/COMPSAC57700.2023.00033"},{"key":"6970_CR15","doi-asserted-by":"publisher","first-page":"132","DOI":"10.1016\/j.inffus.2017.02.004","volume":"37","author":"B Krawczyk","year":"2017","unstructured":"Krawczyk, B., Minku, L. L., Gama, J., Stefanowski, J., & Wo\u017aniak, M. (2017). Ensemble learning for data stream analysis: A survey. Information Fusion,37, 132\u2013156.","journal-title":"Information Fusion"},{"key":"6970_CR16","doi-asserted-by":"crossref","unstructured":"Lu, N., Zhang, T., Fang, T., Teshima, T., & Sugiyama, M. (2022). Rethinking importance weighting for transfer learning. In Federated and transfer learning, pp. 185\u2013231. Springer, Cham, Switzerland.","DOI":"10.1007\/978-3-031-11748-0_9"},{"key":"6970_CR17","first-page":"17602","volume":"36","author":"Y Lu","year":"2023","unstructured":"Lu, Y., Qin, Y., Zhai, R., Shen, A., Chen, K., Wang, Z., Kolouri, S., Stepputtis, S., Campbell, J., & Sycara, K. (2023). Characterizing out-of-distribution error via optimal transport. Advances in Neural Information Processing Systems,36, 17602\u201317622.","journal-title":"Advances in Neural Information Processing Systems"},{"issue":"1","key":"6970_CR18","doi-asserted-by":"publisher","first-page":"521","DOI":"10.1016\/j.patcog.2011.06.019","volume":"45","author":"JG Moreno-Torres","year":"2012","unstructured":"Moreno-Torres, J. G., Raeder, T., Alaiz-Rodr\u00edguez, R., Chawla, N. V., & Herrera, F. (2012). A unifying view on dataset shift in classification. Pattern Recognition,45(1), 521\u2013530.","journal-title":"Pattern Recognition"},{"key":"6970_CR19","unstructured":"Nixon, J., Dusenberry, M. W., Zhang, L., Jerfel, G., & Tran, D. (2019). Measuring calibration in deep learning. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (CVPR) workshops."},{"key":"6970_CR20","unstructured":"Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Lakshminarayanan, B., & Snoek, J. (2019). Can you trust your model\u2019s uncertainty? evaluating predictive uncertainty under dataset shift. In: Advances in neural information processing systems, Vol. 32. Curran Associates, Inc., Red Hook, NY, USA."},{"key":"6970_CR21","unstructured":"Popordanoska, T., Radevski, G., Tuytelaars, T., & Blaschko, M. B. (2023). Estimating calibration error under label shift without labels. arXiv preprint arXiv:2312.08586"},{"key":"6970_CR22","doi-asserted-by":"crossref","unstructured":"Sun, X., Wu, Z., Zhan, Z., & Ji, Y. (2025). Contrastive conditional alignment based on label shift calibration for imbalanced domain adaptation. In: International conference on pattern recognition, Cham, Switzerland, pp. 13\u201328. Springer.","DOI":"10.1007\/978-3-031-78195-7_2"},{"key":"6970_CR23","unstructured":"Wei, J., Liu, H., Liu, T., Niu, G., Sugiyama, M., & Liu, Y. (2022). To smooth or not? when label smoothing meets noisy labels. In: International conference on machine learning."},{"key":"6970_CR24","unstructured":"Zadrozny, B., & Elkan, C. (2001). Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In: Proceedings of the eighteenth international conference on machine learning. ICML \u201901, pp. 609\u2013616."},{"key":"6970_CR25","doi-asserted-by":"crossref","unstructured":"Zadrozny, B., & Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. KDD \u201902, pp. 694\u2013699.","DOI":"10.1145\/775047.775151"}],"container-title":["Machine Learning"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-025-06970-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10994-025-06970-3","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-025-06970-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,9]],"date-time":"2026-03-09T10:13:38Z","timestamp":1773051218000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10994-025-06970-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3]]},"references-count":25,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2026,3]]}},"alternative-id":["6970"],"URL":"https:\/\/doi.org\/10.1007\/s10994-025-06970-3","relation":{},"ISSN":["0885-6125","1573-0565"],"issn-type":[{"value":"0885-6125","type":"print"},{"value":"1573-0565","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,3]]},"assertion":[{"value":"31 May 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 September 2025","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"18 December 2025","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 March 2026","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Two of the authors (J.B. and W.K.) work in a commercial company that already utilizes the techniques described in the paper. It is conceivable that having a peer-reviewed paper published in a high-quality journal could increase the commercial success of said company. However, neither of these authors took part in any of the experiments or in reporting their results.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}],"article-number":"67"}}