{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,21]],"date-time":"2026-05-21T19:49:10Z","timestamp":1779392950022,"version":"3.53.1"},"reference-count":34,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2025,2,24]],"date-time":"2025-02-24T00:00:00Z","timestamp":1740355200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,2,24]],"date-time":"2025-02-24T00:00:00Z","timestamp":1740355200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100004895","name":"European Social Fund","doi-asserted-by":"publisher","award":["IT Academy programme"],"award-info":[{"award-number":["IT Academy programme"]}],"id":[{"id":"10.13039\/501100004895","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100002301","name":"Eesti Teadusagentuur","doi-asserted-by":"publisher","award":["PRG1604"],"award-info":[{"award-number":["PRG1604"]}],"id":[{"id":"10.13039\/501100002301","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Mach Learn"],"published-print":{"date-parts":[[2025,4]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Calibrated uncertainty estimates are essential for classifiers used in safety-critical applications. If a classifier is uncalibrated, then there is a unique way to calibrate its uncertainty using the idealistic true calibration map corresponding to this classifier. Although the true calibration map is typically unknown in practice, it can be estimated with many post-hoc calibration methods which fit some family of potential calibration functions on a validation dataset. This paper examines the connection between such post-hoc calibration methods and calibration evaluation. Despite the negative connotations of fitting on test data in machine learning, we claim that fitting calibration maps on test data as part of the calibration evaluation process is a method worth considering, and we refer to this view as fit-on-test. This view enables the usage of any post-hoc calibration method as an evaluation measure, unlocking missed opportunities in development of evaluation methods. We prove that even ECE, which is the most common calibration evaluation method, is actually a fit-on-test measure. This observation leads us to a new method of tuning the number of bins in ECE with cross-validation. Fitting on test data can lead to test-time overfitting, and therefore, we discuss the limitations and concerns with the fit-on-test view. Our contributions also include: (1) enhancement of reliability diagrams with diagonal filling; (2) development of new calibration map families PL and PL3; and (3) an experimental study of which families perform strongly both as post-hoc calibrators and calibration evaluators.<\/jats:p>","DOI":"10.1007\/s10994-024-06652-6","type":"journal-article","created":{"date-parts":[[2025,2,24]],"date-time":"2025-02-24T22:53:02Z","timestamp":1740437582000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["On the usefulness of the fit-on-test view on evaluating calibration of classifiers"],"prefix":"10.1007","volume":"114","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1916-4350","authenticated-orcid":false,"given":"Markus","family":"K\u00e4ngsepp","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Kaspar","family":"Valk","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Meelis","family":"Kull","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2025,2,24]]},"reference":[{"key":"6652_CR1","doi-asserted-by":"crossref","unstructured":"Allikivi, M.L., & Kull, M. (2019). Non-parametric Bayesian isotonic calibration: Fighting over-confidence in binary classification. In: Machine Learning and Knowledge Discovery in Databases (ECML-PKDD\u201919). Springer, 68\u201385","DOI":"10.1007\/978-3-030-46147-8_7"},{"issue":"1","key":"6652_CR2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1175\/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2","volume":"78","author":"GW Brier","year":"1950","unstructured":"Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1\u20133.","journal-title":"Monthly Weather Review"},{"key":"6652_CR3","doi-asserted-by":"crossref","first-page":"655","DOI":"10.1007\/s00382-011-1191-1","volume":"39","author":"J Broecker","year":"2011","unstructured":"Broecker, J. (2011). Estimating reliability and resolution of probability forecasts through decomposition of the empirical score. Climate Dynamics, 39, 655\u2013667.","journal-title":"Climate Dynamics"},{"key":"6652_CR4","doi-asserted-by":"crossref","first-page":"1954","DOI":"10.1002\/qj.1924","volume":"138","author":"C Ferro","year":"2012","unstructured":"Ferro, C., & Fricker, T. E. (2012). A bias-corrected decomposition of the brier score. Quarterly Journal of the Royal Meteorological Society, 138, 1954\u20131960.","journal-title":"Quarterly Journal of the Royal Meteorological Society"},{"key":"6652_CR5","unstructured":"Guo, C., Pleiss, G., Sun, Y., & Weinberger, K.Q. (2017). On Calibration of Modern Neural Networks. In: Thirty-fourth International Conference on Machine Learning, Sydney, Australia, arXiv:1706.04599."},{"key":"6652_CR6","unstructured":"Gupta, K., Rahimi, A., Ajanthan, T., Mensink, T., Sminchisescu, C., & Hartley, R. (2021). Calibration of neural networks using splines. In: International Conference on Learning Representations, https:\/\/openreview.net\/forum?id=eQe8DEWNN2W."},{"key":"6652_CR7","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. CoRR abs\/1512.03385. arXiv:1512.03385.","DOI":"10.1109\/CVPR.2016.90"},{"key":"6652_CR8","unstructured":"Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. arXiv preprint arxiv:2006.11239."},{"key":"6652_CR9","doi-asserted-by":"crossref","unstructured":"Huang, G., Liu, Z., & Weinberger, K.Q. (2016). Densely connected convolutional networks. CoRR abs\/1608.06993. arXiv:1608.06993,","DOI":"10.1109\/CVPR.2017.243"},{"key":"6652_CR10","unstructured":"Jekel, C.F., & Venter, G. (2019). pwlf: A Python Library for Fitting 1D Continuous Piecewise Linear Functions. https:\/\/github.com\/cjekel\/piecewise_linear_fit_py"},{"key":"6652_CR11","unstructured":"Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980"},{"key":"6652_CR12","unstructured":"Krizhevsky, A., Hinton, G., et\u00a0al. (2009). Learning multiple layers of features from tiny images"},{"key":"6652_CR13","unstructured":"Kull, M., Silva Filho, T., & Flach, P. (2017). Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers. In: Singh A, Zhu J (eds) Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol\u00a054. PMLR, Fort Lauderdale, FL, USA, 623\u2013631"},{"key":"6652_CR14","unstructured":"Kull, M., Perell\u00f3-Nieto, M., K\u00e4ngsepp, M., de Menezes e Silva Filho, T., Song, H., & Flach, Peter A. (2019). Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with dirichlet calibration. In: Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"6652_CR15","unstructured":"Kumar, A., Liang, P., & Ma, T. (2019). Verified uncertainty calibration. In: Advances in Neural Information Processing Systems (NeurIPS\u201919)."},{"issue":"1","key":"6652_CR16","first-page":"41","volume":"26","author":"AH Murphy","year":"1977","unstructured":"Murphy, A. H., & Winkler, R. L. (1977). Reliability of subjective probability forecasts of precipitation and temperature. Journal of the Royal Statistical Society Series C (Applied Statistics), 26(1), 41\u201347.","journal-title":"Journal of the Royal Statistical Society Series C (Applied Statistics)"},{"key":"6652_CR17","unstructured":"Naeini, M.P., Cooper, G., & Hauskrecht, M. (2015). Obtaining well calibrated probabilities using bayesian binning. In: AAAI Conference on Artificial Intelligence"},{"key":"6652_CR18","unstructured":"Nakkiran, P., Neyshabur, B., & Sedghi, H. (2021). The deep bootstrap framework: Good online learners are good offline generalizers. arXiv:2010.08127."},{"key":"6652_CR19","doi-asserted-by":"crossref","unstructured":"Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. In: Proceedings of the 22nd international conference on Machine learning, 625\u2013632.","DOI":"10.1145\/1102351.1102430"},{"key":"6652_CR20","unstructured":"Nieto, M.P., Song, H., Filho, T.S., & K\u00e4ngsepp, M. (2019). PyCalib: Python library for classifier calibration. https:\/\/github.com\/classifier-calibration\/PyCalib"},{"key":"6652_CR21","unstructured":"Nixon, J., Dusenberry, M., & Zhang, L., et\u00a0al. (2019). Measuring calibration in deep learning. ArXiv arXiv:1904.01685"},{"key":"6652_CR22","doi-asserted-by":"crossref","unstructured":"Platt, J., et al. (2000). Probabilities for SV machines. In A. Smola, P. Bartlett, & B. Sch\u00f6lkopf (Eds.), Advances in Large Margin Classifiers, 61\u201374. MIT Press.","DOI":"10.7551\/mitpress\/1113.003.0008"},{"key":"6652_CR23","first-page":"7933","volume":"35","author":"T Popordanoska","year":"2022","unstructured":"Popordanoska, T., Sayer, R., & Blaschko, M. (2022). A consistent and differentiable lp canonical calibration error estimator. Advances in Neural Information Processing Systems, 35, 7933\u20137946.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"6652_CR24","unstructured":"Popordanoska, T., Gruber, S.G., & Tiulpin, A., et\u00a0al. (2023). Consistent and asymptotically unbiased estimation of proper calibration errors. arXiv preprint arXiv:2312.08589"},{"key":"6652_CR25","unstructured":"Rahimi, A., Shaban, A., Cheng, C.A., Hartley, R., & Boots, B. (2020). Intra order-preserving functions for calibration of multi-class neural networks. In: Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"6652_CR26","unstructured":"Roelofs, R., Cain, N., Shlens, J., & Mozer, M. (2020). Mitigating bias in calibration error estimation. ArXiv abs\/2012.08668."},{"key":"6652_CR27","unstructured":"Song, H., Perello-Nieto, M., Santos-Rodriguez, R., Kull, M., & Flach, P., and others (2021). Classifier calibration: How to assess and improve predicted class probabilities: A survey. arXiv preprint arXiv:2112.10327."},{"key":"6652_CR28","doi-asserted-by":"publisher","unstructured":"Tikka, J., & Hollm\u00e9n, J. (2008). Sequential input selection algorithm for long-term prediction of time series. Neurocomputing 71(13):2604\u20132615. https:\/\/doi.org\/10.1016\/j.neucom.2007.11.037, https:\/\/www.sciencedirect.com\/science\/article\/pii\/S0925231208002233, artificial Neural Networks (ICANN 2006) \/ Engineering of Intelligent Systems (ICEIS 2006).","DOI":"10.1016\/j.neucom.2007.11.037"},{"key":"6652_CR29","unstructured":"Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J., & Sch\u00f6n, T. (2019). Evaluating model calibration in classification. In: Chaudhuri K, Sugiyama M (eds) Proceedings of Machine Learning Research, Proceedings of Machine Learning Research, 89. PMLR, 3459\u20133467."},{"key":"6652_CR30","unstructured":"Widmann, D., Lindsten, F., & Zachariah, D. (2019). Calibration tests in multi-class classification: A unifying framework. In: NeurIPS."},{"key":"6652_CR31","unstructured":"Xiong, M., Deng, A., Koh, P.W., Wu, J., Li, S., Xu, J., & Hooi B. (2023). Proximity-informed calibration for deep neural networks. In: Thirty-seventh Conference on Neural Information Processing Systems, https:\/\/openreview.net\/forum?id=xOJUmwwlJc"},{"key":"6652_CR32","doi-asserted-by":"crossref","unstructured":"Zadrozny, B., & Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. In: Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining (KDD\u201902). ACM, 694\u2013699.","DOI":"10.1145\/775047.775151"},{"key":"6652_CR33","unstructured":"Zagoruyko, S., & Komodakis, N. (2016). Wide residual networks. CoRR abs\/1605.07146. arXiv:1605.07146."},{"key":"6652_CR34","unstructured":"Zhang, J., Kailkhura, B., & Han, T.Y. (2020). Mix-n-match: Ensemble and compositional methods for uncertainty calibration in deep learning. In: ICML."}],"container-title":["Machine Learning"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-024-06652-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10994-024-06652-6\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-024-06652-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,30]],"date-time":"2025-03-30T15:12:08Z","timestamp":1743347528000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10994-024-06652-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,24]]},"references-count":34,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2025,4]]}},"alternative-id":["6652"],"URL":"https:\/\/doi.org\/10.1007\/s10994-024-06652-6","relation":{},"ISSN":["0885-6125","1573-0565"],"issn-type":[{"value":"0885-6125","type":"print"},{"value":"1573-0565","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,24]]},"assertion":[{"value":"4 April 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"17 January 2024","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 September 2024","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"24 February 2025","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors have no conflict of interest to declare that are relevant to the content of this article.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval"}},{"value":"The authors agree to participate in the conference.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent to participate"}},{"value":"The authors permit the publication of the article and its materials.","order":5,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}}],"article-number":"105"}}