{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,4]],"date-time":"2026-07-04T01:24:57Z","timestamp":1783128297328,"version":"3.54.6"},"reference-count":38,"publisher":"Springer Science and Business Media LLC","issue":"6","license":[{"start":{"date-parts":[[2024,1,8]],"date-time":"2024-01-08T00:00:00Z","timestamp":1704672000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,1,8]],"date-time":"2024-01-08T00:00:00Z","timestamp":1704672000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100005722","name":"Ludwig-Maximilians-Universit\u00e4t M\u00fcnchen","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100005722","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Mach Learn"],"published-print":{"date-parts":[[2024,6]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Regression trees and forests are widely used due to their flexibility and predictive accuracy. Whereas typical tree induction assumes independently identically distributed (i.i.d.) data, in many applications the training sample follows a complex sampling structure. This includes unequal probability sampling, which is often found in survey data. Then, a \u2018naive estimation\u2019 that simply ignores the sampling weights may be substantially biased. This article analyzes the bias arising from a naive estimation of regression trees or forests under complex sample designs and proposes ways of de-biasing. This is achieved by bridging tree learning to survey statistics, due to the correspondence of the mean-squared-error criterion in regression trees and variance estimation. Transferring population variance estimation approaches from survey statistics to tree induction, indeed considerably reduces the bias in the resulting trees, both in predictions and the tree structure. The latter is particularly crucial if the trees are to be interpreted. Our methodology is extended to random forests, where we show on simulated data and a housing dataset that correcting for complex sample designs leads to overall much better predictive accuracy and more trustworthy interpretation. Interestingly, corrected forests can surpass forests learned on i.i.d. samples in terms of accuracy, which also has important implications for adaptive data collection approaches.<\/jats:p>","DOI":"10.1007\/s10994-023-06439-1","type":"journal-article","created":{"date-parts":[[2024,1,8]],"date-time":"2024-01-08T18:02:35Z","timestamp":1704736955000},"page":"3379-3398","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":16,"title":["Learning de-biased regression trees and forests from complex samples"],"prefix":"10.1007","volume":"113","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3439-4469","authenticated-orcid":false,"given":"Malte","family":"Nalenz","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Julian","family":"Rodemann","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Thomas","family":"Augustin","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2024,1,8]]},"reference":[{"issue":"2","key":"6439_CR1","doi-asserted-by":"publisher","first-page":"190","DOI":"10.1214\/16-STS589","volume":"32","author":"FJ Breidt","year":"2017","unstructured":"Breidt, F. J., & Opsomer, J. D. (2017). Model-assisted survey estimation with modern prediction techniques. Statistical Science, 32(2), 190\u2013205.","journal-title":"Statistical Science"},{"key":"6439_CR2","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1023\/A:1010933404324","volume":"45","author":"L Breiman","year":"2001","unstructured":"Breiman, L. (2001). Random forests. Machine Learning, 45, 5\u201332.","journal-title":"Machine Learning"},{"issue":"1","key":"6439_CR3","doi-asserted-by":"publisher","first-page":"65","DOI":"10.1007\/BF02204352","volume":"25","author":"A Chaudhuri","year":"1978","unstructured":"Chaudhuri, A. (1978). On estimating the variance of a finite population. Metrika, 25(1), 65\u201376.","journal-title":"Metrika"},{"issue":"2","key":"6439_CR4","doi-asserted-by":"publisher","first-page":"236","DOI":"10.1198\/1085711043596","volume":"9","author":"J-YP Courbois","year":"2004","unstructured":"Courbois, J.-Y.P., & Urquhart, N. S. (2004). Comparison of survey estimates of the finite population variance. Journal of Agricultural, Biological, and Environmental Statistics, 9(2), 236\u2013251.","journal-title":"Journal of Agricultural, Biological, and Environmental Statistics"},{"key":"6439_CR5","first-page":"1","volume":"118","author":"M Dagdoug","year":"2021","unstructured":"Dagdoug, M., Goga, C., & Haziza, D. (2021). Model-assisted estimation through random forests in finite population sampling. Journal of the American Statistical Association, 118, 1\u201318.","journal-title":"Journal of the American Statistical Association"},{"issue":"1","key":"6439_CR6","doi-asserted-by":"publisher","first-page":"89","DOI":"10.1093\/biomet\/85.1.89","volume":"85","author":"J-C Deville","year":"1998","unstructured":"Deville, J.-C., & Tille, Y. (1998). Unequal probability sampling without replacement through a splitting method. Biometrika, 85(1), 89\u2013101.","journal-title":"Biometrika"},{"key":"6439_CR7","first-page":"3133","volume":"15","author":"M Fernandez-Delgado","year":"2014","unstructured":"Fernandez-Delgado, M., Cernadas, E., Barro, S., & Amorim, D. (2014). Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research, 15, 3133\u20133181.","journal-title":"The Journal of Machine Learning Research"},{"issue":"177","key":"6439_CR8","first-page":"1","volume":"20","author":"A Fisher","year":"2019","unstructured":"Fisher, A., Rudin, C., & Dominici, F. (2019). All models are wrong, but many are useful: Learning a variable\u2019s importance by studying an entire class of prediction models simultaneously. The Journal of Machine Learning Research, 20(177), 1\u201381.","journal-title":"The Journal of Machine Learning Research"},{"issue":"5","key":"6439_CR9","doi-asserted-by":"publisher","first-page":"1189","DOI":"10.1214\/aos\/1013203451","volume":"29","author":"JH Friedman","year":"2001","unstructured":"Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189\u20131232.","journal-title":"Annals of Statistics"},{"issue":"1","key":"6439_CR10","doi-asserted-by":"publisher","first-page":"44","DOI":"10.1080\/10618600.2014.907095","volume":"24","author":"A Goldstein","year":"2015","unstructured":"Goldstein, A., Kapelner, A., Bleich, J., & Pitkin, E. (2015). Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational and Graphical Statistics, 24(1), 44\u201365.","journal-title":"Journal of Computational and Graphical Statistics"},{"issue":"1","key":"6439_CR11","first-page":"723","volume":"13","author":"A Gretton","year":"2012","unstructured":"Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch\u00f6lkopf, B., & Smola, A. (2012). A kernel two-sample test. The Journal of Machine Learning Research, 13(1), 723\u2013773.","journal-title":"The Journal of Machine Learning Research"},{"key":"6439_CR12","doi-asserted-by":"publisher","DOI":"10.1007\/978-0-387-84858-7","volume-title":"The elements of statistical learning: data mining, inference, and prediction","author":"T Hastie","year":"2009","unstructured":"Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer."},{"issue":"2","key":"6439_CR13","doi-asserted-by":"publisher","first-page":"206","DOI":"10.1214\/16-STS608","volume":"32","author":"D Haziza","year":"2017","unstructured":"Haziza, D., & Beaumont, J.-F. (2017). Construction of weights in surveys: A review. Statistical Science, 32(2), 206\u2013226.","journal-title":"Statistical Science"},{"issue":"260","key":"6439_CR14","doi-asserted-by":"publisher","first-page":"663","DOI":"10.1080\/01621459.1952.10483446","volume":"47","author":"DG Horvitz","year":"1952","unstructured":"Horvitz, D. G., & Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260), 663\u2013685.","journal-title":"Journal of the American statistical Association"},{"issue":"1","key":"6439_CR15","doi-asserted-by":"publisher","first-page":"275","DOI":"10.1214\/aos\/1176346078","volume":"11","author":"T Liu","year":"1983","unstructured":"Liu, T., & Thompson, M. (1983). Properties of estimators of quadratic finite population functions: The batch approach. The Annals of Statistics, 11(1), 275\u2013285.","journal-title":"The Annals of Statistics"},{"key":"6439_CR16","doi-asserted-by":"publisher","DOI":"10.1201\/9780429298899","volume-title":"Sampling: Design and analysis","author":"SL Lohr","year":"2021","unstructured":"Lohr, S. L. (2021). Sampling: Design and analysis (3rd ed.). CRC Press.","edition":"3"},{"key":"6439_CR17","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18637\/jss.v009.i08","volume":"9","author":"T Lumley","year":"2004","unstructured":"Lumley, T. (2004). Analysis of complex survey samples. Journal of Statistical Software, 9, 1\u201319.","journal-title":"Journal of Statistical Software"},{"key":"6439_CR18","unstructured":"Lundberg, S.\u00a0M., Erion, G.\u00a0G., & Lee, S. -I. (2018). Consistent individualized feature attribution for tree ensembles. arXiv preprintarXiv:1802.03888"},{"issue":"1","key":"6439_CR19","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0280387","volume":"18","author":"N MacNell","year":"2023","unstructured":"MacNell, N., Feinstein, L., Wilkerson, J., Salo, P. M., Molsberry, S. A., Fessler, M. B., Thorne, P. S., Motsinger-Reif, A. A., & Zeldin, D. C. (2023). Implementing machine learning methods with complex survey data: Lessons learned on the impacts of accounting sampling weights in gradient boosting. PLoS ONE, 18(1), e0280387.","journal-title":"PLoS ONE"},{"issue":"2","key":"6439_CR20","doi-asserted-by":"publisher","first-page":"389","DOI":"10.1111\/sjos.12356","volume":"46","author":"KS McConville","year":"2019","unstructured":"McConville, K. S., & Toth, D. (2019). Automated selection of post-strata using a model-assisted regression tree estimator. Scandinavian Journal of Statistics, 46(2), 389\u2013413.","journal-title":"Scandinavian Journal of Statistics"},{"issue":"1","key":"6439_CR21","first-page":"67","volume":"12","author":"F Mecatti","year":"2000","unstructured":"Mecatti, F. (2000). Bootstrapping unequal probability samples. Statistica Applicata, 12(1), 67\u201377.","journal-title":"Statistica Applicata"},{"issue":"6","key":"6439_CR22","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0131765","volume":"10","author":"M Nahorniak","year":"2015","unstructured":"Nahorniak, M., Larsen, D. P., Volk, C., & Jordan, C. E. (2015). Using inverse probability bootstrap sampling to eliminate sample induced bias in model based analysis of unequal probability samples. PLoS ONE, 10(6), e0131765.","journal-title":"PLoS ONE"},{"key":"6439_CR23","volume-title":"Robust generalizations of stochastic derivative-free optimization","author":"J Rodemann","year":"2021","unstructured":"Rodemann, J. (2021). Robust generalizations of stochastic derivative-free optimization. LMU Munich."},{"key":"6439_CR24","unstructured":"Rodemann, J., Fischer, S., Schneider, L., Nalenz, M., & Augustin, T. (2022). Not all data are created equal: Lessons from sampling theory for adaptive machine learning. In Poster presented at international conference on statistics and data science (ICSDS), Institute of Mathematical Statistics (IMS)."},{"issue":"4","key":"6439_CR25","doi-asserted-by":"publisher","first-page":"476","DOI":"10.1109\/TSMCC.2004.843247","volume":"35","author":"L Rokach","year":"2005","unstructured":"Rokach, L., & Maimon, O. (2005). Top-down induction of decision trees classifiers-a survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 35(4), 476\u2013487.","journal-title":"IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews)"},{"key":"6439_CR26","volume-title":"Model assisted survey sampling","author":"C-E S\u00e4rndal","year":"2003","unstructured":"S\u00e4rndal, C.-E., Swensson, B., & Wretman, J. (2003). Model assisted survey sampling. Springer."},{"key":"6439_CR27","doi-asserted-by":"publisher","first-page":"281","DOI":"10.1023\/A:1006316418865","volume":"66","author":"HT Schreuder","year":"2001","unstructured":"Schreuder, H. T., Gregoire, T. G., & Weyer, J. P. (2001). For what applications can probability and non-probability sampling be used? Environmental Monitoring and Assessment, 66, 281\u2013291.","journal-title":"Environmental Monitoring and Assessment"},{"issue":"3","key":"6439_CR28","doi-asserted-by":"publisher","first-page":"278","DOI":"10.1177\/0962280210395740","volume":"22","author":"SR Seaman","year":"2013","unstructured":"Seaman, S. R., & White, I. R. (2013). Review of inverse probability weighting for dealing with missing data. Statistical methods in medical research, 22(3), 278\u2013295.","journal-title":"Statistical methods in medical research"},{"issue":"28","key":"6439_CR29","first-page":"307","volume":"2","author":"LS Shapley","year":"1953","unstructured":"Shapley, L. S. (1953). A value for n-person games. Contributions to the Theory of Games, 2(28), 307\u2013317.","journal-title":"Contributions to the Theory of Games"},{"issue":"2","key":"6439_CR30","doi-asserted-by":"publisher","first-page":"165","DOI":"10.1214\/17-STS614","volume":"32","author":"C Skinner","year":"2017","unstructured":"Skinner, C., & Wakefield, J. (2017). Introduction to the design and analysis of complex survey data. Statistical Science, 32(2), 165\u2013175.","journal-title":"Statistical Science"},{"key":"6439_CR31","unstructured":"Snoek, J., Larochelle, H., & Adams, R.\u00a0P. (2012). Practical Bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems (vol. 25)."},{"issue":"3","key":"6439_CR32","first-page":"374","volume":"56","author":"A Swain","year":"1994","unstructured":"Swain, A., & Mishra, G. (1994). Estimation of finite population variance under unequal probability sampling. Sankhy\u0101: The Indian Journal of Statistics, Series B, 56(3), 374\u2013388.","journal-title":"Sankhy\u0101: The Indian Journal of Statistics, Series B"},{"issue":"1","key":"6439_CR33","first-page":"19","volume":"4","author":"T Therneau","year":"2022","unstructured":"Therneau, T., & Atkinson, B. (2022). rpart: Recursive partitioning and regression trees. R Package Version, 4(1), 19.","journal-title":"R Package Version"},{"issue":"496","key":"6439_CR34","doi-asserted-by":"publisher","first-page":"1626","DOI":"10.1198\/jasa.2011.tm10383","volume":"106","author":"D Toth","year":"2011","unstructured":"Toth, D., & Eltinge, J. L. (2011). Building consistent regression trees from complex sample data. Journal of the American Statistical Association, 106(496), 1626\u20131636.","journal-title":"Journal of the American Statistical Association"},{"key":"6439_CR35","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-93632-1","volume-title":"Practical tools for designing and weighting survey samples","author":"R Valliant","year":"2018","unstructured":"Valliant, R., Dever, J. A., & Kreuter, F. (2018). Practical tools for designing and weighting survey samples. Springer."},{"key":"6439_CR36","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18637\/jss.v077.i01","volume":"77","author":"MN Wright","year":"2017","unstructured":"Wright, M. N., & Ziegler, A. (2017). Ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77, 1\u201317.","journal-title":"Journal of Statistical Software"},{"issue":"1","key":"6439_CR37","doi-asserted-by":"publisher","first-page":"291","DOI":"10.3233\/SJI-210875","volume":"38","author":"W Yung","year":"2022","unstructured":"Yung, W., Tam, S.-M., Buelens, B., Chipman, H., Dumpert, F., Ascari, G., Rocci, F., Burger, J., & Choi, I. K. (2022). A quality framework for statistical algorithms. Statistical Journal of the IAOS, 38(1), 291\u2013308.","journal-title":"Statistical Journal of the IAOS"},{"key":"6439_CR38","first-page":"7460","volume":"34","author":"K Zhang","year":"2021","unstructured":"Zhang, K., Janson, L., & Murphy, S. (2021). Statistical inference with M-estimators on adaptively collected data. Advances in Neural Information Processing Systems, 34, 7460\u20137471.","journal-title":"Advances in Neural Information Processing Systems"}],"container-title":["Machine Learning"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-023-06439-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10994-023-06439-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-023-06439-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,10]],"date-time":"2024-05-10T15:09:29Z","timestamp":1715353769000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10994-023-06439-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,8]]},"references-count":38,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2024,6]]}},"alternative-id":["6439"],"URL":"https:\/\/doi.org\/10.1007\/s10994-023-06439-1","relation":{},"ISSN":["0885-6125","1573-0565"],"issn-type":[{"value":"0885-6125","type":"print"},{"value":"1573-0565","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,1,8]]},"assertion":[{"value":"15 June 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 August 2023","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 October 2023","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 January 2024","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"None.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}},{"value":"The empirical evaluation is based on data from the South Korean statistical office.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethical approval"}},{"value":"Not applicable.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent to participate"}},{"value":"Not applicable.","order":5,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}}]}}