{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,25]],"date-time":"2026-04-25T02:55:48Z","timestamp":1777085748174,"version":"3.51.4"},"reference-count":33,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2022,11,28]],"date-time":"2022-11-28T00:00:00Z","timestamp":1669593600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,11,28]],"date-time":"2022-11-28T00:00:00Z","timestamp":1669593600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100002341","name":"Academy of Finland","doi-asserted-by":"crossref","award":["317085"],"award-info":[{"award-number":["317085"]}],"id":[{"id":"10.13039\/501100002341","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100002341","name":"Academy of Finland","doi-asserted-by":"crossref","award":["325117"],"award-info":[{"award-number":["325117"]}],"id":[{"id":"10.13039\/501100002341","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100000781","name":"European Research Council","doi-asserted-by":"publisher","award":["834862"],"award-info":[{"award-number":["834862"]}],"id":[{"id":"10.13039\/501100000781","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100007601","name":"Horizon 2020","doi-asserted-by":"publisher","award":["871042"],"award-info":[{"award-number":["871042"]}],"id":[{"id":"10.13039\/501100007601","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Wallenberg AI, Autonomous Systems and Software Program"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Data Min Knowl Disc"],"published-print":{"date-parts":[[2023,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Decision trees are popular classification models, providing high accuracy and intuitive explanations. However, as the tree size grows the model interpretability deteriorates. Traditional tree-induction algorithms, such as C4.5 and CART, rely on impurity-reduction functions that promote the discriminative power of each split. Thus, although these traditional methods are accurate in practice, there has been no theoretical guarantee that they will produce small trees. In this paper, we justify the use of a general family of impurity functions, including the popular functions of entropy and Gini-index, in scenarios where small trees are desirable, by showing that a simple enhancement can equip them with complexity guarantees. We consider a general setting, where objects to be classified are drawn from an arbitrary probability distribution, classification can be binary or multi-class, and splitting tests are associated with non-uniform costs. As a measure of tree complexity, we adopt the expected cost to classify an object drawn from the input distribution, which, in the uniform-cost case, is the expected number of tests. We propose a tree-induction algorithm that gives a logarithmic approximation guarantee on the tree complexity. This approximation factor is tight up to a constant factor under mild assumptions. The algorithm recursively selects a test that maximizes a greedy criterion defined as a weighted sum of three components. The first two components encourage the selection of tests that improve the balance and the cost-efficiency of the tree, respectively, while the third impurity-reduction component encourages the selection of more discriminative tests. As shown in our empirical evaluation, compared to the original heuristics, the enhanced algorithms strike an excellent balance between predictive accuracy and tree complexity.<\/jats:p>","DOI":"10.1007\/s10618-022-00884-7","type":"journal-article","created":{"date-parts":[[2022,11,28]],"date-time":"2022-11-28T11:04:06Z","timestamp":1669633446000},"page":"434-475","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":27,"title":["Regularized impurity reduction: accurate decision trees with complexity guarantees"],"prefix":"10.1007","volume":"37","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1252-7489","authenticated-orcid":false,"given":"Guangyi","family":"Zhang","sequence":"first","affiliation":[]},{"given":"Aristides","family":"Gionis","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2022,11,28]]},"reference":[{"key":"884_CR1","doi-asserted-by":"crossref","unstructured":"Adler M, Heeringa B (2008) Approximating optimal binary decision trees. In: Goel A, Jansen k, Rolim JDP, Rubinfeld R (eds) Approximation, randomization and combinatorial optimization. Algorithms and techniques. Springer, Berlin, pp 1\u20139","DOI":"10.1007\/978-3-540-85363-3_1"},{"issue":"1","key":"884_CR2","doi-asserted-by":"publisher","first-page":"459","DOI":"10.1109\/TIT.2011.2169296","volume":"58","author":"G Bellala","year":"2012","unstructured":"Bellala G, Bhavnani SK, Scott C (2012) Group-based active query selection for rapid diagnosis in time-critical situations. IEEE Trans Inf Theory 58(1):459\u2013478","journal-title":"IEEE Trans Inf Theory"},{"key":"884_CR3","unstructured":"Blanc G, Lange J, Tan LY (2020) Top-down induction of decision trees: rigorous guarantees and inherent limitations. In: 11th innovations in theoretical computer science conference (ITCS 2020), Schloss Dagstuhl-Leibniz-Zentrum f\u00fcr Informatik"},{"key":"884_CR4","volume-title":"Classification and regression trees","author":"L Breiman","year":"1984","unstructured":"Breiman L, Friedman J, Stone CJ et al (1984) Classification and regression trees. CRC Press, Boca Raton"},{"key":"884_CR5","unstructured":"Brutzkus A, Daniely A, Malach E (2019) On the optimality of trees generated by ID3. arXiv:1907.05444"},{"key":"884_CR6","doi-asserted-by":"crossref","unstructured":"Chakaravarthy VT, Pandit V, Roy S et\u00a0al (2007) Decision trees for entity identification: approximation algorithms and hardness results. In: Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, pp 53\u201362","DOI":"10.1145\/1265530.1265538"},{"key":"884_CR7","unstructured":"Cicalese F, Laber E, Saettler AM (2014) Diagnosis determination: decision trees optimizing simultaneously worst and expected testing cost. In: International conference on machine learning, pp 414\u2013422"},{"key":"884_CR8","unstructured":"Dasgupta S (2005) Analysis of a greedy active learning strategy. In: Advances in neural information processing systems, pp 337\u2013344. https:\/\/proceedings.neurips.cc\/paper\/2004\/hash\/c61fbef63df5ff317aecdc3670094472-Abstract.html"},{"key":"884_CR9","first-page":"1","volume":"7","author":"J Dem\u0161ar","year":"2006","unstructured":"Dem\u0161ar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1\u201330","journal-title":"J Mach Learn Res"},{"issue":"42","key":"884_CR10","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/2876506","volume":"12","author":"A Deshpande","year":"2016","unstructured":"Deshpande A, Hellerstein L, Kletenik D (2016) Approximation algorithms for stochastic Boolean function evaluation and stochastic submodular set cover with applications to boolean function evaluation and min-knapsack. ACM Trans Algorithms 12(42):1\u201342","journal-title":"ACM Trans Algorithms"},{"key":"884_CR11","unstructured":"Doshi-Velez F, Kim B (2017) Towards a rigorous science of interpretable machine learning. arXiv:1702.08608"},{"key":"884_CR12","unstructured":"Dua D, Graff C (2017) UCI machine learning repository. http:\/\/archive.ics.uci.edu\/ml"},{"issue":"4","key":"884_CR13","doi-asserted-by":"publisher","first-page":"634","DOI":"10.1145\/285055.285059","volume":"45","author":"U Feige","year":"1998","unstructured":"Feige U (1998) A threshold of ln n for approximating set cover. J ACM: JACM 45(4):634\u2013652","journal-title":"J ACM: JACM"},{"issue":"4","key":"884_CR14","doi-asserted-by":"publisher","first-page":"219","DOI":"10.1007\/s00453-004-1110-5","volume":"40","author":"U Feige","year":"2004","unstructured":"Feige U, Lov\u00e1sz L, Tetali P (2004) Approximating min sum set cover. Algorithmica 40(4):219\u2013234","journal-title":"Algorithmica"},{"issue":"1","key":"884_CR15","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/2594473.2594475","volume":"15","author":"AA Freitas","year":"2014","unstructured":"Freitas AA (2014) Comprehensible classification models: a position paper. ACM SIGKDD Explor Newsl 15(1):1\u201310","journal-title":"ACM SIGKDD Explor Newsl"},{"issue":"2","key":"884_CR16","doi-asserted-by":"publisher","first-page":"173","DOI":"10.1137\/0123019","volume":"23","author":"MR Garey","year":"1972","unstructured":"Garey MR (1972) Optimal binary identification procedures. SIAM J Appl Math 23(2):173\u2013186","journal-title":"SIAM J Appl Math"},{"key":"884_CR17","first-page":"427","volume":"42","author":"D Golovin","year":"2011","unstructured":"Golovin D, Krause A (2011) Adaptive submodularity: theory and applications in active learning and stochastic optimization. J Artif Intell Res 42:427\u2013486","journal-title":"J Artif Intell Res"},{"key":"884_CR18","unstructured":"Golovin D, Krause A, Ray D (2010) Near-optimal Bayesian active learning with noisy observations. In: Lafferty JD, Williams CKI and Taylor JS and Richard S. Zemel and Aron Culotta Advances in neural information processing systems, pp 766\u2013774"},{"key":"884_CR19","doi-asserted-by":"crossref","unstructured":"Grammel N, Hellerstein L, Kletenik D et\u00a0al (2016) Scenario submodular cover. In: International workshop on approximation and online algorithms. Springer, pp 116\u2013128","DOI":"10.1007\/978-3-319-51741-4_10"},{"key":"884_CR20","doi-asserted-by":"crossref","unstructured":"Guillory A, Bilmes J (2009) Average-case active learning with costs. In: International conference on algorithmic learning theory. Springer, pp 141\u2013155","DOI":"10.1007\/978-3-642-04414-4_15"},{"key":"884_CR21","unstructured":"Guillory A, Bilmes JA (2011) Simultaneous learning and covering with adversarial noise. In: International conference on machine learning"},{"issue":"3","key":"884_CR22","doi-asserted-by":"publisher","first-page":"876","DOI":"10.1287\/moor.2016.0831","volume":"42","author":"A Gupta","year":"2017","unstructured":"Gupta A, Nagarajan V, Ravi R (2017) Approximation algorithms for optimal decision trees and adaptive TSP problems. Math Oper Res 42(3):876\u2013896","journal-title":"Math Oper Res"},{"issue":"2","key":"884_CR23","doi-asserted-by":"publisher","first-page":"114","DOI":"10.1006\/inco.1996.0040","volume":"126","author":"T Hancock","year":"1996","unstructured":"Hancock T, Jiang T, Li M et al (1996) Lower bounds on learning decision lists and trees. Inf Comput 126(2):114\u2013122","journal-title":"Inf Comput"},{"key":"884_CR24","doi-asserted-by":"crossref","unstructured":"Im S, Nagarajan V, van\u00a0der Zwaan R (2012) Minimum latency submodular cover. In: Czumaj A, Mehlhorn K, Pitts AM, Wattenhofer R (eds) International colloquium on automata, languages, and programming, Vol 7391. Springer, pp 485\u2013497. https:\/\/doi.org\/10.1007\/978-3-642-31594-7_41\/","DOI":"10.1007\/978-3-642-31594-7_41"},{"issue":"1","key":"884_CR25","doi-asserted-by":"publisher","first-page":"109","DOI":"10.1006\/jcss.1997.1543","volume":"58","author":"M Kearns","year":"1999","unstructured":"Kearns M, Mansour Y (1999) On the boosting ability of top-down decision tree learning algorithms. J Comput Syst Sci 58(1):109\u2013128","journal-title":"J Comput Syst Sci"},{"key":"884_CR26","doi-asserted-by":"crossref","unstructured":"Kosaraju SR, Przytycka TM, Borgstrom R (1999) On an optimal split tree problem. In: Workshop on algorithms and data structures. Springer, pp 157\u2013168","DOI":"10.1007\/3-540-48447-7_17"},{"issue":"1","key":"884_CR27","doi-asserted-by":"publisher","first-page":"15","DOI":"10.1016\/0020-0190(76)90095-8","volume":"5","author":"H Laurent","year":"1976","unstructured":"Laurent H, Rivest RL (1976) Constructing optimal binary decision trees is np-complete. Inf Process Lett 5(1):15\u201317","journal-title":"Inf Process Lett"},{"issue":"3","key":"884_CR28","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1145\/3236386.3241340","volume":"16","author":"ZC Lipton","year":"2018","unstructured":"Lipton ZC (2018) The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery. Queue 16(3):31\u201357","journal-title":"Queue"},{"issue":"4","key":"884_CR29","doi-asserted-by":"publisher","first-page":"345","DOI":"10.1023\/A:1009744630224","volume":"2","author":"SK Murthy","year":"1998","unstructured":"Murthy SK (1998) Automatic construction of decision trees from data: a multi-disciplinary survey. Data Min Knowl Disc 2(4):345\u2013389","journal-title":"Data Min Knowl Disc"},{"issue":"3","key":"884_CR30","doi-asserted-by":"publisher","first-page":"856","DOI":"10.1287\/opre.2019.1889","volume":"68","author":"F Navidi","year":"2020","unstructured":"Navidi F, Kambadur P, Nagarajan V (2020) Adaptive submodular ranking and routing. Oper Res 68(3):856\u2013877","journal-title":"Oper Res"},{"key":"884_CR31","volume-title":"C4.5: programs for machine learning","author":"JR Quinlan","year":"1993","unstructured":"Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers, Burlington"},{"issue":"1","key":"884_CR32","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/1471-2105-8-25","volume":"8","author":"C Strobl","year":"2007","unstructured":"Strobl C, Boulesteix AL, Zeileis A et al (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform 8(1):1\u201321","journal-title":"BMC Bioinform"},{"issue":"2","key":"884_CR33","doi-asserted-by":"publisher","first-page":"49","DOI":"10.1145\/2641190.2641198","volume":"15","author":"J Vanschoren","year":"2013","unstructured":"Vanschoren J, van Rijn JN, Bischl B et al (2013) OpenML: networked science in machine learning. SIGKDD Explor 15(2):49\u201360. https:\/\/doi.org\/10.1145\/2641190.2641198","journal-title":"SIGKDD Explor"}],"container-title":["Data Mining and Knowledge Discovery"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10618-022-00884-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10618-022-00884-7\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10618-022-00884-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,4]],"date-time":"2023-01-04T17:46:25Z","timestamp":1672854385000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10618-022-00884-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,11,28]]},"references-count":33,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,1]]}},"alternative-id":["884"],"URL":"https:\/\/doi.org\/10.1007\/s10618-022-00884-7","relation":{},"ISSN":["1384-5810","1573-756X"],"issn-type":[{"value":"1384-5810","type":"print"},{"value":"1573-756X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,11,28]]},"assertion":[{"value":"9 February 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 October 2022","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 November 2022","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors have no conflicts of interest to declare that are relevant to the content of this article.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval"}},{"value":"Not applicable.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent to participate"}},{"value":"Not applicable.","order":5,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}}]}}