{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,19]],"date-time":"2026-04-19T12:45:39Z","timestamp":1776602739354,"version":"3.51.2"},"reference-count":21,"publisher":"Springer Science and Business Media LLC","issue":"6","license":[{"start":{"date-parts":[[2026,4,19]],"date-time":"2026-04-19T00:00:00Z","timestamp":1776556800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,4,19]],"date-time":"2026-04-19T00:00:00Z","timestamp":1776556800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Portuguese Foundation for Science and Technology"},{"DOI":"10.13039\/501100006752","name":"Universidade do Porto","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100006752","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Supercomput"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>In this work, we present a novel methodology for decision trees that use oversampling, not before tree construction (in the entire dataset), but inside each internal node (and corresponding input space region) of the tree. This strategy proves to be successful in fighting the greedy nature of decision trees. We take also into consideration the nature of the input variables, not just quantitative or binary, and also introduce the use of novel distances between instances that can also be used in other contexts. The application of our methodology to a significant number of datasets, thirteen, both balanced and imbalanced problems, shows the relevance of our approach when compared to CART and C5.0. Although our experiments were conducted on a standard computing platform, the proposed approach is well suited for high-performance computing environments, since node-level oversampling and distance computations can be efficiently parallelized, enabling the method to scale to large and high-dimensional datasets.<\/jats:p>","DOI":"10.1007\/s11227-026-08503-8","type":"journal-article","created":{"date-parts":[[2026,4,19]],"date-time":"2026-04-19T12:17:11Z","timestamp":1776601031000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Sodet\u2014synthetic oversampling decision trees"],"prefix":"10.1007","volume":"82","author":[{"given":"Joaquim Fernando Pinto","family":"da Costa","sequence":"first","affiliation":[]},{"given":"Hugo","family":"Alonso","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2026,4,19]]},"reference":[{"key":"8503_CR1","doi-asserted-by":"crossref","unstructured":"Alonso H, Costa J (2025) Over-sampling methods for mixed data in imbalanced problems. Commun Stat\u2014Simul Comput 1\u201323","DOI":"10.1080\/03610918.2024.2447451"},{"key":"8503_CR2","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/2907070","volume":"49","author":"P Branco","year":"2016","unstructured":"Branco P, Torgo L, Ribeiro R (2016) A survey of predictive modeling on imbalanced domains. ACM Comput Surv 49:1\u201350","journal-title":"ACM Comput Surv"},{"key":"8503_CR3","unstructured":"Breiman L, Friedman J, Stone C, Olshen R (1984) Classification and regression trees. Taylor & Francis"},{"key":"8503_CR4","doi-asserted-by":"publisher","first-page":"664","DOI":"10.1007\/s10489-011-0287-y","volume":"36","author":"C Bunkhumpornpat","year":"2012","unstructured":"Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) DBSMOTE: density-based synthetic minority over-sampling technique. Appl Intell 36:664\u2013684","journal-title":"Appl Intell"},{"key":"8503_CR5","doi-asserted-by":"publisher","first-page":"321","DOI":"10.1613\/jair.953","volume":"16","author":"N Chawla","year":"2002","unstructured":"Chawla N, Bowyer K, Hall L, Kegelmeyer W (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321\u2013357","journal-title":"J Artif Intell Res"},{"key":"8503_CR6","doi-asserted-by":"crossref","unstructured":"Chawla N (2005) Data mining for imbalanced datasets: an overview. Data Min Knowl Discov Handb 853\u2013867","DOI":"10.1007\/0-387-25465-X_40"},{"key":"8503_CR7","doi-asserted-by":"publisher","first-page":"57","DOI":"10.1023\/A:1022664626993","volume":"10","author":"S Cost","year":"1993","unstructured":"Cost S, Salzberg S (1993) A weighted nearest neighbor algorithm for learning with symbolic features. Mach Learn 10:57\u201378","journal-title":"Mach Learn"},{"key":"8503_CR8","doi-asserted-by":"crossref","unstructured":"Everitt B, Dunn G (2001) Applied multivariate data analysis, 2nd edn. Wiley","DOI":"10.1002\/9781118887486"},{"key":"8503_CR9","doi-asserted-by":"publisher","first-page":"857","DOI":"10.2307\/2528823","volume":"27","author":"J Gower","year":"1971","unstructured":"Gower J (1971) A general coefficient of similarity and some of its properties. Biometrics 27:857\u2013871","journal-title":"Biometrics"},{"key":"8503_CR10","doi-asserted-by":"crossref","unstructured":"Han H, Wang W, Mao B (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International Conference on Intelligent Computing, pp. 878\u2013887","DOI":"10.1007\/11538059_91"},{"key":"8503_CR11","doi-asserted-by":"publisher","first-page":"1263","DOI":"10.1109\/TKDE.2008.239","volume":"21","author":"H He","year":"2009","unstructured":"He H, Garcia E (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21:1263\u20131284","journal-title":"IEEE Trans Knowl Data Eng"},{"key":"8503_CR12","doi-asserted-by":"crossref","unstructured":"He H, Ma Y (2013) Imbalanced learning: foundations, algorithms, and applications. Wiley","DOI":"10.1002\/9781118646106"},{"key":"8503_CR13","doi-asserted-by":"crossref","unstructured":"He H, Bai Y, Garcia E, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the International Joint Conference on Neural Networks, pp 1322\u20131328","DOI":"10.1109\/IJCNN.2008.4633969"},{"key":"8503_CR14","doi-asserted-by":"crossref","unstructured":"Jajuga K, Walesiak M, Bak A (2003) On the general distance measure. In: Exploratory Data Analysis in Empirical Research. Studies In Classification, Data Analysis, and Knowledge Organization, pp 104\u2013109","DOI":"10.1007\/978-3-642-55721-7_12"},{"key":"8503_CR15","unstructured":"Jolliffe I (2002) Principal component analysis, 2nd edn. Springer"},{"key":"8503_CR16","unstructured":"Kaufman J, Rousseeuw P (2005) Finding groups in data: an introduction to cluster analysis. Wiley"},{"key":"8503_CR17","first-page":"1","volume":"4","author":"M Mukherjee","year":"2021","unstructured":"Mukherjee M, Khushi M (2021) SMOTE-ENC: a novel smote-based method to generate synthetic data for nominal and continuous features. Appl Syst Innov 4:1\u201312","journal-title":"Appl Syst Innov"},{"key":"8503_CR18","doi-asserted-by":"publisher","first-page":"331","DOI":"10.2307\/1224438","volume":"48","author":"J Podani","year":"1999","unstructured":"Podani J (1999) Extending Gower\u2019s general coefficient of similarity to ordinal characters. Taxon 48:331\u2013340","journal-title":"Taxon"},{"key":"8503_CR19","unstructured":"Quinlan J (1993) Ross C4.5: programs for machine learning. Morgan Kaufmann"},{"key":"8503_CR20","doi-asserted-by":"crossref","unstructured":"Ross S (2017) Introductory statistics, 4th edn. Academic Press","DOI":"10.1016\/B978-0-12-804317-2.00031-X"},{"key":"8503_CR21","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.ins.2024.121570","volume":"692","author":"F Wang","year":"2025","unstructured":"Wang F, Zheng M, Ma K, Hu X (2025) Resampling approach for imbalanced data classification based on class instance density per feature value intervals. Inf Sci 692:1\u201344","journal-title":"Inf Sci"}],"container-title":["The Journal of Supercomputing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11227-026-08503-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11227-026-08503-8","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11227-026-08503-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,19]],"date-time":"2026-04-19T12:17:14Z","timestamp":1776601034000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11227-026-08503-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,4,19]]},"references-count":21,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2026,4]]}},"alternative-id":["8503"],"URL":"https:\/\/doi.org\/10.1007\/s11227-026-08503-8","relation":{},"ISSN":["1573-0484"],"issn-type":[{"value":"1573-0484","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,4,19]]},"assertion":[{"value":"3 October 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 April 2026","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 April 2026","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}],"article-number":"359"}}