{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,19]],"date-time":"2026-03-19T19:13:16Z","timestamp":1773947596153,"version":"3.50.1"},"reference-count":55,"publisher":"Association for Computing Machinery (ACM)","issue":"3","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,11]]},"abstract":"<jats:p>\n            In the era of data-driven decision-making, efficient machine learning model training is crucial. We present a novel algorithm for constructing\n            <jats:italic>tabular data<\/jats:italic>\n            coresets using datamaps created for Gradient Boosting Decision Trees models. The resulting coresets, computed within minutes, consistently outperform other baselines and match or exceed the performance of models trained on the entire dataset. Additionally, a training enhancement method leveraging datamap insights during the inference phase improves performance with mathematical guarantees, given a defined property holds. An explainability layer and tools for coreset size optimization further enhance the efficiency of training tabular machine learning models.\n          <\/jats:p>","DOI":"10.14778\/3712221.3712249","type":"journal-article","created":{"date-parts":[[2025,4,7]],"date-time":"2025-04-07T18:03:04Z","timestamp":1744048984000},"page":"876-888","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Datamap-Driven Tabular Coreset Selection for Classifier Training"],"prefix":"10.14778","volume":"18","author":[{"given":"Aviv","family":"Hadar","sequence":"first","affiliation":[{"name":"Tel Aviv University"}]},{"given":"Tova","family":"Milo","sequence":"additional","affiliation":[{"name":"Tel Aviv University"}]},{"given":"Kathy","family":"Razmadze","sequence":"additional","affiliation":[{"name":"Tel Aviv University"}]}],"member":"320","published-online":{"date-parts":[[2025,4,7]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"[n.d.]. Bank Fraud Dataset. https:\/\/www.kaggle.com\/datasets\/sgpjesus\/bank-account-fraud-dataset-neurips-2022."},{"key":"e_1_2_1_2_1","unstructured":"[n.d.]. Cover Type Dataset. https:\/\/archive.ics.uci.edu\/dataset\/31\/covertype."},{"key":"e_1_2_1_3_1","unstructured":"[n.d.]. Credit Card Dataset. https:\/\/www.kaggle.com\/datasets\/mlg-ulb\/creditcardfraud."},{"key":"e_1_2_1_4_1","unstructured":"[n.d.]. Diabetes Dataset. https:\/\/archive.ics.uci.edu\/dataset\/34\/diabetes."},{"key":"e_1_2_1_5_1","unstructured":"[n.d.]. Hepmass Dataset. https:\/\/archive.ics.uci.edu\/dataset\/347\/hepmass."},{"key":"e_1_2_1_6_1","unstructured":"[n.d.]. Loan Dataset. https:\/\/www.kaggle.com\/deepanshu08\/prediction-of-lendingclub-loan-defaulters."},{"key":"e_1_2_1_7_1","unstructured":"[n.d.]. Scikit-Learn Decision Tree Classifier. https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.tree.DecisionTreeClassifier.html."},{"key":"e_1_2_1_8_1","unstructured":"[n.d.]. Xgboost Library. https:\/\/xgboost.readthedocs.io\/en\/stable\/."},{"key":"e_1_2_1_9_1","volume-title":"Proceedings of the 8th ACM European Conference on Computer Systems.","author":"Agarwal Sameer","year":"2013","unstructured":"Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/360825.360855"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i8.16826"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/872757.872822"},{"key":"e_1_2_1_13_1","volume-title":"Coresets via bilevel optimization for continual learning and streaming. Advances in neural information processing systems 33","author":"Borsos Zal\u00e1n","year":"2020","unstructured":"Zal\u00e1n Borsos, Mojmir Mutny, and Andreas Krause. 2020. Coresets via bilevel optimization for continual learning and streaming. Advances in neural information processing systems 33 (2020), 14879--14890."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/FOCS54457.2022.00051"},{"key":"e_1_2_1_15_1","first-page":"434","article-title":"Coresets for Relational Data and The Applications","volume":"35","author":"Chen Jiaxiang","year":"2022","unstructured":"Jiaxiang Chen, Qingyuan Yang, Ruomin Huang, and Hu Ding. 2022. Coresets for Relational Data and The Applications. Advances in Neural Information Processing Systems 35 (2022), 434--448.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_16_1","unstructured":"Tianqi Chen Tong He Michael Benesty Vadim Khotilovich Yuan Tang Hyunsu Cho Kailong Chen Rory Mitchell Ignacio Cano Tianyi Zhou et al. 2015. Xg-boost: extreme gradient boosting. R package version 0.4-2 1 4 (2015) 1--4."},{"key":"e_1_2_1_17_1","first-page":"2679","article-title":"Improved Coresets for Euclidean k-Means","volume":"35","author":"Cohen-Addad Vincent","year":"2022","unstructured":"Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn, and Omar Ali Sheikh-Omar. 2022. Improved Coresets for Euclidean k-Means. Advances in Neural Information Processing Systems 35 (2022), 2679--2694.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_18_1","volume-title":"Support-vector networks. Machine learning 20","author":"Cortes Corinna","year":"1995","unstructured":"Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning 20 (1995), 273--297."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1111\/j.2517-6161.1958.tb00292.x"},{"key":"e_1_2_1_20_1","volume-title":"Learning how to active learn: A deep reinforcement learning approach. arXiv preprint arXiv:1708.02383","author":"Fang Meng","year":"2017","unstructured":"Meng Fang, Yuan Li, and Trevor Cohn. 2017. Learning how to active learn: A deep reinforcement learning approach. arXiv preprint arXiv:1708.02383 (2017)."},{"key":"e_1_2_1_21_1","volume-title":"Scalable training of mixture models via coresets. Advances in neural information processing systems 24","author":"Feldman Dan","year":"2011","unstructured":"Dan Feldman, Matthew Faulkner, and Andreas Krause. 2011. Scalable training of mixture models via coresets. Advances in neural information processing systems 24 (2011)."},{"key":"e_1_2_1_22_1","volume-title":"Stochastic gradient boosting. Computational statistics & data analysis 38, 4","author":"Friedman Jerome H","year":"2002","unstructured":"Jerome H Friedman. 2002. Stochastic gradient boosting. Computational statistics & data analysis 38, 4 (2002), 367--378."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3422622"},{"key":"e_1_2_1_24_1","unstructured":"Aviv Hadar Tova Milo and Kathy Razmadze. 2024. CoreTab git repository. https:\/\/github.com\/avivhadar33\/coretab\/."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/1007352.1007400"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2020.106622"},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of 3rd international conference on document analysis and recognition","volume":"1","author":"Ho Tin Kam","year":"1995","unstructured":"Tin Kam Ho. 1995. Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition, Vol. 1. IEEE, 278--282."},{"key":"e_1_2_1_29_1","volume-title":"Jianing Lou, and Xuan Wu.","author":"Huang Lingxiao","year":"2022","unstructured":"Lingxiao Huang, Shaofeng H-C Jiang, Jianing Lou, and Xuan Wu. 2022. Near-optimal coresets for robust clustering. arXiv preprint arXiv:2210.10394 (2022)."},{"key":"e_1_2_1_30_1","volume-title":"International Conference on Discovery Science. Springer, 79--93","author":"Ienco Dino","year":"2013","unstructured":"Dino Ienco, Albert Bifet, Indr\u0117 \u017dliobait\u0117, and Bernhard Pfahringer. 2013. Clustering based active learning for evolving data streams. In International Conference on Discovery Science. Springer, 79--93."},{"key":"e_1_2_1_31_1","first-page":"30352","article-title":"Coresets for decision trees of signals","volume":"34","author":"Jubran Ibrahim","year":"2021","unstructured":"Ibrahim Jubran, Ernesto Evgeniy Sanches Shayda, Ilan I Newman, and Dan Feldman. 2021. Coresets for decision trees of signals. Advances in Neural Information Processing Systems 34 (2021), 30352--30364.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_32_1","volume-title":"Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30","author":"Ke Guolin","year":"2017","unstructured":"Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_2_1_33_1","doi-asserted-by":"crossref","first-page":"772","DOI":"10.14778\/3574245.3574261","article-title":"SubStrat: A Subset-Based Optimization Strategy for Faster AutoML","volume":"16","author":"Lazebnik Teddy","year":"2022","unstructured":"Teddy Lazebnik, Amit Somech, and Abraham Itzhak Weinberg. 2022. SubStrat: A Subset-Based Optimization Strategy for Faster AutoML. Proceedings of the VLDB Endowment 16, 4 (2022), 772--780.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2018.2877362"},{"key":"e_1_2_1_35_1","volume-title":"Instance selection and construction for data mining","author":"Liu Huan","unstructured":"Huan Liu and Hiroshi Motoda. 2013. Instance selection and construction for data mining. Vol. 608. Springer Science & Business Media."},{"key":"e_1_2_1_36_1","volume-title":"Structured search result differentiation. PVLDB 2, 1","author":"Liu Ziyang","year":"2009","unstructured":"Ziyang Liu, Peng Sun, and Yi Chen. 2009. Structured search result differentiation. PVLDB 2, 1 (2009)."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3324958"},{"key":"e_1_2_1_38_1","first-page":"11643","article-title":"Coresets for classification-simplified and strengthened","volume":"34","author":"Mai Tung","year":"2021","unstructured":"Tung Mai, Cameron Musco, and Anup Rao. 2021. Coresets for classification-simplified and strengthened. Advances in Neural Information Processing Systems 34 (2021), 11643--11654.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_39_1","volume-title":"International Conference on Machine Learning. PMLR, 6950--6960","author":"Mirzasoleiman Baharan","year":"2020","unstructured":"Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. 2020. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning. PMLR, 6950--6960."},{"key":"e_1_2_1_40_1","first-page":"11465","article-title":"Coresets for robust training of deep neural networks against noisy labels","volume":"33","author":"Mirzasoleiman Baharan","year":"2020","unstructured":"Baharan Mirzasoleiman, Kaidi Cao, and Jure Leskovec. 2020. Coresets for robust training of deep neural networks against noisy labels. Advances in Neural Information Processing Systems 33 (2020), 11465--11477.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_42_1","doi-asserted-by":"crossref","unstructured":"Y. Park M. Cafarella and B. Mozafari. 2016. Visualization-aware sampling for very large databases. In ICDE.","DOI":"10.1109\/ICDE.2016.7498287"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3196905"},{"key":"e_1_2_1_44_1","volume-title":"International Conference on Machine Learning. PMLR, 17848--17869","author":"Pooladzandi Omead","year":"2022","unstructured":"Omead Pooladzandi, David Davini, and Baharan Mirzasoleiman. 2022. Adaptive second order coresets for data-efficient machine learning. In International Conference on Machine Learning. PMLR, 17848--17869."},{"key":"e_1_2_1_45_1","volume-title":"Anna Veronika Dorogush, and Andrey Gulin","author":"Prokhorenkova Liudmila","year":"2018","unstructured":"Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. 2018. CatBoost: unbiased boosting with categorical features. Advances in neural information processing systems 31 (2018)."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10618-016-0460-3"},{"key":"e_1_2_1_47_1","volume-title":"International Conference on Artificial Intelligence and Statistics. PMLR, 7331--7348","author":"Smith Freddie Bickford","year":"2023","unstructured":"Freddie Bickford Smith, Andreas Kirsch, Sebastian Farquhar, Yarin Gal, Adam Foster, and Tom Rainforth. 2023. Prediction-oriented bayesian active learning. In International Conference on Artificial Intelligence and Statistics. PMLR, 7331--7348."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.746"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.3390\/math11040820"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE48307.2020.00117"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.tcs.2021.09.008"},{"key":"e_1_2_1_52_1","volume-title":"International Conference on Artificial Intelligence and Statistics. PMLR, 5391--5415","author":"Tukan Murad","year":"2022","unstructured":"Murad Tukan, Xuan Wu, Samson Zhou, Vladimir Braverman, and Dan Feldman. 2022. New coresets for projective clustering and applications. In International Conference on Artificial Intelligence and Statistics. PMLR, 5391--5415."},{"key":"e_1_2_1_53_1","volume-title":"Marios Hadjieleftheriou, Divesh Srivastava, Caetano Traina, and Vassilis J Tsotras.","author":"Vieira Marcos R","year":"2011","unstructured":"Marcos R Vieira, Humberto L Razente, Maria CN Barioni, Marios Hadjieleftheriou, Divesh Srivastava, Caetano Traina, and Vassilis J Tsotras. 2011. On query result diversification. In ICDE."},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.14778\/3561261.3561267"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.1927.10502953"},{"key":"e_1_2_1_56_1","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"38","author":"Xiao Weiwei","year":"2024","unstructured":"Weiwei Xiao, Yongyong Chen, Qiben Shan, Yaowei Wang, and Jingyong Su. 2024. Feature Distribution Matching by Optimal Transport for Effective and Robust Coreset Selection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 9196--9204."},{"key":"e_1_2_1_57_1","volume-title":"Hyper-parameter optimization: A review of algorithms and applications. arXiv preprint arXiv:2003.05689","author":"Yu Tong","year":"2020","unstructured":"Tong Yu and Hong Zhu. 2020. Hyper-parameter optimization: A review of algorithms and applications. arXiv preprint arXiv:2003.05689 (2020)."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3712221.3712249","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,4,7]],"date-time":"2025-04-07T18:26:51Z","timestamp":1744050411000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3712221.3712249"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11]]},"references-count":55,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,11]]}},"alternative-id":["10.14778\/3712221.3712249"],"URL":"https:\/\/doi.org\/10.14778\/3712221.3712249","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2024,11]]},"assertion":[{"value":"2025-04-07","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}