{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,2]],"date-time":"2026-06-02T22:39:45Z","timestamp":1780439985722,"version":"3.54.1"},"reference-count":65,"publisher":"Association for Computing Machinery (ACM)","issue":"11","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2021,7]]},"abstract":"<jats:p>Automated machine learning (AutoML) promises to democratize machine learning by automatically generating machine learning pipelines with little to no user intervention. Typically, a search procedure is used to repeatedly generate and validate candidate pipelines, maximizing a predictive performance metric, subject to a limited execution time budget. While this approach to generating candidates works well for small tabular datasets, the same procedure does not directly scale to larger tabular datasets with 100,000s of observations, often producing fewer candidate pipelines and yielding lower performance, given the same execution time budget. We carry out an extensive empirical evaluation of the impact that downsampling - reducing the number of rows in the input tabular dataset - has on the pipelines produced by a genetic-programming-based AutoML search for classification tasks.<\/jats:p>","DOI":"10.14778\/3476249.3476262","type":"journal-article","created":{"date-parts":[[2021,10,27]],"date-time":"2021-10-27T16:46:23Z","timestamp":1635353183000},"page":"2059-2072","source":"Crossref","is-referenced-by-count":18,"title":["Doing more with less"],"prefix":"10.14778","volume":"14","author":[{"given":"Fatjon","family":"Zogaj","sequence":"first","affiliation":[{"name":"ETH Zurich, Z\u00fcrich, Switzerland"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jos\u00e9 Pablo","family":"Cambronero","sequence":"additional","affiliation":[{"name":"Massachusetts Institute of Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Martin C.","family":"Rinard","sequence":"additional","affiliation":[{"name":"Massachusetts Institute of Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"J\u00fcrgen","family":"Cito","sequence":"additional","affiliation":[{"name":"TU Wien, Vienna, Austria and Massachusetts Institute of Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2021,10,27]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"[n.d.]. 1.10. Decision Trees scikit-learn 0.23.2 documentation. Retrieved 2020-10-12 from https:\/\/scikit-learn.org\/stable\/modules\/tree.html#complexity  [n.d.]. 1.10. Decision Trees scikit-learn 0.23.2 documentation. Retrieved 2020-10-12 from https:\/\/scikit-learn.org\/stable\/modules\/tree.html#complexity"},{"key":"e_1_2_1_2_1","unstructured":"[n.d.]. 1.4. Support Vector Machines scikit-learn 0.23.2 documentation. Retrieved 2020-10-12 from https:\/\/scikit-learn.org\/stable\/modules\/svm.html#complexity  [n.d.]. 1.4. Support Vector Machines scikit-learn 0.23.2 documentation. Retrieved 2020-10-12 from https:\/\/scikit-learn.org\/stable\/modules\/svm.html#complexity"},{"key":"e_1_2_1_3_1","unstructured":"[n.d.]. Manual AutoSklearn 0.10.0 documentation. Retrieved 2020-10-12 from https:\/\/automl.github.io\/auto-sklearn\/master\/manual.html#time-and-memory-limits  [n.d.]. Manual AutoSklearn 0.10.0 documentation. Retrieved 2020-10-12 from https:\/\/automl.github.io\/auto-sklearn\/master\/manual.html#time-and-memory-limits"},{"key":"e_1_2_1_4_1","unstructured":"2016. Possible speed up at large data sets Issue #87 EpistasisLab\/tpot. Retrieved 2020-10-12 from https:\/\/github.com\/EpistasisLab\/tpot\/issues\/87  2016. Possible speed up at large data sets Issue #87 EpistasisLab\/tpot. Retrieved 2020-10-12 from https:\/\/github.com\/EpistasisLab\/tpot\/issues\/87"},{"key":"e_1_2_1_5_1","unstructured":"2016. Speed and time budgets Issue #57 automl\/auto-sklearn. Retrieved 2020-10-12 from https:\/\/github.com\/automl\/auto-sklearn\/issues\/57  2016. Speed and time budgets Issue #57 automl\/auto-sklearn. Retrieved 2020-10-12 from https:\/\/github.com\/automl\/auto-sklearn\/issues\/57"},{"key":"e_1_2_1_6_1","volume-title":"Benchmarking Automatic Machine Learning Frameworks. CoRR abs\/1808.06492","author":"Balaji Adithya","year":"2018","unstructured":"Adithya Balaji and Alexander Allen . 2018. Benchmarking Automatic Machine Learning Frameworks. CoRR abs\/1808.06492 ( 2018 ). arXiv:1808.06492 http:\/\/arxiv.org\/abs\/1808.06492 Adithya Balaji and Alexander Allen. 2018. Benchmarking Automatic Machine Learning Frameworks. CoRR abs\/1808.06492 (2018). arXiv:1808.06492 http:\/\/arxiv.org\/abs\/1808.06492"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1214\/009053607000000631"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.21105\/joss.01075"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.5555\/2503308.2188395"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.5555\/3042817.3042832"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10994-018-5747-8"},{"key":"e_1_2_1_12_1","volume-title":"Jan N. van Rijn, and Joaquin Vanschoren.","author":"Bischl Bernd","year":"2017","unstructured":"Bernd Bischl , Giuseppe Casalicchio , Matthias Feurer , Frank Hutter , Michel Lang , Rafael Gomes Mantovani , Jan N. van Rijn, and Joaquin Vanschoren. 2017 . OpenML Benchmarking Suites and the OpenML100. CoRR abs\/1708.03731 (2017). arXiv:1708.03731 http:\/\/arxiv.org\/abs\/1708.03731 Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Frank Hutter, Michel Lang, Rafael Gomes Mantovani, Jan N. van Rijn, and Joaquin Vanschoren. 2017. OpenML Benchmarking Suites and the OpenML100. CoRR abs\/1708.03731 (2017). arXiv:1708.03731 http:\/\/arxiv.org\/abs\/1708.03731"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1007\/11752790_3"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1504\/IJMEI.2014.060245"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3368089.3409700"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3360601"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939785"},{"key":"e_1_2_1_18_1","volume-title":"Website. Retrieved","author":"DARPA.","year":"2021","unstructured":"DARPA. 2021 . Public D3M datasets . Website. Retrieved April 21, 2021 from https:\/\/datasets.datadrivendiscovery.org\/d3m\/datasets DARPA. 2021. Public D3M datasets. Website. Retrieved April 21, 2021 from https:\/\/datasets.datadrivendiscovery.org\/d3m\/datasets"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994515"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969442.2969547"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/956750.956812"},{"key":"e_1_2_1_22_1","volume-title":"An Open Source AutoML Benchmark. CoRR abs\/1907.00909","author":"Gijsbers Pieter","year":"2019","unstructured":"Pieter Gijsbers , Erin LeDell , Janek Thomas , S\u00e9bastien Poirier , Bernd Bischl , and Joaquin Vanschoren . 2019. An Open Source AutoML Benchmark. CoRR abs\/1907.00909 ( 2019 ). arXiv:1907.00909 http:\/\/arxiv.org\/abs\/1907.00909 Pieter Gijsbers, Erin LeDell, Janek Thomas, S\u00e9bastien Poirier, Bernd Bischl, and Joaquin Vanschoren. 2019. An Open Source AutoML Benchmark. CoRR abs\/1907.00909 (2019). arXiv:1907.00909 http:\/\/arxiv.org\/abs\/1907.00909"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.21105\/joss.01132"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN.2008.4633969"},{"key":"e_1_2_1_25_1","volume-title":"AutoML: A Survey of the State-of-the-Art. CoRR abs\/1908.00709","author":"He Xin","year":"2019","unstructured":"Xin He , Kaiyong Zhao , and Xiaowen Chu . 2019. AutoML: A Survey of the State-of-the-Art. CoRR abs\/1908.00709 ( 2019 ). arXiv:1908.00709 http:\/\/arxiv.org\/abs\/1908.00709 Xin He, Kaiyong Zhao, and Xiaowen Chu. 2019. AutoML: A Survey of the State-of-the-Art. CoRR abs\/1908.00709 (2019). arXiv:1908.00709 http:\/\/arxiv.org\/abs\/1908.00709"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-25566-3_40"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.5555\/3360092"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.5555\/3042817.3043075"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2017.2754374"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.5555\/1643031.1643047"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.5555\/138936"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3065386"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.14778\/3342263.3342633"},{"key":"e_1_2_1_34_1","volume-title":"Website. Retrieved","author":"Labs Criteo","year":"2014","unstructured":"Criteo Labs . 2014 . Kaggle Display Advertising Challenge Dataset . Website. Retrieved April 21, 2021 from https:\/\/labs.criteo.com\/2014\/02\/kaggle-display-advertising-challenge-dataset\/ Criteo Labs. 2014. Kaggle Display Advertising Challenge Dataset. Website. Retrieved April 21, 2021 from https:\/\/labs.criteo.com\/2014\/02\/kaggle-display-advertising-challenge-dataset\/"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btz470"},{"key":"e_1_2_1_36_1","volume-title":"7th ICML Workshop on Automated Machine Learning (AutoML) (July 2020","author":"LeDell Erin","year":"2020","unstructured":"Erin LeDell and Sebastien Poirier . 2020 . H2O AutoML: Scalable Automatic Machine Learning . 7th ICML Workshop on Automated Machine Learning (AutoML) (July 2020 ). https:\/\/www.automl.org\/wp-content\/uploads\/2020\/07\/AutoML_2020_paper_61.pdf Erin LeDell and Sebastien Poirier. 2020. H2O AutoML: Scalable Automatic Machine Learning. 7th ICML Workshop on Automated Machine Learning (AutoML) (July 2020). https:\/\/www.automl.org\/wp-content\/uploads\/2020\/07\/AutoML_2020_paper_61.pdf"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.5555\/3122009.3242042"},{"key":"e_1_2_1_38_1","volume-title":"Time, Data and Risk in Unsupervised Learning. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS","author":"Lucic Mario","year":"2015","unstructured":"Mario Lucic , Mesrob I. Ohannessian , Amin Karbasi , and Andreas Krause . 2015. Tradeoffs for Space , Time, Data and Risk in Unsupervised Learning. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2015 , San Diego, California, USA , May 9--12, 2015 (JMLR Workshop and Conference Proceedings), Guy Lebanon and S. V. N. Vishwanathan (Eds.), Vol. 38 . JMLR. org. http:\/\/proceedings.mlr.press\/v38\/lucic15.html Mario Lucic, Mesrob I. Ohannessian, Amin Karbasi, and Andreas Krause. 2015. Tradeoffs for Space, Time, Data and Risk in Unsupervised Learning. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2015, San Diego, California, USA, May 9--12, 2015 (JMLR Workshop and Conference Proceedings), Guy Lebanon and S. V. N. Vishwanathan (Eds.), Vol. 38. JMLR.org. http:\/\/proceedings.mlr.press\/v38\/lucic15.html"},{"key":"e_1_2_1_39_1","volume-title":"Proceedings of the ICML Workshop on Automatic Machine Learning.","author":"Milutinovic Mitar","year":"2020","unstructured":"Mitar Milutinovic , Brandon Schoenfeld , Diego Martinez-Garcia , Saswati Ray , Sujen Shah , and David Yan . 2020 . On evaluation of automl systems . In Proceedings of the ICML Workshop on Automatic Machine Learning. Mitar Milutinovic, Brandon Schoenfeld, Diego Martinez-Garcia, Saswati Ray, Sujen Shah, and David Yan. 2020. On evaluation of automl systems. In Proceedings of the ICML Workshop on Automatic Machine Learning."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969830.2969907"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407816"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352116"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/2908812.2908918"},{"key":"e_1_2_1_44_1","volume-title":"Conference on Innovative Data Systems Research (CIDR). 45","author":"Palkar Shoumik","year":"2017","unstructured":"Shoumik Palkar , James J Thomas , Anil Shanbhag , Deepak Narayanan , Holger Pirk , Malte Schwarzkopf , Saman Amarasinghe , Matei Zaharia , and Stanford InfoLab . 2017 . Weld: A common runtime for high performance data analytics . In Conference on Innovative Data Systems Research (CIDR). 45 . Shoumik Palkar, James J Thomas, Anil Shanbhag, Deepak Narayanan, Holger Pirk, Malte Schwarzkopf, Saman Amarasinghe, Matei Zaharia, and Stanford InfoLab. 2017. Weld: A common runtime for high performance data analytics. In Conference on Innovative Data Systems Research (CIDR). 45."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICTAI.2019.00072"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.5555\/1953048.2078195"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.5555\/3327757.3327789"},{"key":"e_1_2_1_48_1","volume-title":"Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019","author":"Perrone Valerio","year":"2019","unstructured":"Valerio Perrone and Huibin Shen . 2019 . Learning search spaces for Bayesian optimization: Another view of hyperparameter transfer learning . In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019 , NeurIPS 2019, December 8--14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alch\u00e9-Buc, Emily B. Fox, and Roman Garnett (Eds.). 12751--12761. https:\/\/proceedings.neurips.cc\/paper\/2019\/hash\/6ea3f1874b188558fafbab78e8c3a968-Abstract.html Valerio Perrone and Huibin Shen. 2019. Learning search spaces for Bayesian optimization: Another view of hyperparameter transfer learning. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8--14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alch\u00e9-Buc, Emily B. Fox, and Roman Garnett (Eds.). 12751--12761. https:\/\/proceedings.neurips.cc\/paper\/2019\/hash\/6ea3f1874b188558fafbab78e8c3a968-Abstract.html"},{"key":"e_1_2_1_49_1","volume-title":"Carlo Curino, and Markus Weimer","author":"Psallidas Fotis","year":"2019","unstructured":"Fotis Psallidas , Yiwen Zhu , Bojan Karlas , Matteo Interlandi , Avrilia Floratou , Konstantinos Karanasos , Wentao Wu , Ce Zhang , Subru Krishnan , Carlo Curino, and Markus Weimer . 2019 . Data Science through the looking glass and what we found there. CoRR abs\/1912.09536 (2019). arXiv:1912.09536 http:\/\/arxiv.org\/abs\/1912.09536 Fotis Psallidas, Yiwen Zhu, Bojan Karlas, Matteo Interlandi, Avrilia Floratou, Konstantinos Karanasos, Wentao Wu, Ce Zhang, Subru Krishnan, Carlo Curino, and Markus Weimer. 2019. Data Science through the looking glass and what we found there. CoRR abs\/1912.09536 (2019). arXiv:1912.09536 http:\/\/arxiv.org\/abs\/1912.09536"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.5555\/2627435.2627446"},{"key":"e_1_2_1_51_1","volume-title":"AutoML-Zero: Evolving Machine Learning Algorithms From Scratch. arXiv preprint arXiv:2003.03384","author":"Real Esteban","year":"2020","unstructured":"Esteban Real , Chen Liang , David R So , and Quoc V Le. 2020. AutoML-Zero: Evolving Machine Learning Algorithms From Scratch. arXiv preprint arXiv:2003.03384 ( 2020 ). Esteban Real, Chen Liang, David R So, and Quoc V Le. 2020. AutoML-Zero: Evolving Machine Learning Algorithms From Scratch. arXiv preprint arXiv:2003.03384 (2020)."},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3319863"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3386146"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.5555\/2999325.2999464"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/2641190.2641198"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/2939502.2939516"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.14778\/3297753.3297763"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3292500.3330909"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403197"},{"key":"e_1_2_1_60_1","volume-title":"Isabelle Guyon, Yi-Qi Hu, Yu-Feng Li, Wei-Wei Tu, Qiang Yang, and Yang Yu.","author":"Yao Quanming","year":"2018","unstructured":"Quanming Yao , Mengshuo Wang , Hugo Jair Escalante , Isabelle Guyon, Yi-Qi Hu, Yu-Feng Li, Wei-Wei Tu, Qiang Yang, and Yang Yu. 2018 . Taking Human out of Learning Applications: A Survey on Automated Machine Learning. CoRR abs\/1810.13306 (2018). arXiv:1810.13306 http:\/\/arxiv.org\/abs\/1810.13306 Quanming Yao, Mengshuo Wang, Hugo Jair Escalante, Isabelle Guyon, Yi-Qi Hu, Yu-Feng Li, Wei-Wei Tu, Qiang Yang, and Yang Yu. 2018. Taking Human out of Learning Applications: A Survey on Automated Machine Learning. CoRR abs\/1810.13306 (2018). arXiv:1810.13306 http:\/\/arxiv.org\/abs\/1810.13306"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2010.05.007"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1007\/s13755-017-0023-z"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.5281\/zenodo.4292739"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1613\/jair.1.11854"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1145\/2739480.2754694"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3476249.3476262","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T09:57:46Z","timestamp":1672221466000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3476249.3476262"}},"subtitle":["characterizing dataset downsampling for AutoML"],"short-title":[],"issued":{"date-parts":[[2021,7]]},"references-count":65,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2021,7]]}},"alternative-id":["10.14778\/3476249.3476262"],"URL":"https:\/\/doi.org\/10.14778\/3476249.3476262","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2021,7]]}}}