{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T02:10:48Z","timestamp":1755828648782,"version":"3.44.0"},"reference-count":58,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2024,5,29]],"date-time":"2024-05-29T00:00:00Z","timestamp":1716940800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"European Union's Horizon 2020 research and innovation programme","award":["Marie Sklodowska-Curie grant No 101034413"],"award-info":[{"award-number":["Marie Sklodowska-Curie grant No 101034413"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2024,5,29]]},"abstract":"<jats:p>We study the theoretical and practical runtime limits of k-means and k-median clustering on large datasets. Since effectively all clustering methods are slower than the time it takes to read the dataset, the fastest approach is to quickly compress the data and perform the clustering on the compressed representation. Unfortunately, there is no universal best choice for compressing the number of points -- while random sampling runs in sublinear time and coresets provide theoretical guarantees, the former does not enforce accuracy while the latter is too slow as the numbers of points and clusters grow. Indeed, it has been conjectured that any sensitivity-based coreset construction requires super-linear time in the datase size. We examine this relationship by first showing that there does exist an algorithm that obtains coresets via sensitivity sampling in effectively linear time -- within log-factors of the time it takes to read the data. Any approach that significantly improves on this must then resort to practical heuristics, leading us to consider the spectrum of sampling strategies across both real and artificial datasets in the static and streaming settings. Through this, we show the conditions in which coresets are necessary for preserving cluster validity as well as the settings in which faster, cruder sampling strategies are sufficient. As a result, we provide a comprehensive theoretical and practical blueprint for effective clustering regardless of data size. Our code is publicly available at https:\/\/github.com\/Andrew-Draganov\/Fast-Coreset-Generation and has scripts to recreate the experiments.<\/jats:p>","DOI":"10.1145\/3654976","type":"journal-article","created":{"date-parts":[[2024,5,30]],"date-time":"2024-05-30T09:44:53Z","timestamp":1717062293000},"page":"1-25","source":"Crossref","is-referenced-by-count":0,"title":["Settling Time vs. Accuracy Tradeoffs for Clustering Big Data"],"prefix":"10.1145","volume":"2","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1617-4166","authenticated-orcid":false,"given":"Andrew","family":"Draganov","sequence":"first","affiliation":[{"name":"Aarhus University, Aarhus, DK"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4208-8541","authenticated-orcid":false,"given":"David","family":"Saulpic","sequence":"additional","affiliation":[{"name":"CNRS, CNRS &amp; Universit\u00e9 Paris Cit\u00e9, Paris, FR"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1202-0805","authenticated-orcid":false,"given":"Chris","family":"Schwiegelshohn","sequence":"additional","affiliation":[{"name":"Aarhus University, Aarhus, DK"}]}],"member":"320","published-online":{"date-parts":[[2024,5,30]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/2133803.2184450"},{"key":"e_1_2_2_2_1","first-page":"1027","volume-title":"Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007","author":"Arthur David","year":"2007","unstructured":"David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7--9, 2007, pages 1027--1035, 2007. URL http:\/\/dl.acm.org\/citation.cfm?id=1283383.1283494."},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1007\/978--3--642--32512-0_4"},{"key":"e_1_2_2_4_1","volume-title":"Fast and provably good seedings for k-means. Advances in neural information processing systems, 29","author":"Bachem Olivier","year":"2016","unstructured":"Olivier Bachem, Mario Lucic, Hamed Hassani, and Andreas Krause. Fast and provably good seedings for k-means. Advances in neural information processing systems, 29, 2016."},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1609\/AAAI.V30I1.10259"},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3219973"},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/509907.509947"},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.4230\/LIPIcs.ICALP.2021.23"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3313276.3316318"},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10994-006-0587--3"},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1016\/0196--6774(80)90015--2"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.7916\/D8NZ8J07"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0168--1699(99)00046-0"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611976465.159"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/FOCS54457.2022.00051"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1137\/070699007"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/2623330.2623743"},{"key":"e_1_2_2_18_1","volume-title":"Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017","author":"Chierichetti Flavio","year":"2017","unstructured":"Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. Fair clustering through fairlets. In IsabelleGuyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA, pages 5029--5037, 2017. URL https:\/\/proceedings.neurips.cc\/ paper\/2017\/hash\/978fce5bcc4eccc88ad48ce3914124a2-Abstract.html."},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2746539.2746569"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2897518.2897647"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.4230\/LIPIcs.ICALP.2019.41"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/FOCS.2017.14"},{"key":"e_1_2_2_23_1","first-page":"16235","article-title":"Fast and accurate-means++ via rejection sampling","volume":"33","author":"Cohen-Addad Vincent","year":"2020","unstructured":"Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler, and Ola Svensson. Fast and accurate-means++ via rejection sampling. Advances in Neural Information Processing Systems, 33:16235--16245, 2020. URL https:\/\/proceedings.neurips.cc\/paper\/2020\/hash\/babcff88f8be8c4795bd6f0f8cccca61-Abstract.html.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_2_24_1","first-page":"20333","volume-title":"Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021","author":"Cohen-Addad Vincent","year":"2021","unstructured":"Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler, and Ola Svensson. Parallel and efficient hierarchical k-median clustering. In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6--14, 2021, virtual, pages 20333--20345, 2021. URL https:\/\/proceedings.neurips.cc\/paper\/2021\/hash\/aa495e18c7e3a21a4e48923b92048a61-Abstract.html."},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3406325.3451022"},{"key":"e_1_2_2_26_1","first-page":"21085","volume-title":"Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021","author":"Cohen-Addad Vincent","year":"2021","unstructured":"Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. Improved coresets and sublinear algorithms for power means in euclidean spaces. In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6--14, 2021, virtual, pages 21085--21098, 2021. URL https:\/\/proceedings.neurips.cc\/paper\/2021\/hash\/b035d6563a2adac9f822940c145263ce-Abstract.html."},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3519935.3519946"},{"key":"e_1_2_2_28_1","volume-title":"NeurIPS, 2022","author":"Cohen-Addad Vincent","year":"2022","unstructured":"Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn, and Omar Ali Sheikh-Omar. Improved coresets for euclidean k-means. In NeurIPS, 2022. URL http:\/\/papers.nips.cc\/paper_files\/paper\/2022\/hash\/ 120c9ab5c58ba0fa9dd3a22ace1de245-Abstract-Conference.html."},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/2020408.2020516"},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1002\/RSA.20157"},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2210.08361"},{"key":"e_1_2_2_32_1","volume-title":"UCI machine learning repository","author":"Dua Dheeru","year":"2017","unstructured":"Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http:\/\/archive.ics.uci.edu\/ml."},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2020408.2020515"},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01208"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/780542.780608"},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1002\/widm"},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/1993636.1993712"},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-40450-4_41"},{"key":"e_1_2_2_39_1","volume-title":"American Mathematical Soc.","author":"Har-Peled Sariel","year":"2011","unstructured":"Sariel Har-Peled. Geometric approximation algorithms. Number 173. American Mathematical Soc., 2011. doi: 10.1090\/ surv\/173."},{"key":"e_1_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/1007352.1007400"},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3132847.3133091"},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3357713.3384296"},{"key":"e_1_2_2_43_1","first-page":"7587","volume-title":"NeurIPS","author":"Huang Lingxiao","year":"2019","unstructured":"Lingxiao Huang, Shaofeng H.-C. Jiang, and Nisheeth K. Vishnoi. Coresets for clustering with fairness constraints. In NeurIPS, pages 7587--7598, 2019. URL https:\/\/proceedings.neurips.cc\/paper\/2019\/hash\/ 810dfbbebb17302018ae903e9cb7a483-Abstract.html."},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2211.11923"},{"key":"e_1_2_2_45_1","series-title":"Proceedings of Machine Learning Research","first-page":"13933","volume-title":"International Conference on Machine Learning, ICML","author":"Huang Lingxiao","year":"2023","unstructured":"Lingxiao Huang, Shaofeng H.-C. Jiang, and Jianing Lou. The power of uniform sampling for k-median. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23--29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 13933--13956. PMLR, 2023. URL https:\/\/proceedings.mlr.press\/v202\/huang23j.html."},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/FOCS.2010.35"},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611973075.50"},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/5.726791"},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.1982"},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3313276.3316350"},{"key":"e_1_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1023\/B:MACH.0000033115.78247.F0"},{"key":"e_1_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.24432\/C55W25"},{"key":"e_1_2_2_53_1","unstructured":"N\/A 1990. URL https:\/\/archive.ics.uci.edu\/ml\/datasets\/US+Census+Data+(1990)."},{"key":"e_1_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/2395116.2395117"},{"volume-title":"Shooting star sky night royalty-free vector graphic","year":"2023","key":"e_1_2_2_55_1","unstructured":"Pixabay. Shooting star sky night royalty-free vector graphic, 2023. URL https:\/\/pixabay.com\/vectors\/shooting-starsky- night-dark-stars-2024127\/. [Online; accessed Jan 8th 2024]."},{"key":"e_1_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1007\/978--3-030--39479-0"},{"key":"e_1_2_2_57_1","doi-asserted-by":"publisher","DOI":"10.4230\/LIPIcs.ESA.2022.84"},{"key":"e_1_2_2_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/233269.233324"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3654976","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3654976","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T14:37:37Z","timestamp":1755787057000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3654976"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,29]]},"references-count":58,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,5,29]]}},"alternative-id":["10.1145\/3654976"],"URL":"https:\/\/doi.org\/10.1145\/3654976","relation":{},"ISSN":["2836-6573"],"issn-type":[{"type":"electronic","value":"2836-6573"}],"subject":[],"published":{"date-parts":[[2024,5,29]]}}}