{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,12]],"date-time":"2025-11-12T21:12:53Z","timestamp":1762981973958,"version":"3.44.0"},"reference-count":57,"publisher":"Association for Computing Machinery (ACM)","issue":"5","license":[{"start":{"date-parts":[[2024,11,4]],"date-time":"2024-11-04T00:00:00Z","timestamp":1730678400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"NSF","award":["IIS-2348919"],"award-info":[{"award-number":["IIS-2348919"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2024,11,4]]},"abstract":"<jats:p>\n            Clustering plays a crucial role in computer science, facilitating data analysis and problem-solving across numerous fields. By partitioning large datasets into meaningful groups, clustering reveals hidden structures and relationships within the data, aiding tasks such as unsupervised learning, classification, anomaly detection, and recommendation systems. Particularly in relational databases, where data is distributed across multiple tables, efficient clustering is essential yet challenging due to the computational complexity of joining tables. This paper addresses this challenge by introducing efficient algorithms for k-median and k-means clustering on relational data without the need for pre-computing the join query results. For the relational k-median clustering, we propose the first efficient relative approximation algorithm. For the relational k-means clustering, our algorithm significantly improves both the approximation factor and the running time of the known relational k-means clustering algorithms, which suffer either from large constant approximation factors, or expensive running time. Given a join query q and a database instance D of O(N) tuples, for both k-median and k-means clustering on the results of q on D, we propose randomized (1+\u03b5)\u03b3-approximation algorithms that run in roughly O(k\n            <jats:sup>2<\/jats:sup>\n            N\n            <jats:sup>fhw<\/jats:sup>\n            )+T_\u03b3(k\n            <jats:sup>2<\/jats:sup>\n            ) time, where \u03b5 \u2208 (0,1) is a constant parameter decided by the user, \\fhw is the fractional hyper-tree width of Q, while \u03b3 and T_\u03b3(x) represent the approximation factor and the running time, respectively, of a traditional clustering algorithm in the standard computational setting over x points.\n          <\/jats:p>","DOI":"10.1145\/3695831","type":"journal-article","created":{"date-parts":[[2024,11,7]],"date-time":"2024-11-07T17:26:35Z","timestamp":1731000395000},"page":"1-27","source":"Crossref","is-referenced-by-count":1,"title":["Improved Approximation Algorithms for Relational Clustering"],"prefix":"10.1145","volume":"2","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-3798-9578","authenticated-orcid":false,"given":"Aryan","family":"Esmailpour","sequence":"first","affiliation":[{"name":"Department of Computer Science, University of Illinois Chicago, Chicago, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2114-8886","authenticated-orcid":false,"given":"Stavros","family":"Sintos","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Illinois Chicago, Chicago, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,11,7]]},"reference":[{"unstructured":"https:\/\/db-engines.com\/en\/ranking_categories.","key":"e_1_2_1_1_1"},{"key":"e_1_2_1_2_1","volume-title":"https:\/\/www.kaggle.com\/datasets\/kaggle\/kaggle-survey-2018","author":"Kaggle","year":"2018","unstructured":"Kaggle machine learning and data science survey. https:\/\/www.kaggle.com\/datasets\/kaggle\/kaggle-survey-2018, 2018."},{"unstructured":".","key":"e_1_2_1_3_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_4_1","DOI":"10.1137\/1.9781611976489.8"},{"doi-asserted-by":"publisher","key":"e_1_2_1_5_1","DOI":"10.1145\/3196959.3196960"},{"doi-asserted-by":"publisher","key":"e_1_2_1_6_1","DOI":"10.1090\/conm\/223\/03131"},{"doi-asserted-by":"publisher","key":"e_1_2_1_7_1","DOI":"10.1145\/3695835"},{"doi-asserted-by":"publisher","key":"e_1_2_1_8_1","DOI":"10.1016\/B978-044482537-7\/50003-6"},{"doi-asserted-by":"publisher","key":"e_1_2_1_9_1","DOI":"10.1007\/978-3-642-03685-9_2"},{"doi-asserted-by":"publisher","key":"e_1_2_1_10_1","DOI":"10.5555\/1793494.1793505"},{"key":"e_1_2_1_11_1","volume-title":"Reinhard Pichler, and Cristian Riveros. Towards tractability of the diversity of query answers: Ultrametrics to the rescue. arXiv preprint arXiv:2408.01657","author":"Arenas Marcelo","year":"2024","unstructured":"Marcelo Arenas, Timo Camillo Merkl, Reinhard Pichler, and Cristian Riveros. Towards tractability of the diversity of query answers: Ultrametrics to the rescue. arXiv preprint arXiv:2408.01657, 2024."},{"key":"e_1_2_1_12_1","first-page":"1027","volume-title":"Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms","author":"Arthur David","year":"2007","unstructured":"David Arthur and Sergei Vassilvitskii. k-means the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027--1035, 2007."},{"doi-asserted-by":"publisher","key":"e_1_2_1_13_1","DOI":"10.1145\/380752.380755"},{"doi-asserted-by":"publisher","key":"e_1_2_1_14_1","DOI":"10.1137\/110859440"},{"doi-asserted-by":"publisher","key":"e_1_2_1_15_1","DOI":"10.1145\/3219819.3219973"},{"doi-asserted-by":"publisher","key":"e_1_2_1_16_1","DOI":"10.14778\/2180912.2180915"},{"key":"e_1_2_1_17_1","first-page":"576","volume-title":"International Conference on Machine Learning","author":"Braverman Vladimir","year":"2017","unstructured":"Vladimir Braverman, Gereon Frahling, Harry Lang, Christian Sohler, and Lin F Yang. Clustering high dimensional dynamic data streams. In International Conference on Machine Learning, pages 576--585. PMLR, 2017."},{"key":"e_1_2_1_18_1","volume-title":"An algorithm for clustering relational data with applications to social network analysis and comparison with multidimensional scaling. Journal of mathematical psychology, 12(3):328--383","author":"Breiger Ronald L","year":"1975","unstructured":"Ronald L Breiger, Scott A Boorman, and Phipps Arabie. An algorithm for clustering relational data with applications to social network analysis and comparison with multidimensional scaling. Journal of mathematical psychology, 12(3):328--383, 1975."},{"doi-asserted-by":"publisher","key":"e_1_2_1_19_1","DOI":"10.1145\/3578517"},{"doi-asserted-by":"publisher","key":"e_1_2_1_20_1","DOI":"10.1145\/301250.301257"},{"key":"e_1_2_1_21_1","first-page":"434","article-title":"Coresets for relational data and the applications","volume":"35","author":"Chen Jiaxiang","year":"2022","unstructured":"Jiaxiang Chen, Qingyuan Yang, Ruomin Huang, and Hu Ding. Coresets for relational data and the applications. Advances in Neural Information Processing Systems, 35:434--448, 2022.","journal-title":"Advances in Neural Information Processing Systems"},{"doi-asserted-by":"publisher","key":"e_1_2_1_22_1","DOI":"10.1109\/ICDE.2019.00145"},{"doi-asserted-by":"publisher","key":"e_1_2_1_23_1","DOI":"10.1109\/ICDE51399.2021.00103"},{"key":"e_1_2_1_24_1","first-page":"2742","volume-title":"International Conference on Artificial Intelligence and Statistics","author":"Curtin Ryan","year":"2020","unstructured":"Ryan Curtin, Benjamin Moseley, Hung Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. Rk-means: Fast clustering for relational data. In International Conference on Artificial Intelligence and Statistics, pages 2742--2752. PMLR, 2020."},{"doi-asserted-by":"publisher","key":"e_1_2_1_25_1","DOI":"10.14778\/3510397.3510401"},{"key":"e_1_2_1_26_1","volume-title":"24th International Conference on Database Theory","author":"Deep Shaleen","year":"2021","unstructured":"Shaleen Deep and Paraschos Koutris. Ranked enumeration of conjunctive query results. In 24th International Conference on Database Theory, 2021."},{"doi-asserted-by":"publisher","key":"e_1_2_1_27_1","DOI":"10.1145\/2020408.2020515"},{"key":"e_1_2_1_28_1","volume-title":"Improved approximation algorithms for relational clustering. arXiv preprint arXiv:2409.18498","author":"Esmailpour Aryan","year":"2024","unstructured":"Aryan Esmailpour and Stavros Sintos. Improved approximation algorithms for relational clustering. arXiv preprint arXiv:2409.18498, 2024."},{"doi-asserted-by":"publisher","key":"e_1_2_1_29_1","DOI":"10.1016\/j.patcog.2007.02.019"},{"key":"e_1_2_1_30_1","volume-title":"Treewidth and hypertree width. Tractability: Practical Approaches to Hard Problems, 1","author":"Gottlob G.","year":"2014","unstructured":"G. Gottlob, G. Greco, and F. Scarcello. Treewidth and hypertree width. Tractability: Practical Approaches to Hard Problems, 1, 2014."},{"key":"e_1_2_1_31_1","volume-title":"Clustering data streams: Theory and practice","author":"Guha Sudipto","year":"2003","unstructured":"Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O'Callaghan. Clustering data streams: Theory and practice. IEEE transactions on knowledge and data engineering, 15(3):515--528, 2003."},{"key":"e_1_2_1_32_1","first-page":"723","volume-title":"Handbook of discrete and computational geometry","author":"Halperin Dan","year":"2017","unstructured":"Dan Halperin and Micha Sharir. Arrangements. In Handbook of discrete and computational geometry, pages 723--762. Chapman and Hall\/CRC, 2017."},{"doi-asserted-by":"publisher","key":"e_1_2_1_33_1","DOI":"10.1145\/1064092.1064114"},{"doi-asserted-by":"publisher","key":"e_1_2_1_34_1","DOI":"10.1145\/1007352.1007400"},{"doi-asserted-by":"publisher","key":"e_1_2_1_35_1","DOI":"10.32469\/10355\/8897"},{"doi-asserted-by":"publisher","key":"e_1_2_1_36_1","DOI":"10.1145\/513400.513402"},{"key":"e_1_2_1_37_1","volume-title":"Functional aggregate queries with additive inequalities. ACM Transactions on Database Systems (TODS), 45(4):1--41","author":"Khamis Mahmoud Abo","year":"2020","unstructured":"Mahmoud Abo Khamis, Ryan R Curtin, Benjamin Moseley, Hung Q Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. Functional aggregate queries with additive inequalities. ACM Transactions on Database Systems (TODS), 45(4):1--41, 2020."},{"doi-asserted-by":"publisher","key":"e_1_2_1_38_1","DOI":"10.1145\/3209889.3209896"},{"doi-asserted-by":"publisher","key":"e_1_2_1_39_1","DOI":"10.1007\/BFb0027330"},{"doi-asserted-by":"publisher","key":"e_1_2_1_40_1","DOI":"10.1145\/2723372.2723713"},{"doi-asserted-by":"publisher","key":"e_1_2_1_41_1","DOI":"10.1145\/2488608.2488723"},{"doi-asserted-by":"publisher","key":"e_1_2_1_42_1","DOI":"10.1109\/TIT.1982.1056489"},{"doi-asserted-by":"publisher","key":"e_1_2_1_43_1","DOI":"10.1201\/EBK1420072617"},{"key":"e_1_2_1_44_1","volume-title":"26th International Conference on Database Theory (ICDT 2023","author":"Merkl Timo Camillo","year":"2023","unstructured":"Timo Camillo Merkl, Reinhard Pichler, and Sebastian Skritek. Diversity of answers to conjunctive queries. In 26th International Conference on Database Theory (ICDT 2023). Schloss Dagstuhl-Leibniz-Zentrum f\u00fcr Informatik, 2023."},{"key":"e_1_2_1_45_1","first-page":"1","volume-title":"48th International Colloquium on Automata, Languages, and Programming, ICALP","volume":"198","author":"Moseley Benjamin","year":"2021","unstructured":"Benjamin Moseley, Kirk Pruhs, Alireza Samadian, and Yuyan Wang. Relational algorithms for k-means clustering. In 48th International Colloquium on Automata, Languages, and Programming, ICALP, volume 198, pages 97:1--97:21, 2021."},{"key":"e_1_2_1_46_1","first-page":"9","volume-title":"Proceedings of the text mining and link analysis workshop, 18th international joint conference on artificial intelligence","author":"Neville Jennifer","year":"2003","unstructured":"Jennifer Neville, Micah Adler, and David Jensen. Clustering relational data using attribute and link information. In Proceedings of the text mining and link analysis workshop, 18th international joint conference on artificial intelligence, pages 9--15. San Francisco, CA: Morgan Kaufmann Publishers, 2003."},{"key":"e_1_2_1_47_1","volume-title":"Worst-case optimal join algorithms. Journal of the ACM (JACM), 65(3):1--40","author":"Ngo Hung Q","year":"2018","unstructured":"Hung Q Ngo, Ely Porat, Christopher R\u00e9, and Atri Rudra. Worst-case optimal join algorithms. Journal of the ACM (JACM), 65(3):1--40, 2018."},{"doi-asserted-by":"publisher","key":"e_1_2_1_48_1","DOI":"10.1145\/2590989.2590991"},{"doi-asserted-by":"publisher","key":"e_1_2_1_49_1","DOI":"10.14778\/2535573.2488340"},{"volume-title":"Relational database management system market: Industry analysis and forecast 2021--2027: By type, deployment, end users, and region","year":"2022","unstructured":"Report. Relational database management system market: Industry analysis and forecast 2021--2027: By type, deployment, end users, and region, 2022.","key":"e_1_2_1_50_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_51_1","DOI":"10.1007\/978-3-030-35514-2_32"},{"doi-asserted-by":"publisher","key":"e_1_2_1_52_1","DOI":"10.1145\/2882903.2882939"},{"doi-asserted-by":"publisher","key":"e_1_2_1_53_1","DOI":"10.1137\/1.9781611975499.13"},{"key":"e_1_2_1_54_1","first-page":"1582","volume-title":"Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases","volume":"13","author":"Tziavelis Nikolaos","unstructured":"Nikolaos Tziavelis, Deepak Ajwani, Wolfgang Gatterbauer, Mirek Riedewald, and Xiaofeng Yang. Optimal algorithms for ranked enumeration of answers to full conjunctive queries. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, volume 13, page 1582. NIH Public Access, 2020."},{"doi-asserted-by":"publisher","key":"e_1_2_1_55_1","DOI":"10.1145\/3584372.3588670"},{"doi-asserted-by":"publisher","key":"e_1_2_1_56_1","DOI":"10.1109\/ICDE48307.2020.00129"},{"key":"e_1_2_1_57_1","first-page":"82","volume-title":"VLDB","volume":"81","author":"Yannakakis Mihalis","year":"1981","unstructured":"Mihalis Yannakakis. Algorithms for acyclic database schemes. In VLDB, volume 81, pages 82--94, 1981."}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3695831","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3695831","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,23]],"date-time":"2025-08-23T02:33:56Z","timestamp":1755916436000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3695831"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,4]]},"references-count":57,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2024,11,4]]}},"alternative-id":["10.1145\/3695831"],"URL":"https:\/\/doi.org\/10.1145\/3695831","relation":{},"ISSN":["2836-6573"],"issn-type":[{"type":"electronic","value":"2836-6573"}],"subject":[],"published":{"date-parts":[[2024,11,4]]}}}