{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,15]],"date-time":"2026-05-15T01:16:45Z","timestamp":1778807805253,"version":"3.51.4"},"reference-count":68,"publisher":"Association for Computing Machinery (ACM)","issue":"5","funder":[{"name":"NSF","award":["IIS-2348919"],"award-info":[{"award-number":["IIS-2348919"]}]},{"name":"UCSC Chancellor\u2019s Postdoctoral Fellowship"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2025,11,10]]},"abstract":"<jats:p>\n                    We introduce and study the\n                    <jats:italic toggle=\"yes\">k<\/jats:italic>\n                    -center clustering problem with\n                    <jats:italic toggle=\"yes\">set outliers,<\/jats:italic>\n                    a natural and practical generalization of the classical\n                    <jats:italic toggle=\"yes\">k<\/jats:italic>\n                    -center clustering with outliers. Instead of removing individual data points, our model allows discarding up to\n                    <jats:italic toggle=\"yes\">z<\/jats:italic>\n                    subsets from a given family of candidate outlier sets H. More formally, given a metric space (P,dist), where\n                    <jats:italic toggle=\"yes\">P<\/jats:italic>\n                    is a set of elements and dist a distance metric, a family of sets H \u2286 2\n                    <jats:sup>P<\/jats:sup>\n                    , and parameters k, z, the goal is to compute a set of\n                    <jats:italic toggle=\"yes\">k<\/jats:italic>\n                    centers C \u2286 P and a family of\n                    <jats:italic toggle=\"yes\">z<\/jats:italic>\n                    sets H \u2286 H such that C\u2229(\u22c3\n                    <jats:sub>h<\/jats:sub>\n                    \u2208 H h)=\u2205 to minimize max\n                    <jats:sub>p<\/jats:sub>\n                    \u2208 P\u2216(\u22c3\n                    <jats:sub>h<\/jats:sub>\n                    \u2208 H h) min\n                    <jats:sub>c<\/jats:sub>\n                    \u2208 C dist (p,c) (clustering cost). This abstraction captures structured noise common in database applications, such as faulty data sources or corrupted records in data integration and sensor systems.\n                  <\/jats:p>\n                  <jats:p>\n                    We present the first approximation algorithms for this problem in both general and geometric settings. Our methods provide tri-criteria approximations: selecting up to 2k centers and 2f z outlier sets (where\n                    <jats:italic toggle=\"yes\">f<\/jats:italic>\n                    is the maximum number of sets that a point belongs to), while achieving constant-factor approximation in clustering cost. In geometric settings, we leverage range and BBD trees to achieve near-linear time algorithms. In many real applications f=1. In this case we further improve the running time of our algorithms by constructing small\n                    <jats:italic toggle=\"yes\">coresets.<\/jats:italic>\n                    We also provide a hardness result for the general problem showing that it is unlikely to get any sublinear approximation on the clustering cost selecting less than f \u2022 z outlier sets.\n                  <\/jats:p>\n                  <jats:p>We demonstrate that this model naturally captures relational clustering with outliers. We define and study two new formulations: one where outliers are result tuples in a join, and another where outliers are input tuples whose removal affects the join output. We provide approximation algorithms for both, establishing a tight connection between robust clustering and relational query evaluation.<\/jats:p>","DOI":"10.1145\/3767712","type":"journal-article","created":{"date-parts":[[2025,11,12]],"date-time":"2025-11-12T15:06:02Z","timestamp":1762959962000},"page":"1-27","source":"Crossref","is-referenced-by-count":2,"title":["Clustering with Set Outliers and Applications in Relational Clustering"],"prefix":"10.1145","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3091-3823","authenticated-orcid":false,"given":"Vaishali","family":"Surianarayanan","sequence":"first","affiliation":[{"name":"University of California, Santa Cruz, Santa Cruz, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9356-526X","authenticated-orcid":false,"given":"Neeraj","family":"Kumar","sequence":"additional","affiliation":[{"name":"Meta Platforms, Menlo Park, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2114-8886","authenticated-orcid":false,"given":"Stavros","family":"Sintos","sequence":"additional","affiliation":[{"name":"University of Illinois Chicago, Chicago, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,11,12]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"https:\/\/db-engines.com\/en\/ranking_categories."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611976489.8"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3196959.3196960"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3695835"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1016\/B978-044482537-7\/50003-6"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1613\/jair.1.14883"},{"key":"e_1_2_1_7_1","first-page":"10","volume-title":"International Conference on Artificial Intelligence and Statistics","author":"Amagata D.","year":"2024","unstructured":"D. Amagata. Fair k-center clustering with outliers. In International Conference on Artificial Intelligence and Statistics, pages 10-18. PMLR, 2024."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3695833"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.4086\/toc.2012.v008a006"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0925-7721(00)00022-5"},{"key":"e_1_2_1_11_1","volume-title":"An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM (JACM), 45(6):891-923","author":"Arya S.","year":"1998","unstructured":"S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM (JACM), 45(6):891-923, 1998."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1137\/110859440"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2402.322389"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1016\/0020-0190(79)90117-0"},{"key":"e_1_2_1_15_1","volume-title":"A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. Journal of the ACM (JACM), 42(1):67-90","author":"Callahan P. B.","year":"1995","unstructured":"P. B. Callahan and S. R. Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. Journal of the ACM (JACM), 42(1):67-90, 1995."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3578517"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.14778\/3317315.3317319"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-22105-7_14"},{"key":"e_1_2_1_19_1","volume-title":"36th International Symposium on Computational Geometry (SoCG 2020)","author":"Chan T. M.","year":"2020","unstructured":"T. M. Chan and Q. He. Faster approximation algorithms for geometric set cover. In 36th International Symposium on Computational Geometry (SoCG 2020), 2020."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/301250.301257"},{"key":"e_1_2_1_21_1","first-page":"642","volume-title":"SODA","volume":"1","author":"Charikar M.","year":"2001","unstructured":"M. Charikar, S. Khuller, D. M. Mount, and G. Narasimhan. Algorithms for facility location problems with outliers. In SODA, volume 1, pages 642-651. Citeseer, 2001."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/780542.780548"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611975994.62"},{"key":"e_1_2_1_24_1","first-page":"434","article-title":"Coresets for relational data and the applications","volume":"35","author":"Chen J.","year":"2022","unstructured":"J. Chen, Q. Yang, R. Huang, and H. Ding. Coresets for relational data and the applications. Advances in Neural Information Processing Systems, 35:434-448, 2022.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.5555\/1347082.1347173"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2019.00145"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE51399.2021.00103"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2749431"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/1064092.1064115"},{"key":"e_1_2_1_30_1","first-page":"2742","volume-title":"International Conference on Artificial Intelligence and Statistics","author":"Curtin R.","year":"2020","unstructured":"R. Curtin, B. Moseley, H. Ngo, X. Nguyen, D. Olteanu, and M. Schleich. Rk-means: Fast clustering for relational data. In International Conference on Artificial Intelligence and Statistics, pages 2742-2752. PMLR, 2020."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.5555\/261226"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS54959.2023.00090"},{"key":"e_1_2_1_33_1","first-page":"13","volume-title":"29th Annual European Symposium on Algorithms (ESA 2021","author":"de Berg M.","year":"2021","unstructured":"M. de Berg, M. Monemizadeh, and Y. Zhong. k-center clustering with outliers in the sliding-window model. In 29th Annual European Symposium on Algorithms (ESA 2021), pages 13-1, 2021."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.14778\/3510397.3510401"},{"key":"e_1_2_1_35_1","volume-title":"24th International Conference on Database Theory","author":"Deep S.","year":"2021","unstructured":"S. Deep and P. Koutris. Ranked enumeration of conjunctive query results. In 24th International Conference on Database Theory, 2021."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3690624.3709184"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/780542.780629"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3695831"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/2402.322390"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/62212.62255"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF00288933"},{"key":"e_1_2_1_42_1","volume-title":"Theoretical computer science, 38:293-306","author":"Gonzalez T. F.","year":"1985","unstructured":"T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical computer science, 38:293-306, 1985."},{"key":"e_1_2_1_43_1","volume-title":"Treewidth and hypertree width. Tractability: Practical Approaches to Hard Problems, 1:20","author":"Gottlob G.","year":"2014","unstructured":"G. Gottlob, G. Greco, F. Scarcello, et al. Treewidth and hypertree width. Tractability: Practical Approaches to Hard Problems, 1:20, 2014."},{"key":"e_1_2_1_44_1","first-page":"723","volume-title":"Handbook of discrete and computational geometry","author":"Halperin D.","year":"2017","unstructured":"D. Halperin and M. Sharir. Arrangements. In Handbook of discrete and computational geometry, pages 723-762. Chapman and Hall\/CRC, 2017."},{"key":"e_1_2_1_45_1","volume-title":"STOC","author":"Har-Peled S.","year":"2003","unstructured":"S. Har-Peled and S. Mazumdar. Coresets for k-means andk-median clustering and their applications. STOC, 2003."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/1064092.1064117"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4613-0131-8"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3406325.3451058"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-023-00817-w"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3209889.3209896"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/509907.510017"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcss.2007.06.019"},{"key":"e_1_2_1_53_1","volume-title":"Approximately hard: The unique games conjecture. Simons foundation","author":"Klarreich E.","year":"2011","unstructured":"E. Klarreich. Approximately hard: The unique games conjecture. Simons foundation, 2011."},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3188745.3188882"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2723713"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/3654940"},{"key":"e_1_2_1_57_1","first-page":"28","article-title":"Fast distributed k-center clustering with outliers on massive data","author":"Malkomes G.","year":"2015","unstructured":"G. Malkomes, M. J. Kusner, W. Chen, K. Q. Weinberger, and B. Moseley. Fast distributed k-center clustering with outliers on massive data. Advances in Neural Information Processing Systems, 28, 2015.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_58_1","first-page":"21","article-title":"Diversity of answers to conjunctive queries","author":"Merkl T. C.","year":"2025","unstructured":"T. C. Merkl, R. Pichler, and S. Skritek. Diversity of answers to conjunctive queries. Logical Methods in Computer Science, 21, 2025.","journal-title":"Logical Methods in Computer Science"},{"key":"e_1_2_1_59_1","volume-title":"International Colloquium on Automata, Languages, and Programming","author":"Moseley B.","year":"2021","unstructured":"B. Moseley, K. Pruhs, A. Samadian, and Y. Wang. Relational algorithms for k-means clustering. In International Colloquium on Automata, Languages, and Programming, 2021."},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.14778\/2535573.2488340"},{"key":"e_1_2_1_61_1","volume-title":"Relational database management system market: Industry analysis and forecast 2021-2027: By type, deployment, end users, and region","year":"2022","unstructured":"Report. Relational database management system market: Industry analysis and forecast 2021-2027: By type, deployment, end users, and region, 2022."},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1007\/0-387-25465-X_15"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-35514-2_32"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2882939"},{"key":"e_1_2_1_65_1","volume-title":"Clustering with set outliers and applications in relational clustering. https:\/\/arxiv.org\/abs\/2509.16194","author":"Surianarayanan V.","year":"2025","unstructured":"V. Surianarayanan, N. Kumar, and S. Sintos. Clustering with set outliers and applications in relational clustering. https:\/\/arxiv.org\/abs\/2509.16194, 2025."},{"key":"e_1_2_1_66_1","first-page":"1582","volume-title":"Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases","volume":"13","author":"Tziavelis N.","unstructured":"N. Tziavelis, D. Ajwani, W. Gatterbauer, M. Riedewald, and X. Yang. Optimal algorithms for ranked enumeration of answers to full conjunctive queries. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, volume 13, page 1582. NIH Public Access, 2020."},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1145\/3584372.3588670"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE48307.2020.00129"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3767712","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,12]],"date-time":"2025-11-12T16:08:57Z","timestamp":1762963737000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3767712"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,10]]},"references-count":68,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2025,11,10]]}},"alternative-id":["10.1145\/3767712"],"URL":"https:\/\/doi.org\/10.1145\/3767712","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,10]]}}}