{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T05:15:54Z","timestamp":1755839754703},"reference-count":74,"publisher":"Association for Computing Machinery (ACM)","issue":"11","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,7]]},"abstract":"<jats:p>Although clustering methods have shown promising performance in various applications, they cannot effectively handle incomplete data. Existing studies often impute missing values first before clustering analysis and conduct these two processes separately. However, inaccurate imputation does not necessarily contribute positively to the subsequent clustering. Intuitively, accurate imputation and clustering can serve and benefit from each other, where clustering-based imputation methods typically utilize cluster signals to impute incomplete data and accurate fillings are expected to bring more valuable data for clustering. Therefore, in this manuscript, rather than considering two tasks independently or conducting them respectively, we study simultaneous clustering and imputing over incomplete data. The immediate benefit is that such a strategy improves both clustering and imputation performance simultaneously, to get a win-win result. Our major technical highlights include (1) the problem formalization and NP-hardness analysis on computing simultaneous clustering and imputing results, (2) exact solutions by transforming the problem as the integer linear programming (ILP) formulation, and (3) efficient approximation algorithms based on the linear programming (LP) relaxation and local neighbors (LN) solution, with approximation guarantees. Experiments on various real-world datasets demonstrate the superiority of our work in clustering and imputing incomplete data.<\/jats:p>","DOI":"10.14778\/3681954.3681982","type":"journal-article","created":{"date-parts":[[2024,8,30]],"date-time":"2024-08-30T16:23:36Z","timestamp":1725035016000},"page":"3045-3057","update-policy":"http:\/\/dx.doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Win-Win: On Simultaneous Clustering and Imputing over Incomplete Data"],"prefix":"10.14778","volume":"17","author":[{"given":"Yu","family":"Sun","sequence":"first","affiliation":[{"name":"College of Computer, Science, Nankai University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jingyu","family":"Zhu","sequence":"additional","affiliation":[{"name":"Nankai University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiao","family":"Xu","sequence":"additional","affiliation":[{"name":"Ant Group"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xian","family":"Xu","sequence":"additional","affiliation":[{"name":"Ant Group"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yuyao","family":"Sun","sequence":"additional","affiliation":[{"name":"Ant Group"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shaoxu","family":"Song","sequence":"additional","affiliation":[{"name":"Tsinghua University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiang","family":"Li","sequence":"additional","affiliation":[{"name":"Ping An Health Technology"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiaojie","family":"Yuan","sequence":"additional","affiliation":[{"name":"Nankai University"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,8,30]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2024. https:\/\/sci2s.ugr.es\/keel\/dataset.php?cod=59."},{"key":"e_1_2_1_2_1","unstructured":"2024. https:\/\/archive.ics.uci.edu\/dataset\/267\/banknote+authentication."},{"key":"e_1_2_1_3_1","unstructured":"2024. https:\/\/archive.ics.uci.edu\/dataset\/39\/ecoli."},{"key":"e_1_2_1_4_1","unstructured":"2024. https:\/\/archive.ics.uci.edu\/dataset\/89\/solar+flare."},{"key":"e_1_2_1_5_1","unstructured":"2024. https:\/\/archive.ics.uci.edu\/dataset\/90\/soybean+large."},{"key":"e_1_2_1_6_1","unstructured":"2024. https:\/\/github.com\/ClarkDinh\/k-CMM."},{"key":"e_1_2_1_7_1","unstructured":"2024. https:\/\/github.com\/ermongroup\/CSDI."},{"key":"e_1_2_1_8_1","unstructured":"2024. https:\/\/github.com\/erssss\/impute-SCI."},{"key":"e_1_2_1_9_1","unstructured":"2024. https:\/\/github.com\/ethan-yizhang\/GMM-with-Incomplete-Data."},{"key":"e_1_2_1_10_1","unstructured":"2024. https:\/\/github.com\/HoloClean\/holoclean."},{"key":"e_1_2_1_11_1","unstructured":"2024. https:\/\/github.com\/iiradia\/kPOD."},{"key":"e_1_2_1_12_1","unstructured":"2024. https:\/\/github.com\/jsyoon0823\/GAIN."},{"key":"e_1_2_1_13_1","unstructured":"2024. https:\/\/hackage.haskell.org\/package\/MissingPy."},{"key":"e_1_2_1_14_1","unstructured":"2024. https:\/\/sci2s.ugr.es\/keel\/dataset.php?cod=180."},{"key":"e_1_2_1_15_1","unstructured":"2024. https:\/\/sci2s.ugr.es\/keel\/dataset.php?cod=60."},{"key":"e_1_2_1_16_1","unstructured":"2024. https:\/\/sci2s.ugr.es\/keel\/dataset.php?cod=63."},{"key":"e_1_2_1_17_1","unstructured":"2024. https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.impute.IterativeImputer.html."},{"key":"e_1_2_1_18_1","doi-asserted-by":"crossref","unstructured":"Sara Ahmadian Ashkan Norouzi-Fard Ola Svensson and Justin Ward. 2017. Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms. In FOCS. 61--72.","DOI":"10.1109\/FOCS.2017.15"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1080\/00031305.1992.10475879"},{"key":"e_1_2_1_20_1","doi-asserted-by":"crossref","unstructured":"Arvind Arasu Raghav Kaushik and Jian Li. 2011. Data generation using declarative constraints. In SIGMOD. 685--696.","DOI":"10.1145\/1989323.1989395"},{"key":"e_1_2_1_21_1","first-page":"69","article-title":"Error measures for generalizing about forecasting methods: Empirical comparisons","volume":"8","author":"Scott Armstrong J","year":"1992","unstructured":"J Scott Armstrong and Fred Collopy. 1992. Error measures for generalizing about forecasting methods: Empirical comparisons. IJF 8, 1 (1992), 69--80.","journal-title":"IJF"},{"key":"e_1_2_1_22_1","first-page":"1145","article-title":"The use of the area under the ROC curve in the evaluation of machine learning algorithms","volume":"30","author":"Bradley Andrew P","year":"1997","unstructured":"Andrew P Bradley. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. PR 30, 7 (1997), 1145--1159.","journal-title":"PR"},{"key":"e_1_2_1_23_1","volume-title":"BRITS: Bidirectional Recurrent Imputation for Time Series. In NeurIPS. 6776--6786.","author":"Cao Wei","year":"2018","unstructured":"Wei Cao, Dong Wang, Jian Li, Hao Zhou, Lei Li, and Yitan Li. 2018. BRITS: Bidirectional Recurrent Imputation for Time Series. In NeurIPS. 6776--6786."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1006\/jcss.2002.1882"},{"key":"e_1_2_1_25_1","doi-asserted-by":"crossref","unstructured":"Sanjay Chawla and Aristides Gionis. 2013. k-means-: A unified approach to clustering and outlier detection. In SIAM. 189--197.","DOI":"10.1137\/1.9781611972832.21"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41598-018-24271-9"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1080\/00031305.2015.1086685"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1093\/imaiai\/iay008"},{"key":"e_1_2_1_29_1","volume-title":"Costa","author":"Pereira de Souto Marc\u00edlio Carlos","year":"2015","unstructured":"Marc\u00edlio Carlos Pereira de Souto, Pablo A. Jaskowiak, and Ivan G. Costa. 2015. Impact of missing data imputation methods on gene expression clustering and classification. BMC Bioinform. 16 (2015), 64:1--64:9."},{"key":"e_1_2_1_30_1","volume-title":"Clustering mixed numerical and categorical data with missing values. IS 571","author":"Dinh Duy-Tai","year":"2021","unstructured":"Duy-Tai Dinh, Van-Nam Huynh, and Songsak Sriboonchitta. 2021. Clustering mixed numerical and categorical data with missing values. IS 571 (2021), 418--442."},{"key":"e_1_2_1_31_1","doi-asserted-by":"crossref","unstructured":"Carlotta Domeniconi and Bojun Yan. 2004. Nearest Neighbor Ensemble. In ICPR. 228--231.","DOI":"10.1109\/ICPR.2004.1334065"},{"key":"e_1_2_1_32_1","unstructured":"Martin Ester Hans-Peter Kriegel J\u00f6rg Sander and Xiaowei Xu. 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In KDD. 226--231."},{"key":"e_1_2_1_33_1","first-page":"60","article-title":"Clustering with Missing Features","volume":"14","author":"Gao Kun","year":"2022","unstructured":"Kun Gao, Hassan Ali Khan, and Wenwen Qu. 2022. Clustering with Missing Features: A Density-Based Approach. Symmetry 14, 1 (2022), 60.","journal-title":"A Density-Based Approach. Symmetry"},{"volume-title":"A probabilistic interpretation of precision, recall and F-score, with implication for evaluation","author":"Goutte Cyril","key":"e_1_2_1_34_1","unstructured":"Cyril Goutte and Eric Gaussier. 2005. A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In ECIR. Springer, 345--359."},{"key":"e_1_2_1_35_1","unstructured":"Gurobi Optimization LLC. 2021. Gurobi Optimizer Reference Manual."},{"key":"e_1_2_1_36_1","volume-title":"Data Mining: Concepts and Techniques","author":"Han Jiawei","year":"2011","unstructured":"Jiawei Han, Micheline Kamber, and Jian Pei. 2011. Data Mining: Concepts and Techniques, 3rd edition. Morgan Kaufmann.","edition":"3"},{"key":"e_1_2_1_37_1","volume-title":"Ilyas and Xu Chu","author":"Ihab","year":"2019","unstructured":"Ihab F. Ilyas and Xu Chu. 2019. Data Cleaning. ACM."},{"key":"e_1_2_1_38_1","volume-title":"Ilyas and Theodoros Rekatsinas","author":"Ihab","year":"2022","unstructured":"Ihab F. Ilyas and Theodoros Rekatsinas. 2022. Machine Learning and Data Cleaning: Which Serves the Other? JDIQ 14, 3 (2022), 13:1--13:11."},{"key":"e_1_2_1_39_1","volume-title":"0--1 Integer Linear Programming with a Linear Number of Constraints. CoRR abs\/1401.5512","author":"Impagliazzo Russell","year":"2014","unstructured":"Russell Impagliazzo, Shachar Lovett, Ramamohan Paturi, and Stefan Schneider. 2014. 0--1 Integer Linear Programming with a Linear Number of Constraints. CoRR abs\/1401.5512 (2014). arXiv:1401.5512"},{"key":"e_1_2_1_40_1","volume-title":"Dubes","author":"Jain Anil K.","year":"1988","unstructured":"Anil K. Jain and Richard C. Dubes. 1988. Algorithms for Clustering Data."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/331499.331504"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.14778\/3430915.3430917"},{"key":"e_1_2_1_43_1","doi-asserted-by":"crossref","unstructured":"Narendra Karmarkar. 1984. A New Polynomial-Time Algorithm for Linear Programming. In STOC. 302--311.","DOI":"10.1145\/800057.808695"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF01908075"},{"key":"e_1_2_1_46_1","first-page":"573","article-title":"Towards Missing Data Imputation: A Study of Fuzzy K-means Clustering Method","volume":"3066","author":"Li Dan","year":"2004","unstructured":"Dan Li, Jitender S. Deogun, William Spaulding, and Bill Shuart. 2004. Towards Missing Data Imputation: A Study of Fuzzy K-means Clustering Method. In RSCTC, Vol. 3066. 573--579.","journal-title":"RSCTC"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2007.1078"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1016\/0020-0190(92)90208-D"},{"volume-title":"Statistical analysis with missing data","author":"Little Roderick JA","key":"e_1_2_1_49_1","unstructured":"Roderick JA Little and Donald B Rubin. 2019. Statistical analysis with missing data. Vol. 793. John Wiley & Sons."},{"key":"e_1_2_1_50_1","volume-title":"Kai Ming Ting, and Zhi-Hua Zhou","author":"Liu Fei Tony","year":"2008","unstructured":"Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In ICDM. 413--422."},{"volume-title":"Clustering in bioinformatics and drug discovery","author":"MacCuish John David","key":"e_1_2_1_51_1","unstructured":"John David MacCuish and Norah E MacCuish. 2010. Clustering in bioinformatics and drug discovery. CRC Press."},{"key":"e_1_2_1_52_1","doi-asserted-by":"crossref","unstructured":"Chris Mayfield Jennifer Neville and Sunil Prabhakar. 2010. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD. 75--86.","DOI":"10.1145\/1807167.1807178"},{"volume-title":"A new iterative fuzzy clustering algorithm for multiple imputation of missing data","author":"Nikfalazar Sanaz","key":"e_1_2_1_53_1","unstructured":"Sanaz Nikfalazar, Chung-Hsing Yeh, Susan E. Bedingfield, and Hadi Akbarzade Khorshidi. 2017. A new iterative fuzzy clustering algorithm for multiple imputation of missing data. In FUZZ-IEEE. 1--6."},{"key":"e_1_2_1_54_1","unstructured":"OLIVER Pfaffel. 2020. ClustImpute: An R package for K-means clustering with build-in missing data imputation."},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137631"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536258.2536270"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0065773"},{"key":"e_1_2_1_58_1","doi-asserted-by":"crossref","unstructured":"Shaoxu Song Fei Gao Ruihong Huang and Yihan Wang. 2021. On Saving Outliers for Better Clustering over Noisy Data. In SIGMOD. 1692--1704.","DOI":"10.1145\/3448016.3457271"},{"key":"e_1_2_1_59_1","doi-asserted-by":"crossref","unstructured":"Shaoxu Song Chunping Li and Xiaoquan Zhang. 2015. Turn Waste into Wealth: On Simultaneous Clustering and Cleaning over Dirty Data. In SIGKDD. 1115--1124.","DOI":"10.1145\/2783258.2783317"},{"key":"e_1_2_1_60_1","doi-asserted-by":"crossref","unstructured":"Shaoxu Song and Yu Sun. 2020. Imputing Various Incomplete Attributes via Distance Likelihood Maximization. In SIGKDD. 535--545.","DOI":"10.1145\/3394486.3403096"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2018.2883103"},{"key":"e_1_2_1_62_1","volume-title":"An Integer Programming Approach To Subspace Clustering With Missing Data. arXiv preprint arXiv:2309.15285","author":"Soni Akhilesh","year":"2023","unstructured":"Akhilesh Soni, Jeff Linderoth, Jim Luedtke, and Daniel Pimentel-Alarc\u00f3n. 2023. An Integer Programming Approach To Subspace Clustering With Missing Data. arXiv preprint arXiv:2309.15285 (2023)."},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btr597"},{"key":"e_1_2_1_64_1","volume-title":"CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation. In NeurIPS. 24804--24816.","author":"Tashiro Yusuke","year":"2021","unstructured":"Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. 2021. CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation. In NeurIPS. 24804--24816."},{"key":"e_1_2_1_65_1","first-page":"1","article-title":"mice: Multivariate imputation by chained equations in R","volume":"45","author":"Buuren Stef Van","year":"2011","unstructured":"Stef Van Buuren and Karin Groothuis-Oudshoorn. 2011. mice: Multivariate imputation by chained equations in R. JOSS 45 (2011), 1--67.","journal-title":"JOSS"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.3923\/itj.2011.478.484"},{"key":"e_1_2_1_67_1","unstructured":"Christopher Whelan Greg Harrell and Jin Wang. 2015. Understanding the k-medians problem. In CSC. 219."},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.1971.10482356"},{"key":"e_1_2_1_69_1","unstructured":"Richard Wu Aoqian Zhang Ihab F. Ilyas and Theodoros Rekatsinas. 2020. Attention-based Learning for Missing Data Imputation in HoloClean. In MLSys."},{"key":"e_1_2_1_70_1","volume-title":"Elmagarmid","author":"Yakout Mohamed","year":"2013","unstructured":"Mohamed Yakout, Laure Berti-\u00c9quille, and Ahmed K. Elmagarmid. 2013. Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes. In SIGMOD. 553--564."},{"key":"e_1_2_1_71_1","volume-title":"Missing value imputation based on gaussian mixture model for the internet of things. MPE","author":"Yan Xiaobo","year":"2015","unstructured":"Xiaobo Yan, Weiqing Xiong, Liang Hu, Feng Wang, and Kuo Zhao. 2015. Missing value imputation based on gaussian mixture model for the internet of things. MPE (2015)."},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.3390\/s150819443"},{"key":"e_1_2_1_73_1","volume-title":"GAIN: Missing Data Imputation using Generative Adversarial Nets. In ICML. 5675--5684.","author":"Yoon Jinsung","year":"2018","unstructured":"Jinsung Yoon, James Jordon, and Mihaela van der Schaar. 2018. GAIN: Missing Data Imputation using Generative Adversarial Nets. In ICML. 5675--5684."},{"key":"e_1_2_1_74_1","first-page":"128","article-title":"Missing Value Imputation Based on Data Clustering","volume":"1","author":"Zhang Shichao","year":"2008","unstructured":"Shichao Zhang, Jilian Zhang, Xiaofeng Zhu, Yongsong Qin, and Chengqi Zhang. 2008. Missing Value Imputation Based on Data Clustering. TCS 1 (2008), 128--138.","journal-title":"TCS"},{"key":"e_1_2_1_75_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3408318","article-title":"Gaussian mixture model clustering with incomplete data","volume":"17","author":"Zhang Yi","year":"2021","unstructured":"Yi Zhang, Miaomiao Li, Siwei Wang, Sisi Dai, Lei Luo, En Zhu, Huiying Xu, Xinzhong Zhu, Chaoyun Yao, and Haoran Zhou. 2021. Gaussian mixture model clustering with incomplete data. TOMM 17, 1s (2021), 1--14.","journal-title":"TOMM"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3681954.3681982","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,4]],"date-time":"2024-09-04T18:43:44Z","timestamp":1725475424000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3681954.3681982"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7]]},"references-count":74,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2024,7]]}},"alternative-id":["10.14778\/3681954.3681982"],"URL":"https:\/\/doi.org\/10.14778\/3681954.3681982","relation":{},"ISSN":["2150-8097"],"issn-type":[{"type":"print","value":"2150-8097"}],"subject":[],"published":{"date-parts":[[2024,7]]},"assertion":[{"value":"2024-08-30","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}