{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T06:26:26Z","timestamp":1760077586602,"version":"3.41.0"},"reference-count":36,"publisher":"Association for Computing Machinery (ACM)","issue":"3-4","license":[{"start":{"date-parts":[[2023,12,12]],"date-time":"2023-12-12T00:00:00Z","timestamp":1702339200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Norway RCN","award":["314528"],"award-info":[{"award-number":["314528"]}]},{"name":"DFG DFG Research","award":["DFG 411362735"],"award-info":[{"award-number":["DFG 411362735"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Comput. Theory"],"published-print":{"date-parts":[[2023,12,31]]},"abstract":"<jats:p>\n            We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers \u2113 (the number of irrelevant features) and\n            <jats:italic>k<\/jats:italic>\n            (the number of clusters), budget\n            <jats:italic>B<\/jats:italic>\n            , and a set of\n            <jats:italic>n<\/jats:italic>\n            categorical data points (represented by\n            <jats:italic>m<\/jats:italic>\n            -dimensional vectors whose elements belong to a finite set of values \u03a3), we want to select\n            <jats:italic>m<\/jats:italic>\n            -\u2113 relevant features such that the cost of any optimal\n            <jats:italic>k<\/jats:italic>\n            -clustering on these features does not exceed\n            <jats:italic>B<\/jats:italic>\n            . Here the cost of a cluster is the sum of Hamming distances (\u2113\n            <jats:sub>0<\/jats:sub>\n            -distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters.\n          <\/jats:p>\n          <jats:p>\n            We use the framework of parameterized complexity to identify how the complexity of the problem depends on parameters\n            <jats:italic>k<\/jats:italic>\n            ,\n            <jats:italic>B<\/jats:italic>\n            , and |\u03a3|. Our main result is an algorithm that solves the Feature Selection problem in time\n            <jats:italic>f<\/jats:italic>\n            (\n            <jats:italic>k, B<\/jats:italic>\n            , |\u03a3|)\u22c5\n            <jats:italic>m<\/jats:italic>\n            <jats:sup>\n              <jats:italic>g<\/jats:italic>\n              (\n              <jats:italic>k<\/jats:italic>\n              , |\u03a3|)\n            <\/jats:sup>\n            \u22c5\n            <jats:italic>n<\/jats:italic>\n            <jats:sup>2<\/jats:sup>\n            for some functions\n            <jats:italic>f<\/jats:italic>\n            and\n            <jats:italic>g<\/jats:italic>\n            . In other words, the problem is fixed-parameter tractable parameterized by\n            <jats:italic>B<\/jats:italic>\n            when |\u03a3| and\n            <jats:italic>k<\/jats:italic>\n            are constants. Our algorithm for Feature Selection is based on a solution to a more general problem, Constrained Clustering with Outliers. In this problem, we want to delete a certain number of outliers such that the remaining points could be clustered around centers satisfying specific constraints. One interesting fact about Constrained Clustering with Outliers is that besides Feature Selection, it encompasses many other fundamental problems regarding categorical data such as Robust Clustering and Binary and Boolean Low-rank Matrix Approximation with Outliers. Thus, as a byproduct of our theorem, we obtain algorithms for all these problems. We also complement our algorithmic findings with complexity lower bounds.\n          <\/jats:p>","DOI":"10.1145\/3604797","type":"journal-article","created":{"date-parts":[[2023,9,1]],"date-time":"2023-09-01T12:22:38Z","timestamp":1693570958000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Parameterized Complexity of Feature Selection for Categorical Data Clustering"],"prefix":"10.1145","volume":"15","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8875-0102","authenticated-orcid":false,"given":"Sayan","family":"Bandyapadhyay","sequence":"first","affiliation":[{"name":"Department of Computer Science, Portland State University, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1955-4612","authenticated-orcid":false,"given":"Fedor V.","family":"Fomin","sequence":"additional","affiliation":[{"name":"Department of Informatics, University of Bergen, Norway"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2619-2990","authenticated-orcid":false,"given":"Petr A.","family":"Golovach","sequence":"additional","affiliation":[{"name":"Department of Informatics, University of Bergen, Norway"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9436-7310","authenticated-orcid":false,"given":"Kirill","family":"Simonov","sequence":"additional","affiliation":[{"name":"Hasso Plattner Institute, University of Potsdam, Germany"}]}],"member":"320","published-online":{"date-parts":[[2023,12,12]]},"reference":[{"key":"e_1_3_2_2_2","first-page":"30","volume-title":"Data Clustering: Algorithms and Applications","author":"Alelyani Salem","year":"2013","unstructured":"Salem Alelyani, Jiliang Tang, and Huan Liu. 2013. Feature selection for clustering: A review. In Data Clustering: Algorithms and Applications, Charu C. Aggarwal and Chandan K. Reddy (Eds.). CRC Press, 30\u2013373."},{"doi-asserted-by":"publisher","key":"e_1_3_2_3_2","DOI":"10.1145\/210332.210337"},{"doi-asserted-by":"publisher","key":"e_1_3_2_4_2","DOI":"10.1145\/800105.803393"},{"doi-asserted-by":"publisher","key":"e_1_3_2_5_2","DOI":"10.1137\/1.9781611975482.47"},{"doi-asserted-by":"publisher","key":"e_1_3_2_6_2","DOI":"10.4230\/LIPIcs.APPROX-RANDOM.2018.4"},{"doi-asserted-by":"publisher","key":"e_1_3_2_7_2","DOI":"10.1145\/3444942"},{"key":"e_1_3_2_8_2","first-page":"153","volume-title":"NIPS\u201909","author":"Boutsidis Christos","year":"2009","unstructured":"Christos Boutsidis, Michael W. Mahoney, and Petros Drineas. 2009. Unsupervised feature selection for the k-means clustering problem. In NIPS\u201909. Curran Associates, Inc., 153\u2013161. http:\/\/papers.nips.cc\/paper\/3724-unsupervised-feature-selection-for-the-k-means-clustering-problem"},{"doi-asserted-by":"publisher","key":"e_1_3_2_9_2","DOI":"10.1109\/TIT.2014.2375327"},{"doi-asserted-by":"publisher","key":"e_1_3_2_10_2","DOI":"10.1201\/b20190"},{"doi-asserted-by":"publisher","key":"e_1_3_2_11_2","DOI":"10.1145\/1970392.1970395"},{"key":"e_1_3_2_12_2","first-page":"873","volume-title":"ICML\u201911","author":"Chen Yudong","year":"2011","unstructured":"Yudong Chen, Huan Xu, Constantine Caramanis, and Sujay Sanghavi. 2011. Robust matrix completion and corrupted columns. In ICML\u201911. 873\u2013880."},{"doi-asserted-by":"publisher","key":"e_1_3_2_13_2","DOI":"10.1145\/2746539.2746569"},{"doi-asserted-by":"publisher","key":"e_1_3_2_14_2","DOI":"10.1007\/978-3-319-21275-3"},{"key":"e_1_3_2_15_2","article-title":"On low rank approximation of binary matrices","volume":"1511","author":"Dan Chen","year":"2015","unstructured":"Chen Dan, Kristoffer Arnsfelt Hansen, He Jiang, Liwei Wang, and Yuchen Zhou. 2015. On low rank approximation of binary matrices. CoRR abs\/1511.01699 (2015). arxiv:1511.01699. http:\/\/arxiv.org\/abs\/1511.01699","journal-title":"CoRR"},{"doi-asserted-by":"publisher","key":"e_1_3_2_16_2","DOI":"10.1007\/978-1-4471-5559-1"},{"key":"e_1_3_2_17_2","article-title":"NP-hardness of hypercube 2-segmentation","volume":"1411","author":"Feige Uriel","year":"2014","unstructured":"Uriel Feige. 2014. NP-hardness of hypercube 2-segmentation. CoRR abs\/1411.0821 (2014). arxiv:1411.0821. http:\/\/arxiv.org\/abs\/1411.0821","journal-title":"CoRR"},{"doi-asserted-by":"publisher","key":"e_1_3_2_18_2","DOI":"10.1145\/3365653"},{"doi-asserted-by":"publisher","key":"e_1_3_2_19_2","DOI":"10.1007\/s10618-019-00669-5"},{"doi-asserted-by":"publisher","key":"e_1_3_2_20_2","DOI":"10.1016\/j.jcss.2020.10.005"},{"doi-asserted-by":"publisher","key":"e_1_3_2_21_2","DOI":"10.1137\/17M112258X"},{"key":"e_1_3_2_22_2","first-page":"3906","volume-title":"AAAI\u201920","author":"Ganian Robert","year":"2020","unstructured":"Robert Ganian, Iyad Kanj, Sebastian Ordyniak, and Stefan Szeider. 2020. On the parameterized complexity of clustering incomplete data into subspaces of small rank. In AAAI\u201920. AAAI Press, 3906\u20133913."},{"key":"e_1_3_2_23_2","article-title":"On the complexity of robust PCA and  \\(\\ell _1\\) -norm low-rank matrix approximation","volume":"1509","author":"Gillis Nicolas","year":"2015","unstructured":"Nicolas Gillis and Stephen A. Vavasis. 2015. On the complexity of robust PCA and \\(\\ell _1\\) -norm low-rank matrix approximation. CoRR abs\/1509.09236 (2015). http:\/\/arxiv.org\/abs\/1509.09236","journal-title":"CoRR"},{"doi-asserted-by":"publisher","key":"e_1_3_2_24_2","DOI":"10.1006\/jcss.2001.1774"},{"doi-asserted-by":"publisher","key":"e_1_3_2_25_2","DOI":"10.5555\/1293927.1293931"},{"key":"e_1_3_2_26_2","series-title":"ICML\u201919","first-page":"3551","volume":"97","author":"Kumar Ravi","year":"2019","unstructured":"Ravi Kumar, Rina Panigrahy, Ali Rahimi, and David P. Woodruff. 2019. Faster algorithms for binary matrix factorization. In ICML\u201919(Proceedings of Machine Learning Research, Vol. 97). PMLR, 3551\u20133559. http:\/\/proceedings.mlr.press\/v97\/kumar19a.html"},{"doi-asserted-by":"publisher","key":"e_1_3_2_27_2","DOI":"10.1145\/1081870.1081894"},{"doi-asserted-by":"publisher","key":"e_1_3_2_28_2","DOI":"10.1109\/TDSC.2012.21"},{"doi-asserted-by":"publisher","key":"e_1_3_2_29_2","DOI":"10.1137\/060673898"},{"doi-asserted-by":"publisher","key":"e_1_3_2_30_2","DOI":"10.1109\/TKDE.2008.53"},{"key":"e_1_3_2_31_2","first-page":"4922","volume-title":"IJCAI\u2019 20","author":"Miettinen Pauli","year":"2020","unstructured":"Pauli Miettinen and Stefan Neumann. 2020. Recent developments in Boolean matrix factorization. In IJCAI\u2019 20. ijcai.org, 4922\u20134928."},{"doi-asserted-by":"publisher","key":"e_1_3_2_32_2","DOI":"10.1145\/2020408.2020424"},{"doi-asserted-by":"publisher","key":"e_1_3_2_33_2","DOI":"10.1145\/506147.506149"},{"key":"e_1_3_2_34_2","first-page":"5818","volume-title":"ICML\u201919","author":"Simonov Kirill","year":"2019","unstructured":"Kirill Simonov, Fedor V. Fomin, Petr A. Golovach, and Fahad Panolan. 2019. Refined complexity of PCA with outliers. In ICML\u201919, Vol. 97. PMLR, 5818\u20135826. http:\/\/proceedings.mlr.press\/v97\/simonov19a.html"},{"doi-asserted-by":"publisher","key":"e_1_3_2_35_2","DOI":"10.1007\/978-0-387-87811-9"},{"key":"e_1_3_2_36_2","first-page":"2496","volume-title":"NIPS\u201910","author":"Xu Huan","year":"2010","unstructured":"Huan Xu, Constantine Caramanis, and Sujay Sanghavi. 2010. Robust PCA via outlier pursuit. In NIPS\u201910. Curran Associates, Inc., 2496\u20132504. http:\/\/papers.nips.cc\/paper\/4005-robust-pca-via-outlier-pursuit"},{"key":"e_1_3_2_37_2","first-page":"391","volume-title":"ICDM\u201907","author":"Zhang Zhongyuan","year":"2007","unstructured":"Zhongyuan Zhang, Tao Li, Chris Ding, and Xiangsun Zhang. 2007. Binary matrix factorization with applications. In ICDM\u201907. IEEE, 391\u2013400."}],"container-title":["ACM Transactions on Computation Theory"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3604797","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3604797","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:46:04Z","timestamp":1750178764000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3604797"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,12,12]]},"references-count":36,"journal-issue":{"issue":"3-4","published-print":{"date-parts":[[2023,12,31]]}},"alternative-id":["10.1145\/3604797"],"URL":"https:\/\/doi.org\/10.1145\/3604797","relation":{},"ISSN":["1942-3454","1942-3462"],"issn-type":[{"type":"print","value":"1942-3454"},{"type":"electronic","value":"1942-3462"}],"subject":[],"published":{"date-parts":[[2023,12,12]]},"assertion":[{"value":"2021-08-20","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-06-23","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-12-12","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}