{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,9]],"date-time":"2025-06-09T02:22:40Z","timestamp":1749435760124,"version":"3.37.3"},"reference-count":43,"publisher":"Springer Science and Business Media LLC","issue":"6","license":[{"start":{"date-parts":[[2023,12,27]],"date-time":"2023-12-27T00:00:00Z","timestamp":1703635200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,12,27]],"date-time":"2023-12-27T00:00:00Z","timestamp":1703635200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"A*STAR Centre for Frontier AI Research"},{"name":"Program for Guangdong Introducing Innovative and Entrepreneurial Teams","award":["2017ZT07X386"],"award-info":[{"award-number":["2017ZT07X386"]}]},{"name":"Program for Guangdong Provincial Key Laboratory","award":["2020B121201001"],"award-info":[{"award-number":["2020B121201001"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Mach Learn"],"published-print":{"date-parts":[[2024,6]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Real-world datasets inevitably contain biases that arise from different sources or conditions during data collection. Consequently, such inconsistency itself acts as a confounding factor that disturbs the cluster analysis. Existing methods eliminate the biases by projecting data onto the orthogonal complement of the subspace expanded by the confounding factor before clustering. Therein, the interested clustering factor and the confounding factor are coarsely considered in the raw feature space, where the correlation between the data and the confounding factor is ideally assumed to be linear for convenient solutions. These approaches are thus limited in scope as the data in real applications is usually complex and non-linearly correlated with the confounding factor. This paper presents a new clustering framework named Sanitized Clustering Against confounding Bias, which removes the confounding factor in the semantic latent space of complex data through a non-linear dependence measure. To be specific, we eliminate the bias information in the latent space by minimizing the mutual information between the confounding factor and the latent representation delivered by variational auto-encoder. Meanwhile, a clustering module is introduced to cluster over the purified latent representations. Extensive experiments on complex datasets demonstrate that our SCAB achieves a significant gain in clustering performance by removing the confounding bias.<\/jats:p>","DOI":"10.1007\/s10994-023-06451-5","type":"journal-article","created":{"date-parts":[[2023,12,27]],"date-time":"2023-12-27T22:02:36Z","timestamp":1703714556000},"page":"3711-3730","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Sanitized clustering against confounding bias"],"prefix":"10.1007","volume":"113","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3204-0739","authenticated-orcid":false,"given":"Yinghua","family":"Yao","sequence":"first","affiliation":[]},{"given":"Yuangang","family":"Pan","sequence":"additional","affiliation":[]},{"given":"Jing","family":"Li","sequence":"additional","affiliation":[]},{"given":"Ivor W.","family":"Tsang","sequence":"additional","affiliation":[]},{"given":"Xin","family":"Yao","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,12,27]]},"reference":[{"key":"6451_CR1","unstructured":"Alemi A. A., Fischer I., Dillon J. V., et\u00a0al. (2017). Deep variational information bottleneck. In: ICLR"},{"key":"6451_CR2","first-page":"437","volume-title":"21th European symposium on artificial neural networks","author":"D Anguita","year":"2013","unstructured":"Anguita, D., Ghio, A., Oneto, L., et al. (2013). A public domain dataset for human activity recognition using smartphones. 21th European symposium on artificial neural networks (pp. 437\u2013442). CIACO: Computational Intelligence and Machine Learning (ESANN)."},{"issue":"2","key":"6451_CR3","doi-asserted-by":"publisher","first-page":"81","DOI":"10.1145\/380995.381030","volume":"2","author":"SD Bay","year":"2000","unstructured":"Bay, S. D., Kibler, D. F., Pazzani, M. J., et al. (2000). The UCI KDD archive of large data sets for data mining research and experimentation. ACM SIGKDD Explorations Newsletter, 2(2), 81\u201385.","journal-title":"ACM SIGKDD Explorations Newsletter"},{"issue":"1","key":"6451_CR4","doi-asserted-by":"publisher","first-page":"105","DOI":"10.1093\/bioinformatics\/btg385","volume":"20","author":"M Benito","year":"2004","unstructured":"Benito, M., Parker, J., Du, Q., et al. (2004). Adjustment of systematic microarray data biases. Bioinformatics, 20(1), 105\u2013114.","journal-title":"Bioinformatics"},{"key":"6451_CR5","volume-title":"Pattern recognition and machine learning,","author":"CM Bishop","year":"2006","unstructured":"Bishop, C. M. (2006). Pattern recognition and machine learning, (Vol. 4). Springer."},{"issue":"7","key":"6451_CR6","doi-asserted-by":"publisher","first-page":"1901","DOI":"10.1007\/s10994-021-06015-5","volume":"110","author":"A Boubekki","year":"2021","unstructured":"Boubekki, A., Kampffmeyer, M., Brefeld, U., et al. (2021). Joint optimization of an autoencoder for clustering and embedding. Machine Learning, 110(7), 1901\u20131937.","journal-title":"Machine Learning"},{"issue":"8","key":"6451_CR7","doi-asserted-by":"publisher","first-page":"790","DOI":"10.1109\/34.400568","volume":"17","author":"Y Cheng","year":"1995","unstructured":"Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8), 790\u2013799.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"6451_CR8","first-page":"5029","volume":"30","author":"F Chierichetti","year":"2017","unstructured":"Chierichetti, F., Kumar, R., Lattanzi, S., et al. (2017). Fair clustering through fairlets. In: NeurIPS, 30, 5029\u20135037.","journal-title":"In: NeurIPS"},{"key":"6451_CR9","first-page":"259","volume":"10","author":"M Feldman","year":"2015","unstructured":"Feldman, M., Friedler, S. A., Moeller, J., et al. (2015). Certifying and removing disparate impact. SIGKDD, 10, 259\u2013268.","journal-title":"SIGKDD"},{"issue":"3","key":"6451_CR10","doi-asserted-by":"publisher","first-page":"539","DOI":"10.1093\/biostatistics\/kxr034","volume":"13","author":"JA Gagnon-Bartsch","year":"2012","unstructured":"Gagnon-Bartsch, J. A., & Speed, T. P. (2012). Using control genes to correct for unwanted variation in microarray data. Biostatistics, 13(3), 539\u2013552.","journal-title":"Biostatistics"},{"key":"6451_CR11","first-page":"1753","volume":"17","author":"X Guo","year":"2017","unstructured":"Guo, X., Gao, L., Liu, X., et al. (2017). Improved deep embedded clustering with local structure preservation. IJCAI, 17, 1753\u20131759.","journal-title":"IJCAI"},{"key":"6451_CR12","doi-asserted-by":"crossref","unstructured":"He K., Fan H., Wu Y., et\u00a0al. (2020) Momentum contrast for unsupervised visual representation learning. In: CVPR, pp 9729\u20139738","DOI":"10.1109\/CVPR42600.2020.00975"},{"key":"6451_CR13","unstructured":"Hendrycks, D., Dietterich, T. G. (2019). Benchmarking neural network robustness to common corruptions and perturbations. In: ICLR"},{"key":"6451_CR14","doi-asserted-by":"crossref","unstructured":"Huang, J., Gong, S., Zhu, X. (2020). Deep semantic clustering by partition confidence maximisation. In: CVPR, pp 8846\u20138855","DOI":"10.1109\/CVPR42600.2020.00887"},{"issue":"5","key":"6451_CR15","doi-asserted-by":"publisher","first-page":"550","DOI":"10.1109\/34.291440","volume":"16","author":"JJ Hull","year":"1994","unstructured":"Hull, J. J. (1994). A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5), 550\u2013554.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"issue":"1","key":"6451_CR16","doi-asserted-by":"publisher","first-page":"16","DOI":"10.1093\/biostatistics\/kxv026","volume":"17","author":"L Jacob","year":"2016","unstructured":"Jacob, L., Gagnon-Bartsch, J. A., & Speed, T. P. (2016). Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed. Biostatistics, 17(1), 16\u201328.","journal-title":"Biostatistics"},{"issue":"3","key":"6451_CR17","doi-asserted-by":"publisher","first-page":"264","DOI":"10.1145\/331499.331504","volume":"31","author":"AK Jain","year":"1999","unstructured":"Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM Computing Surveys (CSUR), 31(3), 264\u2013323.","journal-title":"ACM Computing Surveys (CSUR)"},{"key":"6451_CR18","doi-asserted-by":"crossref","unstructured":"Jiang Z, Zheng Y, Tan H, et\u00a0al (2017) Variational deep embedding: an unsupervised and generative approach to clustering. In: IJCAI, pp 1965\u20131972","DOI":"10.24963\/ijcai.2017\/273"},{"issue":"1","key":"6451_CR19","doi-asserted-by":"publisher","first-page":"118","DOI":"10.1093\/biostatistics\/kxj037","volume":"8","author":"WE Johnson","year":"2007","unstructured":"Johnson, W. E., Li, C., & Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics, 8(1), 118\u2013127.","journal-title":"Biostatistics"},{"key":"6451_CR20","unstructured":"Jung C, Kannan S, Lutz N (2019) A center in your neighborhood: Fairness in facility location. arXiv preprint arXiv:1908.09041"},{"key":"6451_CR21","unstructured":"Kingma, D. P., Welling M. (2014) Auto-encoding variational bayes. In: ICLR"},{"key":"6451_CR22","unstructured":"Kleindessner, M., Samadi, S., Awasthi, P., et\u00a0al. (2019. Guarantees for spectral clustering with fairness constraints. In: ICML, PMLR, pp 3458\u20133467"},{"key":"6451_CR23","unstructured":"Kulis, B., Jordan, M. I. (2012) Revisiting k-means: new algorithms via bayesian nonparametrics. In: ICML, pp 1131\u20131138"},{"issue":"11","key":"6451_CR24","doi-asserted-by":"publisher","first-page":"2278","DOI":"10.1109\/5.726791","volume":"86","author":"Y Lecun","year":"1998","unstructured":"Lecun, Y., Bottou, L., Bengio, Y., et al. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278\u20132324.","journal-title":"Proceedings of the IEEE"},{"key":"6451_CR25","doi-asserted-by":"crossref","unstructured":"Li, P., Zhao, H., Liu, H. (2020). Deep fair clustering for visual learning. In: CVPR, pp 9070\u20139079","DOI":"10.1109\/CVPR42600.2020.00909"},{"issue":"38","key":"6451_CR26","doi-asserted-by":"publisher","first-page":"16465","DOI":"10.1073\/pnas.1002425107","volume":"107","author":"J Listgarten","year":"2010","unstructured":"Listgarten, J., Kadie, C., Schadt, E. E., et al. (2010). Correction for hidden confounders in the genetic analysis of gene expression. Proceedings of the National Academy of Sciences, 107(38), 16465\u201316470.","journal-title":"Proceedings of the National Academy of Sciences"},{"key":"6451_CR27","unstructured":"Mahabadi, S., Vakilian, A. (2020). Individual fairness for k-clustering. In: ICML, PMLR, pp 6586\u20136596"},{"key":"6451_CR28","doi-asserted-by":"publisher","first-page":"217","DOI":"10.1023\/A:1024016609528","volume":"52","author":"DS Modha","year":"2003","unstructured":"Modha, D. S., & Spangler, W. S. (2003). Feature weighting in k-means clustering. Machine Learning, 52, 217\u2013237.","journal-title":"Machine Learning"},{"key":"6451_CR29","unstructured":"Moyer, D., Gao, S., Brekelmans, R., et\u00a0al. (2018). Invariant representations without adversarial training. In: NeurIPS, pp. 9102\u20139111"},{"issue":"7","key":"6451_CR30","doi-asserted-by":"publisher","first-page":"1340","DOI":"10.1109\/TPAMI.2013.180","volume":"36","author":"D Niu","year":"2013","unstructured":"Niu, D., Dy, J. G., & Jordan, M. I. (2013). Iterative discovery of multiple alternativeclustering views. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1340\u20131353.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"6451_CR31","doi-asserted-by":"publisher","first-page":"7264","DOI":"10.1109\/TIP.2022.3221290","volume":"31","author":"C Niu","year":"2022","unstructured":"Niu, C., Shan, H., & Wang, G. (2022). Spice: Semantic pseudo-labeling for image clustering. IEEE Transactions on Image Processing, 31, 7264\u20137278.","journal-title":"IEEE Transactions on Image Processing"},{"key":"6451_CR32","unstructured":"Pan, Y., Tsang, I. (2021). Streamlining em into auto-encoder networks. In: OpenReview"},{"key":"6451_CR33","doi-asserted-by":"crossref","unstructured":"Saenko, K., Kulis, B., Fritz, M., et\u00a0al. (2010). Adapting visual category models to new domains. In: ECCV, pp 213\u2013226","DOI":"10.1007\/978-3-642-15561-1_16"},{"key":"6451_CR34","doi-asserted-by":"crossref","unstructured":"Sharif, M., Bhagavatula, S., Bauer, L., et\u00a0al. (2016) Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. In: ACM Conference on Computer and Communications Security, pp. 1528\u20131540","DOI":"10.1145\/2976749.2978392"},{"key":"6451_CR35","unstructured":"Tsai, T. W., Li, C., Zhu, J. (2021). Mice: Mixture of contrastive experts for unsupervised image clustering. In: ICLR"},{"key":"6451_CR36","unstructured":"Vakilian, A., Yalciner, M. (2022) Improved approximation algorithms for individually fair clustering. In: AISTATS, PMLR, pp. 8758\u20138779"},{"key":"6451_CR37","unstructured":"Van Den\u00a0Oord, A., Vinyals, O., Kavukcuoglu, K. (2017). Neural discrete representation learning. NeurIPS pp. 6309\u20136318"},{"issue":"12","key":"6451_CR38","first-page":"201","volume":"11","author":"P Vincent","year":"2010","unstructured":"Vincent, P., Larochelle, H., Lajoie, I., et al. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(12), 201.","journal-title":"Journal of Machine Learning Research"},{"key":"6451_CR39","unstructured":"Wu, C., Ioannidis, S., Sznaier, M., et\u00a0al. (2018) Iterative spectral method for alternative clustering. In: AISTATS, pp 115\u2013123"},{"key":"6451_CR40","unstructured":"Wu, C., Miller, J., Chang, Y., et\u00a0al. (2019). Solving interpretable kernel dimensionality reduction. NeurIPS pp 7915\u20137925"},{"key":"6451_CR41","unstructured":"Wu, S., Yuksekgonul, M., Zhang, L., et\u00a0al. (2023) Discover and cure: Concept-aware mitigation of spurious correlation. In: ICML"},{"key":"6451_CR42","unstructured":"Xiao, H., Rasul, K., Vollgraf, R. (2017). Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747"},{"key":"6451_CR43","unstructured":"Xie, J., Girshick, R., Farhadi, A. (2016). Unsupervised deep embedding for clustering analysis. In: ICML, pp 478\u2013487"}],"container-title":["Machine Learning"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-023-06451-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10994-023-06451-5\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-023-06451-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,10]],"date-time":"2024-05-10T15:09:16Z","timestamp":1715353756000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10994-023-06451-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,12,27]]},"references-count":43,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2024,6]]}},"alternative-id":["6451"],"URL":"https:\/\/doi.org\/10.1007\/s10994-023-06451-5","relation":{},"ISSN":["0885-6125","1573-0565"],"issn-type":[{"type":"print","value":"0885-6125"},{"type":"electronic","value":"1573-0565"}],"subject":[],"published":{"date-parts":[[2023,12,27]]},"assertion":[{"value":"1 June 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"10 August 2023","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 October 2023","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"27 December 2023","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors have no financial or non-financial interests to disclose that are relevant to the content of this article.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval"}},{"value":"Not applicable.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent to participate"}},{"value":"Not applicable.","order":5,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent to publishing"}}]}}