{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,10]],"date-time":"2026-06-10T16:31:46Z","timestamp":1781109106240,"version":"3.54.1"},"reference-count":23,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2015,3,23]],"date-time":"2015-03-23T00:00:00Z","timestamp":1427068800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>Traditional centroid-based clustering algorithms for heterogeneous data with numerical and non-numerical features result in different levels of inaccurate clustering. This is because the Hamming distance used for dissimilarity measurement of non-numerical values does not provide optimal distances between different values, and problems arise from attempts to combine the Euclidean distance and Hamming distance. In this study, the mutual information (MI)-based unsupervised feature transformation (UFT), which can transform non-numerical features into numerical features without information loss, was utilized with the conventional k-means algorithm for heterogeneous data clustering. For the original non-numerical features, UFT can provide numerical values which preserve the structure of the original non-numerical features and have the property of continuous values at the same time. Experiments and analysis of real-world datasets showed that, the integrated UFT-k-means clustering algorithm outperformed others for heterogeneous data with both numerical and non-numerical features.<\/jats:p>","DOI":"10.3390\/e17031535","type":"journal-article","created":{"date-parts":[[2015,3,23]],"date-time":"2015-03-23T12:17:00Z","timestamp":1427113020000},"page":"1535-1548","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":27,"title":["Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature Transformation"],"prefix":"10.3390","volume":"17","author":[{"given":"Min","family":"Wei","sequence":"first","affiliation":[{"name":"Department of Electronic Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Tommy","family":"Chow","sequence":"additional","affiliation":[{"name":"Department of Electronic Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Rosa","family":"Chan","sequence":"additional","affiliation":[{"name":"Department of Electronic Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2015,3,23]]},"reference":[{"key":"ref_1","unstructured":"MacQueen, J. (1967). Proc. of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, University of California Press."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"283","DOI":"10.1023\/A:1009769707641","article-title":"Extensions to the k-means algorithm for clustering large data sets with categorical values","volume":"2","author":"Huang","year":"1998","journal-title":"Data Min. Knowl. Discov."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"446","DOI":"10.1109\/91.784206","article-title":"A fuzzy k-modes algorithm for clustering categorical data","volume":"7","author":"Huang","year":"1999","journal-title":"IEEE Trans. Fuzzy Syst."},{"key":"ref_4","unstructured":"Arthur, D., and Vassilvitskii, S. (2007, January 7\u20139). k-means ++: The advantages of careful seeding. New Orleans, LA, USA."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Zhang, T., Ramakrishnan, R., and Livny, M. (1996, January 4\u20136). BIRCH: An efficient data clustering method for very large databases. Montreal, PQ, Cananda.","DOI":"10.1145\/233269.233324"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Guha, S., Rastogi, R., and Shim, K. (1998, January 1\u20134). CURE: An efficient clustering algorithm for large databases. Seattle, WA, USA.","DOI":"10.1145\/276304.276312"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Barbar\u00e1, D., Li, Y., and Couto, J. (2002, January 4\u20139). COOLCAT: an entropy-based algorithm for categorical clustering. McLean, VA, USA.","DOI":"10.1145\/584792.584888"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"49","DOI":"10.1016\/j.neucom.2011.11.001","article-title":"A two-stage genetic algorithm for automatic clustering","volume":"81","author":"He","year":"2012","journal-title":"Neurocomputing"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"3273","DOI":"10.3390\/e16063273","article-title":"On clustering histograms with k-means by using mixed \u03b1-divergences","volume":"16","author":"Nielsen","year":"2014","journal-title":"Entropy"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"865","DOI":"10.3390\/e14050865","article-title":"Entropic approach to multiscale clustering analysis","volume":"14","author":"Insolia","year":"2012","journal-title":"Entropy"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"673","DOI":"10.1109\/TKDE.2002.1019208","article-title":"Unsupervised learning with mixed numeric and nominal data","volume":"14","author":"Li","year":"2002","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"3832","DOI":"10.1016\/j.neucom.2011.07.014","article-title":"Apply extended self-organizing map to cluster and classify mixed-type data","volume":"74","author":"Hsu","year":"2011","journal-title":"Neurocomputing"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"12","DOI":"10.1016\/j.eswa.2005.11.017","article-title":"Mining of mixed data with application to catalog marketing","volume":"32","author":"Hsu","year":"2007","journal-title":"Expert Syst. Appl."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"882","DOI":"10.2307\/2528080","article-title":"A new similarity index based on probability","volume":"22","author":"Goodall","year":"1966","journal-title":"Biometrics"},{"key":"ref_15","unstructured":"Huang, Z. (1997, January 23\u201324). Clustering large data sets with mixed numeric and categorical values. Singapore, Singapore."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"8684","DOI":"10.1016\/j.eswa.2011.01.074","article-title":"A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional","volume":"38","author":"Chatzis","year":"2011","journal-title":"Expert Syst. Appl."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"590","DOI":"10.1016\/j.neucom.2013.04.011","article-title":"An improved k-prototypes clustering algorithm for mixed numeric and categorical data","volume":"120","author":"Ji","year":"2013","journal-title":"Neurocomputing"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"129","DOI":"10.1016\/j.knosys.2012.01.006","article-title":"A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data","volume":"30","author":"Ji","year":"2012","journal-title":"Knowl.-Based Syst"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"416","DOI":"10.1016\/j.patcog.2011.07.006","article-title":"SpectralCAT: Categorical spectral clustering of numerical and nominal data","volume":"45","author":"David","year":"2012","journal-title":"Pattern Recognit."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Flach, P. (2012). Machine Learning: The Art and Science of Algorithms that Make Sense of Data, Cambridge University Press.","DOI":"10.1017\/CBO9780511973000"},{"key":"ref_21","unstructured":"McLachlan, G.J., and Basford, K.E. (1988). Mixture Models. Inference and Applications to Clustering, CRC Press."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"115","DOI":"10.1016\/S0304-4076(98)00009-8","article-title":"Initial conditions and moment restrictions in dynamic panel data models","volume":"87","author":"Blundell","year":"1998","journal-title":"J. Econ."},{"key":"ref_23","unstructured":"Bache, K., and Lichman, M. Available online: http:\/\/archive.ics.uci.edu\/ml."}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/17\/3\/1535\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T20:43:49Z","timestamp":1760215429000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/17\/3\/1535"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,3,23]]},"references-count":23,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2015,3]]}},"alternative-id":["e17031535"],"URL":"https:\/\/doi.org\/10.3390\/e17031535","relation":{},"ISSN":["1099-4300"],"issn-type":[{"value":"1099-4300","type":"electronic"}],"subject":[],"published":{"date-parts":[[2015,3,23]]}}}