{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,16]],"date-time":"2025-10-16T10:09:35Z","timestamp":1760609375143,"version":"build-2065373602"},"reference-count":28,"publisher":"MDPI AG","issue":"6","license":[{"start":{"date-parts":[[2021,6,15]],"date-time":"2021-06-15T00:00:00Z","timestamp":1623715200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["No. 61502135"],"award-info":[{"award-number":["No. 61502135"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Algorithms"],"abstract":"<jats:p>Many mixed datasets with both numerical and categorical attributes have been collected in various fields, including medicine, biology, etc. Designing appropriate similarity measurements plays an important role in clustering these datasets. Many traditional measurements treat various attributes equally when measuring the similarity. However, different attributes may contribute differently as the amount of information they contained could vary a lot. In this paper, we propose a similarity measurement with entropy-based weighting for clustering mixed datasets. The numerical data are first transformed into categorical data by an automatic categorization technique. Then, an entropy-based weighting strategy is applied to denote the different importances of various attributes. We incorporate the proposed measurement into an iterative clustering algorithm, and extensive experiments show that this algorithm outperforms OCIL and K-Prototype methods with 2.13% and 4.28% improvements, respectively, in terms of accuracy on six mixed datasets from UCI.<\/jats:p>","DOI":"10.3390\/a14060184","type":"journal-article","created":{"date-parts":[[2021,6,15]],"date-time":"2021-06-15T11:00:33Z","timestamp":1623754833000},"page":"184","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["A Similarity Measurement with Entropy-Based Weighting for Clustering Mixed Numerical and Categorical Datasets"],"prefix":"10.3390","volume":"14","author":[{"given":"Xia","family":"Que","sequence":"first","affiliation":[{"name":"School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China"}]},{"given":"Siyuan","family":"Jiang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China"}]},{"given":"Jiaoyun","family":"Yang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3317-5299","authenticated-orcid":false,"given":"Ning","family":"An","sequence":"additional","affiliation":[{"name":"School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China"}]}],"member":"1968","published-online":{"date-parts":[[2021,6,15]]},"reference":[{"key":"ref_1","first-page":"1","article-title":"Data Mining: Concepts and Techniques","volume":"5","author":"Jiawei","year":"2006","journal-title":"Data Min. Concepts Model. Methods Algorithms Second Ed."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Rodoshi, R.T., Kim, T., and Choi, W. (2020). Resource Management in Cloud Radio Access Network: Conventional and New Approaches. Sensors, 20.","DOI":"10.3390\/s20092708"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Khorraminezhad, L., Leclercq, M., Droit, A., Bilodeau, J.F., and Rudkowska, I. (2020). Statistical and Machine-Learning Analyses in Nutritional Genomics Studies. Nutrients, 12.","DOI":"10.3390\/nu12103140"},{"key":"ref_4","first-page":"281","article-title":"Some Methods for Classification and Analysis of Multivariate Observations","volume":"1","author":"Macqueen","year":"1967","journal-title":"Berkeley Symp. Math. Stat. Probab."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"39","DOI":"10.1016\/j.asoc.2016.06.019","article-title":"K-Harmonic means type clustering algorithm for mixed datasets","volume":"48","author":"Ahmad","year":"2016","journal-title":"Appl. Soft Comput."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1111\/j.2517-6161.1977.tb01600.x","article-title":"Maximum Likelihood from Incomplete Data via the EM Algorithm","volume":"39","author":"Dempster","year":"1977","journal-title":"J. R. Stat. Soc."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"120","DOI":"10.1016\/j.knosys.2011.07.011","article-title":"A dissimilarity measure for the k-Modes clustering algorithm","volume":"26","author":"Cao","year":"2012","journal-title":"Knowl. Based Syst."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"345","DOI":"10.1016\/S0306-4379(00)00022-3","article-title":"ROCK: A robust clustering algorithm for categorical attributes","volume":"25","author":"Guha","year":"1999","journal-title":"Inf. Syst."},{"key":"ref_9","unstructured":"Huang, Z. (1997, January 23\u201324). Clustering large data sets with mixed numeric and categorical values. Proceedings of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"31883","DOI":"10.1109\/ACCESS.2019.2903568","article-title":"Survey of State-of-the-Art Mixed Data Clustering Algorithms","volume":"7","author":"Ahmad","year":"2019","journal-title":"IEEE Access"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"283","DOI":"10.1023\/A:1009769707641","article-title":"Extensions to the k-means Algorithm for Clustering Large Data Sets with Categorical Values","volume":"2","author":"Huang","year":"1998","journal-title":"Data Min. Knowl. Discov."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"2228","DOI":"10.1016\/j.patcog.2013.01.027","article-title":"Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number","volume":"45","author":"Cheung","year":"2013","journal-title":"Pattern Recognit."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"416","DOI":"10.1016\/j.patcog.2011.07.006","article-title":"SpectralCAT: Categorical spectral clustering of numerical and nominal data","volume":"45","author":"David","year":"2012","journal-title":"Pattern Recognit."},{"key":"ref_14","first-page":"849","article-title":"On spectral clustering: Analysis and an algorithm","volume":"14","author":"Ng","year":"2001","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"294","DOI":"10.1109\/TNN.2005.863415","article-title":"Generalizing self-organizing map for categorical data","volume":"17","author":"Hsu","year":"2006","journal-title":"IEEE Trans. Neural Netw."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"331","DOI":"10.1080\/0308107021000013635","article-title":"A new method for measuring uncertainty and fuzziness in rough set theory","volume":"31","author":"Liang","year":"2002","journal-title":"Int. J. Gen. Syst."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"503","DOI":"10.1109\/TPAMI.2007.53","article-title":"On the impact of dissimilarity measure in k-modes clustering algorithm","volume":"29","author":"Ng","year":"2007","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_18","first-page":"2628","article-title":"Non-mode clustering of categorical data with attributes weighting","volume":"14","author":"Chen","year":"2013","journal-title":"J. Softw."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"2843","DOI":"10.1016\/j.patcog.2011.04.024","article-title":"A novel attribute weighting algorithm for clustering high-dimensional categorical data","volume":"44","author":"Bai","year":"2011","journal-title":"Pattern Recognit."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"503","DOI":"10.1016\/j.datak.2007.03.016","article-title":"A k-mean clustering algorithm for mixed numeric and categorical data","volume":"63","author":"Ahmad","year":"2007","journal-title":"Data Knowl. Eng."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"121","DOI":"10.1109\/TKDE.2005.11","article-title":"Interpretable Hierarchical Clustering by Constructing an Unsupervised Decision Tree","volume":"17","author":"Basak","year":"2005","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_22","first-page":"194","article-title":"Supervised and Unsupervised Discretization of Continuous Features","volume":"2","author":"Dougherty","year":"1995","journal-title":"Mach. Learn. Proc."},{"key":"ref_23","unstructured":"Grzymala-Busse, J.W. (2002). Data reduction: Discretization of numerical attributes. Handbook of Data Mining and Knowledge Discovery, Oxford University Press, Inc."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1023\/A:1021394316112","article-title":"A Decision Criterion for the Optimal Number of Clusters in Hierarchical Clustering","volume":"25","author":"Jung","year":"2003","journal-title":"J. Glob. Optim."},{"key":"ref_25","first-page":"4684","article-title":"A heuristic method for finding the optimal number of clusters with application in medical data","volume":"2008","author":"Bayati","year":"2008","journal-title":"Conf. Proc. IEEE Eng. Med. Biol. Soc."},{"key":"ref_26","unstructured":"(2021, June 15). UCI Machine Learning Repository. Available online: http:\/\/archive.ics.uci.edu\/ml."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Zhu, L., Miao, L., and Zhang, D. (2012). Iterative Laplacian Score for Feature Selection. Chinese Conference on Pattern Recognition, Springer.","DOI":"10.1007\/978-3-642-33506-8_11"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Kononenko, I. (1994). Estimating attributes: Analysis and extensions of RELIEF. European Conference on Machine Learning, Springer.","DOI":"10.1007\/3-540-57868-4_57"}],"container-title":["Algorithms"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-4893\/14\/6\/184\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:14:14Z","timestamp":1760163254000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-4893\/14\/6\/184"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,6,15]]},"references-count":28,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2021,6]]}},"alternative-id":["a14060184"],"URL":"https:\/\/doi.org\/10.3390\/a14060184","relation":{},"ISSN":["1999-4893"],"issn-type":[{"type":"electronic","value":"1999-4893"}],"subject":[],"published":{"date-parts":[[2021,6,15]]}}}