{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,2]],"date-time":"2026-06-02T09:35:44Z","timestamp":1780392944502,"version":"3.54.1"},"reference-count":66,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,2,22]],"date-time":"2025-02-22T00:00:00Z","timestamp":1740182400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,2,22]],"date-time":"2025-02-22T00:00:00Z","timestamp":1740182400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100003032","name":"Association Nationale de la Recherche et de la Technologie","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100003032","id-type":"DOI","asserted-by":"publisher"}]},{"name":"SolutionData Group"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Clustering algorithms play a pivotal role in data mining, offering powerful tools for uncovering hidden patterns and structures within datasets. These algorithms aim to divide data points into coherent groups based on similarities or dissimilarities, making it easier to explore and understand complex data. Clustering algorithms typically rely on similarity measures to assess the likeness between data points. Consequently, selecting a suitable similarity measure is crucial for achieving satisfactory clustering outcomes. However, this decision can pose significant challenges, especially for non-experts, given the plethora of similarity measures available in the literature and their performance which is closely linked to the specific dataset, clustering algorithm, and cluster validity index employed. This difficulty is even more important when considering mixed data clustering. Mixed data refers to heterogeneous data characterized by both numerical and categorical attributes. In such a context, the same similarity measure cannot be used for both types of attributes due to their different nature. Commonly, two similarity measures are combined, one for numerical attributes and one for categorical attributes. This adds a layer of complexity to the problem since it requires the selection of two similarity measures instead of just one. This paper introduces SIMREC, a similarity measure recommendation system for mixed data clustering. The system uses meta-learning to mine the relationship between dataset characteristics and similarity measures performances for different mixed data clustering algorithms and cluster validity indices. Therefore, given a mixed dataset, a mixed data clustering algorithm, and a cluster validity index, the system can recommend suitable pairs of numerical and categorical similarity measures based on the characteristics of the dataset. We implemented the proposed system using 130 pairs of similarity measures (10 numerical and 13 categorical), 4 commonly used mixed data clustering algorithms (K-Prototypes, LSH-K-Prototypes, K-Medoids, and Hierarchical Clustering), and three cluster validity indices (Silhouette, Clustering Accuracy, and Adjusted Rand Index). Our experiments on 185 publicly available mixed datasets show that the pairs of similarity measures recommended by SIMREC outperform the baseline pairs, including classically used pairs of similarity measures in the literature.<\/jats:p>","DOI":"10.1186\/s40537-024-01052-y","type":"journal-article","created":{"date-parts":[[2025,2,22]],"date-time":"2025-02-22T16:12:51Z","timestamp":1740240771000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Simrec: a similarity measure recommendation system for mixed data clustering algorithms"],"prefix":"10.1186","volume":"12","author":[{"given":"Abdoulaye","family":"Diop","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Nabil","family":"El-Malki","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Max","family":"Chevalier","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Andr\u00e9","family":"P\u00e9ninou","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Geoffrey","family":"Roman-Jimenez","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Olivier","family":"Teste","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2025,2,22]]},"reference":[{"key":"1052_CR1","doi-asserted-by":"publisher","first-page":"203","DOI":"10.1016\/j.ins.2018.10.043","volume":"477","author":"B.A Pimentel","year":"2019","unstructured":"Pimentel B.A, Carvalho A.C.P.L.F. A new data characterization for selecting clustering algorithms using meta-learning. Inform Sci. 2019;477:203\u201319. .","journal-title":"Inform Sci"},{"key":"1052_CR2","doi-asserted-by":"publisher","first-page":"181","DOI":"10.1016\/j.ins.2014.12.044","volume":"301","author":"DG Ferrari","year":"2015","unstructured":"Ferrari DG, Castro LN. Clustering algorithm selection by meta-learning systems: A new distance-based problem characterization and ranking combination methods. Inform Sci. 2015;301:181\u201394. .","journal-title":"Inform Sci"},{"key":"1052_CR3","doi-asserted-by":"publisher","first-page":"473","DOI":"10.1016\/j.ins.2021.06.033","volume":"574","author":"I Gabbay","year":"2021","unstructured":"Gabbay I, Shapira B, Rokach L. Isolation forests and landmarking-based representations for clustering algorithm recommendation using meta-learning. Inform Sci. 2021;574:473\u201389. .","journal-title":"Inform Sci"},{"key":"1052_CR4","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN.2008.4634333","author":"M Souto","year":"2008","unstructured":"Souto M, Prud\u00eancio R, Soares R, Araujo D, Costa I, Ludermir T, Schliep A. Ranking and selecting clustering algorithms using a meta-learning approach. Int Joint Conf Neural Network. 2008. https:\/\/doi.org\/10.1109\/IJCNN.2008.4634333.","journal-title":"Int Joint Conf Neural Network"},{"key":"1052_CR5","doi-asserted-by":"publisher","first-page":"105682","DOI":"10.1016\/j.knosys.2020.105682","volume":"195","author":"B.A Pimentel","year":"2020","unstructured":"Pimentel B.A, Carvalho A.C.P.L.F. A Meta-learning approach for recommending the number of clusters for clustering algorithms. Knowledge-Based Syst. 2020;195:105682. .","journal-title":"Knowledge-Based Syst"},{"issue":"1","key":"1052_CR6","first-page":"7","volume":"15","author":"X Zhu","year":"2020","unstructured":"Zhu X, Li Y, Wang J, Zheng T, Fu J. Automatic Recommendation of a Distance Measure for Clustering Algorithms. ACM Trans Knowled Disc Data. 2020;15(1):7\u20131722. .","journal-title":"ACM Trans Knowled Disc Data"},{"key":"1052_CR7","unstructured":"Alves, G., Couceiro, M., Napoli, A.: Similarity Measure Selection for Categorical Data Clustering (2019). https:\/\/hal.archives-ouvertes.fr\/hal-02399640 Accessed 24 06 2021"},{"key":"1052_CR8","unstructured":"Halawani, S.M., Alhaddad, M., Ahmad, A.: A study of digital mammograms by using clustering algorithms. JSIR Vol.71(09) [September 2012] (2012). Accepted: 2012-08-31T09:32:15Z Publisher: NISCAIR-CSIR, India. Accessed 04 06 2022"},{"issue":"28","key":"1052_CR9","doi-asserted-by":"publisher","first-page":"4548","DOI":"10.1002\/sim.7371","volume":"36","author":"D McParland","year":"2017","unstructured":"McParland D, Phillips CM, Brennan L, Roche HM, Gormley IC. Clustering high-dimensional mixed data to uncover sub-phenotypes: joint analysis of phenotypic and genotypic data. Statistics Med. 2017;36(28):4548\u201369. https:\/\/doi.org\/10.1002\/sim.7371 (https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1002\/sim.7371. Accessed 2022-04-06).","journal-title":"Statistics Med"},{"key":"1052_CR10","doi-asserted-by":"publisher","unstructured":"Cheng M, Xin Y, Tian Y, Wang C, Yang Y.: Customer behavior pattern discovering based on mixed data clustering. In: 2009 International Conference on Computational Intelligence and Software Engineering, 2009;1\u2013 4 . https:\/\/doi.org\/10.1109\/CISE.2009.5366556","DOI":"10.1109\/CISE.2009.5366556"},{"key":"1052_CR11","doi-asserted-by":"publisher","unstructured":"Kassi ML, Berrado A, Benabbou L, Benabdelkader K. Towards a new framework for clustering in a mixed data space: case of gasoline service stations segmentation in Morocco. In: 2015 IEEE\/ACS 12th International Conference of Computer Systems and Applications (AICCSA), 2015; 1\u2013 6 . https:\/\/doi.org\/10.1109\/AICCSA.2015.7507121.","DOI":"10.1109\/AICCSA.2015.7507121"},{"issue":"3","key":"1052_CR12","doi-asserted-by":"publisher","first-page":"1177","DOI":"10.1016\/j.eswa.2007.08.049","volume":"35","author":"C-C Hsu","year":"2008","unstructured":"Hsu C-C, Huang Y-P. Incremental clustering of mixed data based on distance hierarchy. Expert Syst with Appl: An Int J. 2008;35(3):1177\u201385. .","journal-title":"Expert Syst with Appl: An Int J"},{"key":"1052_CR13","doi-asserted-by":"publisher","first-page":"747628","DOI":"10.1155\/2015\/747628","volume":"2015","author":"K Niu","year":"2015","unstructured":"Niu K, Niu Z, Su Y, Wang C, Lu H, Guan J. A coupled user clustering algorithm based on mixed data for web-based learning systems. Mathematical Prob Eng. 2015;2015:747628. .","journal-title":"Mathematical Prob Eng"},{"key":"1052_CR14","doi-asserted-by":"publisher","first-page":"31883","DOI":"10.1109\/ACCESS.2019.2903568","volume":"7","author":"A Ahmad","year":"2019","unstructured":"Ahmad A, Khan SS. Survey of state-of-the-art mixed data clustering algorithms. IEEE Access. 2019;7:31883\u2013902.","journal-title":"IEEE Access"},{"issue":"1","key":"1052_CR15","doi-asserted-by":"publisher","first-page":"167","DOI":"10.1007\/s10844-011-0187-y","volume":"39","author":"F Barcelo-Rico","year":"2012","unstructured":"Barcelo-Rico F, Diez J-L. Geometrical codification for clustering mixed categorical and numerical databases. J Intel Inform Syst. 2012;39(1):167\u201385. .","journal-title":"J Intel Inform Syst"},{"key":"1052_CR16","doi-asserted-by":"crossref","unstructured":"Diop, A., El\u00a0Malki, N., Chevalier, M., Peninou, A., Teste, O.: Impact of similarity measures on clustering mixed data. In: Proceedings of the 34th International Conference on Scientific and Statistical Database Management. SSDBM \u201922, pp. 1\u2013 12. Association for Computing Machinery, New York, NY, USA ( 2022).","DOI":"10.1145\/3538712.3538742"},{"issue":"9","key":"1052_CR17","doi-asserted-by":"publisher","first-page":"177","DOI":"10.3390\/a12090177","volume":"12","author":"W Budiaji","year":"2019","unstructured":"Budiaji W, Leisch F. Simple K-medoids partitioning algorithm for mixed variable data. Algorithms. 2019;12(9):177. .","journal-title":"Algorithms"},{"issue":"3","key":"1052_CR18","doi-asserted-by":"publisher","first-page":"283","DOI":"10.1023\/A:1009769707641","volume":"2","author":"Z Huang","year":"1998","unstructured":"Huang Z. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining Knowled Disc. 1998;2(3):283\u2013304. .","journal-title":"Data Mining Knowled Disc"},{"issue":"4","key":"1052_CR19","doi-asserted-by":"publisher","first-page":"221","DOI":"10.1089\/big.2018.0175","volume":"7","author":"H.A Abu\u00a0Alfeilat","year":"2019","unstructured":"Abu\u00a0Alfeilat H.A, Hassanat A.B.A, Lasassmeh O, Tarawneh A.S, Alhasanat M.B, Eyal\u00a0Salman H.S, Prasath V.B.S. Effects of distance measure choice on K-nearest neighbor classifier performance: a review. Big Data. 2019;7(4):221\u201348. .","journal-title":"Big Data"},{"key":"1052_CR20","doi-asserted-by":"crossref","unstructured":"Boriah S, Chandola V, Kumar V. Similarity measures for categorical data: A comparative evaluation. Soc Indust Appl Mathe. 2008.https:\/\/epubs.siam.org\/doi\/abs\/10.1137\/1.9781611972788.22","DOI":"10.1137\/1.9781611972788.22"},{"issue":"1","key":"1052_CR21","first-page":"43","volume":"8","author":"S Choi","year":"2009","unstructured":"Choi S, Cha S-H, Tappert C. A survey of binary similarity and distance measures. J Syst Cybern Inf. 2009;8(1):43.","journal-title":"J Syst Cybern Inf"},{"key":"1052_CR22","unstructured":"Vanschoren, J.: Meta-Learning: A Survey. arXiv. arXiv:1810.03548 [cs, stat] (2018). Accessed 28 11 2022"},{"key":"1052_CR23","unstructured":"Diop, A.: Similarity measure recommendation system for mixed data clustering (SIMREC). Accessed: 07 06 2024"},{"issue":"3","key":"1052_CR24","doi-asserted-by":"publisher","first-page":"233","DOI":"10.1007\/s41060-020-00216-2","volume":"10","author":"S Behzadi","year":"2020","unstructured":"Behzadi S, M\u00fcller NS, Plant C, B\u00f6hm C. Clustering of mixed-type data considering concept hierarchies: problem specification and algorithm. Int J Data Sci Anal. 2020;10(3):233\u201348.","journal-title":"Int J Data Sci Anal"},{"issue":"3","key":"1052_CR25","doi-asserted-by":"publisher","first-page":"1456","DOI":"10.1002\/wics.1456","volume":"11","author":"M Velden","year":"2019","unstructured":"Velden M, Iodice D\u2019Enza A, Markos A. Distance-based clustering of mixed data. WIREs Comput Statis. 2019;11(3):1456.","journal-title":"WIREs Comput Statis"},{"issue":"3","key":"1052_CR26","doi-asserted-by":"publisher","first-page":"1535","DOI":"10.3390\/e17031535","volume":"17","author":"M Wei","year":"2015","unstructured":"Wei M, Chow TWS, Chan RHM. Clustering heterogeneous data with k-Means by mutual information-based unsupervised feature transformation. Entropy. 2015;17(3):1535\u201348.","journal-title":"Entropy"},{"key":"1052_CR27","doi-asserted-by":"crossref","unstructured":"Zhu, C., Zhang, Q., Cao, L., Abrahamyan, A.: Mix2Vec: Unsupervised Mixed Data Representation. In: 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), 2020;  https:\/\/ieeexplore.ieee.org\/document\/9260035 Accessed 26 06 2024","DOI":"10.1109\/DSAA49011.2020.00024"},{"key":"1052_CR28","doi-asserted-by":"publisher","first-page":"33","DOI":"10.1109\/ACCESS.2022.3232372","volume":"11","author":"Y Lee","year":"2023","unstructured":"Lee Y, Park C, Kang S. Deep embedded clustering framework for mixed data. IEEE Access. 2023;11:33\u201340.","journal-title":"IEEE Access"},{"key":"1052_CR29","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.11597","author":"S Jian","year":"2018","unstructured":"Jian S, Hu L, Cao L, Lu K. Metric-based auto-instructor for learning mixed data representation. Proc AAAI Conf Art Intel. 2018. https:\/\/doi.org\/10.1609\/aaai.v32i1.11597.","journal-title":"Proc AAAI Conf Art Intel"},{"key":"1052_CR30","doi-asserted-by":"publisher","first-page":"104123","DOI":"10.1016\/j.chemolab.2020.104123","volume":"204","author":"K Balaji","year":"2020","unstructured":"Balaji K, Lavanya K, Mary AG. Clustering of mixed datasets using deep learning algorithm. Chem Intel Lab Syst. 2020;204:104123.","journal-title":"Chem Intel Lab Syst"},{"key":"1052_CR31","unstructured":"He, Z., Xu, X., Deng, S.: Clustering Mixed Numeric and Categorical Data: A Cluster Ensemble Approach.  (2005). arXiv: cs\/0509011. Accessed 26 01 2002"},{"key":"1052_CR32","first-page":"19","volume":"42","author":"J Suguna","year":"2012","unstructured":"Suguna J, Selvi M. Ensemble fuzzy clustering for mixed numeric and categorical data. Int J Comput Appl. 2012;42:19\u201323. .","journal-title":"Int J Comput Appl"},{"key":"1052_CR33","unstructured":"Cheeseman, P., Stutz, J.: Bayesian classification(AutoClass):theory and results. Adva Knowled Disc Data Min 1997"},{"issue":"3","key":"1052_CR34","doi-asserted-by":"publisher","first-page":"659","DOI":"10.1016\/j.csda.2004.03.001","volume":"48","author":"I Moustaki","year":"2005","unstructured":"Moustaki I, Papageorgiou I. Latent class models for mixed variables with applications in Archaeometry. Comput Statis Data Anal. 2005;48(3):659\u201375 .","journal-title":"Comput Statis Data Anal"},{"issue":"2","key":"1052_CR35","doi-asserted-by":"publisher","first-page":"155","DOI":"10.1007\/s11634-016-0238-x","volume":"10","author":"D McParland","year":"2016","unstructured":"McParland D, Gormley IC. Model based clustering for mixed data: clustMD. Adv Data Anal Class. 2016;10(2):155\u201369.","journal-title":"Adv Data Anal Class"},{"issue":"2","key":"1052_CR36","doi-asserted-by":"publisher","first-page":"119","DOI":"10.1111\/j.1475-4754.1983.tb00671.x","volume":"25","author":"G Philip","year":"1983","unstructured":"Philip G, Ottaway BS. Mixed data cluster analysis: an illustration using cypriot hooked-tang weapons. Archaeometry. 1983;25(2):119\u201333.","journal-title":"Archaeometry"},{"key":"1052_CR37","doi-asserted-by":"publisher","first-page":"338","DOI":"10.22271\/chemi.2020.v8.i4f.10087","volume":"8","author":"S Bishnoi","year":"2020","unstructured":"Bishnoi S, Hooda BK. A survey of distance measures for mixed variables. Int J Chem Stud. 2020;8:338\u201343.","journal-title":"Int J Chem Stud"},{"issue":"5","key":"1052_CR38","doi-asserted-by":"publisher","first-page":"657","DOI":"10.1109\/TPAMI.2005.95","volume":"27","author":"JZ Huang","year":"2005","unstructured":"Huang JZ, Ng MK, Rong H, Li Z. Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Machine Intel. 2005;27(5):657\u201368. .","journal-title":"IEEE Trans Pattern Anal Machine Intel"},{"issue":"2","key":"1052_CR39","doi-asserted-by":"publisher","first-page":"503","DOI":"10.1016\/j.datak.2007.03.016","volume":"63","author":"A Ahmad","year":"2007","unstructured":"Ahmad A, Dey L. A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowled Eng. 2007;63(2):503\u201327.  .","journal-title":"Data Knowled Eng"},{"key":"1052_CR40","doi-asserted-by":"publisher","first-page":"590","DOI":"10.1016\/j.neucom.2013.04.011","volume":"120","author":"J Ji","year":"2013","unstructured":"Ji J, Bai T, Zhou C, Ma C, Wang Z. An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing. 2013;120:590\u20136.","journal-title":"Neurocomputing"},{"key":"1052_CR41","doi-asserted-by":"publisher","first-page":"226","DOI":"10.1016\/j.procs.2015.10.077","volume":"70","author":"S Harikumar","year":"2015","unstructured":"Harikumar S, Pv S. K-Medoid clustering for heterogeneous datasets. Procedia Comput Sci. 2015;70:226\u201337.","journal-title":"Procedia Comput Sci"},{"key":"1052_CR42","doi-asserted-by":"publisher","first-page":"513","DOI":"10.1016\/j.ins.2019.07.100","volume":"505","author":"P D\u2019Urso","year":"2019","unstructured":"D\u2019Urso P, Massari R. Fuzzy clustering of mixed data. Inform Sci. 2019;505:513\u201334.","journal-title":"Inform Sci"},{"issue":"1","key":"1052_CR43","doi-asserted-by":"publisher","first-page":"1","DOI":"10.3390\/stats5010001","volume":"5","author":"F Mbuga","year":"2022","unstructured":"Mbuga F, Tortora C. Spectral clustering of mixed-type data. Stats. 2022;5(1):1\u201311.","journal-title":"Stats"},{"key":"1052_CR44","doi-asserted-by":"publisher","first-page":"46","DOI":"10.1016\/j.patrec.2017.07.001","volume":"97","author":"M Du","year":"2017","unstructured":"Du M, Ding S, Xue Y. A novel density peaks clustering algorithm for mixed data. Pattern Recogn Lett. 2017;97:46\u201353.","journal-title":"Pattern Recogn Lett"},{"key":"1052_CR45","doi-asserted-by":"publisher","first-page":"294","DOI":"10.1016\/j.knosys.2017.07.027","volume":"133","author":"S Ding","year":"2017","unstructured":"Ding S, Du M, Sun T, Xu X, Xue Y. An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood. Knowledge-Based Syst. 2017;133:294\u2013313.","journal-title":"Knowledge-Based Syst"},{"issue":"20","key":"1052_CR46","doi-asserted-by":"publisher","first-page":"4474","DOI":"10.1016\/j.ins.2007.05.003","volume":"177","author":"C-C Hsu","year":"2007","unstructured":"Hsu C-C, Chen C-L, Su Y-W. Hierarchical clustering of mixed data based on distance hierarchy. Inform Sci. 2007;177(20):4474\u201392.","journal-title":"Inform Sci"},{"issue":"9","key":"1052_CR47","doi-asserted-by":"publisher","first-page":"6530","DOI":"10.1109\/TNNLS.2022.3202700","volume":"34","author":"Y Zhang","year":"2023","unstructured":"Zhang Y, Cheung Y-M. Graph-based dissimilarity measurement for cluster analysis of any-type-attributed data. IEEE Trans Neural Network Learn Syst. 2023;34(9):6530\u201344.","journal-title":"IEEE Trans Neural Network Learn Syst"},{"issue":"2","key":"1052_CR48","doi-asserted-by":"publisher","first-page":"101","DOI":"10.1504\/IJDMB.2016.074682","volume":"14","author":"M Vukicevic","year":"2016","unstructured":"Vukicevic M, Radovanovic S, Delibasic B, Suknovic M. Extending meta-learning framework for clustering gene expression data with component-based algorithm design and internal evaluation measures. Int J Data Min Bioinform. 2016;14(2):101\u201319.","journal-title":"Int J Data Min Bioinform"},{"key":"#cr-split#-1052_CR49.1","doi-asserted-by":"crossref","unstructured":"Poulakis, Y., Doulkeridis, C., Kyriazis, D.: AutoClust: A Framework for Automated Clustering Based on Cluster Validity Indices. In: 2020 IEEE International Conference on Data Mining (ICDM), 2020","DOI":"10.1109\/ICDM50108.2020.00153"},{"key":"#cr-split#-1052_CR49.2","unstructured":"1ISSN: 2374-8486. https:\/\/ieeexplore.ieee.org\/document\/9338346 Accessed 03 06 2024"},{"key":"1052_CR50","doi-asserted-by":"crossref","unstructured":"ElShawi, R., Sakr, S.: TPE-AutoClust: A Tree-based Pipline Ensemble Framework for Automated Clustering. In: 2022 IEEE International Conference on Data Mining Workshops (ICDMW), 2022. ISSN: 2375-9259. https:\/\/ieeexplore.ieee.org\/document\/10031132 Accessed 03 06 2024","DOI":"10.1109\/ICDMW58026.2022.00149"},{"key":"1052_CR51","doi-asserted-by":"publisher","first-page":"824","DOI":"10.1016\/j.ins.2021.08.028","volume":"577","author":"N Cohen-Shapira","year":"2021","unstructured":"Cohen-Shapira N, Rokach L. Automatic selection of clustering algorithms using supervised graph embedding. Inform Sci. 2021;577:824\u201351. arXiv: 2011.08225 [cs, stat]. Accessed 15 05 2024.","journal-title":"Inform Sci"},{"key":"1052_CR52","volume-title":"Elements of Information Theory","author":"TM Cover","year":"1999","unstructured":"Cover TM. Elements of Information Theory. New Jersey: John Wiley & Sons; 1999."},{"key":"1052_CR53","doi-asserted-by":"publisher","DOI":"10.4304\/jcp.9.3.557-565","author":"Y Zhou","year":"2014","unstructured":"Zhou Y, Liu Y, Yang J, He X, Liu L. A taxonomy of label ranking algorithms. J Comput. 2014. https:\/\/doi.org\/10.4304\/jcp.9.3.557-565.","journal-title":"J Comput"},{"issue":"1","key":"1052_CR54","doi-asserted-by":"publisher","first-page":"12166","DOI":"10.1111\/exsy.12166","volume":"34","author":"CR S\u00e1","year":"2017","unstructured":"S\u00e1 CR, Soares C, Knobbe A, Cortez P. Label Ranking Forests. Expert Syst. 2017;34(1):12166.","journal-title":"Expert Syst"},{"issue":"17","key":"1052_CR55","doi-asserted-by":"publisher","first-page":"2914","DOI":"10.1016\/j.neucom.2011.03.034","volume":"74","author":"MM Kabir","year":"2011","unstructured":"Kabir MM, Shahjahan M, Murase K. A new local search based hybrid genetic algorithm for feature selection. Neurocomputing. 2011;74(17):2914\u201328.","journal-title":"Neurocomputing"},{"issue":"4","key":"1052_CR56","doi-asserted-by":"publisher","first-page":"491","DOI":"10.1109\/TKDE.2005.66","volume":"17","author":"H Liu","year":"2005","unstructured":"Liu H, Yu L. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowled Data Eng. 2005;17(4):491\u2013502.","journal-title":"IEEE Trans Knowled Data Eng"},{"key":"1052_CR57","doi-asserted-by":"publisher","first-page":"101804","DOI":"10.1016\/j.is.2021.101804","volume":"101","author":"E Schubert","year":"2021","unstructured":"Schubert E, Rousseeuw PJ. Fast and eager k-medoids clustering: O(k) runtime improvement of the PAM, CLARA, and CLARANS algorithms. Inform Syst. 2021;101:101804.","journal-title":"Inform Syst"},{"key":"1052_CR58","unstructured":"Toan, N.M.: Clustering algorithm for Mixed data of categorial and numerical (ordinal and nonordinal) data using LSH. https:\/\/pypi.org\/project\/lshkrepresentatives\/. Accessed: 05 11 2024"},{"key":"1052_CR59","doi-asserted-by":"publisher","first-page":"29","DOI":"10.1016\/j.neucom.2021.08.050","volume":"463","author":"TN Mau","year":"2021","unstructured":"Mau TN, Huynh V-N. An LSH-based k-representatives clustering method for large categorical data. Neurocomputing. 2021;463:29\u201344.","journal-title":"Neurocomputing"},{"key":"1052_CR60","doi-asserted-by":"publisher","first-page":"53","DOI":"10.1016\/0377-0427(87)90125-7","volume":"20","author":"PJ Rousseeuw","year":"1987","unstructured":"Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Mathe. 1987;20:53\u201365.","journal-title":"J Comput Appl Mathe"},{"issue":"1","key":"1052_CR61","doi-asserted-by":"publisher","first-page":"193","DOI":"10.1007\/BF01908075","volume":"2","author":"L Hubert","year":"1985","unstructured":"Hubert L, Arabie P. Comparing partitions. J Class. 1985;2(1):193\u2013218.","journal-title":"J Class"},{"issue":"1","key":"1052_CR62","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/2133360.2133361","volume":"6","author":"D Ienco","year":"2012","unstructured":"Ienco D, Pensa RG, Meo R. From context to distance: learning dissimilarity for categorical data clustering. ACM Trans Knowl Discov Data. 2012;6(1):1\u20131125.","journal-title":"ACM Trans Knowl Discov Data"},{"issue":"2","key":"1052_CR63","doi-asserted-by":"publisher","first-page":"49","DOI":"10.1145\/2641190.2641198","volume":"15","author":"J Vanschoren","year":"2013","unstructured":"Vanschoren J, Rijn JN, Bischl B, Torgo L. Openml: networked science in machine learning. SIGKDD Explorations. 2013;15(2):49\u201360.","journal-title":"SIGKDD Explorations"},{"issue":"6","key":"1052_CR64","doi-asserted-by":"publisher","first-page":"80","DOI":"10.2307\/3001968","volume":"1","author":"F Wilcoxon","year":"1945","unstructured":"Wilcoxon F. Individual comparisons by ranking methods. Bio Bull. 1945;1(6):80\u20133.","journal-title":"Bio Bull"},{"key":"1052_CR65","doi-asserted-by":"crossref","unstructured":"Jomaa, H.S., Schmidt-Thieme, L., Grabocka, J.: Dataset2Vec: learning dataset meta-features. arXiv. arXiv:1905.11063 [cs, stat] (2021).","DOI":"10.1007\/s10618-021-00737-9"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-024-01052-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s40537-024-01052-y\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-024-01052-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,2,22]],"date-time":"2025-02-22T16:14:58Z","timestamp":1740240898000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-024-01052-y"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,22]]},"references-count":66,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["1052"],"URL":"https:\/\/doi.org\/10.1186\/s40537-024-01052-y","relation":{},"ISSN":["2196-1115"],"issn-type":[{"value":"2196-1115","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,22]]},"assertion":[{"value":"26 June 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 December 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 February 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"43"}}