{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,26]],"date-time":"2026-03-26T12:07:47Z","timestamp":1774526867584,"version":"3.50.1"},"reference-count":47,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2021,7,7]],"date-time":"2021-07-07T00:00:00Z","timestamp":1625616000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>Detecting outliers is a widely studied problem in many disciplines, including statistics, data mining, and machine learning. All anomaly detection activities are aimed at identifying cases of unusual behavior compared to most observations. There are many methods to deal with this issue, which are applicable depending on the size of the data set, the way it is stored, and the type of attributes and their values. Most of them focus on traditional datasets with a large number of quantitative attributes. The multitude of solutions related to detecting outliers in quantitative sets, a large and still has a small number of research solutions is a problem detecting outliers in data containing only qualitative variables. This article was designed to compare three different categorical data clustering algorithms: K-modes algorithm taken from MacQueen\u2019s K-means algorithm and the STIRR and ROCK algorithms. The comparison concerned the method of dividing the set into clusters and, in particular, the outliers detected by algorithms. During the research, the authors analyzed the clusters detected by the indicated algorithms, using several datasets that differ in terms of the number of objects and variables. They have conducted experiments on the parameters of the algorithms. The presented study made it possible to check whether the algorithms similarly detect outliers in the data and how much they depend on individual parameters and parameters of the set, such as the number of variables, tuples, and categories of a qualitative variable.<\/jats:p>","DOI":"10.3390\/e23070869","type":"journal-article","created":{"date-parts":[[2021,7,7]],"date-time":"2021-07-07T12:31:25Z","timestamp":1625661085000},"page":"869","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["Qualitative Data Clustering to Detect Outliers"],"prefix":"10.3390","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7238-1170","authenticated-orcid":false,"given":"Agnieszka","family":"Nowak-Brzezi\u0144ska","sequence":"first","affiliation":[{"name":"Institute of Computer Science, Faculty of Science and Technology, University of Silesia, Bankowa 12, 40-007 Katowice, Poland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1509-5909","authenticated-orcid":false,"given":"Weronika","family":"\u0141azarz","sequence":"additional","affiliation":[{"name":"Institute of Computer Science, Faculty of Science and Technology, University of Silesia, Bankowa 12, 40-007 Katowice, Poland"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,7,7]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Denzin, N.K. (2017). The Research Act: A Theoretical Introduction to Sociological Methods, Transaction Publishers.","DOI":"10.4324\/9781315134543"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"301","DOI":"10.1016\/j.ins.2019.02.019","article-title":"Exploration of rule-based knowledge bases: A knowledge engineers support","volume":"485","year":"2019","journal-title":"Inf. Sci."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"2065491","DOI":"10.1155\/2018\/2065491","article-title":"Enhancing the efficiency of a decision support system through the clustering of complex rule-based knowledge bases and modification of the inference algorithm","volume":"2018","year":"2018","journal-title":"Complex."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Nowak-Brzezi\u0144ska, A., and Hory\u0144, C. (2020). Exploration of Outliers in If-Then Rule-Based Knowledge Bases. Entropy, 22.","DOI":"10.3390\/e22101096"},{"key":"ref_5","first-page":"1","article-title":"Procedures for detecting outlying observations in samples","volume":"11","author":"Grubbs","year":"1974","journal-title":"Ballist. Res. Lab. Aberd. Proving Ground"},{"key":"ref_6","first-page":"919","article-title":"The Windows-Users and Intruder simulations Logs Dataset (WUIL): An Experimental Framework for Masquerade Detection Mechanisms","volume":"41","author":"Monroy","year":"2014","journal-title":"Expert Syst. Appl. Methods Appl. Artif. Comput. Intell."},{"key":"ref_7","unstructured":"Carletti, M., Terzi, M., and Susto, G.A. (2020). Interpretable Anomaly Detection with DIFFI: Depth-based Feature Importance for the Isolation Forest. arXiv."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Liu, F.T., Ting, K.M., and Zhou, Z.-H. (2009, January 6\u20139). Isolation Forest. Proceedings of the IEEE International Conference on Data Mining, Miami, FL, USA.","DOI":"10.1109\/ICDM.2008.17"},{"key":"ref_9","first-page":"142","article-title":"A review of anomaly detection systems in cloud networks and survey of cloud security measures incloud storage applications","volume":"6","author":"Sari","year":"2015","journal-title":"J. Inf. Secur."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Ma, M.X., Ngan, H.Y., and Liu, W. (2016, January 14\u201318). Density-based Outlier Detection by Local Outlier Factor on Largescale Traffic Data. Proceedings of the IS&T International Symposium on Electronic Imaging 2016, Image Processing: Machine Vision Applications IXAt, San Francisco, CA, USA.","DOI":"10.2352\/ISSN.2470-1173.2016.14.IPMVA-385"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Ieva, F., and Paganoni, A.M. (2015). Detecting and visualizing outliers in provider profiling via funnel plots and mixedeffect models. Health Care Manag. Sci., 166\u2013172.","DOI":"10.1007\/s10729-013-9264-9"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Ijaz, M.F., Alfian, G., Syafrudin, M., and Rhee, J. (2018). Hybrid Prediction Model for Type 2 Diabetes and Hypertension Using DBSCAN-Based Outlier Detection, Synthetic Minority Over Sampling Technique (SMOTE), and Random Forest. Appl. Sci., 8.","DOI":"10.3390\/app8081325"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"89","DOI":"10.1016\/j.chemolab.2013.06.004","article-title":"Dynamic process monitoring using adaptive local outlier factor","volume":"127","author":"Ma","year":"2013","journal-title":"Chemom. Intell. Lab. Syst."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"77","DOI":"10.1191\/1478088706qp063oa","article-title":"Using thematic analysis in psychology","volume":"3","author":"Braun","year":"2006","journal-title":"Qual. Res. Psychol."},{"key":"ref_15","first-page":"79","article-title":"General Foundations for Studying Masking and Swamping Robustness of Outlier Identifiers","volume":"20","author":"Wang","year":"2013","journal-title":"Stat. Methodol."},{"key":"ref_16","unstructured":"Loureiro, A., Torgo, L., and Soares, C. (2004, January 3\u20134). Outlier detection using clustering methods: A data cleaning application. Proceedings of the KDNet Symposium on Knowledge-Based Systems for the Public Sector, Bonn, Germany."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"691","DOI":"10.1016\/S0167-8655(00)00131-8","article-title":"Two-phase clustering process for outliers detection","volume":"22","author":"Jiang","year":"2001","journal-title":"Pattern Recognit. Lett."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Hawkins, S., He, H., Williams, G., and Baxter, R. (2002). Outlier Detection Using Replicator Neural Networks. Int. Conf. Data Warehous. Knowl. Discov., 170\u2013180.","DOI":"10.1007\/3-540-46145-0_17"},{"key":"ref_19","unstructured":"Jordaan, E., and Smits, G. (2004, January 25\u201329). Robust outlier detection using SVM regression. Proceedings of the IEEE International Joint Conference on Neural Networks, Budapest, Hungary."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Primartha, R., and Tama, B.A. (2017, January 1\u20132). Anomaly detection using random forest: A performance revisited. Proceedings of the 2017 International Conference on Data and Software Engineering (ICoDSE), Palembang, Indonesia.","DOI":"10.1109\/ICODSE.2017.8285847"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Viet, H.N., Van, Q.N., Trang, L.L.T., and Nathan, S. (2018, January 25\u201327). Using deep learning model for network scanning detection. Proceedings of the 4th International Conference on Frontiers of Educational Technologies, Moscow, Russia.","DOI":"10.1145\/3233347.3233379"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Breunig, M.M., Kriegel, H.P., Ng, R.T., and Sander, J. (2000, January 16\u201318). LOF: Identifying Density-Based Local Outliers. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA. Available online: https:\/\/www.dbs.ifi.lmu.de\/Publikationen\/Papers\/LOF.pdf.","DOI":"10.1145\/342009.335388"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"212","DOI":"10.1080\/00401706.1999.10485670","article-title":"A fast algorithm for the Minimum Covariance Determinant estimator","volume":"41","author":"Rousseeuw","year":"1999","journal-title":"Technometrics"},{"key":"ref_24","unstructured":"Ester, M., Kriegel, H.P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd, 226\u2013231."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Khan, M.M.R., Siddique, A.B., Arif, R.B., and Oishe, M.R. (2018, January 13\u201315). DBSCAN: Adaptive Density-Based Spatial Clustering of Applications with Noise for Identifying Clusters with Varying Densities. Proceedings of the 4th International Conference on Electrical Engineering and Information & Communication Technology (iCEEiCT), Dhaka, Bangladesh.","DOI":"10.1109\/CEEICT.2018.8628138"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Yang, Y., Guan, X., and You, J. (2002, January 23\u201326). CLOPE: A Fast and Effective Clustering Algorithm for Transactional Data. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, Canada.","DOI":"10.1145\/775047.775149"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Ganti, V., Gehrke, J., and Ramakrishnan, R. (1999, January 15\u201318). CACTUS: Clustering Categorical Data Using Summaries. Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA.","DOI":"10.1145\/312129.312201"},{"key":"ref_28","unstructured":"MacQueen, J.B. (1965, January 18-21). Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of the 5th Berkley Symposium on Matematical Statistics and Probability, Oakland, CA, USA."},{"key":"ref_29","unstructured":"Kaufman, L., and Rousseeuw, P.J. (2005). Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, Inc."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"345","DOI":"10.1016\/S0306-4379(00)00022-3","article-title":"ROCK: A Robust Clustering Algorithm for Categorical Attributes","volume":"25","author":"Guha","year":"2000","journal-title":"Inf. Syst."},{"key":"ref_31","unstructured":"Myeong-Hun, J., Cai, Y., Sullivan, C.J., and Wang, S. (November, January 31). Data depth based clustering analysis. Proceedings of the 24th SIGSPATIAL International Conference on Advances in Geographic Information Systems, New York, NY, USA."},{"key":"ref_32","first-page":"27","article-title":"Fuzzy based clustering algorithm for privacy preserving datamining","volume":"7","author":"Kumar","year":"2011","journal-title":"Int. J. Bus. Inf. Syst."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"126","DOI":"10.1080\/00207217.2012.687191","article-title":"Fuzzy logic based clustering in wireless sensor networks: A survey","volume":"100","author":"Ashutosh","year":"2013","journal-title":"Int. J. Electron."},{"key":"ref_34","first-page":"1001","article-title":"A Unified Framework for Model-based Clustering","volume":"4","author":"Zhong","year":"2003","journal-title":"J. Mach. Learn. Res. USA"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"977","DOI":"10.1093\/bioinformatics\/17.10.977","article-title":"Model-based clustering and data transformations for gene expression data","volume":"17","author":"Yeung","year":"2001","journal-title":"Bioinformatics"},{"key":"ref_36","unstructured":"Huang, Z. (1998, January 27\u201331). A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining. Proceedings of the Data Mining and Knowledge Discovery, New York, NY, USA."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"222","DOI":"10.1007\/s007780050005","article-title":"Clustering Categorical Data: An Approach Based on Dynamical Systems","volume":"8","author":"Gibson","year":"2000","journal-title":"Vldb J. USA"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"544","DOI":"10.1080\/00029890.1991.11995755","article-title":"Gram-Schmidt Orthogonalization by Gauss Elimination","volume":"98","author":"Pursell","year":"1991","journal-title":"Am. Math. Mon. USA"},{"key":"ref_39","unstructured":"Zwittera, M., and Soklic, M. (2020, May 23). Primary Tumor Dataset, University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia, Published in UCI Machine Learning Repository. Available online: https:\/\/archive.ics.uci.edu\/ml\/datasets\/Primary+Tumor."},{"key":"ref_40","unstructured":"Zwittera, M., and Soklic, M. (2020, May 23). Lymphography Dataset, University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia, Published in UCI Machine Learning Repository. Available online: https:\/\/archive.ics.uci.edu\/ml\/datasets\/Lymphography."},{"key":"ref_41","unstructured":"Schlimmer, J. (2020, May 23). Congressional Voting Records Dataset, University Medical Centre, Congressional Quarterly Almanac, 98th Congress, Second Session 1984, Volume XL: Congressional Quarterly Inc. Washington, 1985, Published in UCI Machine Learning Repository. Available online: https:\/\/archive.ics.uci.edu\/ml\/datasets\/congressional+voting+records."},{"key":"ref_42","unstructured":"Bohanec, M., and Rajkovic, V. (2020, May 23). Car Evaluation Dataset, Expert System for Decision Making. 1990, Published in UCI Machine Learning Repository. Available online: https:\/\/archive.ics.uci.edu\/ml\/datasets\/Car+Evaluation."},{"key":"ref_43","unstructured":"Cios, K.J., and Kurgan, L.A. (2021, June 09). SPECT Heart Dataset in UCI Machine Learning Repository. Available online: https:\/\/archive.ics.uci.edu\/ml\/datasets\/spect+heart."},{"key":"ref_44","unstructured":"(2021, June 09). Effects on Personality Due to Covid-19, Published in Kaggle. Available online: https:\/\/www.kaggle.com\/anushiagrawal\/effects-on-personality-due-to-covid19."},{"key":"ref_45","unstructured":"(2021, June 09). Phishing Website Detector Dataset, Published in Kaggle. Available online: https:\/\/www.kaggle.com\/eswarchandt\/phishing-website-detector."},{"key":"ref_46","unstructured":"(2021, June 09). Japanese Credit Screening Dataset, Published in Data World. Available online: https:\/\/data.world\/uci\/japanese-credit-screening."},{"key":"ref_47","unstructured":"(2021, June 09). Bank Data for Cash Deposit, Published in Kaggle. Available online: https:\/\/www.kaggle.com\/raosuny\/success-of-bank-telemarketing-data."}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/23\/7\/869\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:27:23Z","timestamp":1760164043000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/23\/7\/869"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,7]]},"references-count":47,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2021,7]]}},"alternative-id":["e23070869"],"URL":"https:\/\/doi.org\/10.3390\/e23070869","relation":{},"ISSN":["1099-4300"],"issn-type":[{"value":"1099-4300","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,7,7]]}}}