{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,5]],"date-time":"2026-05-05T01:37:38Z","timestamp":1777945058617,"version":"3.51.4"},"reference-count":28,"publisher":"SAGE Publications","issue":"3","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["SCS"],"published-print":{"date-parts":[[2023,10,9]]},"abstract":"<jats:p>Data cleaning, also referred to as data cleansing, constitutes a pivotal phase in data processing subsequent to data collection. Its primary objective is to identify and eliminate incomplete data, duplicates, outdated information, anomalies, missing values, and errors. The influence of data quality on the effectiveness of machine learning (ML) models is widely acknowledged, prompting data scientists to dedicate substantial effort to data cleaning prior to model training. This study accentuates critical facets of data cleaning and the utilization of outlier detection algorithms. Additionally, our investigation encompasses the evaluation of prominent outlier detection algorithms through benchmarking, seeking to identify an efficient algorithm boasting consistent performance. As the culmination of our research, we introduce an innovative algorithm centered on the fusion of Isolation Forest and clustering techniques. By leveraging the strengths of both methods, this proposed algorithm aims to enhance outlier detection outcomes. This work endeavors to elucidate the multifaceted importance of data cleaning, underscored by its symbiotic relationship with ML models. Furthermore, our exploration of outlier detection methodologies aligns with the broader objective of refining data processing and analysis paradigms. Through the convergence of theoretical insights, algorithmic exploration, and innovative proposals, this study contributes to the advancement of data cleaning and outlier detection techniques in the realm of contemporary data-driven environments.<\/jats:p>","DOI":"10.3233\/scs-230008","type":"journal-article","created":{"date-parts":[[2023,10,10]],"date-time":"2023-10-10T12:16:53Z","timestamp":1696940213000},"page":"125-140","source":"Crossref","is-referenced-by-count":42,"title":["Data cleaning survey and challenges \u2013 improving outlier detection algorithm in machine learning"],"prefix":"10.1177","volume":"2","author":[{"given":"Sanae","family":"Borrohou","sequence":"first","affiliation":[{"name":"IDS Team, Abdelmalek Essaadi University, Tangier, Morocco"}]},{"given":"Rachida","family":"Fissoune","sequence":"additional","affiliation":[{"name":"IDS Team, Abdelmalek Essaadi University, Tangier, Morocco"}]},{"given":"Hassan","family":"Badir","sequence":"additional","affiliation":[{"name":"IDS Team, Abdelmalek Essaadi University, Tangier, Morocco"}]}],"member":"179","reference":[{"key":"10.3233\/SCS-230008_ref1","doi-asserted-by":"crossref","unstructured":"J.M.\u00a0Wing, The data life cycle, Harvard Data Science Review 1(1) (2019), 6.","DOI":"10.1162\/99608f92.e26845b4"},{"issue":"21","key":"10.3233\/SCS-230008_ref2","first-page":"15","article-title":"Qualitative data analysis: An overview of data reduction, data display, and interpretation","volume":"10","author":"Mezmir","year":"2020","journal-title":"Research on humanities and social sciences"},{"issue":"4","key":"10.3233\/SCS-230008_ref6","first-page":"3","article-title":"Data cleaning: Problems and current approaches","volume":"23","author":"Rahm","year":"2000","journal-title":"IEEE Data Eng. Bull."},{"issue":"5","key":"10.3233\/SCS-230008_ref8","doi-asserted-by":"publisher","first-page":"453","DOI":"10.1515\/revce-2015-0022","article-title":"Data cleaning in the process industries","volume":"31","author":"Xu","year":"2015","journal-title":"Reviews in Chemical Engineering"},{"key":"10.3233\/SCS-230008_ref9","unstructured":"C.\u00a0Xu et al., Data cleaning: Overview and emerging challenges, in: Proceedings of the 2016 International Conference on Management of Data, 2016."},{"key":"10.3233\/SCS-230008_ref10","doi-asserted-by":"publisher","first-page":"731","DOI":"10.1016\/j.procs.2019.11.177","article-title":"A review on data cleansing methods for big data","volume":"161","author":"Ridzuan","year":"2019","journal-title":"Procedia Computer Science"},{"issue":"1","key":"10.3233\/SCS-230008_ref12","doi-asserted-by":"publisher","first-page":"1","DOI":"10.4108\/trans.sis.2013.01-03.e2","article-title":"Advancements of outlier detection: A survey","volume":"13","author":"Zhang","year":"2013","journal-title":"ICST Transactions on Scalable Information Systems"},{"key":"10.3233\/SCS-230008_ref13","doi-asserted-by":"crossref","unstructured":"D.\u00a0Divya and S.S.\u00a0Babu, Methods to Detect Different Types of Outliers, 2016 International Conference on Data Mining and Advanced Computing (SAPIENCE), IEEE, 2016.","DOI":"10.1109\/SAPIENCE.2016.7684114"},{"key":"10.3233\/SCS-230008_ref14","doi-asserted-by":"crossref","unstructured":"H.C.\u00a0Mandhare and S.R.\u00a0Idate, A Comparative Study of Cluster Based Outlier Detection, Distance Based Outlier Detection and Density Based Outlier Detection Techniques, 2017 International Conference on Intelligent Computing and Control Systems (ICICCS), IEEE, 2017.","DOI":"10.1109\/ICCONS.2017.8250601"},{"key":"10.3233\/SCS-230008_ref15","doi-asserted-by":"publisher","first-page":"107964","DOI":"10.1109\/ACCESS.2019.2932769","article-title":"Progress in outlier detection techniques: A survey","volume":"7","author":"Wang","year":"2019","journal-title":"Ieee Access"},{"key":"10.3233\/SCS-230008_ref16","doi-asserted-by":"publisher","DOI":"10.1016\/j.asoc.2021.107354"},{"key":"10.3233\/SCS-230008_ref17","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2021.108115"},{"key":"10.3233\/SCS-230008_ref18","doi-asserted-by":"crossref","unstructured":"C.C.\u00a0Aggarwal, Outlier Analysis, 2nd edn, 2016.","DOI":"10.1007\/978-3-319-47578-3"},{"key":"10.3233\/SCS-230008_ref20","doi-asserted-by":"crossref","unstructured":"H.H.\u00a0Mohamed, E-Clean: A Data Cleaning Framework for Patient Data, 2011 First International Conference on Informatics and Computational Intelligence, IEEE, 2011.","DOI":"10.1109\/ICI.2011.21"},{"key":"10.3233\/SCS-230008_ref21","doi-asserted-by":"publisher","DOI":"10.1007\/978-94-015-3994-4_8"},{"key":"10.3233\/SCS-230008_ref22","unstructured":"V.\u00a0Barnett and T.\u00a0Lewis, Outliers in Statistical Data, Vol.\u00a03, Wiley, New York, 1994."},{"key":"10.3233\/SCS-230008_ref23","doi-asserted-by":"crossref","unstructured":"R.J.A.Little and D.B.\u00a0Rubin, Statistical Analysis with Missing Data, Vol.\u00a0793, John Wiley & Sons, 2019.","DOI":"10.1002\/9781119482260"},{"key":"10.3233\/SCS-230008_ref24","unstructured":"J.\u00a0Han, J.\u00a0Pei and M.\u00a0Kamber, Data Mining: Concepts and Techniques. 2011, 1999."},{"issue":"12","key":"10.3233\/SCS-230008_ref26","doi-asserted-by":"publisher","first-page":"2451","DOI":"10.1109\/TSMC.2017.2718220","article-title":"Efficient outlier detection for high-dimensional data","volume":"48","author":"Liu","year":"2017","journal-title":"IEEE Transactions on Systems, Man, and Cybernetics: Systems"},{"key":"10.3233\/SCS-230008_ref27","doi-asserted-by":"crossref","unstructured":"R.\u00a0Bansal, N.\u00a0Gaur and S.N.\u00a0Singh, Outlier detection: Applications and techniques in data mining, in: 2016 6th International Conference-Cloud System and Big Data Engineering (Confluence), IEEE, 2016.","DOI":"10.1109\/CONFLUENCE.2016.7508146"},{"issue":"3","key":"10.3233\/SCS-230008_ref28","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/1541880.1541882","article-title":"Anomaly detection: A survey","volume":"41","author":"Chandola","year":"2009","journal-title":"ACM computing surveys (CSUR)"},{"key":"10.3233\/SCS-230008_ref30","doi-asserted-by":"crossref","unstructured":"S.\u00a0Maguerra, A.\u00a0Boulmakoul and H.\u00a0Badir, in: Time Framework: A Type Level and Algebra Driven Design Approach, 2021 International Conference on Data Analytics for Business and Industry (ICDABI), Sakheer, Bahrain, 2021, pp.\u00a025\u201326.","DOI":"10.1109\/ICDABI53623.2021.9655926"},{"key":"10.3233\/SCS-230008_ref31","doi-asserted-by":"publisher","DOI":"10.1080\/00401706.1969.10490657"},{"key":"10.3233\/SCS-230008_ref33","doi-asserted-by":"publisher","first-page":"731","DOI":"10.1016\/j.procs.2019.11.177","article-title":"A review on data cleansing methods for big data","volume":"161","author":"Ridzuan","year":"2019","journal-title":"Procedia Computer Science"},{"key":"10.3233\/SCS-230008_ref34","doi-asserted-by":"crossref","unstructured":"F.T.\u00a0Liu, K.M.\u00a0Ting and Z.-H.\u00a0Zhou, Isolation forest, in: 2008 Eighth IEEE International Conference on Data Mining, IEEE, 2008.","DOI":"10.1109\/ICDM.2008.17"},{"key":"10.3233\/SCS-230008_ref35","doi-asserted-by":"crossref","unstructured":"M.M.\u00a0Breunig et al., LOF: Identifying density-based local outliers, in: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 2000.","DOI":"10.1145\/342009.335388"},{"key":"10.3233\/SCS-230008_ref36","unstructured":"B.\u00a0Sch\u00f6lkopf, R.\u00a0Herbrich and A.J.\u00a0Smola, A generalized representer theorem, in: Computational Learning Theory: 14th Annual Conference on Computational Learning Theory, COLT 2001 and 5th European Conference on Computational Learning Theory, EuroCOLT 2001, Amsterdam, The Netherlands, July 16\u201319, 2001 Proceedings 14, Springer, Berlin Heidelberg, 2001."},{"issue":"1","key":"10.3233\/SCS-230008_ref37","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1348\/000711005X48266","article-title":"K-means clustering: A half-century synthesis","volume":"59","author":"Steinley","year":"2006","journal-title":"British Journal of Mathematical and Statistical Psychology"}],"container-title":["Journal of Smart Cities and Society"],"original-title":[],"link":[{"URL":"https:\/\/content.iospress.com\/download?id=10.3233\/SCS-230008","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T04:58:50Z","timestamp":1777611530000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/full\/10.3233\/SCS-230008"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,9]]},"references-count":28,"journal-issue":{"issue":"3"},"URL":"https:\/\/doi.org\/10.3233\/scs-230008","relation":{},"ISSN":["2772-3585","2772-3577"],"issn-type":[{"value":"2772-3585","type":"electronic"},{"value":"2772-3577","type":"print"}],"subject":[],"published":{"date-parts":[[2023,10,9]]}}}