{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,3]],"date-time":"2026-06-03T22:17:45Z","timestamp":1780525065915,"version":"3.54.1"},"reference-count":24,"publisher":"MDPI AG","issue":"5","license":[{"start":{"date-parts":[[2025,5,5]],"date-time":"2025-05-05T00:00:00Z","timestamp":1746403200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"PRR - Plano de Recupera\u00e7\u00e3o e Resili\u00eancia pela Uni\u00e3o Europeia","award":["C644943391-00000051"],"award-info":[{"award-number":["C644943391-00000051"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>Data cleaning remains one of the most time-consuming and critical steps in modern data science, directly influencing the reliability and accuracy of downstream analytics. In this paper, we present a comprehensive evaluation of five widely used data cleaning tools\u2014OpenRefine, Dedupe, Great Expectations, TidyData (PyJanitor), and a baseline Pandas pipeline\u2014applied to large-scale, messy datasets spanning three domains (healthcare, finance, and industrial telemetry). We benchmark each tool on dataset sizes ranging from 1 million to 100 million records, measuring execution time, memory usage, error detection accuracy, and scalability under increasing data volumes. Additionally, we assess qualitative aspects such as usability and ease of integration, reflecting real-world adoption concerns. We incorporate recent findings on parallelized data cleaning and highlight how domain-specific anomalies (e.g., negative amounts in finance, sensor corruption in industrial telemetry) can significantly impact tool choice. Our findings reveal that no single solution excels across all metrics; while Dedupe provides robust duplicate detection and Great Expectations offers in-depth rule-based validation, tools like TidyData and baseline Pandas pipelines demonstrate strong scalability and flexibility under chunk-based ingestion. The choice of tool ultimately depends on domain-specific requirements (e.g., approximate matching in finance and strict auditing in healthcare) and the magnitude of available computational resources. By highlighting each framework\u2019s strengths and limitations, this study offers data practitioners clear, evidence-driven guidance for selecting and combining tools to tackle large-scale data cleaning challenges.<\/jats:p>","DOI":"10.3390\/data10050068","type":"journal-article","created":{"date-parts":[[2025,5,5]],"date-time":"2025-05-05T21:42:09Z","timestamp":1746481329000},"page":"68","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":12,"title":["Performance and Scalability of Data Cleaning and Preprocessing Tools: A Benchmark on Large Real-World Datasets"],"prefix":"10.3390","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2118-1440","authenticated-orcid":false,"given":"Pedro","family":"Martins","sequence":"first","affiliation":[{"name":"Research Center in Digital Services, Polytechnic of Viseu, 3504-510 Viseu, Portugal"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3916-5182","authenticated-orcid":false,"given":"Filipe","family":"Cardoso","sequence":"additional","affiliation":[{"name":"Polytechnic Institute of Santar\u00e9m, Escola Superior de Gest\u00e3o e Tecnologia de Santar\u00e9m, 2001-904 Santar\u00e9m, Portugal"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1745-8937","authenticated-orcid":false,"given":"Paulo","family":"V\u00e1z","sequence":"additional","affiliation":[{"name":"Research Center in Digital Services, Polytechnic of Viseu, 3504-510 Viseu, Portugal"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7285-8282","authenticated-orcid":false,"given":"Jos\u00e9","family":"Silva","sequence":"additional","affiliation":[{"name":"Research Center in Digital Services, Polytechnic of Viseu, 3504-510 Viseu, Portugal"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9011-0734","authenticated-orcid":false,"given":"Maryam","family":"Abbasi","sequence":"additional","affiliation":[{"name":"Applied Research Institute, Polytechnic of Coimbra, 3045-093 Coimbra, Portugal"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2025,5,5]]},"reference":[{"key":"ref_1","first-page":"3","article-title":"Data Cleaning: Problems and Current Approaches","volume":"23","author":"Rahm","year":"2000","journal-title":"Bull. IEEE Comput. Soc. Tech. Comm. Data Eng."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Kandel, S., Paepcke, A., Hellerstein, J.M., and Heer, J. (2011, January 7\u201312). Wrangler: Interactive Visual Specification of Data Transformation Scripts. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI), Vancouver, BC, Canada.","DOI":"10.1145\/1978942.1979444"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"271","DOI":"10.1177\/1473871611415994","article-title":"Research Directions in Data Wrangling: Visualizations and Transformations for Usable and Credible Data","volume":"10","author":"Kandel","year":"2012","journal-title":"Inf. Vis."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TKDE.2007.250581","article-title":"Duplicate Record Detection: A Survey","volume":"19","author":"Elmagarmid","year":"2007","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"993","DOI":"10.14778\/2994509.2994518","article-title":"Detecting data errors: Where are we and what needs to be done?","volume":"9","author":"Abedjan","year":"2016","journal-title":"Proc. VLDB Endow."},{"key":"ref_6","unstructured":"Ahmadi, F., Mandirali, Y., and Abedjan, Z. (2023). Accelerating the Data Cleaning Systems Raha and Baran through Task and Data Parallelism. Proc. VLDB Endow., Available online: https:\/\/vldb.org\/workshops\/2024\/proceedings\/QDB\/QDB-1.pdf."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Chu, X., Ilyas, I.F., Krishnan, S., and Wang, J. (July, January 26). Data Cleaning: Overview and Emerging Challenges. Proceedings of the 2016 International Conference on Management of Data (SIGMOD), San Francisco, CA, USA.","DOI":"10.1145\/2882903.2912574"},{"key":"ref_8","unstructured":"Stonebraker, M., Bruckner, D., Ilyas, I.F., Beskales, G., Cherniack, M., Zdonik, S.B., Pagan, A., and Xu, S. (2013, January 6\u20139). Data curation at scale: The data tamer system. Proceedings of the CIDR 2013, Asilomar, CA, USA."},{"key":"ref_9","unstructured":"Newcombe, H.B. (1988). Handbook of Record Linkage: Methods for Health and Statistical Studies, Administration, and Business, Oxford University Press."},{"key":"ref_10","unstructured":"DataMade, Inc. (2025, March 10). Dedupe: A Library for de-Duplicating and Finding Matches in Messy Data. Available online: https:\/\/github.com\/datamade\/dedupe."},{"key":"ref_11","first-page":"113","article-title":"A Survey on Data Cleaning Tools and Techniques in the Big Data Era","volume":"13","author":"Singh","year":"2022","journal-title":"Int. J. Adv. Comput. Sci. Appl."},{"key":"ref_12","unstructured":"Papastergios, V., and Gounaris, A. (2024). A Survey of Open-Source Data Quality Tools: Shedding Light on the Materialization of Data Quality Dimensions in Practice. arXiv."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Ehrlinger, L., and W\u00f6\u00df, W. (2022). A survey of data quality measurement and monitoring tools. Front. Big Data, 5.","DOI":"10.3389\/fdata.2022.850611"},{"key":"ref_14","unstructured":"Great Expectations Contributors (2025, March 10). Great Expectations: Always Know What to Expect from Your Data. Available online: https:\/\/greatexpectations.io\/."},{"key":"ref_15","unstructured":"Rekatsinas, T., Chu, X., Ilyas, I.F., and R\u00e9, C. (2017, January 14\u201319). Holoclean: Holistic Data Repairs with Probabilistic Inference. Proceedings of the 2017 ACM SIGMOD International Conference on Management of Data, Chicago, IL, USA."},{"key":"ref_16","first-page":"416","article-title":"BoostClean: Automated Error Detection and Repair for Machine Learning","volume":"12","author":"Roost","year":"2018","journal-title":"Proc. VLDB Endow."},{"key":"ref_17","first-page":"1418","article-title":"Raha: A Configuration-Free Error Detector","volume":"13","author":"Ike","year":"2020","journal-title":"Proc. VLDB Endow."},{"key":"ref_18","first-page":"1210","article-title":"Baran: Efficient Error Detection for Large-Scale Data","volume":"16","author":"Wei","year":"2023","journal-title":"Proc. VLDB Endow."},{"key":"ref_19","unstructured":"Tamr Inc. (2025, April 21). Tamr Product Overview. Available online: https:\/\/www.tamr.com."},{"key":"ref_20","unstructured":"McKinney, W. (July, January 28). Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conference (SciPy), Austin, TX, USA."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Mohammed, S., Budach, L., Feuerpfeil, M., Ihde, N., Nathansen, A., Noack, N., Patzlaff, H., Naumann, F., and Harmouch, H. (2024). The Effects of Data Quality on Machine Learning Performance. arXiv.","DOI":"10.1016\/j.is.2025.102549"},{"key":"ref_22","unstructured":"OpenRefine Community (2025, March 10). OpenRefine: A Free, Open Source Power Tool for Working with Messy Data. Available online: https:\/\/openrefine.org\/."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Resnick, D., and Ravent\u00f3s, N.A. (2024). Efficiency and Privacy in Record Linkage: Evaluating a Novel Blocking Technique Implemented on Cryptographic Longterm Keys. Int. J. Popul. Data Sci., 9.","DOI":"10.23889\/ijpds.v9i5.2534"},{"key":"ref_24","unstructured":"(2025, March 10). pyjanitor-devs.; Macken, Carl.; Contributors. pyjanitor: Clean APIs for Data Cleaning in Python. Available online: https:\/\/github.com\/pyjanitor-devs\/pyjanitor."}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/10\/5\/68\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T17:27:14Z","timestamp":1760030834000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/10\/5\/68"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,5]]},"references-count":24,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2025,5]]}},"alternative-id":["data10050068"],"URL":"https:\/\/doi.org\/10.3390\/data10050068","relation":{},"ISSN":["2306-5729"],"issn-type":[{"value":"2306-5729","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,5,5]]}}}