{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,17]],"date-time":"2026-04-17T11:48:18Z","timestamp":1776426498493,"version":"3.51.2"},"reference-count":56,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,3,19]],"date-time":"2024-03-19T00:00:00Z","timestamp":1710806400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","award":["SFRH\/BD\/135719\/2018, UIDB\/50021\/2020, UIDB\/00408\/2020, and UIDP\/00408\/2020"],"award-info":[{"award-number":["SFRH\/BD\/135719\/2018, UIDB\/50021\/2020, UIDB\/00408\/2020, and UIDP\/00408\/2020"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["J. Data and Information Quality"],"published-print":{"date-parts":[[2024,3,31]]},"abstract":"<jats:p>The existence of large amounts of data increases the probability of occurring data quality problems. A data cleaning process that corrects these problems is usually an iterative process, because it may need to be re-executed and refined to produce high-quality data. Moreover, due to the specificity of some data quality problems and the limitation of data cleaning programs to cover all problems, often a user has to be involved during the program executions by manually repairing data. However, there is no data cleaning framework that appropriately supports this involvement in such an iterative process, a form of human-in-the-loop, to clean structured data. Moreover, data preparation tools that somehow involve the user in data cleaning processes have not been evaluated with real users to assess their effort.<\/jats:p>\n          <jats:p>Therefore, we propose Cleenex, a data cleaning framework with support for user involvement during an iterative data cleaning process, and conduct two data cleaning experimental evaluations: an assessment of the Cleenex components that support the user when manually repairing data with a simulated user; and a comparison, in terms of user involvement, of data preparation tools with real users.<\/jats:p>\n          <jats:p>Results show that Cleenex components reduce the user effort when manually cleaning data during a data cleaning process, for example, the number of tuples visualized is reduced in 99%. Moreover, when performing data cleaning tasks with Cleenex, real users need less time\/effort (e.g., half the clicks) and, based on questionnaires, prefer it to the other tools used for comparison, OpenRefine and Pentaho Data Integration.<\/jats:p>","DOI":"10.1145\/3648476","type":"journal-article","created":{"date-parts":[[2024,2,15]],"date-time":"2024-02-15T11:51:27Z","timestamp":1707997887000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Cleenex: Support for User Involvement during an Iterative Data Cleaning Process"],"prefix":"10.1145","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3247-5524","authenticated-orcid":false,"given":"Jo\u00e3o L. M.","family":"Pereira","sequence":"first","affiliation":[{"name":"VISTA Lab, Algoritmi Center, University of \u00c9vora, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3559-828X","authenticated-orcid":false,"given":"Manuel J.","family":"Fonseca","sequence":"additional","affiliation":[{"name":"LASIGE, Faculdade de Ci\u00eancias, Universidade de Lisboa, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0688-3521","authenticated-orcid":false,"given":"Ant\u00f3nia","family":"Lopes","sequence":"additional","affiliation":[{"name":"LASIGE, Faculdade de Ci\u00eancias, Universidade de Lisboa, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9330-3910","authenticated-orcid":false,"given":"Helena","family":"Galhardas","sequence":"additional","affiliation":[{"name":"INESC-ID and Instituto Superior T\u00e9cnico, Universidade de Lisboa, Portugal"}]}],"member":"320","published-online":{"date-parts":[[2024,3,19]]},"reference":[{"key":"e_1_3_3_2_2","volume-title":"aiDM@SIGMOD","author":"Abdelaal Mohamed","year":"2023","unstructured":"Mohamed Abdelaal, Rashmi Koparde, and Harald Schoening. 2023. AutoCure: Automated tabular data curation technique for ML pipelines. In aiDM@SIGMOD."},{"key":"e_1_3_3_3_2","volume-title":"WebDB@SIGMOD","author":"Assadi Ahmad","year":"2018","unstructured":"Ahmad Assadi, Tova Milo, and Slava Novgorodov. 2018. Cleaning data with constraints and experts. In WebDB@SIGMOD."},{"issue":"15","key":"e_1_3_3_4_2","first-page":"48","article-title":"A survey of data quality tools.","volume":"14","author":"Barateiro Jos\u00e9","year":"2005","unstructured":"Jos\u00e9 Barateiro and Helena Galhardas. 2005. A survey of data quality tools. Daten.-Spektr. 14, 15-21 (2005), 48.","journal-title":"Daten.-Spektr."},{"key":"e_1_3_3_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/1634.1636"},{"key":"e_1_3_3_6_2","volume-title":"PODS","author":"Bertossi Leopoldo","year":"2019","unstructured":"Leopoldo Bertossi. 2019. Database repairs and consistent query answering: Origins and further developments. In PODS."},{"key":"e_1_3_3_7_2","volume-title":"ICDE","author":"Bohannon Philip","year":"2007","unstructured":"Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2007. Conditional functional dependencies for data cleaning. In ICDE."},{"key":"e_1_3_3_8_2","unstructured":"Loreto Bravo Wenfei Fan and Shuai Ma. 2007. Extending dependencies with conditions. In VLDB (2007) 243\u2013254."},{"key":"e_1_3_3_9_2","volume-title":"ICDE","author":"Chiang Fei","year":"2011","unstructured":"Fei Chiang and Renee J. Miller. 2011. A unified model for data and constraint repair. In ICDE."},{"key":"e_1_3_3_10_2","volume-title":"HDKM","author":"Christen Peter","year":"2008","unstructured":"Peter Christen. 2008. FEBRL\u2014A freely available record linkage system with a graphical user interface. In HDKM."},{"key":"e_1_3_3_11_2","unstructured":"Gao Cong Wenfei Fan Floris Geerts Xibei Jia and Shuai Ma. 2007. Improving data quality: Consistency and accuracy. In VLDB (2007) 315\u2013326."},{"key":"e_1_3_3_12_2","volume-title":"SIGMOD","author":"Dallachiesa Michele","year":"2013","unstructured":"Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, and Nan Tang. 2013. NADEEF: A commodity data cleaning system. In SIGMOD."},{"key":"e_1_3_3_13_2","doi-asserted-by":"publisher","DOI":"10.2307\/249008"},{"key":"e_1_3_3_14_2","doi-asserted-by":"publisher","DOI":"10.1147\/rd.216.0534"},{"key":"e_1_3_3_15_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-010-0206-6"},{"key":"e_1_3_3_16_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-01892-3"},{"key":"e_1_3_3_17_2","volume-title":"VLDB","author":"Galhardas Helena","year":"2001","unstructured":"Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, and Cristian-Augustin Saita. 2001. Declarative data cleaning: Language, model, and algorithms. In VLDB."},{"key":"e_1_3_3_18_2","volume-title":"DaWak","author":"Galhardas Helena","year":"2011","unstructured":"Helena Galhardas, Ant\u00f3nia Lopes, and Emanuel Santos. 2011. Support for user involvement in data cleaning. In DaWak."},{"key":"e_1_3_3_19_2","doi-asserted-by":"publisher","DOI":"10.14778\/2536360.2536363"},{"key":"e_1_3_3_20_2","volume-title":"SIGMOD","author":"Giannakopoulou Stella","year":"2020","unstructured":"Stella Giannakopoulou, Manos Karpathiotakis, and Anastasia Ailamaki. 2020. Cleaning denial constraint violations through relaxation. In SIGMOD."},{"key":"e_1_3_3_21_2","volume-title":"SIGMOD","author":"He Jian","year":"2016","unstructured":"Jian He, Enzo Veltri, Donatello Santoro, Guoliang Li, Giansalvatore Mecca, Paolo Papotti, and Nan Tang. 2016. Interactive and deterministic data cleaning. In SIGMOD."},{"key":"e_1_3_3_22_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.is.2020.101608"},{"key":"e_1_3_3_23_2","volume-title":"CHI","author":"Kandel Sean","year":"2011","unstructured":"Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2011. Wrangler: Interactive visual specification of data transformation scripts. In CHI."},{"key":"e_1_3_3_24_2","doi-asserted-by":"publisher","DOI":"10.14778\/3430915.3430917"},{"key":"e_1_3_3_25_2","volume-title":"HILDA@SIGMOD","author":"Krishnan Sanjay","year":"2016","unstructured":"Sanjay Krishnan, Daniel Haas, Michael J. Franklin, and Eugene Wu. 2016. Towards reliable interactive data cleaning: A user survey and recommendations. In HILDA@SIGMOD."},{"key":"e_1_3_3_26_2","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994514"},{"key":"e_1_3_3_27_2","volume-title":"Research Methods in Human-Computer Interaction (2nd ed.)","author":"Lazar Jonathan","year":"2017","unstructured":"Jonathan Lazar, Jinjuan Heidi Feng, and Harry Hochheiser. 2017. Research Methods in Human-Computer Interaction (2nd ed.). Morgan Kaufmann."},{"key":"e_1_3_3_28_2","first-page":"707","volume-title":"Soviet Physics Doklady","author":"Levenshtein Vladimir I.","year":"1966","unstructured":"Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady, Vol. 10. 707\u2013710."},{"key":"e_1_3_3_29_2","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407801"},{"key":"e_1_3_3_30_2","first-page":"156","volume-title":"Big Data: The Next Frontier for Innovation, Competition, and Productivity","author":"Manyika J.","year":"2011","unstructured":"J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. Byers. 2011. Big Data: The Next Frontier for Innovation, Competition, and Productivity. Technical Report. McKinsey Global Institute. 156 pages."},{"key":"e_1_3_3_31_2","volume-title":"SIGMOD","author":"Musleh Mashaal","year":"2020","unstructured":"Mashaal Musleh, Mourad Ouzzani, Nan Tang, and AnHai Doan. 2020. CoClean: Collaborative data cleaning. In SIGMOD."},{"key":"e_1_3_3_32_2","doi-asserted-by":"publisher","DOI":"10.14778\/3574245.3574258"},{"key":"e_1_3_3_33_2","volume-title":"Workshop on Data and Information Quality","author":"Oliveira Paulo","year":"2005","unstructured":"Paulo Oliveira, F\u00e1tima Rodrigues, and Helena Galhardas. 2005. A taxonomy of data quality problems. In Workshop on Data and Information Quality."},{"key":"e_1_3_3_34_2","volume-title":"CLEENEX: Iterative Data Cleaning with User Intervention","author":"Ormonde C\u00e1tia Borges","year":"2017","unstructured":"C\u00e1tia Borges Ormonde. 2017. CLEENEX: Iterative Data Cleaning with User Intervention. Master\u2019s Thesis. Instituto Superior T\u00e9cnico, Universidade de Lisboa."},{"key":"e_1_3_3_35_2","doi-asserted-by":"publisher","DOI":"10.14778\/3570690.3570694"},{"key":"e_1_3_3_36_2","volume-title":"Towards Effective and Effortless Data Cleaning: From Automatic Approaches to User Involvement","author":"Pereira Jo\u00e3o L. M.","year":"2023","unstructured":"Jo\u00e3o L. M. Pereira. 2023. Towards Effective and Effortless Data Cleaning: From Automatic Approaches to User Involvement. Ph. D. Dissertation. Instituto Superior T\u00e9cnico, Universidade de Lisboa."},{"key":"e_1_3_3_37_2","volume-title":"INFORUM","author":"Pereira Jo\u00e3o L. M.","year":"2017","unstructured":"Jo\u00e3o L. M. Pereira and Helena Galhardas. 2017. Approximate duplicate elimination using state-of-the-art tools: A comparison. In INFORUM."},{"key":"e_1_3_3_38_2","volume-title":"CIKM","author":"Personnaz Aur\u00e9lien","year":"2021","unstructured":"Aur\u00e9lien Personnaz, Sihem Amer-Yahia, Laure Berti-Equille, Maximilian Fabricius, and Srividya Subramanian. 2021. DORA the explorer: Exploring very large data with interactive deep reinforcement learning. In CIKM."},{"key":"e_1_3_3_39_2","volume-title":"IEEE Big Data","author":"Pham Minh","year":"2019","unstructured":"Minh Pham, Craig A. Knoblock, and Jay Pujara. 2019. Learning data transformations with minimal user effort. In IEEE Big Data."},{"issue":"4","key":"e_1_3_3_40_2","first-page":"3","article-title":"Data cleaning: Problems and current approaches","volume":"23","author":"Rahm Erhard","year":"2000","unstructured":"Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3\u201313.","journal-title":"IEEE Data Eng. Bull."},{"issue":"13","key":"e_1_3_3_41_2","first-page":"2263","article-title":"ICARUS: Minimizing human effort in iterative data completion","volume":"11","author":"Rahman Protiva","year":"2018","unstructured":"Protiva Rahman, Courtney Hebert, and Arnab Nandi. 2018. ICARUS: Minimizing human effort in iterative data completion. PVLDB 11, 13 (2018), 2263\u20132276.","journal-title":"PVLDB"},{"key":"e_1_3_3_42_2","volume-title":"VLDB","author":"Raman Vijayshankar","year":"2001","unstructured":"Vijayshankar Raman and Joseph M. Hellerstein. 2001. Potter\u2019s wheel: An interactive data cleaning system. In VLDB."},{"key":"e_1_3_3_43_2","volume-title":"HILDA@SIGMOD","author":"R\u00e4th Timo","year":"2023","unstructured":"Timo R\u00e4th, Ngozichukwuka Onah, and Kai-Uwe Sattler. 2023. Interactive data cleaning for real-time streaming applications. In HILDA@SIGMOD."},{"key":"e_1_3_3_44_2","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137631"},{"key":"e_1_3_3_45_2","doi-asserted-by":"publisher","DOI":"10.14778\/3476249.3476301"},{"key":"e_1_3_3_46_2","volume-title":"HILDA@SIGMOD","author":"Rezig El Kindi","year":"2019","unstructured":"El Kindi Rezig, Mourad Ouzzani, Ahmed K. Elmagarmid, Walid G. Aref, and Michael Stonebraker. 2019. Towards an end-to-end human-centric data cleaning framework. In HILDA@SIGMOD."},{"key":"e_1_3_3_47_2","volume-title":"ICDE","author":"Saha Barna","year":"2014","unstructured":"Barna Saha and Divesh Srivastava. 2014. Data quality: The other face of big data. In ICDE."},{"key":"e_1_3_3_48_2","volume-title":"SIGMOD","author":"Salimi Babak","year":"2019","unstructured":"Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Interventional fairness: Causal database repair for algorithmic fairness. In SIGMOD."},{"key":"e_1_3_3_49_2","volume-title":"Interaction Design: Beyond Human Computer Interaction (2nd ed.)","author":"Sharp Helen","year":"2007","unstructured":"Helen Sharp, Yvonne Rogers, and Jenny Preece. 2007. Interaction Design: Beyond Human Computer Interaction (2nd ed.). John Wiley & Sons, Inc., Hoboken, NJ."},{"key":"e_1_3_3_50_2","article-title":"How to create a business case for data quality improvement","author":"Moore Susan","year":"2018","unstructured":"Susan Moore. 2018. How to create a business case for data quality improvement. Gartner. https:\/\/www.gartner.com\/smarterwithgartner\/how-to-create-a-business-case-for-data-quality-improvement","journal-title":"Gartner"},{"issue":"3","key":"e_1_3_3_51_2","first-page":"6","article-title":"Gartner warns firms of \u201cdirty data.\u201d","volume":"41","author":"Swartz Nikki","year":"2007","unstructured":"Nikki Swartz. 2007. Gartner warns firms of \u201cdirty data.\u201d Inf. Manag. J. 41, 3 (2007), 6.","journal-title":"Inf. Manag. J."},{"key":"e_1_3_3_52_2","volume-title":"SIGMOD","author":"Thirumuruganathan Saravanan","year":"2017","unstructured":"Saravanan Thirumuruganathan, Laure Berti-Equille, Mourad Ouzzani, Jorge-Arnulfo Quiane-Ruiz, and Nan Tang. 2017. UGuide: User-guided discovery of FD-detectable errors. In SIGMOD."},{"key":"e_1_3_3_53_2","volume-title":"ICDE","author":"Volkovs Maksims","year":"2014","unstructured":"Maksims Volkovs, Fei Chiang, Jaroslaw Szlichta, and Ren\u00e9e J. Miller. 2014. Continuous data cleaning. In ICDE."},{"key":"e_1_3_3_54_2","volume-title":"TRL@NeurIPS","author":"Vos David","year":"2022","unstructured":"David Vos, Till D\u00f6hmen, and Sebastian Schelter. 2022. Towards parameter-efficient automation of data wrangling tasks with prefix-tuning. In TRL@NeurIPS."},{"key":"e_1_3_3_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2962152"},{"key":"e_1_3_3_56_2","doi-asserted-by":"publisher","DOI":"10.14778\/1952376.1952378"},{"key":"e_1_3_3_57_2","volume-title":"SIGMOD","author":"Yu Zhuoran","year":"2019","unstructured":"Zhuoran Yu and Xu Chu. 2019. PIClean: A probabilistic and interactive data cleaning system. In SIGMOD."}],"container-title":["Journal of Data and Information Quality"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3648476","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3648476","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T22:50:19Z","timestamp":1750287019000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3648476"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,19]]},"references-count":56,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,3,31]]}},"alternative-id":["10.1145\/3648476"],"URL":"https:\/\/doi.org\/10.1145\/3648476","relation":{},"ISSN":["1936-1955","1936-1963"],"issn-type":[{"value":"1936-1955","type":"print"},{"value":"1936-1963","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,3,19]]},"assertion":[{"value":"2023-03-14","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-01-25","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-03-19","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}