{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,28]],"date-time":"2026-03-28T20:21:35Z","timestamp":1774729295685,"version":"3.50.1"},"reference-count":59,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2024,6,24]],"date-time":"2024-06-24T00:00:00Z","timestamp":1719187200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["J. Data and Information Quality"],"published-print":{"date-parts":[[2024,6,30]]},"abstract":"<jats:p>In this work, we address the challenging and open problem of involving non-expert users in the data repairing problem as first-class citizens. Despite a large number of proposals that have been devoted to cleaning data from the point of view of expert users (IT staff and data scientists), there is a lack of studies from the perspective of non-expert ones. Given a set of available data quality rules, we exploit machine learning techniques to guide the user to identify the dirty values for each violation and repair them. We show that with a low user effort, it is possible to identify the values in tuples that can be trusted and the ones that are most likely errors. We show experimentally how this machine learning approach leads to a unique clean solution with high quality in scenarios where other approaches fail.<\/jats:p>","DOI":"10.1145\/3665930","type":"journal-article","created":{"date-parts":[[2024,5,25]],"date-time":"2024-05-25T07:57:44Z","timestamp":1716623864000},"page":"1-31","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["BUNNI: Learning Repair Actions in Rule-driven Data Cleaning"],"prefix":"10.1145","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1189-1481","authenticated-orcid":false,"given":"Giansalvatore","family":"Mecca","sequence":"first","affiliation":[{"name":"DIMIE, Universita degli Studi della Basilicata, Potenza, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0651-4128","authenticated-orcid":false,"given":"Paolo","family":"Papotti","sequence":"additional","affiliation":[{"name":"EURECOM, Sophia Antipolis, France"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5651-8584","authenticated-orcid":false,"given":"Donatello","family":"Santoro","sequence":"additional","affiliation":[{"name":"DiMIE, Universita degli Studi della Basilicata, Potenza, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9947-8909","authenticated-orcid":false,"given":"Enzo","family":"Veltri","sequence":"additional","affiliation":[{"name":"Universita degli Studi della Basilicata, Potenza, Italy"}]}],"member":"320","published-online":{"date-parts":[[2024,6,24]]},"reference":[{"key":"e_1_3_3_2_2","doi-asserted-by":"publisher","unstructured":"Marcelo Arenas Leopoldo E. Bertossi and Jan Chomicki. 1999. Consistent query answers in inconsistent databases. In Proceedings of the 18th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM 68\u201379. DOI:10.1145\/303976.303983","DOI":"10.1145\/303976.303983"},{"key":"e_1_3_3_3_2","doi-asserted-by":"publisher","DOI":"10.14778\/2850578.2850579"},{"key":"e_1_3_3_4_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00544"},{"key":"e_1_3_3_5_2","volume-title":"Pattern Recognition and Machine Learning","author":"Bishop Christopher M.","year":"2007","unstructured":"Christopher M. Bishop. 2007. Pattern Recognition and Machine Learning (5th ed.). Springer. https:\/\/www.worldcat.org\/oclc\/71008143"},{"key":"e_1_3_3_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/1066157.1066175"},{"key":"e_1_3_3_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/3626730"},{"key":"e_1_3_3_8_2","doi-asserted-by":"publisher","DOI":"10.14778\/1453856.1453980"},{"key":"e_1_3_3_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2011.5767833"},{"key":"e_1_3_3_10_2","doi-asserted-by":"publisher","DOI":"10.14778\/2536258.2536262"},{"key":"e_1_3_3_11_2","first-page":"315","volume-title":"Proceedings of the 2007 33rd International Conference on Very Large Data Bases","author":"Cong Gao","year":"2007","unstructured":"Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. 2007. Improving data quality: Consistency and accuracy. In Proceedings of the 2007 33rd International Conference on Very Large Data Bases. ACM, 315\u2013326. http:\/\/www.vldb.org\/conf\/2007\/papers\/research\/p315-cong.pdf"},{"key":"e_1_3_3_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2465327"},{"key":"e_1_3_3_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/BigData.2014.7004207"},{"issue":"1","key":"e_1_3_3_14_2","first-page":"5","article-title":"BayesWipe: A scalable probabilistic framework for improving data quality","volume":"8","author":"De Sushovan","year":"2016","unstructured":"Sushovan De, Yuheng Hu, Venkata Vamsikrishna Meduri, Yi Chen, and Subbarao Kambhampati. 2016. BayesWipe: A scalable probabilistic framework for improving data quality. Journal of Data and Information Quality 8, 1 (2016), 5.","journal-title":"Journal of Data and Information Quality"},{"key":"e_1_3_3_15_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/n19-1423"},{"key":"e_1_3_3_16_2","doi-asserted-by":"publisher","DOI":"10.14778\/2536274.2536280"},{"key":"e_1_3_3_17_2","doi-asserted-by":"publisher","DOI":"10.2200\/S00439ED1V01Y201207DTM030"},{"key":"e_1_3_3_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/1366102.1366103"},{"key":"e_1_3_3_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2010.154"},{"key":"e_1_3_3_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2013.6544848"},{"key":"e_1_3_3_21_2","article-title":"Towards certain fixes with editing rules and master data","volume":"3","author":"Fan Wenfei","year":"2010","unstructured":"Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2010. Towards certain fixes with editing rules and master data. VLDB Journal 3, 2 (2010), 173\u2013184.","journal-title":"VLDB Journal"},{"key":"e_1_3_3_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/1989323.1989373"},{"issue":"2","key":"e_1_3_3_23_2","article-title":"Towards certain fixes with editing rules and master data","volume":"21","author":"Fan Wenfei","year":"2012","unstructured":"Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2012. Towards certain fixes with editing rules and master data. VLDB Journal 21, 2 (2012), 213\u2013238.","journal-title":"VLDB Journal"},{"key":"e_1_3_3_24_2","unstructured":"Helena Galhardas Daniela Florescu Dennis E. Shasha Eric Simon and Cristian-Augustin Saita. 2001. Declarative data cleaning: Language model and algorithms. In Proceedingsof the 27th International Conference on Very Large Databases (VLDB \u201901). 371\u2013380. http:\/\/www.vldb.org\/conf\/2001\/P371.pdf"},{"key":"e_1_3_3_25_2","unstructured":"Susan Garavaglia and Asha Sharma. 1998. A smart guide to dummy variables: Four applications and a macro. In Proceedings of the Northeast SAS Users Group Conference. 43."},{"key":"e_1_3_3_26_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-019-00586-5"},{"key":"e_1_3_3_27_2","volume-title":"Proceedings 27th International Conference on Extending Database Technology (EDBT \u201924)","author":"Glavic Boris","year":"2024","unstructured":"Boris Glavic, Giansalvatore Mecca, Ren\u00e9e J. Miller, Paolo Papotti, Donatello Santoro, and Enzo Veltri. 2024. Similarity measures for incomplete database instances. In Proceedings 27th International Conference on Extending Database Technology (EDBT \u201924)."},{"key":"e_1_3_3_28_2","doi-asserted-by":"publisher","unstructured":"Lukasz Golab Howard J. Karloff Flip Korn Barna Saha and Divesh Srivastava. 2012. Discovering conservation rules. In Proceedings of the IEEE 28th International Conference on Data Engineering (ICDE \u201912). IEEE 738\u2013749. DOI:10.1109\/ICDE.2012.105","DOI":"10.1109\/ICDE.2012.105"},{"key":"e_1_3_3_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/1656274.1656278"},{"key":"e_1_3_3_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2915242"},{"key":"e_1_3_3_31_2","unstructured":"Jeffrey Heer Joseph M. Hellerstein and Sean Kandel. 2015. Predictive interaction for data transformation. In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR \u201915). http:\/\/cidrdb.org\/cidr2015\/Papers\/CIDR15_Paper27.pdf"},{"key":"e_1_3_3_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2015.7113269"},{"key":"e_1_3_3_33_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.is.2022.102158"},{"key":"e_1_3_3_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/1978942.1979444"},{"key":"e_1_3_3_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2747646"},{"key":"e_1_3_3_36_2","series-title":"Frontiers in Artificial Intelligence and Applications","first-page":"3","volume-title":"Emerging Artificial Intelligence Applications in Computer Engineering\u2014Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies","author":"Kotsiantis Sotiris B.","year":"2007","unstructured":"Sotiris B. Kotsiantis. 2007. Supervised machine learning: A review of classification techniques. In Emerging Artificial Intelligence Applications in Computer Engineering\u2014Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies. Frontiers in Artificial Intelligence and Applications, Vol. 160. IOS Press, 3\u201324. http:\/\/www.booksonline.iospress.nl\/Content\/View.aspx?piid=6950"},{"key":"e_1_3_3_37_2","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994514"},{"key":"e_1_3_3_38_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-00063-9_29"},{"key":"e_1_3_3_39_2","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407801"},{"key":"e_1_3_3_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3324956"},{"key":"e_1_3_3_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807178"},{"key":"e_1_3_3_42_2","first-page":"653","article-title":"Relational dependency networks","author":"Neville Jennifer","year":"2007","unstructured":"Jennifer Neville and David Jensen. 2007. Relational dependency networks. Journal of Machine Learning Research 8 (March 2007), 653\u2013692.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_3_43_2","first-page":"841","article-title":"On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes","volume":"2","author":"Ng Andrew Y.","year":"2002","unstructured":"Andrew Y. Ng and Michael I. Jordan. 2002. On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. Advances in Neural Information Processing Systems 2 (2002), 841\u2013848.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_44_2","doi-asserted-by":"publisher","DOI":"10.14778\/2824032.2824086"},{"key":"e_1_3_3_45_2","doi-asserted-by":"publisher","DOI":"10.1016\/0004-3702(86)90072-X"},{"key":"e_1_3_3_46_2","first-page":"Article 140, 67","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21 (2020), Article 140, 67 pages.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_3_47_2","first-page":"381","volume-title":"Proceedings of the 27th International Conference on Very Large Data Bases (VLDB \u201901)","author":"Raman Vijayshankar","year":"2001","unstructured":"Vijayshankar Raman and Joseph M. Hellerstein. 2001. Potter\u2019s Wheel: An interactive data cleaning system. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB \u201901). 381\u2013390. http:\/\/www.vldb.org\/conf\/2001\/P381.pdf"},{"key":"e_1_3_3_48_2","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137631"},{"key":"e_1_3_3_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3472291"},{"key":"e_1_3_3_50_2","doi-asserted-by":"publisher","DOI":"10.1145\/775047.775087"},{"key":"e_1_3_3_51_2","doi-asserted-by":"publisher","DOI":"10.2200\/S00429ED1V01Y201207AIM018"},{"key":"e_1_3_3_52_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.datak.2013.06.003"},{"key":"e_1_3_3_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE55515.2023.00041"},{"key":"e_1_3_3_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2014.6816655"},{"key":"e_1_3_3_55_2","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2610494"},{"key":"e_1_3_3_56_2","doi-asserted-by":"publisher","DOI":"10.5555\/2832415.2832489"},{"key":"e_1_3_3_57_2","volume-title":"Proceedings of Machine Learning and Systems 2020 (MLSys \u201920)","author":"Wu Richard","year":"2020","unstructured":"Richard Wu, Aoqian Zhang, Ihab F. Ilyas, and Theodoros Rekatsinas. 2020. Attention-based learning for missing data imputation in HoloClean. In Proceedings of Machine Learning and Systems 2020 (MLSys \u201920). https:\/\/proceedings.mlsys.org\/book\/307.pdf"},{"key":"e_1_3_3_58_2","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2463706"},{"key":"e_1_3_3_59_2","doi-asserted-by":"publisher","DOI":"10.14778\/1952376.1952378"},{"key":"e_1_3_3_60_2","doi-asserted-by":"publisher","unstructured":"Jian Zhou Zhixu Li Binbin Gu Qing Xie Jia Zhu Xiangliang Zhang and Guoliang Li. 2016. CrowdAidRepair: A crowd-aided interactive data repairing method. In Database Systems for Advanced Applications. Lecture Notes in Computer Science Vol. 9642. Springer 51\u201366. DOI:10.1007\/978-3-319-32025-0_4","DOI":"10.1007\/978-3-319-32025-0_4"}],"container-title":["Journal of Data and Information Quality"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3665930","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3665930","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T23:44:27Z","timestamp":1750290267000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3665930"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,6,24]]},"references-count":59,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,6,30]]}},"alternative-id":["10.1145\/3665930"],"URL":"https:\/\/doi.org\/10.1145\/3665930","relation":{},"ISSN":["1936-1955","1936-1963"],"issn-type":[{"value":"1936-1955","type":"print"},{"value":"1936-1963","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,6,24]]},"assertion":[{"value":"2023-05-04","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-05-15","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-06-24","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}