{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,26]],"date-time":"2025-09-26T16:40:12Z","timestamp":1758904812730,"version":"3.44.0"},"reference-count":50,"publisher":"Association for Computing Machinery (ACM)","issue":"4","funder":[{"DOI":"10.13039\/501100001809","name":"NSFC","doi-asserted-by":"crossref","award":["62302241"],"award-info":[{"award-number":["62302241"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2025,9,22]]},"abstract":"<jats:p>Erroneous data frequently arise in practical scenarios due to a variety of factors, severely degrading data quality and impeding downstream applications. A widely adopted strategy for error detection is to detect conflicts based on integrity constraints and identify the minimum number of errors, thereby ensuring that the remaining cells satisfy the constraints. However, the minimum change principle may not be applicable in practical scenarios, since errors can occur simultaneously or irregularly. Therefore, this study employs Bayesian statistics to identify erroneous attribute values in conflicting cells that violate inter-attribute dependencies, rather than simply relying on the minimum change principle. This approach ensures that our work neither misses multiple erroneous attribute values conflicting with each other nor mistakenly detects outliers without errors. Furthermore, to address the efficiency issues commonly encountered in constraint-based data cleaning methods, we design 1) parallel conflict detection and error determination methods with the guaranteed parallel scalability, and 2) efficient incremental error detection strategies that can also be executed in parallel with such guarantees. Experiments conducted on various datasets demonstrate the superiority of our error detection methods in terms of both effectiveness and efficiency.<\/jats:p>","DOI":"10.1145\/3749174","type":"journal-article","created":{"date-parts":[[2025,9,23]],"date-time":"2025-09-23T17:17:03Z","timestamp":1758647823000},"page":"1-26","source":"Crossref","is-referenced-by-count":0,"title":["Minimum Change\u2260 Best Cleaning: Parallel and Incremental Error Detection under Integrity Constraints"],"prefix":"10.1145","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-9105-1961","authenticated-orcid":false,"given":"Jiahui","family":"Chen","sequence":"first","affiliation":[{"name":"College of Computer Science, Nankai University, Tianjin, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-7398-2972","authenticated-orcid":false,"given":"Yu","family":"Sun","sequence":"additional","affiliation":[{"name":"College of Computer Science, Nankai University, Tianjin, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9503-2755","authenticated-orcid":false,"given":"Shaoxu","family":"Song","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5852-0426","authenticated-orcid":false,"given":"Haiwei","family":"Zhang","sequence":"additional","affiliation":[{"name":"College of Computer Science, Nankai University, Tianjin, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5876-6856","authenticated-orcid":false,"given":"Xiaojie","family":"Yuan","sequence":"additional","affiliation":[{"name":"College of Computer Science, Nankai University, Tianjin, China"}]}],"member":"320","published-online":{"date-parts":[[2025,9,23]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"https:\/\/github.com\/twinklelittlestars\/PED"},{"key":"e_1_2_1_2_1","unstructured":"https:\/\/mimic.mit.edu\/"},{"key":"e_1_2_1_3_1","unstructured":"https:\/\/lunadong.com\/fusiondatasets"},{"key":"e_1_2_1_4_1","unstructured":"https:\/\/github.com\/sjyk\/activedetect"},{"key":"e_1_2_1_5_1","unstructured":"https:\/\/github.com\/HoloClean\/holoclean"},{"key":"e_1_2_1_6_1","unstructured":"https:\/\/github.com\/BigDaMa\/raha"},{"key":"e_1_2_1_7_1","unstructured":"https:\/\/github.com\/densitysrepair\/densitysrepair"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994518"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.14778\/2850578.2850579"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1920870"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3196959.3196966"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3517804.3526230"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3651600"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2020.2974602"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ic.2004.04.007"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536258.2536262"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2013.6544847"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2013.6544847"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2749431"},{"key":"e_1_2_1_20_1","unstructured":"Wenfei Fan. 2009. Constraint-Driven Database Repair."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-010-0206-6"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1412331.1412337"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.14778\/3457390.3457400"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3319888"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3196889"},{"key":"e_1_2_1_26_1","unstructured":"James Joyce. 2003. Bayes' theorem. (2003)."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2747646"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2009.219"},{"key":"e_1_2_1_29_1","volume-title":"BoostClean: Automated Error Detection and Repair for Machine Learning. CoRR","author":"Krishnan Sanjay","year":"2017","unstructured":"Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, and Eugene Wu. 2017. BoostClean: Automated Error Detection and Repair for Machine Learning. CoRR, Vol. abs\/1711.01299 (2017). arXiv:1711.01299 http:\/\/arxiv.org\/abs\/1711.01299"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994514"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.14778\/3654621.3654624"},{"key":"e_1_2_1_32_1","doi-asserted-by":"crossref","unstructured":"Dennis Victor Lindley. 1972. Bayesian statistics: A review. SIAM.","DOI":"10.1137\/1.9781611970654"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/1102351.1102418"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3324956"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3324956"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3097983.3098062"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807178"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3357384.3358129"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.14778\/3675034.3675051"},{"key":"e_1_2_1_40_1","unstructured":"Clement Pit-Claudel Zelda Mariet Rachael Harding and Sam Madden. 2016. Outlier detection in heterogeneous datasets using automatic tuple expansion. (2016)."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.14778\/3236187.3236193"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2019.2905548"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137631"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476249.3476301"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/2000824.2000826"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE48307.2020.00068"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2023.3294401"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.14778\/3565816.3565828"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2463706"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380568"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3749174","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,26]],"date-time":"2025-09-26T16:20:33Z","timestamp":1758903633000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3749174"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,22]]},"references-count":50,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2025,9,22]]}},"alternative-id":["10.1145\/3749174"],"URL":"https:\/\/doi.org\/10.1145\/3749174","relation":{},"ISSN":["2836-6573"],"issn-type":[{"type":"electronic","value":"2836-6573"}],"subject":[],"published":{"date-parts":[[2025,9,22]]}}}