{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T00:05:29Z","timestamp":1767830729455,"version":"3.49.0"},"reference-count":54,"publisher":"Association for Computing Machinery (ACM)","issue":"11","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2025,7]]},"abstract":"<jats:p>\n            Data cleaning is an essential technique to enhance data quality. Despite the proposal of various algorithms with different cleaning strategies, current automated cleaning technologies still fall short of practical requirements when dealing with large-scale data containing mixed errors. This paper presents\n            <jats:bold>UniClean<\/jats:bold>\n            to efficiently solve the mixed error cleaning problem with three key technical contributions. (1) A unified construction and extension method for cleaners, enabling cleaning methods to easily utilize various cleaners to perform cleaning tasks. (2) Three optimization strategies to achieve efficiency-oriented cleaning preparation. (3) A cleaning algorithm based on an optimized cleaning process to effectively clean mixed errors. UniClean achieves a time complexity of\n            <jats:italic toggle=\"yes\">O<\/jats:italic>\n            (|\n            <jats:italic toggle=\"yes\">D<\/jats:italic>\n            <jats:sub>error<\/jats:sub>\n            |\n            <jats:sup>4<\/jats:sup>\n            \u00b7 |\n            <jats:italic toggle=\"yes\">Op<\/jats:italic>\n            | +\n            <jats:italic toggle=\"yes\">|D|<\/jats:italic>\n            \u00b7 |\n            <jats:italic toggle=\"yes\">D<\/jats:italic>\n            <jats:sub>error<\/jats:sub>\n            |), significantly enhancing scalability. Experiments on public and large-scale enterprise datasets demonstrate that UniClean achieves over 40% improvement across five metrics, compared to five state-of-the-art cleaning methods, and delivers more than 30% gains in\n            <jats:italic toggle=\"yes\">F1<\/jats:italic>\n            and\n            <jats:italic toggle=\"yes\">REDR<\/jats:italic>\n            on complex datasets, while completing the cleaning process within hours even for millions of records.\n          <\/jats:p>","DOI":"10.14778\/3749646.3749681","type":"journal-article","created":{"date-parts":[[2025,9,4]],"date-time":"2025-09-04T17:55:06Z","timestamp":1757008506000},"page":"4117-4130","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["UniClean: A Scalable Data Cleaning Solution for Mixed Errors Based on Unified Cleaners and Optimized Cleaning Workflow"],"prefix":"10.14778","volume":"18","author":[{"given":"Xiaoou","family":"Ding","sequence":"first","affiliation":[{"name":"Harbin Institute of Technology"}]},{"given":"Zekai","family":"Qian","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology"}]},{"given":"Hongzhi","family":"Wang","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology"}]},{"given":"Siying","family":"Chen","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology"}]},{"given":"Yafeng","family":"Tang","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology"}]},{"given":"Hongbin","family":"Su","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology"}]},{"given":"Huan","family":"Hu","sequence":"additional","affiliation":[{"name":"Huawei Cloud Computing Technologies Co., Ltd., China"}]},{"given":"Chen","family":"Wang","sequence":"additional","affiliation":[{"name":"Tsinghua University"}]}],"member":"320","published-online":{"date-parts":[[2025,9,4]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.48786\/EDBT.2023.43"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994518"},{"key":"e_1_2_1_3_1","volume-title":"Foundations of Databases","author":"Abiteboul Serge","unstructured":"Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Addison-Wesley."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.14778\/2850578.2850579"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2013.6544854"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.14778\/3007263.3007320"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536258.2536262"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2013.6544847"},{"key":"e_1_2_1_9_1","volume-title":"Proceedings of Workshops at the 50th International Conference on Very Large Data Bases, VLDB 2024","author":"Cortes Carolina","year":"2024","unstructured":"Carolina Cortes, Camila Sanz, Lorena Etcheverry, and Adriana Marotta. 2024. Data Quality Management for Responsible AI in Data Lakes. In Proceedings of Workshops at the 50th International Conference on Very Large Data Bases, VLDB 2024, Guangzhou, China, August 26\u201330, 2024. VLDB.org. https:\/\/vldb.org\/workshops\/2024\/proceedings\/TaDA\/TaDA.13.pdf"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2465327"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE60146.2024.00283"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE60146.2024.00283"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE60146.2024.00282"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE60146.2024.00271"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.14778\/3681954.3682015"},{"key":"e_1_2_1_16_1","volume-title":"UniClean: A Multi-Signal Fusion Pipeline for Optimizing Data Cleaning Workflow. In 2025 IEEE 41st International Conference on Data Engineering (ICDE). IEEE Computer Society, 4600\u20134603","author":"Ding Xiaoou","year":"2025","unstructured":"Xiaoou Ding, Zekai Qian, Hongzhi Wang, Zhe Sun, Siying Chen, Hongbin Su, and Huan Hu. 2025. UniClean: A Multi-Signal Fusion Pipeline for Optimizing Data Cleaning Workflow. In 2025 IEEE 41st International Conference on Data Engineering (ICDE). IEEE Computer Society, 4600\u20134603."},{"key":"e_1_2_1_17_1","unstructured":"Xiaoou Ding Zekai Qian Hongzhi Wang Zhe Sun Siying Chen Hongbin Su and Huan Hu. 2025. UniClean Demo (ICDE Demo 2025). https:\/\/youtu.be\/BGYHj0gMN2g Video demonstration."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.14778\/3704965.3704987"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/3685800.3685879"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352066"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2020.2992456"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.14778\/3236187.3236198"},{"key":"e_1_2_1_23_1","unstructured":"Equifax. 2022. Equifax Statement on Recent Coding Issue. https:\/\/www.equifax.com\/newsroom\/all-news\/-\/story\/equifax-statement-on-recent-coding-issue\/ Accessed: 2025-01-01."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.2200\/S00439ED1V01Y201207DTM030"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1366102.1366103"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687627.1687674"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3452750"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE51399.2021.00258"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389775"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE55515.2023.00303"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-94-015-3994-4"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3319888"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/PROC.1982.12425"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE53745.2022.00231"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2747646"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/1514894.1514901"},{"key":"e_1_2_1_37_1","volume-title":"Adaptive Data Transformations for QaaS. In Conference on Innovative Data Systems Research (CIDR). https:\/\/vldb.org\/cidrdb\/papers\/2025\/p20-koutsoukos.pdf","author":"Koutsoukos Dimitrios","year":"2025","unstructured":"Dimitrios Koutsoukos, Renato Marroqu\u00edn, Ingo M\u00fcller, and Ana Klimovic. 2025. Adaptive Data Transformations for QaaS. In Conference on Innovative Data Systems Research (CIDR). https:\/\/vldb.org\/cidrdb\/papers\/2025\/p20-koutsoukos.pdf"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE51399.2021.00009"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.14778\/3421424.3421431"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415484"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407801"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3324956"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.14778\/3675034.3675051"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1016\/0169-2070(95)00625-7"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476311.3476390"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-55705-2_14"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137631"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476249.3476301"},{"key":"e_1_2_1_49_1","unstructured":"Manasi S. 2021. How to Improve Your Data Quality. Gartner Research. https:\/\/www.gartner.com\/smarterwithgartner\/how-to-improve-your-data-quality [Accessed: 2025-01-01]."},{"key":"e_1_2_1_50_1","doi-asserted-by":"crossref","unstructured":"Shaoxu Song Aoqian Zhang Jianmin Wang and Philip S. Yu. [n. d.]. SCREEN: Stream Data Cleaning under Speed Constraints. In SIGMOD. 827\u2013841.","DOI":"10.1145\/2723372.2723730"},{"key":"e_1_2_1_51_1","unstructured":"Motley Fool Transcribing. 2022. Unity Software Inc. (U) Q1 2022 Earnings Call Transcript. https:\/\/www.fool.com\/earnings\/call-transcripts\/2022\/05\/11\/unity-software-inc-u-q1-2022-earnings-call-transcr Accessed: 2025-01-01."},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.14778\/3554821.3554848"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2463706"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.5441\/002\/EDBT.2020.28"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3749646.3749681","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,5]],"date-time":"2025-09-05T03:41:06Z","timestamp":1757043666000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3749646.3749681"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7]]},"references-count":54,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2025,7]]}},"alternative-id":["10.14778\/3749646.3749681"],"URL":"https:\/\/doi.org\/10.14778\/3749646.3749681","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2025,7]]},"assertion":[{"value":"2025-09-04","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}