{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,15]],"date-time":"2025-12-15T14:02:09Z","timestamp":1765807329483,"version":"3.41.0"},"reference-count":16,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2016,5,9]],"date-time":"2016-05-09T00:00:00Z","timestamp":1462752000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["SIGMOD Rec."],"published-print":{"date-parts":[[2016,5,9]]},"abstract":"<jats:p>For big data, data quality problem is more serious. Big data cleaning system requires scalability and the abilityof handling mixed errors. Motivated by this, we develop Cleanix, a prototype system for cleaning relational Big Data. Cleanix takes data integrated from multiple data sources and cleans them on a shared-nothing machine cluster. The backend system is built on-top-of an extensible and flexible data-parallel substrate the Hyracks framework. Cleanix supports various data cleaning tasks such as abnormal value detection and correction, incomplete data filling, de-duplication, and conflict resolution. In this paper, we show the organization, data cleaning algorithms as well as the design of Cleanix.<\/jats:p>","DOI":"10.1145\/2935694.2935702","type":"journal-article","created":{"date-parts":[[2016,5,11]],"date-time":"2016-05-11T12:11:38Z","timestamp":1462968698000},"page":"35-40","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":17,"title":["Cleanix"],"prefix":"10.1145","volume":"44","author":[{"given":"Hongzhi","family":"Wang","sequence":"first","affiliation":[{"name":"Harbin Institute of Technology"}]},{"given":"Mingda","family":"Li","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology"}]},{"given":"Yingyi","family":"Bu","sequence":"additional","affiliation":[{"name":"University of California, Irvine"}]},{"given":"Jianzhong","family":"Li","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology"}]},{"given":"Hong","family":"Gao","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology"}]},{"given":"Jiacheng","family":"Zhang","sequence":"additional","affiliation":[{"name":"Tsinghua University"}]}],"member":"320","published-online":{"date-parts":[[2016,5,9]]},"reference":[{"volume-title":"Springer","year":"2007","author":"Herzog Thomas N.","key":"e_1_2_1_1_1"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.14778\/3402755.3402774"},{"key":"e_1_2_1_3_1","first-page":"371","volume-title":"VLDB","author":"Galhardas Helena","year":"2001"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2011.5767921"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2007.9"},{"issue":"4","key":"e_1_2_1_6_1","first-page":"3","article-title":"Problems and current approaches","volume":"23","author":"Rahm Erhard","year":"2000","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2007.367920"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1862919.1862924"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1066157.1066175"},{"key":"e_1_2_1_10_1","first-page":"315","volume-title":"Proceedings of the 33rd International Conference on Very Large Data Bases","author":"Cong Gao","year":"2007"},{"issue":"3","key":"e_1_2_1_11_1","first-page":"11","article-title":"Corroborating information from web sources","volume":"34","author":"Marian Am\u00e9lie","year":"2011","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687627.1687690"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2661829.2661837"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2247596.2247598"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1016\/0304-3975(92)90143-4"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.5555\/1884017.1884103"}],"container-title":["ACM SIGMOD Record"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2935694.2935702","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2935694.2935702","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T03:39:56Z","timestamp":1750217996000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2935694.2935702"}},"subtitle":["a Parallel Big Data Cleaning System"],"short-title":[],"issued":{"date-parts":[[2016,5,9]]},"references-count":16,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2016,5,9]]}},"alternative-id":["10.1145\/2935694.2935702"],"URL":"https:\/\/doi.org\/10.1145\/2935694.2935702","relation":{},"ISSN":["0163-5808"],"issn-type":[{"type":"print","value":"0163-5808"}],"subject":[],"published":{"date-parts":[[2016,5,9]]},"assertion":[{"value":"2016-05-09","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}