{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T09:02:38Z","timestamp":1775638958822,"version":"3.50.1"},"reference-count":32,"publisher":"Association for Computing Machinery (ACM)","issue":"11","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2021,7]]},"abstract":"<jats:p>\n            A large class of data repair algorithms rely on integrity constraints to detect and repair errors. A well-studied class of constraints is Functional Dependencies (FDs, for short). Although there has been an increased interest in developing general data cleaning systems for a myriad of data errors, scalability has been left behind. This is because current systems assume data cleaning is performed offline and in one iteration. However, developing data science pipelines is highly iterative and requires efficient cleaning techniques to scale to millions of records in seconds\/minutes, not days. In our efforts to re-think the data cleaning stack and bring it to the era of data science, we introduce\n            <jats:italic>Horizon<\/jats:italic>\n            , an end-to-end FD repair system to address two key challenges: (1) Accuracy: Most existing FD repair techniques aim to produce repairs that minimize changes to the data that may lead to incorrect combinations of attribute values (or patterns).\n            <jats:italic>Horizon<\/jats:italic>\n            leverages the interaction between the data patterns induced by the various FDs, and subsequently selects repairs that preserve the most frequent patterns found in the original data, and hence leading to a better repair accuracy. (2) Scalability: Existing data cleaning systems struggle when dealing with large-scale real-world datasets.\n            <jats:italic>Horizon<\/jats:italic>\n            features a linear-time repair algorithm that scales to millions of records, and is orders-of-magnitude faster than state-of-the-art cleaning algorithms. A benchmark of\n            <jats:italic>Horizon<\/jats:italic>\n            against state-of-the-art cleaning systems on multiple datasets and metrics shows that\n            <jats:italic>Horizon<\/jats:italic>\n            consistently outperforms existing techniques in repair quality and scalability.\n          <\/jats:p>","DOI":"10.14778\/3476249.3476301","type":"journal-article","created":{"date-parts":[[2021,10,27]],"date-time":"2021-10-27T16:46:23Z","timestamp":1635353183000},"page":"2546-2554","source":"Crossref","is-referenced-by-count":36,"title":["Horizon"],"prefix":"10.14778","volume":"14","author":[{"given":"El Kindi","family":"Rezig","sequence":"first","affiliation":[{"name":"MIT CSAIL"}]},{"given":"Mourad","family":"Ouzzani","sequence":"additional","affiliation":[{"name":"Qatar Computing Research Institute"}]},{"given":"Walid G.","family":"Aref","sequence":"additional","affiliation":[{"name":"Purdue University"}]},{"given":"Ahmed K.","family":"Elmagarmid","sequence":"additional","affiliation":[{"name":"Qatar Computing Research Institute"}]},{"given":"Ahmed R.","family":"Mahmood","sequence":"additional","affiliation":[{"name":"Purdue University"}]},{"given":"Michael","family":"Stonebraker","sequence":"additional","affiliation":[{"name":"MIT CSAIL"}]}],"member":"320","published-online":{"date-parts":[[2021,10,27]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"New York City Open Data. https:\/\/opendata.cityofnewyork.us.  New York City Open Data. https:\/\/opendata.cityofnewyork.us."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.5555\/551350"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/170036.170072"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.14778\/2850578.2850579"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.tcs.2016.03.016"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1920870"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2013.6544854"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1066157.1066175"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2011.5767833"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2013.6544847"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.14778\/2824032.2824109"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.5555\/1325851.1325890"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1559845.1559895"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2465327"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2992787"},{"key":"e_1_2_1_16_1","volume-title":"Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang.","author":"Deng Dong","year":"2017","unstructured":"Dong Deng , Raul Castro Fernandez , Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017 . The Data Civilizer System. In CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, January 8--11, 2017, Online Proceedings . www.cidrdb.org. http:\/\/cidrdb.org\/cidr2017\/papers\/p44-deng-cidr17.pdf Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017. The Data Civilizer System. In CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, January 8--11, 2017, Online Proceedings. www.cidrdb.org. http:\/\/cidrdb.org\/cidr2017\/papers\/p44-deng-cidr17.pdf"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/1366102.1366103"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536360.2536363"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/1453856.1453900"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2016.2637928"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1561\/1900000045"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2015.7113269"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/1514894.1514901"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407801"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137631"},{"key":"e_1_2_1_26_1","volume-title":"CIDR 2020, 10th Conference on Innovative Data Systems Research, Amsterdam, The Netherlands, January 12--15, 2020, Online Proceedings. www.cidrdb.org. http:\/\/cidrdb.org\/cidr2020\/papers\/p35-rezig-cidr20","author":"Rezig El Kindi","year":"2020","unstructured":"El Kindi Rezig , Lei Cao , Giovanni Simonini , Maxime Schoemans , Samuel Madden , Nan Tang , Mourad Ouzzani , and Michael Stonebraker . 2020 . Dagger: A Data (not code) Debugger . In CIDR 2020, 10th Conference on Innovative Data Systems Research, Amsterdam, The Netherlands, January 12--15, 2020, Online Proceedings. www.cidrdb.org. http:\/\/cidrdb.org\/cidr2020\/papers\/p35-rezig-cidr20 .pdf El Kindi Rezig, Lei Cao, Giovanni Simonini, Maxime Schoemans, Samuel Madden, Nan Tang, Mourad Ouzzani, and Michael Stonebraker. 2020. Dagger: A Data (not code) Debugger. In CIDR 2020, 10th Conference on Innovative Data Systems Research, Amsterdam, The Netherlands, January 12--15, 2020, Online Proceedings. www.cidrdb.org. http:\/\/cidrdb.org\/cidr2020\/papers\/p35-rezig-cidr20.pdf"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-018-0510-0"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/SWAT.1971.10"},{"key":"e_1_2_1_29_1","first-page":"16","article-title":"MODELDB: Opportunities and Challenges in Managing Machine Learning Models","volume":"41","author":"Vartak Manasi","year":"2018","unstructured":"Manasi Vartak and Samuel Madden . 2018 . MODELDB: Opportunities and Challenges in Managing Machine Learning Models . IEEE Data Eng. Bull. 41 , 4 (2018), 16 -- 25 . http:\/\/sites.computer.org\/debull\/A18dec\/p16.pdf Manasi Vartak and Samuel Madden. 2018. MODELDB: Opportunities and Challenges in Managing Machine Learning Models. IEEE Data Eng. Bull. 41, 4 (2018), 16--25. http:\/\/sites.computer.org\/debull\/A18dec\/p16.pdf","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2610505"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3041761"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2463706"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3476249.3476301","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T10:08:12Z","timestamp":1672222092000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3476249.3476301"}},"subtitle":["scalable dependency-driven data cleaning"],"short-title":[],"issued":{"date-parts":[[2021,7]]},"references-count":32,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2021,7]]}},"alternative-id":["10.14778\/3476249.3476301"],"URL":"https:\/\/doi.org\/10.14778\/3476249.3476301","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2021,7]]}}}