{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,18]],"date-time":"2025-02-18T18:10:35Z","timestamp":1739902235280,"version":"3.37.3"},"reference-count":34,"publisher":"Association for Computing Machinery (ACM)","issue":"13","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,9]]},"abstract":"<jats:p>In this paper, we present a new method for learned data cleaning. In contrast to existing methods, our method learns to clean data in the latent space. The main idea is that we (1) shape the latent space such that we know the area where clean data resides and (2) learn latent operators trained on error repair (Lopster) which shift erroneous data (e.g., table rows with noise, outliers, or missing values) in their latent representation back to a \"clean\" region, thus abstracting the complexities of the input domain. When formulating data cleaning as a simple shift operation in latent space, we can repair all types of errors using the same method which makes it more robust than other methods. Importantly, with our method, we can handle errors that are unseen during the training of our error repair model. We do not rely on an external error detection method as seen in the state-of-the-art, instead, we handle both detection and repair within the Lopster framework. In our evaluation, we show that our approach outperforms existing cleaning methods even when trained on only a subset of the errors that occur in the dirty data.<\/jats:p>","DOI":"10.14778\/3704965.3704983","type":"journal-article","created":{"date-parts":[[2025,2,18]],"date-time":"2025-02-18T17:22:57Z","timestamp":1739899377000},"page":"4786-4798","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Generalizable Data Cleaning of Tabular Data in Latent Space"],"prefix":"10.14778","volume":"17","author":[{"given":"Eduardo","family":"Reis","sequence":"first","affiliation":[{"name":"Technical University of Darmstadt"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mohamed","family":"Abdelaal","sequence":"additional","affiliation":[{"name":"Software AG"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Carsten","family":"Binnig","sequence":"additional","affiliation":[{"name":"Technical University of Darmstadt &amp; DFKI"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,2,18]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the 26th International Conference on Extending Database Technology (EDBT).","author":"Abdelaal Mohamed","year":"2023","unstructured":"Mohamed Abdelaal, Christian Hammacher, and Harald Schoening. 2023. REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in ML Pipelines. In Proceedings of the 26th International Conference on Extending Database Technology (EDBT)."},{"key":"e_1_2_1_2_1","volume-title":"27th International Conference on Extending Database Technology (EDBT).","author":"Abdelaal Mohamed","year":"2024","unstructured":"Mohamed Abdelaal, Tim Ktitarev, Daniel Staedtler, and Harald Schoening. 2024. SAGED: Meta learning-powered Error Detection Technique for Tabular Data. In 27th International Conference on Extending Database Technology (EDBT)."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476249.3476300"},{"key":"e_1_2_1_4_1","first-page":"1","article-title":"DataWig: Missing value imputation for tables","volume":"20","author":"Biessmann Felix","year":"2019","unstructured":"Felix Biessmann, Tammo Rukat, Phillipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, and David Salinas. 2019. DataWig: Missing value imputation for tables. Journal of Machine Learning Research 20, 175 (2019), 1--6.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_2_1_5_1","volume-title":"Addressing the Topological Defects of Disentanglement via Distributed Operators. CoRR abs\/2102.05623","author":"Bouchacourt Diane","year":"2021","unstructured":"Diane Bouchacourt, Mark Ibrahim, and St\u00e9phane Deny. 2021. Addressing the Topological Defects of Disentanglement via Distributed Operators. CoRR abs\/2102.05623 (2021). arXiv:2102.05623 https:\/\/arxiv.org\/abs\/2102.05623"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.14778\/3430915.3430921"},{"key":"e_1_2_1_7_1","volume-title":"International Conference on Machine Learning. PMLR, 2761--2770","author":"Dupont Emilien","year":"2020","unstructured":"Emilien Dupont, Miguel Bautista Martin, Alex Colburn, Aditya Sankar, Josh Susskind, and Qi Shan. 2020. Equivariant neural rendering. In International Conference on Machine Learning. PMLR, 2761--2770."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3319888"},{"key":"e_1_2_1_9_1","volume-title":"Robust self-supervised learning with lie groups. arXiv preprint arXiv:2210.13356","author":"Ibrahim Mark","year":"2022","unstructured":"Mark Ibrahim, Diane Bouchacourt, and Ari Morcos. 2022. Robust self-supervised learning with lie groups. arXiv preprint arXiv:2210.13356 (2022)."},{"key":"e_1_2_1_10_1","volume-title":"Varun Manjunatha, and Mohit Iyyer.","author":"Iida Hiroshi","year":"2021","unstructured":"Hiroshi Iida, Dung Ngoc Thai, Varun Manjunatha, and Mohit Iyyer. 2021. TABBIE: Pretrained Representations of Tabular Data. In NAACL."},{"key":"e_1_2_1_11_1","volume-title":"Quantised Transforming Auto-Encoders: Achieving Equivariance to Arbitrary Transformations in Deep Networks. arXiv preprint arXiv:2111.12873","author":"Jiao Jianbo","year":"2021","unstructured":"Jianbo Jiao and Jo\u00e3o F Henriques. 2021. Quantised Transforming Auto-Encoders: Achieving Equivariance to Arbitrary Transformations in Deep Networks. arXiv preprint arXiv:2111.12873 (2021)."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.14778\/3430915.3430917"},{"key":"e_1_2_1_13_1","first-page":"28585","article-title":"Topographic vaes learn equivariant capsules","volume":"34","author":"Anderson Keller T","year":"2021","unstructured":"T Anderson Keller and Max Welling. 2021. Topographic vaes learn equivariant capsules. Advances in Neural Information Processing Systems 34 (2021), 28585--28597.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_14_1","volume-title":"BoostClean: Automated Error Detection and Repair for Machine Learning. CoRR abs\/1711.01299","author":"Krishnan Sanjay","year":"2017","unstructured":"Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, and Eugene Wu. 2017. BoostClean: Automated Error Detection and Repair for Machine Learning. CoRR abs\/1711.01299 (2017). arXiv:1711.01299 http:\/\/arxiv.org\/abs\/1711.01299"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994514"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE51399.2021.00210"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.14778\/3450980.3450989"},{"key":"e_1_2_1_18_1","volume-title":"Picket: guarding against corrupted data in tabular data during learning and inference. The VLDB Journal","author":"Liu Zifan","year":"2022","unstructured":"Zifan Liu, Zhechun Zhou, and Theodoros Rekatsinas. 2022. Picket: guarding against corrupted data in tabular data during learning and inference. The VLDB Journal (2022), 1--29."},{"key":"e_1_2_1_19_1","unstructured":"Mohammad Mahdavi and Ziawasch Abedjan. 2021. Semi-Supervised Data Cleaning with Raha and Baran. In CIDR."},{"key":"e_1_2_1_20_1","volume-title":"Finding the Best k for the Dimension of the Latent Space in Autoencoders","author":"Ngoc Kien Mai","unstructured":"Kien Mai Ngoc and Myunggwon Hwang. 2020. Finding the Best k for the Dimension of the Latent Space in Autoencoders. In Computational Collective Intelligence, Ngoc Thanh Nguyen, Bao Hung Hoang, Cong Phap Huynh, Dosam Hwang, Bogdan Trawi\u0144ski, and Gottfried Vossen (Eds.). Springer International Publishing, Cham, 453--464."},{"key":"e_1_2_1_21_1","volume-title":"Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911","author":"Narayan Avanika","year":"2022","unstructured":"Avanika Narayan, Ines Chami, Laurel Orr, Simran Arora, and Christopher R\u00e9. 2022. Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911 (2022)."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3447541"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3357384.3358129"},{"key":"e_1_2_1_24_1","volume-title":"Automatic Data Repair: Are We Ready to Deploy? arXiv preprint arXiv:2310.00711","author":"Ni Wei","year":"2023","unstructured":"Wei Ni, Xiaoye Miao, Xiangyu Zhao, Yangyang Wu, and Jianwei Yin. 2023. Automatic Data Repair: Are We Ready to Deploy? arXiv preprint arXiv:2310.00711 (2023)."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.14778\/3570690.3570694"},{"key":"e_1_2_1_26_1","volume-title":"Outlier Detection in Heterogeneous Datasets using Automatic Tuple Expansion. (02","author":"Pit-Claudel Cl\u00e9ment","year":"2016","unstructured":"Cl\u00e9ment Pit-Claudel, Zelda Mariet, Rachael Harding, and Sam Madden. 2016. Outlier Detection in Heterogeneous Datasets using Automatic Tuple Expansion. (02 2016)."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.14778\/3157794.3157797"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137631"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476249.3476301"},{"key":"e_1_2_1_30_1","volume-title":"DataVinci: Learning Syntactic and Semantic String Repairs. arXiv preprint arXiv:2308.10922","author":"Singh Mukul","year":"2023","unstructured":"Mukul Singh, Jos\u00e9 Cambronero, Sumit Gulwani, Vu Le, Carina Negreanu, and Gust Verbruggen. 2023. DataVinci: Learning Syntactic and Semantic String Repairs. arXiv preprint arXiv:2308.10922 (2023)."},{"key":"e_1_2_1_31_1","volume-title":"Proceedings of the IEEE International Conference on Computer Vision. 5726--5735","author":"Worrall Daniel E","year":"2017","unstructured":"Daniel E Worrall, Stephan J Garbin, Daniyar Turmukhambetov, and Gabriel J Brostow. 2017. Interpretable transformations with encoder-decoder networks. In Proceedings of the IEEE International Conference on Computer Vision. 5726--5735."},{"key":"e_1_2_1_32_1","volume-title":"Proceedings of Machine Learning and Systems, I. Dhillon, D. Papailiopoulos, and V. Sze (Eds.)","volume":"2","author":"Wu Richard","year":"2020","unstructured":"Richard Wu, Aoqian Zhang, Ihab Ilyas, and Theodoros Rekatsinas. 2020. Attention-based Learning for Missing Data Imputation in HoloClean. In Proceedings of Machine Learning and Systems, I. Dhillon, D. Papailiopoulos, and V. Sze (Eds.), Vol. 2. 307--325. https:\/\/proceedings.mlsys.org\/paper\/2020\/file\/202cb962ac59075b964b07152d234b70-Paper.pdf"},{"key":"e_1_2_1_33_1","volume-title":"Wen tau Yih, and Sebastian Riedel","author":"Yin Pengcheng","year":"2020","unstructured":"Pengcheng Yin, Graham Neubig, Wen tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. ArXiv abs\/2005.08314 (2020)."},{"key":"e_1_2_1_34_1","unstructured":"Haochen Zhang Yuyang Dong Chuan Xiao and Masafumi Oyamada. 2023. Large Language Models as Data Preprocessors. arXiv:2308.16361 [cs.AI]"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3704965.3704983","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,2,18]],"date-time":"2025-02-18T17:32:01Z","timestamp":1739899921000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3704965.3704983"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,9]]},"references-count":34,"journal-issue":{"issue":"13","published-print":{"date-parts":[[2024,9]]}},"alternative-id":["10.14778\/3704965.3704983"],"URL":"https:\/\/doi.org\/10.14778\/3704965.3704983","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2024,9]]},"assertion":[{"value":"2025-02-18","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}