{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,7]],"date-time":"2026-03-07T18:37:45Z","timestamp":1772908665570,"version":"3.50.1"},"reference-count":57,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2024,5,29]],"date-time":"2024-05-29T00:00:00Z","timestamp":1716940800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100006374","name":"NSF","doi-asserted-by":"publisher","award":["OAC- 2112606"],"award-info":[{"award-number":["OAC- 2112606"]}],"id":[{"id":"10.13039\/501100006374","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100006374","name":"NIH","doi-asserted-by":"publisher","award":["54HG012510"],"award-info":[{"award-number":["54HG012510"]}],"id":[{"id":"10.13039\/501100006374","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2024,5,29]]},"abstract":"<jats:p>Ensuring Conditional Independence (CI) constraints is pivotal for the development of fair and trustworthy machine learning models. In this paper, we introduce OTClean, a framework that harnesses optimal transport theory for data repair under CI constraints. Optimal transport theory provides a rigorous framework for measuring the discrepancy between probability distributions, thereby ensuring control over data utility. We formulate the data repair problem concerning CIs as a Quadratically Constrained Linear Program (QCLP) and propose an alternating method for its solution. However, this approach faces scalability issues due to the computational cost associated with computing optimal transport distances, such as the Wasserstein distance. To overcome these scalability challenges, we reframe our problem as a regularized optimization problem, enabling us to develop an iterative algorithm inspired by Sinkhorn's matrix scaling algorithm, which efficiently addresses high-dimensional and large-scale data. Through extensive experiments, we demonstrate the efficacy and efficiency of our proposed methods, showcasing their practical utility in real-world data cleaning and preprocessing tasks. Furthermore, we provide comparisons with traditional approaches, highlighting the superiority of our techniques in terms of preserving data utility while ensuring adherence to the desired CI constraints.<\/jats:p>","DOI":"10.1145\/3654963","type":"journal-article","created":{"date-parts":[[2024,5,30]],"date-time":"2024-05-30T09:44:53Z","timestamp":1717062293000},"page":"1-26","source":"Crossref","is-referenced-by-count":10,"title":["OTClean: Data Cleaning for Conditional Independence Violations using Optimal Transport"],"prefix":"10.1145","volume":"2","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-9537-2898","authenticated-orcid":false,"given":"Alireza","family":"Pirhadi","sequence":"first","affiliation":[{"name":"University of Western Ontario, London, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-0278-4665","authenticated-orcid":false,"given":"Mohammad Hossein","family":"Moslemi","sequence":"additional","affiliation":[{"name":"University of Western Ontario, London, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1423-9624","authenticated-orcid":false,"given":"Alexander","family":"Cloninger","sequence":"additional","affiliation":[{"name":"University of California San Diego, San Diego, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3386-7079","authenticated-orcid":false,"given":"Mostafa","family":"Milani","sequence":"additional","affiliation":[{"name":"University of Western Ontario, London, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8763-8354","authenticated-orcid":false,"given":"Babak","family":"Salimi","sequence":"additional","affiliation":[{"name":"University of California San Diego, San Diego, California, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,5,30]]},"reference":[{"key":"e_1_2_2_1_1","unstructured":"2023. Adult Data Set. https:\/\/archive.ics.uci.edu\/ml\/datasets\/adult"},{"key":"e_1_2_2_2_1","unstructured":"2023. The Boston Housing Dataset. https:\/\/www.cs.toronto.edu\/~delve\/data\/boston\/bostonDetail.html"},{"key":"e_1_2_2_3_1","unstructured":"2023. Car Evaluation Data Set. https:\/\/archive.ics.uci.edu\/ml\/datasets\/CarEvaluation UCI Machine Learning Repository."},{"key":"e_1_2_2_4_1","unstructured":"2023. COMPAS Analysis. https:\/\/github.com\/propublica\/compas-analysis\/"},{"key":"e_1_2_2_5_1","volume-title":"Karthikeyan Natesan Ramamurthy, and Murat Kocaoglu","author":"Ahuja Kartik","year":"2021","unstructured":"Kartik Ahuja, Prasanna Sattigeri, Karthikeyan Shanmugam, Dennis Wei, Karthikeyan Natesan Ramamurthy, and Murat Kocaoglu. 2021. Conditionally Independent Data Generation. In UAI. 2050--2060."},{"key":"e_1_2_2_6_1","volume-title":"Invariant Risk Minimization. arXiv preprint arXiv:1907.02893","author":"Arjovsky Martin","year":"2019","unstructured":"Martin Arjovsky, L\u00e9on Bottou, Ishaan Gulrajani, and David Lopez-Paz. 2019. Invariant Risk Minimization. arXiv preprint arXiv:1907.02893 (2019)."},{"key":"e_1_2_2_7_1","unstructured":"Martin Arjovsky Soumith Chintala and L\u00e9on Bottou. 2017. Wasserstein Generative Adversarial Networks. In ICML. 214--223."},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1147376.1147391"},{"key":"e_1_2_2_9_1","doi-asserted-by":"crossref","unstructured":"Emily Black Samuel Yeom and Matt Fredrikson. 2020. FlipTest: Fairness Testing via Optimal Transport. In FAccT. 111--121.","DOI":"10.1145\/3351095.3372845"},{"key":"e_1_2_2_10_1","doi-asserted-by":"crossref","unstructured":"Philip Bohannon Wenfei Fan Floris Geerts Xibei Jia and Anastasios Kementsietsidis. 2006. Conditional Functional Dependencies for Data Cleaning. In ICDE. 746--755.","DOI":"10.1109\/ICDE.2007.367920"},{"key":"e_1_2_2_11_1","volume-title":"Convex Optimization","author":"Boyd Stephen","unstructured":"Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge University Press."},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1561\/2200000016"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10676-013-9321-6"},{"key":"e_1_2_2_14_1","volume-title":"Karthikeyan Natesan Ramamurthy, and Kush R Varshney","author":"Calmon Flavio","year":"2017","unstructured":"Flavio Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R Varshney. 2017. Optimized Pre-Processing for Discrimination Prevention. NeurIPS 30 (2017)."},{"key":"e_1_2_2_15_1","volume-title":"Fairness in Machine Learning: A Survey. Computing Surveys","author":"Caton Simon","year":"2023","unstructured":"Simon Caton and Christian Haas. 2023. Fairness in Machine Learning: A Survey. Computing Surveys (2023)."},{"key":"e_1_2_2_16_1","first-page":"7321","article-title":"Fair regression with wasserstein barycenters","volume":"33","author":"Chzhen Evgenii","year":"2020","unstructured":"Evgenii Chzhen, Christophe Denis, Mohamed Hebiri, Luca Oneto, and Massimiliano Pontil. 2020. Fair regression with wasserstein barycenters. Advances in Neural Information Processing Systems 33 (2020), 7321--7331.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_2_17_1","volume-title":"Elements of Information Theory","author":"Cover Thomas M","unstructured":"Thomas M Cover. 1999. Elements of Information Theory. John Wiley & Sons."},{"key":"e_1_2_2_18_1","volume-title":"Sinkhorn Distances: Lightspeed Computation of Optimal Transport. NeurIPS 26","author":"Cuturi Marco","year":"2013","unstructured":"Marco Cuturi. 2013. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. NeurIPS 26 (2013)."},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/2350229.2350279"},{"key":"e_1_2_2_20_1","volume-title":"Foundations of Data Quality Management","author":"Fan Wenfei","unstructured":"Wenfei Fan and Floris Geerts. 2022. Foundations of Data Quality Management. Springer Nature."},{"key":"e_1_2_2_21_1","doi-asserted-by":"crossref","unstructured":"Michael Feldman Sorelle A. Friedler John Moeller Carlos Scheidegger and Suresh Venkatasubramanian. 2015. Certifying and Removing Disparate Impact. In KDD. 259--268.","DOI":"10.1145\/2783258.2783311"},{"key":"e_1_2_2_22_1","volume-title":"Approximate Nonnegative Matrix Factorization via Alternating Minimization. arXiv preprint math\/0402229","author":"Finesso Lorenzo","year":"2004","unstructured":"Lorenzo Finesso and Peter Spreij. 2004. Approximate Nonnegative Matrix Factorization via Alternating Minimization. arXiv preprint math\/0402229 (2004)."},{"key":"e_1_2_2_23_1","volume-title":"Learning with a Wasserstein Loss. NeurIPS 28","author":"Frogner Charlie","year":"2015","unstructured":"Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. 2015. Learning with a Wasserstein Loss. NeurIPS 28 (2015)."},{"key":"e_1_2_2_24_1","doi-asserted-by":"crossref","unstructured":"Sainyam Galhotra Romila Pradhan and Babak Salimi. 2021. Explaining Black-Box Algorithms using Probabilistic Contrastive Counterfactuals. In SIGMOD. 577--590.","DOI":"10.1145\/3448016.3458455"},{"key":"e_1_2_2_25_1","volume-title":"Gamboa Fabrice, and Jean-Michel Loubes.","author":"Gordaliza Paula","year":"2019","unstructured":"Paula Gordaliza, Eustasio Del Barrio, Gamboa Fabrice, and Jean-Michel Loubes. 2019. Obtaining Fairness using Optimal Transport Theory. In ICML. 2357--2365."},{"key":"e_1_2_2_26_1","volume-title":"Mar","author":"Guyon Isabelle","year":"2003","unstructured":"Isabelle Guyon and Andr\u00e9 Elisseeff. 2003. An Introduction to Variable and Feature Selection. JMLR 3, Mar (2003), 1157--1182."},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2945386"},{"key":"e_1_2_2_28_1","first-page":"1","article-title":"Algorithms for Nonnegative Matrix Factorization with the Kullback--Leibler Divergence","volume":"87","author":"Khanh Hien Le Thi","year":"2021","unstructured":"Le Thi Khanh Hien and Nicolas Gillis. 2021. Algorithms for Nonnegative Matrix Factorization with the Kullback--Leibler Divergence. SISC 87, 3 (2021), 1--32.","journal-title":"SISC"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patter.2021.100241"},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517841"},{"key":"e_1_2_2_31_1","volume-title":"Hyperimpute: Generalized Iterative Imputation with Automatic Model Selection. In ICML. 9916--9937.","author":"Jarrett Daniel","year":"2022","unstructured":"Daniel Jarrett, Bogdan C Cebere, Tennison Liu, Alicia Curth, and Mihaela van der Schaar. 2022. Hyperimpute: Generalized Iterative Imputation with Automatic Model Selection. In ICML. 9916--9937."},{"key":"e_1_2_2_32_1","unstructured":"Ray Jiang Aldo Pacchiano Tom Stepleton Heinrich Jiang and Silvia Chiappa. 2020. Wasserstein fair classification. In Uncertainty in artificial intelligence. PMLR 862--872."},{"key":"e_1_2_2_33_1","doi-asserted-by":"crossref","unstructured":"Solmaz Kolahi and Laks VS Lakshmanan. 2009. On Approximating Optimum Repairs for Functional Dependency Violations. In ICDT. 53--62.","DOI":"10.1145\/1514894.1514901"},{"key":"e_1_2_2_34_1","volume-title":"Probabilistic Graphical Models: Principles and Techniques","author":"Koller Daphne","unstructured":"Daphne Koller and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT press."},{"key":"e_1_2_2_35_1","first-page":"292","article-title":"Toward Optimal Feature Selection","volume":"96","author":"Koller Daphne","year":"1996","unstructured":"Daphne Koller, Mehran Sahami, et al. 1996. Toward Optimal Feature Selection. In ICML, Vol. 96. 292.","journal-title":"ICML"},{"key":"e_1_2_2_36_1","volume-title":"Algorithms for Non-Negative Matrix Factorization. NeurIPS 13","author":"Lee Daniel","year":"2000","unstructured":"Daniel Lee and H Sebastian Seung. 2000. Algorithms for Non-Negative Matrix Factorization. NeurIPS 13 (2000)."},{"key":"e_1_2_2_37_1","volume-title":"The Shapley Value of Inconsistency Measures for Functional Dependencies. LMCS 18","author":"Livshits Ester","year":"2022","unstructured":"Ester Livshits and Benny Kimelfeld. 2022. The Shapley Value of Inconsistency Measures for Functional Dependencies. LMCS 18 (2022)."},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3360904"},{"key":"e_1_2_2_39_1","volume-title":"Tom Claassen, Stephan Bongers, Philip Versteeg, and Joris M Mooij.","author":"Magliacane Sara","year":"2018","unstructured":"Sara Magliacane, Thijs Van Ommen, Tom Claassen, Stephan Bongers, Philip Versteeg, and Joris M Mooij. 2018. Domain Adaptation by Using Causal Inference to Predict Invariant Conditional Distributions. NeurIPS 31 (2018)."},{"key":"e_1_2_2_40_1","first-page":"1948","article-title":"Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning","volume":"13","author":"Mahdavi Mohammad","year":"2020","unstructured":"Mohammad Mahdavi and Ziawasch Abedjan. 2020. Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning. VLDB 13, 12 (2020), 1948--1961.","journal-title":"VLDB"},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1214\/09-SS057"},{"key":"e_1_2_2_42_1","doi-asserted-by":"crossref","unstructured":"Ofir Pele and Michael Werman. 2009. Fast and Robust Earth Mover's Distances. In ICCV. 460--467.","DOI":"10.1109\/ICCV.2009.5459199"},{"key":"e_1_2_2_43_1","volume-title":"Efficient Conditionally Invariant Representation Learning. ICLR","author":"Pogodin Roman","year":"2023","unstructured":"Roman Pogodin, Namrata Deka, Yazhe Li, Danica J Sutherland, Victor Veitch, and Arthur Gretton. 2023. Efficient Conditionally Invariant Representation Learning. ICLR (2023)."},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.5555\/3291125.3291161"},{"key":"e_1_2_2_45_1","volume-title":"Data Management for Causal Algorithmic Fairness. Data Engineering","author":"Salimi Babak","year":"2019","unstructured":"Babak Salimi, Bill Howe, and Dan Suciu. 2019. Data Management for Causal Algorithmic Fairness. Data Engineering (2019), 24."},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3422648.3422657"},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3319901"},{"key":"e_1_2_2_48_1","unstructured":"Nian Si Karthyek Murthy Jose Blanchet and Viet Anh Nguyen. 2021. Testing Group Fairness via Optimal Transport Projections. In ICML. 9649--9659."},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i04.5771"},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1214\/aoms\/1177703591"},{"key":"e_1_2_2_51_1","doi-asserted-by":"crossref","unstructured":"Antonio Torralba and Alexei A Efros. 2011. Unbiased Look at Dataset Bias. In CVPR. 1521--1528.","DOI":"10.1109\/CVPR.2011.5995347"},{"key":"e_1_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1017501703105"},{"key":"e_1_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1287\/mnsc.12.11.798"},{"key":"e_1_2_2_54_1","unstructured":"Kilian Q Weinberger and Gerald Tesauro. 2007. Metric Learning for Kernel Regression. In AISTATS. 612--619."},{"key":"e_1_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/3468.895901"},{"key":"e_1_2_2_56_1","volume-title":"SCODED: Statistical Constraint Oriented Data Error Detection. In SIGMOD. 845--860.","author":"Yan Jing Nathan","year":"2020","unstructured":"Jing Nathan Yan, Oliver Schulte, MoHan Zhang, Jiannan Wang, and Reynold Cheng. 2020. SCODED: Statistical Constraint Oriented Data Error Detection. In SIGMOD. 845--860."},{"key":"e_1_2_2_57_1","volume-title":"GAIN: Missing Data Imputation using Generative Adversarial Nets. In ICML. 5689--5698.","author":"Yoon Jinsung","year":"2018","unstructured":"Jinsung Yoon, James Jordon, and Mihaela Schaar. 2018. GAIN: Missing Data Imputation using Generative Adversarial Nets. In ICML. 5689--5698."}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3654963","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3654963","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T14:42:46Z","timestamp":1755787366000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3654963"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,29]]},"references-count":57,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,5,29]]}},"alternative-id":["10.1145\/3654963"],"URL":"https:\/\/doi.org\/10.1145\/3654963","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,5,29]]}}}