{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,8]],"date-time":"2026-05-08T22:38:17Z","timestamp":1778279897872,"version":"3.51.4"},"reference-count":53,"publisher":"Association for Computing Machinery (ACM)","issue":"2","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,10]]},"abstract":"<jats:p>Datasets may include errors, and specifically violations of integrity constraints, for various reasons. Standard techniques for \"minimalcost\" database repairing resolve these violations by aiming for a minimum change in the data, and in the process, may sway representations of different sub-populations. For instance, the repair may end up deleting more females than males, or more tuples from a certain age group or race, due to varying levels of inconsistency in different sub-populations. Such repaired data can mislead consumers when used for analytics, and can lead to biased decisions for downstream machine learning tasks. We study the \"cost of representation\" in subset repairs for functional dependencies. In simple terms, we target the question of how many additional tuples have to be deleted if we want to satisfy not only the integrity constraints but also representation constraints for given sub-populations. We study the complexity of this problem and compare it with the complexity of optimal subset repairs without representations. While the problem is NP-hard in general, we give polynomial-time algorithms for special cases, and efficient heuristics for general cases. We perform a suite of experiments that show the effectiveness of our algorithms in computing or approximating the cost of representation.<\/jats:p>","DOI":"10.14778\/3705829.3705860","type":"journal-article","created":{"date-parts":[[2025,2,28]],"date-time":"2025-02-28T23:21:06Z","timestamp":1740784866000},"page":"475-487","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["The Cost of Representation by Subset Repairs"],"prefix":"10.14778","volume":"18","author":[{"given":"Yuxi","family":"Liu","sequence":"first","affiliation":[{"name":"Duke University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Fangzhu","family":"Shen","sequence":"additional","affiliation":[{"name":"Duke University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kushagra","family":"Ghosh","sequence":"additional","affiliation":[{"name":"Duke University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Amir","family":"Gilad","sequence":"additional","affiliation":[{"name":"Hebrew University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Benny","family":"Kimelfeld","sequence":"additional","affiliation":[{"name":"Technion"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sudeepa","family":"Roy","sequence":"additional","affiliation":[{"name":"Duke University"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,2,28]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Kolaitis","author":"Afrati Foto N.","year":"2009","unstructured":"Foto N. Afrati and Phokion G. Kolaitis. 2009. Repair Checking in Inconsistent Databases: Algorithms and Complexity. In ICDT. 31--41."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1016\/0196-6774(81)90020-1"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1137\/0213006"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00224-012-9402-7"},{"key":"e_1_2_1_5_1","volume-title":"Cleaning data with Swipe. arXiv preprint arXiv:2403.19378","author":"Boeckling Toon","year":"2024","unstructured":"Toon Boeckling and Antoon Bronselaer. 2024. Cleaning data with Swipe. arXiv preprint arXiv:2403.19378 (2024)."},{"key":"e_1_2_1_6_1","doi-asserted-by":"crossref","unstructured":"Philip Bohannon Wenfei Fan Floris Geerts Xibei Jia and Anastasios Kementsietsidis. 2007. Conditional functional dependencies for data cleaning. In ICDE.","DOI":"10.1109\/ICDE.2007.367920"},{"key":"e_1_2_1_7_1","unstructured":"US Census Bureau. [n.d.]. PUMS Documentation. https:\/\/www.census.gov\/programs-surveys\/acs\/microdata\/documentation.html Section: Government."},{"key":"e_1_2_1_8_1","volume-title":"Karthikeyan Natesan Ramamurthy, and Kush R. Varshney","author":"Calmon Fl\u00e5vio P.","year":"2017","unstructured":"Fl\u00e5vio P. Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R. Varshney. 2017. Optimized Pre-Processing for Discrimination Prevention. In NIPS. 3992--4001."},{"key":"e_1_2_1_9_1","unstructured":"Priyam Choksi. 2023. Credit Card Transactions Dataset. https:\/\/www.kaggle.com\/datasets\/priyamchoksi\/credit-card-transactions-dataset Accessed: 2024-08-04."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ic.2004.04.007"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1089\/big.2016.0047"},{"key":"e_1_2_1_12_1","doi-asserted-by":"crossref","unstructured":"Xu Chu Ihab F. Ilyas and Paolo Papotti. 2013. Holistic data cleaning: Putting violations into context. In ICDE. 458--469.","DOI":"10.1109\/ICDE.2013.6544847"},{"key":"e_1_2_1_13_1","first-page":"216","article-title":"Uniform guidelines on employee selection procedures","volume":"1","author":"Equal Employment Opportunity Commission et al.","year":"1990","unstructured":"Equal Employment Opportunity Commission et al. 1990. Uniform guidelines on employee selection procedures. Fed Register 1 (1990), 216--243.","journal-title":"Fed Register"},{"key":"e_1_2_1_14_1","doi-asserted-by":"crossref","unstructured":"Michele Dallachiesa Amr Ebaid Ahmed Eldawy Ahmed K. Elmagarmid Ihab F. Ilyas Mourad Ouzzani and Nan Tang. 2013. NADEEF: a commodity data cleaning system. In SIGMOD. 541--552.","DOI":"10.1145\/2463676.2465327"},{"key":"e_1_2_1_15_1","volume-title":"Retiring Adult: New Datasets for Fair Machine Learning. arXiv:2108.04884 [cs.LG]","author":"Ding Frances","year":"2022","unstructured":"Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. 2022. Retiring Adult: New Datasets for Fair Machine Learning. arXiv:2108.04884 [cs.LG]"},{"key":"e_1_2_1_16_1","volume-title":"Zemel","author":"Dwork Cynthia","year":"2012","unstructured":"Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard S. Zemel. 2012. Fairness through awareness. In ITCS. ACM, 214--226."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-0-387-39940-9_1274"},{"key":"e_1_2_1_18_1","volume-title":"Kolaitis","author":"Fagin Ronald","year":"2015","unstructured":"Ronald Fagin, Benny Kimelfeld, and Phokion G. Kolaitis. 2015. Dichotomies in the Complexity of Preferred Repairs. In PODS. 3--15."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.5555\/1085304.1085309"},{"key":"e_1_2_1_20_1","doi-asserted-by":"crossref","unstructured":"Michael Feldman Sorelle A. Friedler John Moeller Carlos Scheidegger and Suresh Venkatasubramanian. 2015. Certifying and Removing Disparate Impact. In SIGKDD. 259--268.","DOI":"10.1145\/2783258.2783311"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536360.2536363"},{"key":"e_1_2_1_22_1","doi-asserted-by":"crossref","unstructured":"Amir Gilad Daniel Deutch and Sudeepa Roy. 2020. On Multiple Semantics for Declarative Database Repairs. In SIGMOD. 817--831.","DOI":"10.1145\/3318464.3389721"},{"key":"e_1_2_1_23_1","unstructured":"Stefan Grafberger Julia Stoyanovich and Sebastian Schelter. 2021. Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines. In CIDR."},{"key":"e_1_2_1_24_1","volume-title":"Julia Stoyanovich, and Sebastian Schelter.","author":"Guha Shubha","year":"2023","unstructured":"Shubha Guha, Falaah Arif Khan, Julia Stoyanovich, and Sebastian Schelter. 2023. Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making. In ICDE. IEEE, 3747--3754."},{"key":"e_1_2_1_25_1","unstructured":"LLC Gurobi Optimization. 2024. Gurobi Optimizer. https:\/\/www.gurobi.com Accessed: 2024-03-01."},{"key":"e_1_2_1_26_1","unstructured":"Moritz Hardt Eric Price and Nati Srebro. 2016. Equality of Opportunity in Supervised Learning. In NeurIPS Daniel D. Lee Masashi Sugiyama Ulrike von Luxburg Isabelle Guyon and Roman Garnett (Eds.). 3315--3323."},{"key":"e_1_2_1_27_1","doi-asserted-by":"crossref","unstructured":"Sean Kandel Andreas Paepcke Joseph M. Hellerstein and Jeffrey Heer. 2011. Wrangler: interactive visual specification of data transformation scripts. In CHI Desney S. Tan Saleema Amershi Bo Begole Wendy A. Kellogg and Manas Tungare (Eds.). ACM 3363--3372.","DOI":"10.1145\/1978942.1979444"},{"key":"e_1_2_1_28_1","first-page":"1","article-title":"Inherent Trade-Offs in the Fair Determination of Risk Scores","volume":"67","author":"Kleinberg Jon M.","year":"2017","unstructured":"Jon M. Kleinberg, Sendhil Mullainathan, and Manish Raghavan. 2017. Inherent Trade-Offs in the Fair Determination of Risk Scores. In ITCS (LIPIcs), Vol. 67. 43:1--43:23.","journal-title":"ITCS (LIPIcs)"},{"key":"e_1_2_1_29_1","volume-title":"Lakshmanan","author":"Kolahi Solmaz","year":"2009","unstructured":"Solmaz Kolahi and Laks V. S. Lakshmanan. 2009. On Approximating Optimum Repairs for Functional Dependency Violations. In ICDT. ACM, 53--62."},{"key":"e_1_2_1_30_1","doi-asserted-by":"crossref","unstructured":"Nick Koudas Avishek Saha Divesh Srivastava and Suresh Venkatasubramanian. 2009. Metric functional dependencies. In ICDE.","DOI":"10.1109\/ICDE.2009.219"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994514"},{"key":"e_1_2_1_32_1","unstructured":"Jeff Larson Surya Mattu Lauren Kirchner and Julia Angwin. 2016. How We Analyzed the COMPAS Recidivism Algorithm. https:\/\/www.propublica.org\/article\/how-we-analyzed-the-compas-recidivism-algorithm. Accessed: 2023-04-07."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.14778\/3626292.3626295"},{"key":"e_1_2_1_34_1","volume-title":"Kenneth Lyons, Weiyi Meng, and Divesh Srivastava.","author":"Li Xian","year":"2015","unstructured":"Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, and Divesh Srivastava. 2015. Truth finding on the deep web: Is the problem solved? arXiv preprint arXiv:1503.00303 (2015)."},{"key":"e_1_2_1_35_1","unstructured":"Yuxi Liu Fangzhu Shen Kushagra Ghosh Amir Gilad Benny Kimelfeld and Sudeepa Roy. 2024. The Cost of Representation by Subset Repairs. arXiv:2410.16501 [cs.DB] https:\/\/arxiv.org\/abs\/2410.16501"},{"key":"e_1_2_1_36_1","doi-asserted-by":"crossref","unstructured":"Ester Livshits and Benny Kimelfeld. 2017. Counting and Enumerating (Preferred) Database Repairs. In PODS. 289--301.","DOI":"10.1145\/3034786.3056107"},{"key":"e_1_2_1_37_1","first-page":"1","article-title":"The Shapley Value of Inconsistency Measures for Functional Dependencies","volume":"186","author":"Livshits Ester","year":"2021","unstructured":"Ester Livshits and Benny Kimelfeld. 2021. The Shapley Value of Inconsistency Measures for Functional Dependencies. In ICDT, Vol. 186. 15:1--15:19.","journal-title":"ICDT"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3360904"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcss.2020.10.001"},{"key":"e_1_2_1_40_1","doi-asserted-by":"crossref","unstructured":"Ester Livshits Rina Kochirgan Segev Tsur Ihab F. Ilyas Benny Kimelfeld and Sudeepa Roy. 2021. Properties of Inconsistency Measures for Databases. In SIGMOD. 1182--1194.","DOI":"10.1145\/3448016.3457310"},{"key":"e_1_2_1_41_1","volume-title":"Bertossi","author":"Lopatenko Andrei","year":"2007","unstructured":"Andrei Lopatenko and Leopoldo E. Bertossi. 2007. Complexity of Consistent Query Answering in Databases Under Cardinality-Based and Incremental Repair Semantics. In ICDT. 179--193."},{"key":"e_1_2_1_42_1","volume-title":"Counting Database Repairs that Satisfy Conjunctive Queries with Self-Joins","author":"Maslowski Dany","unstructured":"Dany Maslowski and Jef Wijsen. 2014. Counting Database Repairs that Satisfy Conjunctive Queries with Self-Joins. In ICDT, Nicole Schweikardt, Vassilis Christophides, and Vincent Leroy (Eds.). OpenProceedings.org, 155--164."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407809"},{"key":"e_1_2_1_44_1","doi-asserted-by":"crossref","unstructured":"Michael J. Muller Ingrid Lange Dakuo Wang David Piorkowski Jason Tsay Q. Vera Liao Casey Dugan and Thomas Erickson. 2019. How Data Science Workers Work with Data: Discovery Capture Curation Design Creation. In CHI. ACM 126.","DOI":"10.1145\/3290605.3300356"},{"key":"e_1_2_1_45_1","volume-title":"Fairness in rankings and recommendations: an overview. The VLDB Journal","author":"Pitoura Evaggelia","year":"2022","unstructured":"Evaggelia Pitoura, Kostas Stefanidis, and Georgia Koutrika. 2022. Fairness in rankings and recommendations: an overview. The VLDB Journal (2022), 1--28."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137631"},{"key":"e_1_2_1_47_1","first-page":"1","article-title":"A Formal Framework for Probabilistic Unclean Databases. In ICDT (LIPIcs), Vol. 127","volume":"6","author":"Sa Christopher De","year":"2019","unstructured":"Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher R\u00e9, and Theodoros Rekatsinas. 2019. A Formal Framework for Probabilistic Unclean Databases. In ICDT (LIPIcs), Vol. 127. Schloss Dagstuhl - Leibniz-Zentrum f\u00fcr Informatik, 6:1--6:18.","journal-title":"Schloss Dagstuhl - Leibniz-Zentrum f\u00fcr Informatik"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.14778\/3461535.3463474"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3319901"},{"key":"e_1_2_1_50_1","unstructured":"Sebastian Schelter Yuxuan He Jatin Khilnani and Julia Stoyanovich. 2020. Fair-Prep: Promoting Data to a First-Class Citizen in Studies on Fairness-Enhancing Interventions. In EDBT. OpenProceedings.org 395--398."},{"key":"e_1_2_1_51_1","volume-title":"Fairness-Aware Range Queries for Selecting Unbiased Data","author":"Shetiya Suraj","unstructured":"Suraj Shetiya, Ian P. Swift, Abolfazl Asudeh, and Gautam Das. 2022. Fairness-Aware Range Queries for Selecting Unbiased Data. In ICDE. IEEE, 1423--1436."},{"key":"e_1_2_1_52_1","doi-asserted-by":"crossref","unstructured":"Camelia Simoiu Sam Corbett-Davies and Sharad Goel. 2017. The problem of infra-marginality in outcome tests for discrimination. (2017).","DOI":"10.1214\/17-AOAS1058"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/800070.802186"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3705829.3705860","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,2,28]],"date-time":"2025-02-28T23:23:08Z","timestamp":1740784988000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3705829.3705860"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10]]},"references-count":53,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,10]]}},"alternative-id":["10.14778\/3705829.3705860"],"URL":"https:\/\/doi.org\/10.14778\/3705829.3705860","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2024,10]]},"assertion":[{"value":"2025-02-28","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}