{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,19]],"date-time":"2026-05-19T07:14:56Z","timestamp":1779174896985,"version":"3.51.4"},"reference-count":79,"publisher":"Association for Computing Machinery (ACM)","issue":"11","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2021,7]]},"abstract":"<jats:p>Data scientists often develop data sets for analysis by drawing upon sources of data available to them. A major challenge is to ensure that the data set used for analysis has an appropriate representation of relevant (demographic) groups: it meets desired distribution requirements. Whether data is collected through some experiment or obtained from some data provider, the data from any single source may not meet the desired distribution requirements. Therefore, a union of data from multiple sources is often required. In this paper, we study how to acquire such data in the most cost effective manner, for typical cost functions observed in practice. We present an optimal solution for binary groups when the underlying distributions of data sources are known and all data sources have equal costs. For the generic case with unequal costs, we design an approximation algorithm that performs well in practice. When the underlying distributions are unknown, we develop an exploration-exploitation based strategy with a reward function that captures the cost and approximations of group distributions in each data source. Besides theoretical analysis, we conduct comprehensive experiments that confirm the effectiveness of our algorithms.<\/jats:p>","DOI":"10.14778\/3476249.3476299","type":"journal-article","created":{"date-parts":[[2021,10,27]],"date-time":"2021-10-27T16:46:23Z","timestamp":1635353183000},"page":"2519-2532","source":"Crossref","is-referenced-by-count":27,"title":["Tailoring data source distributions for fairness-aware data integration"],"prefix":"10.14778","volume":"14","author":[{"given":"Fatemeh","family":"Nargesian","sequence":"first","affiliation":[{"name":"University of Rochester"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Abolfazl","family":"Asudeh","sequence":"additional","affiliation":[{"name":"University of Illinois at Chicago"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"H. V.","family":"Jagadish","sequence":"additional","affiliation":[{"name":"University of Michigan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,10,27]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"[n.d.]. Dawex: Sell buy and share data. https:\/\/www.dawex.com\/en.  [n.d.]. Dawex: Sell buy and share data. https:\/\/www.dawex.com\/en."},{"key":"e_1_2_1_2_1","unstructured":"[n.d.]. WorldQuant. https:\/\/www.worldquant.com.  [n.d.]. WorldQuant. https:\/\/www.worldquant.com."},{"key":"e_1_2_1_3_1","unstructured":"[n.d.]. Xignite. https:\/\/aws.amazon.com\/solutionspace\/financialservices\/solutions\/xignite-market-data-cloudplatform.  [n.d.]. Xignite. https:\/\/aws.amazon.com\/solutionspace\/financialservices\/solutions\/xignite-market-data-cloudplatform."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1561\/2200000024"},{"key":"e_1_2_1_5_1","unstructured":"2020. Data Broker Registry. https:\/\/oag.ca.gov\/data-brokers.  2020. Data Broker Registry. https:\/\/oag.ca.gov\/data-brokers."},{"key":"e_1_2_1_6_1","unstructured":"July 2020. Google Flights API: Incorporate Travel Data into Your App. The Rapid API Blog.  July 2020. Google Flights API: Incorporate Travel Data into Your App. The Rapid API Blog."},{"key":"e_1_2_1_7_1","unstructured":"June 2021. Airborne Flights database. U.S. Department of Transportation https:\/\/www.transtats.bts.gov.  June 2021. Airborne Flights database. U.S. Department of Transportation https:\/\/www.transtats.bts.gov."},{"key":"e_1_2_1_8_1","unstructured":"June 2021. The Socrata Open Data API. ftp:\/\/ftp.funet.fi\/pub\/mirrors\/ftp.imdb.com\/pub\/frozendata\/.  June 2021. The Socrata Open Data API. ftp:\/\/ftp.funet.fi\/pub\/mirrors\/ftp.imdb.com\/pub\/frozendata\/."},{"key":"e_1_2_1_9_1","unstructured":"June 2021. The Texas Tribune Data set. https:\/\/salaries.texastribune.org.  June 2021. The Texas Tribune Data set. https:\/\/salaries.texastribune.org."},{"key":"e_1_2_1_10_1","volume-title":"EDBT\/ICDT Workshops.","author":"Accinelli Chiara","year":"2020"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.14778\/3291264.3291269"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3300079"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415566"},{"key":"e_1_2_1_14_1","doi-asserted-by":"crossref","unstructured":"Abolfazl Asudeh Zhongjun Jin and H. V. Jagadish. 2019. Assessing and Remedying Coverage for a Given Dataset. In ICDE. 554--565.  Abolfazl Asudeh Zhongjun Jin and H. V. Jagadish. 2019. Assessing and Remedying Coverage for a Given Dataset. In ICDE. 554--565.","DOI":"10.1109\/ICDE.2019.00056"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457315"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.14778\/2904483.2904491"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.14778\/2983200.2983205"},{"key":"e_1_2_1_18_1","unstructured":"Solon Barocas Moritz Hardt and Arvind Narayanan. 2019. Fairness and machine learning: Limitations and opportunities. fairmlbook.org.  Solon Barocas Moritz Hardt and Arvind Narayanan. 2019. Fairness and machine learning: Limitations and opportunities. fairmlbook.org."},{"key":"e_1_2_1_19_1","first-page":"671","article-title":"Big data's disparate impact","volume":"104","author":"Barocas Solon","year":"2016","journal-title":"Calif. L. Rev."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/1007730.1007735"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3308558.3313685"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.5555\/3294996.3295155"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.5555\/1622407.1622416"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.5555\/3327144.3327272"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.14778\/3397230.3397235"},{"key":"e_1_2_1_27_1","volume-title":"Amazon scraps secret AI recruiting tool that showed bias against women","author":"Dastin Jeffrey"},{"key":"e_1_2_1_28_1","volume-title":"Diversity in big data: A review. Big data 5, 2","author":"Drosou Marina","year":"2017"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/2783258.2783311"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3287560.3287589"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3314117"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3320244"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.5555\/3157382.3157469"},{"key":"e_1_2_1_34_1","doi-asserted-by":"crossref","unstructured":"Wassily Hoeffding. 1994. Probability Inequalities for sums of Bounded Random Variables. 409--426.  Wassily Hoeffding. 1994. Probability Inequalities for sums of Bounded Random Variables. 409--426.","DOI":"10.1007\/978-1-4612-0865-5_26"},{"key":"e_1_2_1_35_1","first-page":"333","article-title":"Methods of weighting for unit non-response","volume":"40","author":"Holt David","year":"1991","journal-title":"Journal of the Royal Statistical Society: Series D (The Statistician)"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3384689"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3384689"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10115-011-0463-8"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2010.50"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-33486-3_3"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.5555\/2775993.2776000"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/2770870"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407855"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3326467.3326480"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2915235"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407821"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/564691.564721"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.14778\/1454159.1454163"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.5555\/211390"},{"key":"e_1_2_1_50_1","unstructured":"M. Mulshine. 2015. A major flaw in Google's algorithm allegedly tagged two black people's faces with the word 'gorillas'. Business Insider.  M. Mulshine. 2015. A major flaw in Google's algorithm allegedly tagged two black people's faces with the word 'gorillas'. Business Insider."},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.14778\/3192965.3192973"},{"key":"e_1_2_1_52_1","volume-title":"Contributions to the theory of testing statistical hypotheses. Statistical Research Memoirs","author":"Neyman Jerzy","year":"1936"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.3389\/fdata.2019.00013"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380606"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.aap.2019.05.014"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.14778\/2336664.2336665"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/2213836.2213846"},{"key":"e_1_2_1_58_1","volume-title":"Data Distillation: Towards Omni-Supervised Learning. In CVPR. 4119--4128.","author":"Radosavovic Ilija","year":"2018"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2899403"},{"key":"e_1_2_1_60_1","unstructured":"Adam Rose. 2010. Are Face-Detection Cameras Racist? Time Business.  Adam Rose. 2010. Are Face-Detection Cameras Racist? Time Business."},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/3186549.3186559"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/3422648.3422657"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3319901"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3383129"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2593664"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.14778\/2350229.2350232"},{"key":"e_1_2_1_67_1","unstructured":"N. Singer. 2013. A data broker offers a peek behind the curtain. The New York Times.  N. Singer. 2013. A data broker offers a peek behind the curtain. The New York Times."},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1561\/2200000068"},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415570"},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1145\/3357384.3357853"},{"key":"e_1_2_1_71_1","unstructured":"Tess Townsend. 2017. Most engineers are white and so are the faces they use to train software. Recode.  Tess Townsend. 2017. Most engineers are white and so are the faces they use to train software. Recode."},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1145\/3294052.3322192"},{"key":"e_1_2_1_73_1","volume-title":"Conference on Learning Theory. PMLR","author":"Woodworth Blake","year":"2017"},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3193568"},{"key":"e_1_2_1_75_1","volume-title":"Manuel Gomez Rodriguez, and Krishna P Gummadi","author":"Zafar M. B.","year":"2015"},{"key":"e_1_2_1_76_1","volume-title":"Manuel Gomez Rogriguez, and Krishna P Gummadi","author":"Zafar Muhammad Bilal","year":"2017"},{"key":"e_1_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.5555\/3042817.3042973"},{"key":"e_1_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3452787"},{"key":"e_1_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3183739"},{"key":"e_1_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994534"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3476249.3476299","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T10:08:52Z","timestamp":1672222132000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3476249.3476299"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7]]},"references-count":79,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2021,7]]}},"alternative-id":["10.14778\/3476249.3476299"],"URL":"https:\/\/doi.org\/10.14778\/3476249.3476299","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2021,7]]}}}