{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,17]],"date-time":"2025-09-17T16:41:44Z","timestamp":1758127304776,"version":"3.44.0"},"reference-count":84,"publisher":"Association for Computing Machinery (ACM)","issue":"8","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2025,4]]},"abstract":"<jats:p>Data practitioners often sample their datasets to produce representative subsets for their downstream tasks. When entities in a dataset can be partitioned into multiple groups, stratified sampling is commonly used to produce subsets that match a target group distribution, e.g., to select a balanced subset for training a machine learning model. However, real-world data frequently contains duplicates \u2014 multiple representations of the same real-world entity \u2014 that can bias sampling, necessitating deduplication.<\/jats:p>\n          <jats:p>\n            We define\n            <jats:italic toggle=\"yes\">deduplicated sampling<\/jats:italic>\n            as the task of producing a clean sample of a dirty dataset according to a target group distribution. The na\u00efve approach to deduplicated sampling would first deduplicate the entire dataset upfront, then perform sampling\n            <jats:italic toggle=\"yes\">ex post.<\/jats:italic>\n            However, that approach might be prohibitively expensive for large datasets and time\/resource constraints.\n            <jats:italic toggle=\"yes\">Deduplicated sampling ondemand<\/jats:italic>\n            with RadlER is a novel approach to produce a clean sample by focusing the cleaning effort only on entities required to appear in that sample. Our experimental evaluation, performed on multiple datasets from different domains, demonstrates that RadlER consistently outperforms baseline approaches, providing data scientists with an efficient solution to quickly produce a clean sample of a dirty dataset according to a target group distribution.\n          <\/jats:p>","DOI":"10.14778\/3742728.3742742","type":"journal-article","created":{"date-parts":[[2025,9,3]],"date-time":"2025-09-03T13:32:53Z","timestamp":1756906373000},"page":"2482-2495","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Deduplicated Sampling On-Demand"],"prefix":"10.14778","volume":"18","author":[{"given":"Luca","family":"Zecchini","sequence":"first","affiliation":[{"name":"BIFOLD &amp; TU Berlin, Berlin, Germany"}]},{"given":"Vasilis","family":"Efthymiou","sequence":"additional","affiliation":[{"name":"Harokopio University, Athens, Greece"}]},{"given":"Felix","family":"Naumann","sequence":"additional","affiliation":[{"name":"Hasso Plattner Institute, Potsdam, Germany"}]},{"given":"Giovanni","family":"Simonini","sequence":"additional","affiliation":[{"name":"University of Modena and Reggio Emilia, Italy"}]}],"member":"320","published-online":{"date-parts":[[2025,9,3]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.48786\/edbt.2025.10"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.14778\/2556549.2556567"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.14778\/2850583.2850587"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.is.2024.102506"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-01851-0"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3442200"},{"volume-title":"Pattern Recognition and Machine Learning","author":"Bishop Christopher M.","key":"e_1_2_1_7_1","unstructured":"Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer. https:\/\/link.springer.com\/book\/9780387310732"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1456650.1456651"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-60626-7_7"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137641"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3616865"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-31164-2"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3418896"},{"key":"e_1_2_1_14_1","article-title":"The Measure and Mismeasure of Fairness","volume":"312","author":"Corbett-Davies Sam","year":"2023","unstructured":"Sam Corbett-Davies, Johann D. Gaebler, Hamed Nilforoshan, Ravi Shroff, and Sharad Goel. 2023. The Measure and Mismeasure of Fairness. Journal of Machine Learning Research (JMLR) 24, Article 312 (2023), 117 pages. http:\/\/jmlr.org\/papers\/v24\/22-1511.html","journal-title":"Journal of Machine Learning Research (JMLR) 24, Article"},{"key":"e_1_2_1_15_1","volume-title":"Donatella Firmani, Maurizio Mazzei, Paolo Merialdo, Federico Piai, and Divesh Srivastava.","author":"Crescenzi Valter","year":"2021","unstructured":"Valter Crescenzi, Andrea De Angelis, Donatella Firmani, Maurizio Mazzei, Paolo Merialdo, Federico Piai, and Divesh Srivastava. 2021. Alaska: A Flexible Benchmark for Data Integration Tasks. arXiv:2101.11259"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3035960"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3615952.3615965"},{"key":"e_1_2_1_18_1","unstructured":"Luca Deck Jan-Laurin M\u00fcller Conradin Braun Domenique Zipperling and Niklas K\u00fchl. 2024. Implications of the AI Act for Non-Discrimination Law and Algorithmic Fairness. arXiv:2403.20089"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2019.00026"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/n19-1423"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-01853-4"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2090236.2090255"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.14778\/3236187.3236198"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/BigData59044.2023.10386556"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3459637.3482105"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-01892-3"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-33455-9_5"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.14778\/2876473.2876474"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.5441\/002\/edbt.2019.66"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.48786\/edbt.2023.07"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389775"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2588576"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.3926\/jiem.2011.v4n2.p168-193"},{"key":"e_1_2_1_34_1","volume-title":"Ilyas","author":"Heidari Alireza","year":"2020","unstructured":"Alireza Heidari, Shrinu Kushagra, and Ihab F. Ilyas. 2020. On sampling from data with duplicate records. arXiv:2008.10549"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3310205"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/2254556.2254659"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994535"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.14778\/3377369.3377379"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994514"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1017\/9781108684163"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137765.3137833"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.14778\/3626292.3626295"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.14778\/3421424.3421431"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.14778\/3617838.3617841"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1007\/s12027-024-00785-w"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3709715"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1002\/int.22415"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380597"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3457607"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/2501511.2501523"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3196926"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.14778\/3574245.3574258"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476249.3476299"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-01835-0"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.14778\/3529337.3529356"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2013.54"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/3377455"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2014.2359666"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457284"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.14778\/3583140.3583163"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-42941-5_20"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.48786\/edbt.2024.03"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.48786\/edbt.2025.42"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.14778\/3503585.3503595"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1145\/3639326"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1145\/3494672"},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-021-00697-y"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-66917-5_19"},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.14778\/3611479.3611525"},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588433"},{"key":"e_1_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2018.00015"},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.14778\/3523210.3523226"},{"key":"e_1_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.14778\/3149193.3149199"},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1355"},{"key":"e_1_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476249.3476294"},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","unstructured":"Steven K. Thompson. 2012. Sampling. John Wiley & Sons. 10.1002\/9781118162934","DOI":"10.1002\/9781118162934"},{"key":"e_1_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1080\/15544771003697247"},{"key":"e_1_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1134\/S0005117920100082"},{"key":"e_1_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2610505"},{"key":"e_1_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-013-0315-0"},{"key":"e_1_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2012.43"},{"key":"e_1_2_1_82_1","volume-title":"Proceedings of the International Workshop on Challenges and Experiences from Data Integration to Knowledge Graphs (DI2KG @ VLDB). https:\/\/ceur-ws.org\/Vol-2726\/paper3.pdf","author":"Zecchini Luca","year":"2020","unstructured":"Luca Zecchini, Giovanni Simonini, and Sonia Bergamaschi. 2020. Entity Resolution on Camera Records without Machine Learning. In Proceedings of the International Workshop on Challenges and Experiences from Data Integration to Knowledge Graphs (DI2KG @ VLDB). https:\/\/ceur-ws.org\/Vol-2726\/paper3.pdf"},{"key":"e_1_2_1_83_1","doi-asserted-by":"publisher","DOI":"10.14778\/3611540.3611612"},{"key":"e_1_2_1_84_1","doi-asserted-by":"publisher","DOI":"10.1145\/3132847.3132938"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3742728.3742742","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,3]],"date-time":"2025-09-03T13:33:52Z","timestamp":1756906432000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3742728.3742742"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,4]]},"references-count":84,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2025,4]]}},"alternative-id":["10.14778\/3742728.3742742"],"URL":"https:\/\/doi.org\/10.14778\/3742728.3742742","relation":{},"ISSN":["2150-8097"],"issn-type":[{"type":"print","value":"2150-8097"}],"subject":[],"published":{"date-parts":[[2025,4]]},"assertion":[{"value":"2025-09-03","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}