{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,4]],"date-time":"2025-12-04T18:35:08Z","timestamp":1764873308949},"reference-count":27,"publisher":"Association for Computing Machinery (ACM)","issue":"1","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2009,8]]},"abstract":"<jats:p>One of the most prominent data quality problems is the existence of duplicate records. Current duplicate elimination procedures usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. Furthermore, replacing the input dirty data with one possible clean instance may result in unrecoverable errors, for example, identification and merging of possible duplicate records in health care systems.<\/jats:p>\n          <jats:p>In this paper, we treat duplicate detection procedures as data processing tasks with uncertain outcomes. We concentrate on a family of duplicate detection algorithms that are based on parameterized clustering. We propose a novel uncertainty model that compactly encodes the space of possible repairs corresponding to different parameter settings. We show how to efficiently support relational queries under our model, and to allow new types of queries on the set of possible repairs. We give an experimental study illustrating the scalability and the efficiency of our techniques in different configurations.<\/jats:p>","DOI":"10.14778\/1687627.1687695","type":"journal-article","created":{"date-parts":[[2014,6,24]],"date-time":"2014-06-24T12:17:57Z","timestamp":1403612277000},"page":"598-609","source":"Crossref","is-referenced-by-count":22,"title":["Modeling and querying possible repairs in duplicate detection"],"prefix":"10.14778","volume":"2","author":[{"given":"George","family":"Beskales","sequence":"first","affiliation":[{"name":"University of Waterloo"}]},{"given":"Mohamed A.","family":"Soliman","sequence":"additional","affiliation":[{"name":"University of Waterloo"}]},{"given":"Ihab F.","family":"Ilyas","sequence":"additional","affiliation":[{"name":"University of Waterloo"}]},{"given":"Shai","family":"Ben-David","sequence":"additional","affiliation":[{"name":"University of Waterloo"}]}],"member":"320","published-online":{"date-parts":[[2009,8]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Bsiness objects http:\/\/www.businessobjects.com.  Bsiness objects http:\/\/www.businessobjects.com."},{"key":"e_1_2_1_2_1","unstructured":"Oracle data integrator http:\/\/www.oracle.com\/technology\/products\/oracle-data-integrator.  Oracle data integrator http:\/\/www.oracle.com\/technology\/products\/oracle-data-integrator."},{"key":"e_1_2_1_3_1","unstructured":"PostgreSQL database system http:\/\/www.postgresql.org.  PostgreSQL database system http:\/\/www.postgresql.org."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/38714.38724"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2006.35"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/303976.303983"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1137\/1018115"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1217299.1217304"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2005.125"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/1247480.1247530"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1007\/11508069_15"},{"key":"e_1_2_1_13_1","unstructured":"P. Christen and T. Churches. Febrl. freely extensible biomedical record linkage http:\/\/datamining.anu.edu.au\/projects.  P. Christen and T. Churches. Febrl. freely extensible biomedical record linkage http:\/\/datamining.anu.edu.au\/projects."},{"key":"e_1_2_1_14_1","volume-title":"the 25th Annual SAS Users Group International Conference","author":"Yuan Y. C.","year":"2002","unstructured":"Y. C. Yuan . Multiple imputation for missing data: Concepts and new development . In the 25th Annual SAS Users Group International Conference , 2002 . Y. C. Yuan. Multiple imputation for missing data: Concepts and new development. In the 25th Annual SAS Users Group International Conference, 2002."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-006-0004-3"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2007.9"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.1969.10501049"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1080\/15427951.2004.10129093"},{"key":"e_1_2_1_19_1","volume-title":"VLDB","author":"Galhardas H.","year":"2001","unstructured":"H. Galhardas , D. Florescu , D. Shasha , E. Simon , and C.-A. Saita . Declarative data cleaning: Language, model, and algorithms . In VLDB , 2001 . H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB, 2001."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/276304.276312"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1634.1886"},{"key":"e_1_2_1_23_1","volume-title":"Algorithms for Clustering Data","author":"Jain A. K.","year":"1988","unstructured":"A. K. Jain and R. C. Dubes . Algorithms for Clustering Data . Prentice Hall College Div , 1988 . A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall College Div, 1988."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.5555\/951949.952157"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/347090.347154"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/844128.844142"},{"key":"e_1_2_1_27_1","volume-title":"Jour. of Official Statistics","author":"Mulry M. H.","year":"2006","unstructured":"M. H. Mulry , S. L. Bean , D. M. Bauder , D. Wagner , T. Mule , and R. J. Petroni . Evaluation of estimates of census duplication using administrative records information . Jour. of Official Statistics , 2006 . M. H. Mulry, S. L. Bean, D. M. Bauder, D. Wagner, T. Mule, and R. J. Petroni. Evaluation of estimates of census duplication using administrative records information. Jour. of Official Statistics, 2006."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2006.63"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2005.11"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/1687627.1687695","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T11:30:09Z","timestamp":1672227009000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/1687627.1687695"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2009,8]]},"references-count":27,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2009,8]]}},"alternative-id":["10.14778\/1687627.1687695"],"URL":"https:\/\/doi.org\/10.14778\/1687627.1687695","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2009,8]]}}}