{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,7]],"date-time":"2026-03-07T19:04:32Z","timestamp":1772910272935,"version":"3.50.1"},"reference-count":41,"publisher":"Association for Computing Machinery (ACM)","issue":"2","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2017,10]]},"abstract":"<jats:p>\n            Entity matching (EM) is a critical part of data integration. We study how to\n            <jats:italic>synthesize entity matching rules<\/jats:italic>\n            from positive-negative matching examples. The core of our solution is\n            <jats:italic>program synthesis<\/jats:italic>\n            , a powerful tool to automatically generate rules (or programs) that satisfy a given high-level specification, via a predefined grammar. This grammar describes a\n            <jats:italic>General Boolean Formula<\/jats:italic>\n            (\n            <jats:bold>GBF<\/jats:bold>\n            ) that can include arbitrary attribute matching predicates combined by conjunctions (\u2227), disjunctions (\u2228) and negations (\u00ac), and is expressive enough to model EM problems, from capturing arbitrary attribute combinations to handling missing attribute values. The rules in the form of\n            <jats:bold>GBF<\/jats:bold>\n            are more concise than traditional EM rules represented in Disjunctive Normal Form (\n            <jats:bold>DNF<\/jats:bold>\n            ). Consequently, they are more interpretable than decision trees and other machine learning algorithms that output deep trees with many branches. We present a new synthesis algorithm that, given only positive-negative examples as input, synthesizes EM rules that are effective over the entire dataset. Extensive experiments show that we outperform other interpretable rules (e.g., decision trees with low depth) in effectiveness, and are comparable with non-interpretable tools (e.g., decision trees with high depth, gradient-boosting trees, random forests and SVM).\n          <\/jats:p>","DOI":"10.14778\/3149193.3149199","type":"journal-article","created":{"date-parts":[[2017,12,12]],"date-time":"2017-12-12T18:33:38Z","timestamp":1513103618000},"page":"189-202","source":"Crossref","is-referenced-by-count":95,"title":["Synthesizing entity matching rules by examples"],"prefix":"10.14778","volume":"11","author":[{"given":"Rohit","family":"Singh","sequence":"first","affiliation":[{"name":"CSAIL and Uber AI Labs"}]},{"given":"Venkata Vamsikrishna","family":"Meduri","sequence":"additional","affiliation":[{"name":"Arizona State University"}]},{"given":"Ahmed","family":"Elmagarmid","sequence":"additional","affiliation":[{"name":"Qatar Computing Research Institute, HBKU, Qatar"}]},{"given":"Samuel","family":"Madden","sequence":"additional","affiliation":[{"name":"CSAIL"}]},{"given":"Paolo","family":"Papotti","sequence":"additional","affiliation":[{"name":"EURECOM, France"}]},{"given":"Jorge-Arnulfo","family":"Quian\u00e9-Ruiz","sequence":"additional","affiliation":[{"name":"Qatar Computing Research Institute, HBKU, Qatar"}]},{"given":"Armando","family":"Solar-Lezama","sequence":"additional","affiliation":[{"name":"CSAIL"}]},{"given":"Nan","family":"Tang","sequence":"additional","affiliation":[{"name":"Qatar Computing Research Institute, HBKU, Qatar"}]}],"member":"320","published-online":{"date-parts":[[2017,10]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Datasketch: Minhash lsh. https:\/\/ekzhu.github.io\/datasketch\/lsh.html.  Datasketch: Minhash lsh. https:\/\/ekzhu.github.io\/datasketch\/lsh.html."},{"key":"e_1_2_1_2_1","unstructured":"Scikit learn: Support vector machines in practice. http:\/\/scikit-learn.org\/stable\/modules\/svm.html.  Scikit learn: Support vector machines in practice. http:\/\/scikit-learn.org\/stable\/modules\/svm.html."},{"key":"e_1_2_1_3_1","unstructured":"Standardization or mean removal and variance scaling. http:\/\/scikit-learn.org\/stable\/modules\/preprocessing.html.  Standardization or mean removal and variance scaling. http:\/\/scikit-learn.org\/stable\/modules\/preprocessing.html."},{"key":"e_1_2_1_4_1","unstructured":"Tuning the hyper-parameters of an estimator. http:\/\/scikit-learn.org\/stable\/modules\/grid_search.html.  Tuning the hyper-parameters of an estimator. http:\/\/scikit-learn.org\/stable\/modules\/grid_search.html."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994518"},{"key":"e_1_2_1_6_1","first-page":"1","volume-title":"Dependable Software Systems Engineering","author":"Alur R.","year":"2015","unstructured":"R. Alur , R. Bod\u00edk , E. Dallal , D. Fisman , P. Garg , G. Juniwal , H. Kress-Gazit , P. Madhusudan , M. M. K. Martin , M. Raghothaman , S. Saha , S. A. Seshia , R. Singh , A. Solar-Lezama , E. Torlak , and A. Udupa . Syntax-guided synthesis . In Dependable Software Systems Engineering , pages 1 -- 25 . 2015 . R. Alur, R. Bod\u00edk, E. Dallal, D. Fisman, P. Garg, G. Juniwal, H. Kress-Gazit, P. Madhusudan, M. M. K. Martin, M. Raghothaman, S. Saha, S. A. Seshia, R. Singh, A. Solar-Lezama, E. Torlak, and A. Udupa. Syntax-guided synthesis. In Dependable Software Systems Engineering, pages 1--25. 2015."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/956750.956759"},{"key":"e_1_2_1_8_1","volume-title":"Introduction to boosted trees","author":"Chen T.","year":"2014","unstructured":"T. Chen . Introduction to boosted trees . University of Washington Computer Science , 2014 . T. Chen. Introduction to boosted trees. University of Washington Computer Science, 2014."},{"key":"e_1_2_1_9_1","first-page":"827","volume-title":"Rule-based information extraction is dead! long live rule-based information extraction systems! In EMNLP","author":"Chiticariu L.","year":"2013","unstructured":"L. Chiticariu , Y. Li , and F. R. Reiss . Rule-based information extraction is dead! long live rule-based information extraction systems! In EMNLP , pages 827 -- 832 , 2013 . L. Chiticariu, Y. Li, and F. R. Reiss. Rule-based information extraction is dead! long live rule-based information extraction systems! In EMNLP, pages 827--832, 2013."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/775047.775116"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2488388.2488415"},{"key":"e_1_2_1_12_1","volume-title":"CIDR","author":"Deng D.","year":"2017","unstructured":"D. Deng , R. C. Fernandez , Z. Abedjan , S. Wang , M. Stonebraker , A. K. Elmagarmid , I. F. Ilyas , S. Madden , M. Ouzzani , and N. Tang . The data civilizer system . In CIDR , 2017 . D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang. The data civilizer system. In CIDR, 2017."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.5555\/1287369.1287422"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISMVL.2011.40"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2594511"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2007.9"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.1969.10501049"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.14778\/2876473.2876474"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/358669.358692"},{"key":"e_1_2_1_20_1","first-page":"175","volume-title":"CAV","volume":"4","author":"Ganzinger H.","year":"2004","unstructured":"H. Ganzinger , G. Hagen , R. Nieuwenhuis , A. Oliveras , and C. Tinelli . Dpll (t): Fast decision procedures . In CAV , volume 4 , pages 175 -- 188 . Springer , 2004 . H. Ganzinger, G. Hagen, R. Nieuwenhuis, A. Oliveras, and C. Tinelli. Dpll (t): Fast decision procedures. In CAV, volume 4, pages 175--188. Springer, 2004."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2742784"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2588576"},{"key":"e_1_2_1_23_1","volume-title":"ML for Complex Systems NIPS 2016 Workshop. https:\/\/sites.google.com\/site\/nips2016interpretml.","unstructured":"Interpretable ML for Complex Systems NIPS 2016 Workshop. https:\/\/sites.google.com\/site\/nips2016interpretml. Interpretable ML for Complex Systems NIPS 2016 Workshop. https:\/\/sites.google.com\/site\/nips2016interpretml."},{"key":"e_1_2_1_24_1","first-page":"3","volume-title":"QDB\/MUD","author":"K\u00f6pcke H.","year":"2008","unstructured":"H. K\u00f6pcke and E. Rahm . Training selection for tuning entity matching . In QDB\/MUD , pages 3 -- 12 , 2008 . H. K\u00f6pcke and E. Rahm. Training selection for tuning entity matching. In QDB\/MUD, pages 3--12, 2008."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1920904"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939874"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1011"},{"key":"e_1_2_1_28_1","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9781139924801","volume-title":"Mining of massive datasets","author":"Leskovec J.","year":"2014","unstructured":"J. Leskovec , A. Rajaraman , and J. D. Ullman . Mining of massive datasets . Cambridge university press , 2014 . J. Leskovec, A. Rajaraman, and J. D. Ullman. Mining of massive datasets. Cambridge university press, 2014."},{"key":"e_1_2_1_29_1","first-page":"356","volume-title":"Proc. of the 16th International Florida Artificial Intelligence Research Society Conference","author":"Musicant D. R.","year":"2003","unstructured":"D. R. Musicant , V. Kumar , and A. Ozgur . Optimizing f-measure with support vector machines . In Proc. of the 16th International Florida Artificial Intelligence Research Society Conference , pages 356 -- 360 , 2003 . D. R. Musicant, V. Kumar, and A. Ozgur. Optimizing f-measure with support vector machines. In Proc. of the 16th International Florida Artificial Intelligence Research Society Conference, pages 356--360, 2003."},{"key":"e_1_2_1_30_1","first-page":"354","volume-title":"EDBT","author":"Panahi F.","year":"2017","unstructured":"F. Panahi , W. Wu , A. Doan , and J. F. Naughton . Towards interactive debugging of rule-based entity matching . In EDBT , pages 354 -- 365 , 2017 . F. Panahi, W. Wu, A. Doan, and J. F. Naughton. Towards interactive debugging of rule-based entity matching. In EDBT, pages 354--365, 2017."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.5555\/1953048.2078195"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/775047.775087"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807285"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3058739"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2006.65"},{"key":"e_1_2_1_36_1","volume-title":"Statistical Genomics: Methods and Protocols","author":"Smith T. C.","year":"2016","unstructured":"T. C. Smith and E. Frank . Statistical Genomics: Methods and Protocols , chapter Introducing Machine Learning Concepts with WEKA. Springer , 2016 . T. C. Smith and E. Frank. Statistical Genomics: Methods and Protocols, chapter Introducing Machine Learning Concepts with WEKA. Springer, 2016."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-10672-9_3"},{"key":"e_1_2_1_38_1","volume-title":"Program sketching. STTT, 15(5--6):475--495","author":"Solar-Lezama A.","year":"2013","unstructured":"A. Solar-Lezama . Program sketching. STTT, 15(5--6):475--495 , 2013 . A. Solar-Lezama. Program sketching. STTT, 15(5--6):475--495, 2013."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.14778\/2350229.2350263"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.14778\/2021017.2021020"},{"key":"e_1_2_1_41_1","volume-title":"CoRR","author":"Wang J.","year":"2014","unstructured":"J. Wang , H. T. Shen , J. Song , and J. Ji . Hashing for similarity search: A survey . CoRR , 2014 . J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. CoRR, 2014."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3149193.3149199","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T09:33:43Z","timestamp":1672220023000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3149193.3149199"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,10]]},"references-count":41,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2017,10]]}},"alternative-id":["10.14778\/3149193.3149199"],"URL":"https:\/\/doi.org\/10.14778\/3149193.3149199","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2017,10]]}}}