{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,17]],"date-time":"2026-03-17T09:16:58Z","timestamp":1773739018674,"version":"3.50.1"},"reference-count":47,"publisher":"Association for Computing Machinery (ACM)","issue":"1","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2009,8]]},"abstract":"<jats:p>\n            The presence of duplicate records is a major data quality concern in large databases. To detect duplicates,\n            <jats:italic>entity resolution<\/jats:italic>\n            also known as\n            <jats:italic>duplication detection<\/jats:italic>\n            or\n            <jats:italic>record linkage<\/jats:italic>\n            is used as a part of the data cleaning process to identify records that potentially refer to the same real-world entity. We present the Stringer system that provides an evaluation framework for understanding what barriers remain towards the goal of truly scalable and general purpose duplication detection algorithms. In this paper, we use Stringer to evaluate the quality of the clusters (groups of potential duplicates) obtained from several unconstrained clustering algorithms used in concert with approximate join techniques. Our work is motivated by the recent significant advancements that have made approximate join algorithms highly scalable. Our extensive evaluation reveals that some clustering algorithms that have never been considered for duplicate detection, perform extremely well in terms of both accuracy and scalability.\n          <\/jats:p>","DOI":"10.14778\/1687627.1687771","type":"journal-article","created":{"date-parts":[[2014,6,24]],"date-time":"2014-06-24T12:17:57Z","timestamp":1403612277000},"page":"1282-1293","source":"Crossref","is-referenced-by-count":143,"title":["Framework for evaluating clustering algorithms in duplicate detection"],"prefix":"10.14778","volume":"2","author":[{"given":"Oktie","family":"Hassanzadeh","sequence":"first","affiliation":[{"name":"University of Toronto"}]},{"given":"Fei","family":"Chiang","sequence":"additional","affiliation":[{"name":"University of Toronto"}]},{"given":"Hyun Chul","family":"Lee","sequence":"additional","affiliation":[{"name":"Thoora Inc."}]},{"given":"Ren\u00e9e J.","family":"Miller","sequence":"additional","affiliation":[{"name":"University of Toronto"}]}],"member":"320","published-online":{"date-parts":[[2009,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/1060590.1060692"},{"key":"e_1_2_1_3_1","first-page":"918","volume-title":"Proc. of the Int'l Conf. on Very Large Data Bases (VLDB)","author":"Arasu A.","year":"2006","unstructured":"A. Arasu , V. Ganti , and R. Kaushik . Efficient Exact Set-Similarity Joins . In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB) , pages 918 -- 929 , 2006 . A. Arasu, V. Ganti, and R. Kaushik. Efficient Exact Set-Similarity Joins. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 918--929, 2006."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.7155\/jgaa.00084"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1023\/B:MACH.0000033116.57574.95"},{"key":"e_1_2_1_6_1","first-page":"806","volume-title":"Proc. of the Int'l Conf. on Very Large Data Bases (VLDB)","author":"Bansal N.","year":"2007","unstructured":"N. Bansal , F. Chiang , N. Koudas , and F. W. Tompa . Seeking Stable Clusters In The Blogosphere . In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB) , pages 806 -- 817 , Vienna, Austria , 2007 . N. Bansal, F. Chiang, N. Koudas, and F. W. Tompa. Seeking Stable Clusters In The Blogosphere. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 806--817, Vienna, Austria, 2007."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1242572.1242591"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611972764.5"},{"issue":"2","key":"e_1_2_1_9_1","first-page":"4","volume":"29","author":"Bhattacharya I.","year":"2006","unstructured":"I. Bhattacharya and L. Getoor . Collective Entity Resolution in Relational Data. IEEE Data Engineering Bulletin , 29 ( 2 ): 4 -- 12 , 2006 . I. Bhattacharya and L. Getoor. Collective Entity Resolution in Relational Data. IEEE Data Engineering Bulletin, 29(2):4--12, 2006.","journal-title":"Collective Entity Resolution in Relational Data. IEEE Data Engineering Bulletin"},{"key":"e_1_2_1_10_1","first-page":"568","volume-title":"Experiments on Graph Clustering Algorithms. In The 11th Europ. Symp. Algorithms","author":"Brandes U.","year":"2003","unstructured":"U. Brandes , M. Gaertler , and D. Wagner . Experiments on Graph Clustering Algorithms. In The 11th Europ. Symp. Algorithms , pages 568 -- 579 . Springer-Verlag , 2003 . U. Brandes, M. Gaertler, and D. Wagner. Experiments on Graph Clustering Algorithms. In The 11th Europ. Symp. Algorithms, pages 568--579. Springer-Verlag, 2003."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1186\/1471-2105-7-488"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcss.2004.10.012"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2005.125"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1265530.1265545"},{"key":"e_1_2_1_16_1","first-page":"73","volume-title":"Proc. of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03)","author":"Cohen W. W.","year":"2003","unstructured":"W. W. Cohen , P. Ravikumar , and S. E. Fienberg . A Comparison of String Distance Metrics for Name-Matching Tasks . In Proc. of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03) , pages 73 -- 78 , Acapulco, Mexico , 2003 . W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A Comparison of String Distance Metrics for Name-Matching Tasks. In Proc. of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), pages 73--78, Acapulco, Mexico, 2003."},{"key":"e_1_2_1_17_1","volume-title":"Introduction to Algorithms","author":"Cormen T. H.","year":"1990","unstructured":"T. H. Cormen , C. E. Leiserson , and R. L. Rivest . Introduction to Algorithms . McGraw Hill and MIT Press , 1990 . T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. McGraw Hill and MIT Press, 1990."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF01890115"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.tcs.2006.05.008"},{"key":"e_1_2_1_20_1","first-page":"1277","article-title":"Algorithm for Solution of a Problem Of Maximum Flow","volume":"11","author":"Dinic E. A.","year":"1970","unstructured":"E. A. Dinic . Algorithm for Solution of a Problem Of Maximum Flow in Networks with Power Estimation. Soviet Math. Dokl , 11 : 1277 -- 1280 , 1970 . E. A. Dinic. Algorithm for Solution of a Problem Of Maximum Flow in Networks with Power Estimation. Soviet Math. Dokl, 11:1277--1280, 1970.","journal-title":"Networks with Power Estimation. Soviet Math. Dokl"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/321694.321699"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1023\/B:VISI.0000022288.19776.77"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2007.05.018"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1080\/15427951.2004.10129093"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.4153\/CJM-1956-045-5"},{"key":"e_1_2_1_26_1","first-page":"721","volume-title":"Proc. of the Int'l Conf. on Very Large Data Bases (VLDB)","author":"Gibson D.","year":"2005","unstructured":"D. Gibson , R. Kumar , and A. Tomkins . Discovering Large Dense Subgraphs in Massive Graphs . In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB) , pages 721 -- 732 , 2005 . D. Gibson, R. Kumar, and A. Tomkins. Discovering Large Dense Subgraphs in Massive Graphs. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 721--732, 2005."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/48014.61051"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1012801612483"},{"key":"e_1_2_1_29_1","volume-title":"Benchmarking Declarative Approximate Selection Predicates. Master's thesis","author":"Hassanzadeh O.","year":"2007","unstructured":"O. Hassanzadeh . Benchmarking Declarative Approximate Selection Predicates. Master's thesis , University of Toronto , February 2007 . O. Hassanzadeh. Benchmarking Declarative Approximate Selection Predicates. Master's thesis, University of Toronto, February 2007."},{"key":"e_1_2_1_31_1","first-page":"11","volume-title":"Proc. of the International Workshop on Quality in Databases (QDB)","author":"Hassanzadeh O.","year":"2007","unstructured":"O. Hassanzadeh , M. Sadoghi , and R. J. Miller . Accuracy of Approximate String Joins Using Grams . In Proc. of the International Workshop on Quality in Databases (QDB) , pages 11 -- 18 , Vienna, Austria , 2007 . O. Hassanzadeh, M. Sadoghi, and R. J. Miller. Accuracy of Approximate String Joins Using Grams. In Proc. of the International Workshop on Quality in Databases (QDB), pages 11--18, Vienna, Austria, 2007."},{"key":"e_1_2_1_32_1","first-page":"129","volume-title":"Proc. of the Int'l Workshop on the Web and Databases (WebDB)","author":"Haveliwala T. H.","year":"2000","unstructured":"T. H. Haveliwala , A. Gionis , and P. Indyk . Scalable Techniques for Clustering the Web . In Proc. of the Int'l Workshop on the Web and Databases (WebDB) , pages 129 -- 134 , Dallas, Texas, USA , 2000 . T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable Techniques for Clustering the Web. In Proc. of the Int'l Workshop on the Web and Databases (WebDB), pages 129--134, Dallas, Texas, USA, 2000."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1009761603038"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1101\/gr.9.11.1106"},{"key":"e_1_2_1_35_1","volume-title":"Algorithms for Clustering Data","author":"Jain A.","year":"1988","unstructured":"A. Jain and R. Dubes . Algorithms for Clustering Data . Prentice Hall , 1988 . A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/331499.331504"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/990308.990313"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1186\/1471-2105-7-456"},{"key":"e_1_2_1_39_1","volume-title":"Graph Clustering with Restricted Neighbourhood Search. Master's thesis","author":"King A. D.","year":"2004","unstructured":"A. D. King . Graph Clustering with Restricted Neighbourhood Search. Master's thesis , University of Toronto , 2004 . A. D. King. Graph Clustering with Restricted Neighbourhood Search. Master's thesis, University of Toronto, 2004."},{"key":"e_1_2_1_40_1","volume-title":"Introduction to Clustering Large and High-Dimensional Data","author":"Kogan J.","year":"2007","unstructured":"J. Kogan . Introduction to Clustering Large and High-Dimensional Data . Cambridge Univ. Press , 2007 . J. Kogan. Introduction to Clustering Large and High-Dimensional Data. Cambridge Univ. Press, 2007."},{"key":"e_1_2_1_41_1","first-page":"303","volume-title":"Proc. of the Int'l Conf. on Very Large Data Bases (VLDB)","author":"Li C.","year":"2007","unstructured":"C. Li , B. Wang , and X. Yang . VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams . In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB) , pages 303 -- 314 , Vienna, Austria , 2007 . C. Li, B. Wang, and X. Yang. VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 303--314, Vienna, Austria, 2007."},{"key":"e_1_2_1_42_1","first-page":"727","volume-title":"Pelleg. X-Means: Extending K-Means with Efficient Estimation of the Number of Clusters. In Proc. of the Int'l Conf. on Machine Learning","author":"A. M.","year":"2000","unstructured":"A. M. D. Pelleg. X-Means: Extending K-Means with Efficient Estimation of the Number of Clusters. In Proc. of the Int'l Conf. on Machine Learning , pages 727 -- 734 , San Francisco, CA, USA , 2000 . A. M. D. Pelleg. X-Means: Extending K-Means with Efficient Estimation of the Number of Clusters. In Proc. of the Int'l Conf. on Machine Learning, pages 727--734, San Francisco, CA, USA, 2000."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/1007568.1007652"},{"key":"e_1_2_1_45_1","first-page":"526","volume-title":"Correlation Clustering: Maximizing Agreements Via Semidefinite Programming. In Proc. of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)","author":"Swamy C.","year":"2004","unstructured":"C. Swamy . Correlation Clustering: Maximizing Agreements Via Semidefinite Programming. In Proc. of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages 526 -- 527 , New Orleans, Louisiana, USA , 2004 . C. Swamy. Correlation Clustering: Maximizing Agreements Via Semidefinite Programming. In Proc. of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 526--527, New Orleans, Louisiana, USA, 2004."},{"key":"e_1_2_1_46_1","first-page":"465","volume-title":"Symp. on Foundations of Computer Science (FOCS)","author":"Umans C.","year":"1999","unstructured":"C. Umans . Hardness of Approximating Sigma2p Minimization Problems . In Symp. on Foundations of Computer Science (FOCS) , pages 465 -- 474 , 1999 . C. Umans. Hardness of Approximating Sigma2p Minimization Problems. In Symp. on Foundations of Computer Science (FOCS), pages 465--474, 1999."},{"key":"e_1_2_1_47_1","volume-title":"University of Utrecht","author":"van Dongen S.","year":"2000","unstructured":"S. van Dongen . Graph Clustering By Flow Simulation. PhD thesis , University of Utrecht , 2000 . S. van Dongen. Graph Clustering By Flow Simulation. PhD thesis, University of Utrecht, 2000."},{"key":"e_1_2_1_48_1","volume-title":"Graph Clustering With Overlap. Master's thesis","author":"Whitney J. A.","year":"2006","unstructured":"J. A. Whitney . Graph Clustering With Overlap. Master's thesis , University of Toronto , 2006 . J. A. Whitney. Graph Clustering With Overlap. Master's thesis, University of Toronto, 2006."},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-00887-0_13"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNN.2005.845141"},{"key":"e_1_2_1_51_1","volume-title":"Clustering of Large Data Sets","author":"Zupan J.","year":"1982","unstructured":"J. Zupan . Clustering of Large Data Sets . Research Studies Press , 1982 . J. Zupan. Clustering of Large Data Sets. Research Studies Press, 1982."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/1687627.1687771","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T11:37:09Z","timestamp":1672227429000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/1687627.1687771"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2009,8]]},"references-count":47,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2009,8]]}},"alternative-id":["10.14778\/1687627.1687771"],"URL":"https:\/\/doi.org\/10.14778\/1687627.1687771","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2009,8]]}}}