{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,12]],"date-time":"2025-11-12T03:18:53Z","timestamp":1762917533241,"version":"3.41.0"},"reference-count":41,"publisher":"Association for Computing Machinery (ACM)","issue":"1-2","license":[{"start":{"date-parts":[[2014,9,4]],"date-time":"2014-09-04T00:00:00Z","timestamp":1409788800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["J. Data and Information Quality"],"published-print":{"date-parts":[[2014,9,4]]},"abstract":"<jats:p>Duplicates in a database are one of the prime causes of poor data quality and are at the same time among the most difficult data quality problems to alleviate. To detect and remove such duplicates, many commercial and academic products and methods have been developed. The evaluation of such systems is usually in need of pre-classified results. Such gold standards are often expensive to come by (much manual classification is necessary), not representative (too small or too synthetic), and proprietary and thus preclude repetition (company-internal data). This lament has been uttered in many papers and even more paper reviews.<\/jats:p><jats:p>The proposed<jats:italic>annealing standard<\/jats:italic>is a structured set of duplicate detection results, some of which are manually verified and some of which are merely validated by many classifiers. As more and more classifiers are evaluated against the annealing standard, more and more results are verified and validation becomes more and more confident. We formally define gold, silver, and the annealing standard and their maintenance. Experiments show how quickly an annealing standard converges to a gold standard. Finally, we provide an annealing standard for 750,000 CDs to the duplicate detection community.<\/jats:p>","DOI":"10.1145\/2629687","type":"journal-article","created":{"date-parts":[[2014,9,5]],"date-time":"2014-09-05T19:12:56Z","timestamp":1409944376000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":8,"title":["Reach for gold"],"prefix":"10.1145","volume":"5","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2608-8859","authenticated-orcid":false,"given":"Tobias","family":"Vogel","sequence":"first","affiliation":[{"name":"Hasso Plattner Institute"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Arvid","family":"Heise","sequence":"additional","affiliation":[{"name":"Hasso Plattner Institute"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Uwe","family":"Draisbach","sequence":"additional","affiliation":[{"name":"Hasso Plattner Institute"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Dustin","family":"Lange","sequence":"additional","affiliation":[{"name":"Hasso Plattner Institute"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Felix","family":"Naumann","sequence":"additional","affiliation":[{"name":"Hasso Plattner Institute"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2014,9,4]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation.","author":"Baxter Rohan","year":"2003","unstructured":"Rohan Baxter , Peter Christen , and Tim Churches . 2003 . A comparison of fast blocking methods for record linkage . In Proceedings of the KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation. Rohan Baxter, Peter Christen, and Tim Churches. 2003. A comparison of fast blocking methods for record linkage. In Proceedings of the KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/956750.956759"},{"volume-title":"Proceedings of the KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation.","author":"Bilenko Mikhail","key":"e_1_2_1_3_1","unstructured":"Mikhail Bilenko and Raymond J. Mooney . 2003b. On evaluation and training-set construction for duplicate detection . In Proceedings of the KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation. Mikhail Bilenko and Raymond J. Mooney. 2003b. On evaluation and training-set construction for duplicate detection. In Proceedings of the KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1018054314350"},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of the VLDB Workshop on Efficiency and Effectiveness of XML Tools and Techniques (EEXTT).","author":"Bressan St\u00e9phane","year":"2002","unstructured":"St\u00e9phane Bressan , Mong Li Lee , Ying Guang Li , Zo\u00e9 Lacroix , and Ullas Nambiar . 2002 . The XOO7 benchmark . In Proceedings of the VLDB Workshop on Efficiency and Effectiveness of XML Tools and Techniques (EEXTT). St\u00e9phane Bressan, Mong Li Lee, Ying Guang Li, Zo\u00e9 Lacroix, and Ullas Nambiar. 2002. The XOO7 benchmark. In Proceedings of the VLDB Workshop on Efficiency and Effectiveness of XML Tools and Techniques (EEXTT)."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1007\/11508069_15"},{"volume-title":"Quality Measures in Data Mining, Fabrice Guillet and Howard J","author":"Christen Peter","key":"e_1_2_1_7_1","unstructured":"Peter Christen and Karl Goiser . 2007. Quality and complexity measures for data linkage and deduplication . In Quality Measures in Data Mining, Fabrice Guillet and Howard J . Hamilton (Eds.), Springer , 127--151. Peter Christen and Karl Goiser. 2007. Quality and complexity measures for data linkage and deduplication. In Quality Measures in Data Mining, Fabrice Guillet and Howard J. Hamilton (Eds.), Springer, 127--151."},{"volume-title":"Proceedings of the Brazilian Symposium on Databases.","author":"Cota Ricardo G.","key":"e_1_2_1_8_1","unstructured":"Ricardo G. Cota , Marcos Andr Gon\u00e7alves , and Alberto H. F. Laender . 2007. A heuristic-based hierarchical clustering method for author name disambiguation in digital libraries . In Proceedings of the Brazilian Symposium on Databases. Ricardo G. Cota, Marcos Andr Gon\u00e7alves, and Alberto H. F. Laender. 2007. A heuristic-based hierarchical clustering method for author name disambiguation in digital libraries. In Proceedings of the Brazilian Symposium on Databases."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1018006431188"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1066157.1066168"},{"key":"e_1_2_1_11_1","volume-title":"Proceedings of the International Workshop on Quality in Databases (QDB).","author":"Draisbach Uwe","year":"2010","unstructured":"Uwe Draisbach and Felix Naumann . 2010 . DuDe: The duplicate detection toolkit . In Proceedings of the International Workshop on Quality in Databases (QDB). Uwe Draisbach and Felix Naumann. 2010. DuDe: The duplicate detection toolkit. In Proceedings of the International Workshop on Quality in Databases (QDB)."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2007.9"},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the COLING Workshop on Semantic Annotation and Intelligent Content.","author":"Erdmann Michael","year":"2000","unstructured":"Michael Erdmann , Alexander Maedche , Hans-Peter Schnurr , and Steffen Staab . 2000 . From manual to semi-automatic semantic annotation: About ontology-based text annotation tools . In Proceedings of the COLING Workshop on Semantic Annotation and Intelligent Content. Michael Erdmann, Alexander Maedche, Hans-Peter Schnurr, and Steffen Staab. 2000. From manual to semi-automatic semantic annotation: About ontology-based text annotation tools. In Proceedings of the COLING Workshop on Semantic Annotation and Intelligent Content."},{"key":"e_1_2_1_14_1","volume-title":"Proceedings of the International Workshop on Data Mining in Bioinformatics.","author":"Forman George","year":"2002","unstructured":"George Forman . 2002 . Incremental machine learning to reduce biochemistry lab costs in the search for drug discovery . In Proceedings of the International Workshop on Data Mining in Bioinformatics. George Forman. 2002. Incremental machine learning to reduce biochemistry lab costs in the search for drug discovery. In Proceedings of the International Workshop on Data Mining in Bioinformatics."},{"key":"e_1_2_1_15_1","volume-title":"Schapire","author":"Freund Yoav","year":"1996","unstructured":"Yoav Freund and Robert E . Schapire . 1996 . Experiments with a new boosting algorithm. In Proceedings of the International Conference on Machine Learning. Morgan Kaufmann , 148--156. Yoav Freund and Robert E. Schapire. 1996. Experiments with a new boosting algorithm. In Proceedings of the International Conference on Machine Learning. Morgan Kaufmann, 148--156."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1007330508534"},{"key":"e_1_2_1_17_1","unstructured":"Jim Gray (Ed.). 1991. The Benchmark Handbook for Database and Transaction Processing Systems. Morgan Kaufmann Publishers. Jim Gray (Ed.). 1991. The Benchmark Handbook for Database and Transaction Processing Systems . Morgan Kaufmann Publishers."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1656274.1656278"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687627.1687771"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.5555\/3120676.3120732"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICMLA.2010.79"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.datak.2009.10.003"},{"volume-title":"Proceedings of the BioCreative III Workshop.","author":"Lu Zhiyong","key":"e_1_2_1_23_1","unstructured":"Zhiyong Lu and W. John Wilbur . 2010. Overview of BioCreative III Gene Normalization . In Proceedings of the BioCreative III Workshop. Zhiyong Lu and W. John Wilbur. 2010. Overview of BioCreative III Gene Normalization. In Proceedings of the BioCreative III Workshop."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1920871"},{"volume-title":"An Introduction to Duplicate Detection (Synthesis Lectures on Data Management)","author":"Naumann Felix","key":"e_1_2_1_25_1","unstructured":"Felix Naumann and Melanie Herschel . 2010. An Introduction to Duplicate Detection (Synthesis Lectures on Data Management) . Morgan and Claypool Publishers . Felix Naumann and Melanie Herschel. 2010. An Introduction to Duplicate Detection (Synthesis Lectures on Data Management). Morgan and Claypool Publishers."},{"key":"e_1_2_1_26_1","volume-title":"Proceedings of the nternational Workshop on Data Quality in Cooperative Information Systsems (DQCIS).","author":"Neiling Mattis","year":"2003","unstructured":"Mattis Neiling , Steffen Jurk , Hans- J. Lenz , and Felix Naumann . 2003 . Object identification quality . In Proceedings of the nternational Workshop on Data Quality in Cooperative Information Systsems (DQCIS). Mattis Neiling, Steffen Jurk, Hans-J. Lenz, and Felix Naumann. 2003. Object identification quality. In Proceedings of the nternational Workshop on Data Quality in Cooperative Information Systsems (DQCIS)."},{"key":"e_1_2_1_27_1","doi-asserted-by":"crossref","first-page":"411","DOI":"10.1017\/S1930297500002205","article-title":"Running experiments on Amazon Mechanical Turk","volume":"5","author":"Paolacci Gabriele","year":"2010","unstructured":"Gabriele Paolacci , Jesse Chandler , and Panagiotis G. Ipeirotis . 2010 . Running experiments on Amazon Mechanical Turk . Judgment Decision Making 5 , 5 (2010), 411 -- 419 . Gabriele Paolacci, Jesse Chandler, and Panagiotis G. Ipeirotis. 2010. Running experiments on Amazon Mechanical Turk. Judgment Decision Making 5, 5 (2010), 411--419.","journal-title":"Judgment Decision Making"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/5326.983933"},{"key":"e_1_2_1_29_1","volume-title":"Proceedings of the VLDB Workshop on Efficiency and Effectiveness of XML Tools and Techniques (EEXTT).","author":"Rahm Erhard","year":"2002","unstructured":"Erhard Rahm and Timo B\u00f6hme . 2002 . XMach-1: A multi-user benchmark for XML data management . In Proceedings of the VLDB Workshop on Efficiency and Effectiveness of XML Tools and Techniques (EEXTT). Erhard Rahm and Timo B\u00f6hme. 2002. XMach-1: A multi-user benchmark for XML data management. In Proceedings of the VLDB Workshop on Efficiency and Effectiveness of XML Tools and Techniques (EEXTT)."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1142\/S0219720010004562"},{"volume-title":"Readings in Information Retrieval","author":"Salton Gerard","key":"e_1_2_1_31_1","unstructured":"Gerard Salton and Chris Buckley . 1997. Improving retrieval performance by relevance feedback . In Readings in Information Retrieval , Morgan Kaufmann Publishers Inc . Gerard Salton and Chris Buckley. 1997. Improving retrieval performance by relevance feedback. In Readings in Information Retrieval, Morgan Kaufmann Publishers Inc."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/775047.775087"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.5555\/1287369.1287455"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/130385.130417"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/1401890.1401965"},{"key":"e_1_2_1_36_1","volume-title":"Proceedings of the Workshop on Computer-Aided Language Processing (CALP).","author":"Simov Kiril","year":"2007","unstructured":"Kiril Simov , Petya Osenova , Alexander Simov , Anelia Tincheva , and Borislav Kirilov . 2007 . A system for a semi-automatic ontology annotation . In Proceedings of the Workshop on Computer-Aided Language Processing (CALP). Kiril Simov, Petya Osenova, Alexander Simov, Anelia Tincheva, and Borislav Kirilov. 2007. A system for a semi-automatic ontology annotation. In Proceedings of the Workshop on Computer-Aided Language Processing (CALP)."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.5555\/3121445.3121478"},{"volume-title":"Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 254--263","author":"Snow Rion","key":"e_1_2_1_38_1","unstructured":"Rion Snow , Brendan O\u2019Connor , Daniel Jurafsky , and Andrew Y. Ng . 2008. Cheap and fast---but is it good?: Evaluating non-expert annotations for natural language tasks . In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 254--263 . Rion Snow, Brendan O\u2019Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast---but is it good?: Evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 254--263."},{"key":"e_1_2_1_39_1","volume-title":"Proceedings of the SIGMOD International Workshop on Information Quality for Information Systems (IQIS).","author":"Weis Melanie","year":"2006","unstructured":"Melanie Weis , Felix Naumann , and Franziska Brosy . 2006 . A duplicate detection benchmark for XML (and relational) data . In Proceedings of the SIGMOD International Workshop on Information Quality for Information Systems (IQIS). Melanie Weis, Felix Naumann, and Franziska Brosy. 2006. A duplicate detection benchmark for XML (and relational) data. In Proceedings of the SIGMOD International Workshop on Information Quality for Information Systems (IQIS)."},{"key":"e_1_2_1_40_1","volume-title":"Proceedings of the Neural Information Processing Systems Conference (NIPS).","author":"Welinder Peter","year":"2010","unstructured":"Peter Welinder , Steve Branson , Serge Belongie , and Pietro Perona . 2010 . The multidimensional wisdom of crowds . In Proceedings of the Neural Information Processing Systems Conference (NIPS). Peter Welinder, Steve Branson, Serge Belongie, and Pietro Perona. 2010. The multidimensional wisdom of crowds. In Proceedings of the Neural Information Processing Systems Conference (NIPS)."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/1559845.1559870"}],"container-title":["Journal of Data and Information Quality"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2629687","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2629687","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T06:13:30Z","timestamp":1750227210000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2629687"}},"subtitle":["An annealing standard to evaluate duplicate detection results"],"short-title":[],"issued":{"date-parts":[[2014,9,4]]},"references-count":41,"journal-issue":{"issue":"1-2","published-print":{"date-parts":[[2014,9,4]]}},"alternative-id":["10.1145\/2629687"],"URL":"https:\/\/doi.org\/10.1145\/2629687","relation":{},"ISSN":["1936-1955","1936-1963"],"issn-type":[{"type":"print","value":"1936-1955"},{"type":"electronic","value":"1936-1963"}],"subject":[],"published":{"date-parts":[[2014,9,4]]},"assertion":[{"value":"2013-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2014-03-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2014-09-04","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}