{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,25]],"date-time":"2025-11-25T06:50:17Z","timestamp":1764053417032,"version":"3.41.2"},"reference-count":31,"publisher":"Emerald","issue":"2","license":[{"start":{"date-parts":[[2016,4,4]],"date-time":"2016-04-04T00:00:00Z","timestamp":1459728000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/www.emerald.com\/insight\/site-policies"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2016,4,4]]},"abstract":"<jats:sec><jats:title content-type=\"abstract-heading\">Purpose<\/jats:title><jats:p>\u2013 The purpose of this paper is to describe a large-scale algorithm for generating a catalogue of scientific publication records (citations) from a crowd-sourced data, demonstrate how to learn an optimal combination of distance metrics for duplicate detection and introduce a parallel duplicate clustering algorithm.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-heading\">Design\/methodology\/approach<\/jats:title><jats:p>\u2013 The authors developed the algorithm and compared it with state-of-the art systems tackling the same problem. The authors used benchmark data sets (3k data points) to test the effectiveness of our algorithm and a real-life data ( &gt; 90 million) to test the efficiency and scalability of our algorithm.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-heading\">Findings<\/jats:title><jats:p>\u2013 The authors show that duplicate detection can be improved by an additional step we call duplicate clustering. The authors also show how to improve the efficiency of map\/reduce similarity calculation algorithm by introducing a sampling step. Finally, the authors find that the system is comparable to the state-of-the art systems for duplicate detection, and that it can scale to deal with hundreds of million data points.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-heading\">Research limitations\/implications<\/jats:title><jats:p>\u2013 Academic researchers can use this paper to understand some of the issues of transitivity in duplicate detection, and its effects on digital catalogue generations.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-heading\">Practical implications<\/jats:title><jats:p>\u2013 Industry practitioners can use this paper as a use case study for generating a large-scale real-life catalogue generation system that deals with millions of records in a scalable and efficient way.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-heading\">Originality\/value<\/jats:title><jats:p>\u2013 In contrast to other similarity calculation algorithms developed for m\/r frameworks the authors present a specific variant of similarity calculation that is optimized for duplicate detection of bibliographic records by extending previously proposed e-algorithm based on inverted index creation. In addition, the authors are concerned with more than duplicate detection, and investigate how to group detected duplicates. The authors develop distinct algorithms for duplicate detection and duplicate clustering and use the canopy clustering idea for multi-pass clustering. The work extends the current state-of-the-art by including the duplicate clustering step and demonstrate new strategies for speeding up m\/r similarity calculations.<\/jats:p><\/jats:sec>","DOI":"10.1108\/prog-02-2015-0021","type":"journal-article","created":{"date-parts":[[2016,3,21]],"date-time":"2016-03-21T10:45:31Z","timestamp":1458557131000},"page":"138-156","source":"Crossref","is-referenced-by-count":2,"title":["De-duplicating a large crowd-sourced catalogue of bibliographic records"],"prefix":"10.1108","volume":"50","author":[{"given":"Ilija","family":"Subasic","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Nebojsa","family":"Gvozdenovic","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kris","family":"Jack","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"140","reference":[{"key":"key2020121620252457100_b1","doi-asserted-by":"crossref","unstructured":"Alabduljalil, M.A. , Tang, X. and Yang, T. (2013), \u201cOptimizing parallel algorithms for all pairs similarity search\u201d, Proceedings of the Sixth ACM International Conference on Web Search and Data Mining\u2019, ACM, pp. 203-212.","DOI":"10.1145\/2433396.2433422"},{"key":"key2020121620252457100_b3","doi-asserted-by":"crossref","unstructured":"Bilenko, M. and Mooney, R.J. (2003), \u201cAdaptive duplicate detection using learnable string similarity measures\u201d, Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), pp. 39-48.","DOI":"10.1145\/956750.956759"},{"key":"key2020121620252457100_b4","unstructured":"Broder, A.Z. (1997), \u201cOn the resemblance and containment of documents\u201d, Compression and Complexity of SEQUENCES 1997: SEQUENCES \u201897\u2019, IEEE Computer Society , Washington, DC, pp. 21-29, available at: http:\/\/ieeexplore.ieee.org\/lpdocs\/epic03\/wrapper.htm?arnumber=666900 (accessed 23 February 2016)."},{"key":"key2020121620252457100_b5","unstructured":"Broder, A.Z. (2000), \u201cIdentifying and filtering near-duplicate documents\u201d, Proceeding SEQUENCES \u201897 Proceedings of the Compression and Complexity of Sequences 1997, IEEE Computer Society, Washington, DC , p. 21, available at: http:\/\/dl.acm.org\/citation.cfm?id=830043"},{"key":"key2020121620252457100_b6","doi-asserted-by":"crossref","unstructured":"Charikar, M.S. (2002), \u201cSimilarity estimation techniques from rounding algorithms\u201d, technical report, New York, NY, available at: http:\/\/doi.acm.org\/10.1145\/509907.509965 (accessed 23 February 2016).","DOI":"10.1145\/509907.509965"},{"key":"key2020121620252457100_b7","unstructured":"Christen, P. (2008), \u201cFebrl a Freely available record linkage system with a graphical user interface\u201d, Second Australasian Workshop on Health Data and Knowledge Management HDKM 2008, ACS, Las Vegas, pp. 17-25."},{"key":"key2020121620252457100_b8","doi-asserted-by":"crossref","unstructured":"Christen, P. (2012), Data Matching \u2013 Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, Data-Centric Systems and Applications , Springer, Berlin.","DOI":"10.1007\/978-3-642-31164-2"},{"key":"key2020121620252457100_b9","unstructured":"Cohen, W.W. , Ravikumar, P. and Fienberg, S.E. (2003), \u201cA comparison of string distance metrics for name-matching tasks\u201d, in Kambhampati, S. and Knoblock, C.A. (Eds), Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03) , American Association for Articial Intelligence, Acapulco, pp. 73-78."},{"key":"key2020121620252457100_b10","doi-asserted-by":"crossref","unstructured":"Elmagarmid, A. , Ipeirotis, P. and Verykios, V. (2007), \u201cDuplicate record detection: a survey\u201d, IEEE Transactions on Knowledge and Data Engineering , Vol. 19 No. 1, pp. 1-16, available at: http:\/\/scholar.google.co.uk\/scholar?cluster=15791964806923690987 & hl=en#1","DOI":"10.1109\/TKDE.2007.250581"},{"key":"key2020121620252457100_b11","doi-asserted-by":"crossref","unstructured":"Elsayed, T. , Lin, J. and Oard, D.W. (2008), \u201cPairwise document similarity in large collections with MapReduce\u201d, Stroudsburg, PA, pp. 265-268.","DOI":"10.3115\/1557690.1557767"},{"key":"key2020121620252457100_b12","unstructured":"Hall, M.A. (1999), \u201cCorrelation-based feature selection for machine learning\u201d, PhD thesis, University of Waikato, Hamilton."},{"key":"key2020121620252457100_b13","doi-asserted-by":"crossref","unstructured":"Hammerton, J.A. , Granitzer, M. , Harvey, D. , Hristakeva, M. and Jack, K. (2012), \u201cOn generating large-scale ground truth datasets for the deduplication of bibliographic records\u201d, International Conference on Web Intelligence, Mining and Semantics 2012, 13-15 June, Craiova.","DOI":"10.1145\/2254129.2254153"},{"key":"key2020121620252457100_b14","doi-asserted-by":"crossref","unstructured":"Jaro, M.A. (1995), \u201cProbabilistic linkage of large public health data les\u201d, Statistics in Medicine , Vol. 14 Nos 5-7, pp. 491-498, available at: http:\/\/dx.doi.org\/10.1002\/sim.4780140510","DOI":"10.1002\/sim.4780140510"},{"key":"key2020121620252457100_b15","doi-asserted-by":"crossref","unstructured":"Kolb, L. and Rahm, E. (2012), \u201cParallel entity resolution with Dedoop\u201d, Datenbank-Spektrum , Vol. 13 No. 1, pp. 23-32, available at: www.springerlink.com\/index\/10.1007\/s13222-012-0110-x","DOI":"10.1007\/s13222-012-0110-x"},{"key":"key2020121620252457100_b16","doi-asserted-by":"crossref","unstructured":"K\u00f6pcke, H. , Thor, A. and Rahm, E. (2010), \u201cEvaluation of entity resolution approaches on real-world match problems\u201d, VLDB Endowment , Vol. 3 No. 1, pp. 484-493.","DOI":"10.14778\/1920841.1920904"},{"key":"key2020121620252457100_b17","unstructured":"Levenshtein, V.I. (1965), \u201cBinary codes capable of correcting deletions, insertions, and reversals\u201d, Doklady Akademii Nauk SSSR , Vol. 163 No. 4, pp. 845-848."},{"key":"key2020121620252457100_b18","doi-asserted-by":"crossref","unstructured":"Lin, J. (2009), \u201cBrute force and indexed approaches to pairwise document similarity comparisons with MapReduce\u201d, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval \u2013 SIGIR \u201809, p. 155, available at: http:\/\/portal.acm.org\/citation.cfm?doid=1571941.1571970 (accessed 23 February 2016).","DOI":"10.1145\/1571941.1571970"},{"key":"key2020121620252457100_b19","doi-asserted-by":"crossref","unstructured":"McCallum, A. , Nigam, K. and Ungar, L.H. (2000), \u201cEfficient clustering of high-dimensional data sets with application to reference matching\u201d, Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining \u2013 KDD \u201800\u2019, ACM Press, New York, NY, pp. 169-178, available at: http:\/\/dl.acm.org\/citation.cfm?id=347123 http:\/\/portal.acm.org\/citation.cfm?doid=347090.347123 (accessed 23 February 2016).","DOI":"10.1145\/347090.347123"},{"key":"key2020121620252457100_b20","doi-asserted-by":"crossref","unstructured":"Naumann, F. and Herschel, M. (2010), An Introduction to Duplicate Detection , Vol. 2, synthesis edn, Morgan & Claypool Publishers, Seattle, WA.","DOI":"10.2200\/S00262ED1V01Y201003DTM003"},{"key":"key2020121620252457100_b21","unstructured":"Rogers, D.J. and Tanimoto, T.T. (1960), \u201cA computer program for classifying plants\u201d, Science , Vol. 132 No. 3434, pp. 1115-1118, available at: www.ncbi.nlm.nih.gov\/pubmed\/17790723"},{"key":"key2020121620252457100_b22","unstructured":"Sadowski, C. and Levin, G. (2007), \u201cSimHash: hash-based similarity detection\u201d, technical report, Google, available at: http:\/\/scholar.google.com\/scholar?hl=en & btnG=Search & q=intitle:SimHash:+Hash-based+Similarity+Detection#0; http:\/\/simhash.googlecode.com\/svn-history\/r37\/trunk\/paper\/WithBib.pdf (accessed 23 February 2016)."},{"key":"key2020121620252457100_b23","unstructured":"Salton, G. and Buckley, C. (1988), \u201cTerm-weighting approaches in automatic text retrieval\u201d, Information Processing & Management , Vol. 24 No. 5, pp. 513-523, available at: www.sciencedirect.com\/science\/article\/pii\/0306457388900210"},{"key":"key2020121620252457100_b24","doi-asserted-by":"crossref","unstructured":"Sarawagi, S. and Bhamidipaty, A. (2002), \u201cInteractive deduplication using active learning\u201d, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD \u201802\u2019, ACM, New York, NY, pp. 269-278, available at: http:\/\/doi.acm.org\/10.1145\/775047.775087 (accessed 23 February 2016).","DOI":"10.1145\/775047.775087"},{"key":"key2020121620252457100_b25","doi-asserted-by":"crossref","unstructured":"Sood, S. and Loguinov, D. (2011), \u201cProbabilistic near-duplicate detection using SimHash\u201d, Proceedings of the 20th ACM International Conference on Information and Knowledge Management \u2013 CIKM \u201811\u2019, ACM Press, No. 1, New York, NY, pp. 1117-1127, available at: http:\/\/dl.acm.org\/citation.cfm?doid=2063576.2063737 (accessed 23 February 2016).","DOI":"10.1145\/2063576.2063737"},{"key":"key2020121620252457100_b502","unstructured":"Stadwinkel, S. (2014), \u201cReal-time data de-duplication using locality-sensitive hashing powered by storm and riak\u201d, technical talk, Buzzwords Berlin 2015, available at: https:\/\/berlinbuzzwords.de\/session\/real-time-data-de-duplication-using-locality-sensitive-hashing-powered-storm-and-riak (accessed 23 February 2016)."},{"key":"key2020121620252457100_b27","unstructured":"Wilcke, N. (2015), \u201cDduP \u2013 towards a deduplication framework utilising Apache Spark\u201d, master thesis, Universit\u00e4t Hamburg, Fachbereich Informatik, Hamburg."},{"key":"key2020121620252457100_b28","unstructured":"Wu, Y. , Zhang, Q. and Huang, X. (2011), \u201cEfficient near-duplicate detection for Q & A forum\u201d, 5th International Joint Conference on Natural Language Processing, AFNLP, pp. 1001-1009."},{"key":"key2020121620252457100_b29","unstructured":"Yang, B. , Kim, H.J. , Shim, J. , Lee, D. and Lee, S. (2015), \u201cFast and scalable vector similarity joins with MapReduce\u201d, Journal of Intelligent Information Systems , pp. 1-25, available at: http:\/\/link.springer.com\/article\/10.1007%2Fs10844-015-0363-6"},{"key":"key2020121620252457100_b30","unstructured":"Zadeh, R.B. and Goel, A. (2012), \u201cDimension independent similarity computation\u201d, Journal of Machine Learning Research , Vol. 1 No. 14, pp. 1605-1626, available at: http:\/\/arxiv.org\/abs\/1206.2082"},{"key":"key2020121620252457100_frd1","unstructured":"Baxter, R. , Christen, P. and Churches, T. (2003), \u201cA comparison of fast blocking methods for record linkage\u201d, KDD 2003 Workshops , Vol. 3, pp. 25-27, available at: http:\/\/citeseerx.ist.psu.edu\/viewdoc\/summary?doi=10.1.1.10.4563 (accessed 23 February 2016)."},{"key":"key2020121620252457100_frd2","doi-asserted-by":"crossref","unstructured":"Tax, D.M.J. , Breukelen, M.V. , Duin, R.P.W. and Kittler, J. (2000), \u201cCombining multiple classifers by averaging or by multiplying?\u201d, Pattern Recognition , Vol. 33 No. 9, pp. 1475-1485.","DOI":"10.1016\/S0031-3203(99)00138-7"}],"container-title":["Program"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/www.emeraldinsight.com\/doi\/full-xml\/10.1108\/PROG-02-2015-0021","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/PROG-02-2015-0021\/full\/xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/PROG-02-2015-0021\/full\/html","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,24]],"date-time":"2025-07-24T21:57:23Z","timestamp":1753394243000},"score":1,"resource":{"primary":{"URL":"http:\/\/www.emerald.com\/dta\/article\/50\/2\/138-156\/324371"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,4,4]]},"references-count":31,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2016,4,4]]}},"alternative-id":["10.1108\/PROG-02-2015-0021"],"URL":"https:\/\/doi.org\/10.1108\/prog-02-2015-0021","relation":{},"ISSN":["0033-0337"],"issn-type":[{"type":"print","value":"0033-0337"}],"subject":[],"published":{"date-parts":[[2016,4,4]]}}}