{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2022,11,28]],"date-time":"2022-11-28T05:43:05Z","timestamp":1669614185590},"reference-count":25,"publisher":"IGI Global","issue":"4","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2013,10,1]]},"abstract":"<p>Graph Proximity Cleansing (GPC) is a string clustering algorithm that automatically detects cluster borders and has been successfully used for string cleansing. For each potential cluster a so-called proximity graph is computed, and the cluster border is detected based on the proximity graph. However, the computation of the proximity graph is expensive and the state-of-the-art GPC algorithms only approximate the proximity graph using a sampling technique. Further, the quality of GPC clusters has never been compared to standard clustering techniques like k-means, density-based, or hierarchical clustering. In this article the authors propose two efficient algorithms, PG-DS and PG-SM, for the exact computation of proximity graphs. The authors experimentally show that our solutions are faster even if the sampling-based algorithms use very small sample sizes. The authors provide a thorough experimental evaluation of GPC and conclude that it is very efficient and shows good clustering quality in comparison to the standard techniques. These results open a new perspective on string clustering in settings, where no knowledge about the input data is available.<\/p>","DOI":"10.4018\/ijkbo.2013100105","type":"journal-article","created":{"date-parts":[[2014,2,12]],"date-time":"2014-02-12T14:33:26Z","timestamp":1392215606000},"page":"84-104","source":"Crossref","is-referenced-by-count":3,"title":["Clustering with Proximity Graphs"],"prefix":"10.4018","volume":"3","author":[{"given":"Michail","family":"Kazimianec","sequence":"first","affiliation":[{"name":"Faculty of Economics, Vilnius University, Vilnius, Lithuania"}]},{"given":"Nikolaus","family":"Augsten","sequence":"additional","affiliation":[{"name":"Faculty of Computer Science, Free University of Bozen-Bolzano, Bozen-Bolzano, Italy"}]}],"member":"2432","reference":[{"key":"ijkbo.2013100105-0","doi-asserted-by":"crossref","unstructured":"Augsten, N., B\u00f6hlen, M. H., Dyreson, C. E., & Gamper, J. (2008). Approximate joins for data-centric XML. In Proceedings of the 24th International Conference on Data Engineering (ICDE), (pp. 814\u2013823).","DOI":"10.1109\/ICDE.2008.4497490"},{"key":"ijkbo.2013100105-1","doi-asserted-by":"publisher","DOI":"10.1145\/1670243.1670247"},{"key":"ijkbo.2013100105-2","doi-asserted-by":"crossref","unstructured":"Behm, A., Ji, S., Li, C., & Lu, J. (2009). Space-constrained gram-based indexing for efficient approximate string search. In Proceedings of the 25th International Conference on Data Engineering (ICDE), (pp. 604\u2013615).","DOI":"10.1109\/ICDE.2009.32"},{"key":"ijkbo.2013100105-3","doi-asserted-by":"publisher","DOI":"10.1109\/MIS.2003.1234765"},{"key":"ijkbo.2013100105-4","doi-asserted-by":"crossref","unstructured":"Chaudhuri, S., Ganti, V., & Motwani, R. (2005). Robust identification of fuzzy duplicates. In Proceedings of the 21st International Conference on Data Engineering (ICDE), (pp. 865\u2013876).","DOI":"10.1109\/ICDE.2005.125"},{"key":"ijkbo.2013100105-5","unstructured":"Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD), (pp. 226\u2013231)."},{"key":"ijkbo.2013100105-6","doi-asserted-by":"crossref","unstructured":"Flanagan, J. A., M\u00e4ntyjarvi, J., & Himberg, J. (2002). Unsupervised clustering of symbol strings and context recognition. In Proceedings of the International Conference on Data Mining (ICDM) (pp. 171-178).","DOI":"10.1109\/ICDM.2002.1183900"},{"key":"ijkbo.2013100105-7","unstructured":"Gravano, L., Ipeirotis, P. G., Jagadish, H. V., Koudas, N., Muthukrishnan, S., & Srivastava, D. (2001). Approximate string joins in a database (almost) for free. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), (pp. 491\u2013500)."},{"key":"ijkbo.2013100105-8","author":"J.Han","year":"2006","journal-title":"Data mining: Concepts and techniques"},{"key":"ijkbo.2013100105-9","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2003.1232265"},{"key":"ijkbo.2013100105-10","author":"L.Kaufman","year":"1990","journal-title":"Finding groups in data - An introduction to cluster analysis"},{"key":"ijkbo.2013100105-11","doi-asserted-by":"publisher","DOI":"10.1145\/146370.146380"},{"key":"ijkbo.2013100105-12","doi-asserted-by":"publisher","DOI":"10.1016\/0196-6774(89)90010-2"},{"key":"ijkbo.2013100105-13","first-page":"707","article-title":"Binary codes capable of correcting deletions, insertions and reversals.","volume":"10","author":"V.Levenshtein","year":"1966","journal-title":"Soviet Physics, Doklady"},{"key":"ijkbo.2013100105-14","doi-asserted-by":"crossref","unstructured":"Li, C., Lu, J., & Lu, Y. (2008). Efficient merging and filtering algorithms for approximate string searches. In Proceedings of the 24th International Conference on Data Engineering (ICDE), (pp. 257\u2013266).","DOI":"10.1109\/ICDE.2008.4497434"},{"key":"ijkbo.2013100105-15","unstructured":"Li, C., Wang, B., & Yang, X. (2007). Vgram: Improving performance of approximate queries on string collections using variable-length grams. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB), (pp. 303\u2013314)."},{"key":"ijkbo.2013100105-16","unstructured":"MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, (pp. 281\u2013297)."},{"key":"ijkbo.2013100105-17","unstructured":"Mazeika, A., & B\u00f6hlen, M. H. (2006). Cleansing databases of misspelled proper nouns. In Proceedings of the 1st International VLDB Workshop on Clean Databases, (CleanDB)."},{"key":"ijkbo.2013100105-18","doi-asserted-by":"publisher","DOI":"10.1145\/375360.375365"},{"issue":"4","key":"ijkbo.2013100105-19","first-page":"28","article-title":"Using q-grams in a dbms for approximate string processing.","volume":"24","author":"L.Pietarinen","year":"2001","journal-title":"A Quarterly Bulletin of the Computer Society of the IEEE Technical Committee on Data Engineering"},{"key":"ijkbo.2013100105-20","author":"C. J. V.Rijsbergen","year":"1979","journal-title":"Information retrieval"},{"key":"ijkbo.2013100105-21","doi-asserted-by":"publisher","DOI":"10.1016\/j.neunet.2004.08.004"},{"key":"ijkbo.2013100105-22","doi-asserted-by":"publisher","DOI":"10.1016\/0304-3975(92)90143-4"},{"key":"ijkbo.2013100105-23","doi-asserted-by":"publisher","DOI":"10.1016\/0001-8708(76)90202-4"},{"key":"ijkbo.2013100105-24","doi-asserted-by":"crossref","unstructured":"Xiao, C., Wang, W., Lin, X., & Yu, J. X. (2008). Efficient similarity joins for near duplicate detection. In Proceeding of the 17th International Conference on World Wide Web (WWW), (pp. 131\u2013140).","DOI":"10.1145\/1367497.1367516"}],"container-title":["International Journal of Knowledge-Based Organizations"],"original-title":[],"language":"ng","link":[{"URL":"https:\/\/www.igi-global.com\/viewtitle.aspx?TitleId=101195","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,6,1]],"date-time":"2022-06-01T14:52:56Z","timestamp":1654095176000},"score":1,"resource":{"primary":{"URL":"https:\/\/services.igi-global.com\/resolvedoi\/resolve.aspx?doi=10.4018\/ijkbo.2013100105"}},"subtitle":["Exact and Efficient Algorithms"],"short-title":[],"issued":{"date-parts":[[2013,10,1]]},"references-count":25,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2013,10]]}},"URL":"https:\/\/doi.org\/10.4018\/ijkbo.2013100105","relation":{},"ISSN":["2155-6393","2155-6407"],"issn-type":[{"value":"2155-6393","type":"print"},{"value":"2155-6407","type":"electronic"}],"subject":[],"published":{"date-parts":[[2013,10,1]]}}}