{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T13:28:30Z","timestamp":1772198910856,"version":"3.50.1"},"reference-count":39,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2005,10,11]],"date-time":"2005-10-11T00:00:00Z","timestamp":1128988800000},"content-version":"unspecified","delay-in-days":0,"URL":"http:\/\/www.springer.com\/tdm"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Med Inform Decis Mak"],"published-print":{"date-parts":[[2005,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Background<\/jats:title><jats:p>Multiplication of data sources within heterogeneous healthcare information systems always results in redundant information, split among multiple databases. Our objective is to detect exact and approximate duplicates within identity records, in order to attain a better quality of information and to permit cross-linkage among stand-alone and clustered databases. Furthermore, we need to assist human decision making, by computing a value reflecting identity proximity.<\/jats:p><\/jats:sec><jats:sec><jats:title>Methods<\/jats:title><jats:p>The proposed method is in three steps. The first step is to standardise and to index elementary identity fields, using blocking variables, in order to speed up information analysis. The second is to match similar pair records, relying on a global similarity value taken from the Porter-Jaro-Winkler algorithm. And the third is to create clusters of coherent related records, using graph drawing, agglomerative clustering methods and partitioning methods.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>The batch analysis of 300,000 \"supposedly\" distinct identities isolates 240,000 true unique records, 24,000 duplicates (clusters composed of 2 records) and 3,000 clusters whose size is greater than or equal to 3 records.<\/jats:p><\/jats:sec><jats:sec><jats:title>Conclusion<\/jats:title><jats:p>Duplicate-free databases, used in conjunction with relevant indexes and similarity values, allow immediate (i.e.: real-time) proximity detection when inserting a new identity.<\/jats:p><\/jats:sec>","DOI":"10.1186\/1472-6947-5-32","type":"journal-article","created":{"date-parts":[[2005,10,11]],"date-time":"2005-10-11T18:14:25Z","timestamp":1129054465000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":40,"title":["Medical record linkage in health information systems by approximate string matching and clustering"],"prefix":"10.1186","volume":"5","author":[{"given":"Erik A","family":"Sauleau","sequence":"first","affiliation":[]},{"given":"Jean-Philippe","family":"Paumier","sequence":"additional","affiliation":[]},{"given":"Antoine","family":"Buemi","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2005,10,11]]},"reference":[{"key":"83_CR1","doi-asserted-by":"publisher","first-page":"697","DOI":"10.1080\/01621459.1995.10476563","volume":"90","author":"TR Belin","year":"1995","unstructured":"Belin TR, Rubin DB: A method for calibrating false match rates in record linkage. Journal of the American Statistical Association. 1995, 90: 697-707.","journal-title":"Journal of the American Statistical Association"},{"key":"83_CR2","doi-asserted-by":"publisher","first-page":"563","DOI":"10.1145\/368996.369026","volume":"5","author":"HB Newcombe","year":"1962","unstructured":"Newcombe HB, Kennedy JM: Record linkage: making maximum use of the discriminating power of identifying information. Communications of the ACM. 1962, 5: 563-566. 10.1145\/368996.369026.","journal-title":"Communications of the ACM"},{"key":"83_CR3","doi-asserted-by":"publisher","first-page":"52","DOI":"10.1007\/BF01074755","volume":"4","author":"T Vintsyuk","year":"1968","unstructured":"Vintsyuk T: Speech discrimination by dynamic programming. Cybernetics. 1968, 4: 52-58. 10.1007\/BF01074755.","journal-title":"Cybernetics"},{"key":"83_CR4","doi-asserted-by":"publisher","first-page":"359","DOI":"10.1016\/0196-6774(80)90016-4","volume":"1","author":"P Sellers","year":"1980","unstructured":"Sellers P: The theory and computation of evolutionary distances: pattern recognition. Journal of Algorithms. 1980, 1: 359-373. 10.1016\/0196-6774(80)90016-4.","journal-title":"Journal of Algorithms"},{"key":"83_CR5","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9781316135228","volume-title":"Flexible pattern matching in strings","author":"G Navarro","year":"2002","unstructured":"Navarro G, Raffinot M: Flexible pattern matching in strings. 2002, Cambridge, Cambridge University Press"},{"key":"83_CR6","first-page":"31","volume":"33","author":"G Navarro","year":"2001","unstructured":"Navarro G: A guided tour to approximate string matching. Association of Computing Machinery Computing Surveys. 2001, 33: 31-88.","journal-title":"Association of Computing Machinery Computing Surveys"},{"key":"83_CR7","first-page":"707","volume":"10","author":"V Levenhstein","year":"1966","unstructured":"Levenhstein V: Binary code capable of correcting deletions, insertions and reversals. Soviet Physics Doklady. 1966, 10: 707-710.","journal-title":"Soviet Physics Doklady"},{"key":"83_CR8","doi-asserted-by":"publisher","first-page":"195","DOI":"10.1016\/0022-2836(81)90087-5","volume":"14","author":"TF Smith","year":"1981","unstructured":"Smith TF, Waterman MS: Identification of common molecular subsequences. Journal of Molecular Biology. 1981, 14: 195-197. 10.1016\/0022-2836(81)90087-5.","journal-title":"Journal of Molecular Biology"},{"key":"83_CR9","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1111\/j.2517-6161.1977.tb01600.x","volume":"39","author":"AP Dempster","year":"1977","unstructured":"Dempster AP, Laird N, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, series B. 1977, 39: 1-38.","journal-title":"Journal of the Royal Statistical Society, series B"},{"key":"83_CR10","volume-title":"Statistical research report series N\u00b005","author":"WE Winkler","year":"2000","unstructured":"Winkler WE: Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. Statistical research report series N\u00b005. 2000, US bureau of census, Washington DC"},{"key":"83_CR11","volume-title":"Handbook of record linkage: methods for health and statistical studies, administration and business","author":"HB Newcombe","year":"1988","unstructured":"Newcombe HB: Handbook of record linkage: methods for health and statistical studies, administration and business. 1988, Oxford, Oxford University Press"},{"key":"83_CR12","doi-asserted-by":"publisher","first-page":"1183","DOI":"10.1080\/01621459.1969.10501049","volume":"64","author":"I Fellegi","year":"1969","unstructured":"Fellegi I, Sunter A: A theory for record linkage. Journal of the American Statistical Association. 1969, 64: 1183-1210.","journal-title":"Journal of the American Statistical Association"},{"key":"83_CR13","doi-asserted-by":"publisher","first-page":"414","DOI":"10.1080\/01621459.1989.10478785","volume":"84","author":"MA Jaro","year":"1989","unstructured":"Jaro MA: Advances in record linkage methodology as applied to matching the 1985 Census of Tempa, Florida. Journal of the American Statistical Association. 1989, 84: 414-420.","journal-title":"Journal of the American Statistical Association"},{"key":"83_CR14","doi-asserted-by":"publisher","first-page":"491","DOI":"10.1002\/sim.4780140510","volume":"14","author":"MA Jaro","year":"1995","unstructured":"Jaro MA: Probabilistic linkage of large public health data files. Statistics in Medicine. 1995, 14: 491-498.","journal-title":"Statistics in Medicine"},{"key":"83_CR15","volume-title":"Clustering algorithms","author":"J Hartigan","year":"1975","unstructured":"Hartigan J: Clustering algorithms. 1975, New York, John Wiley and Sons"},{"key":"83_CR16","volume-title":"Cluster analysis","author":"B Everitt","year":"1993","unstructured":"Everitt B: Cluster analysis. 1993, London, Edward Arnold, 3","edition":"3"},{"key":"83_CR17","first-page":"149","volume":"42","author":"P Eades","year":"1984","unstructured":"Eades P: A heuristique for graph drawing. Congressus Numerantium. 1984, 42: 149-160.","journal-title":"Congressus Numerantium"},{"key":"83_CR18","doi-asserted-by":"publisher","first-page":"1129","DOI":"10.1002\/spe.4380211102","volume":"21","author":"T Fruchterman","year":"1991","unstructured":"Fruchterman T, Reingold E: Graph drawing by force-directed placement. Software-Practice and Experience. 1991, 21: 1129-1164.","journal-title":"Software-Practice and Experience"},{"key":"83_CR19","doi-asserted-by":"publisher","first-page":"7","DOI":"10.1016\/0020-0190(89)90102-6","volume":"31","author":"T Kamada","year":"1989","unstructured":"Kamada T, Kawai S: An algorithm for drawing general undirected graphs. Information Processing Letters. 1989, 31: 7-15. 10.1016\/0020-0190(89)90102-6.","journal-title":"Information Processing Letters"},{"key":"83_CR20","volume-title":"SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tuscon","author":"AE Monge","year":"1997","unstructured":"Monge AE, Elkan CP: An efficient domain-independent algorithm for detecting approximately duplicate database records. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tuscon. 1997"},{"key":"83_CR21","volume-title":"Identifying and merging related bibliographic records","author":"JA Hylthon","year":"1996","unstructured":"Hylthon JA: Identifying and merging related bibliographic records. 1996, Master of Engineering in Electrical and Computer Sciences. MIT"},{"key":"83_CR22","first-page":"269","volume-title":"Current Topics in Computational Molecular Biology","author":"R Sharan","year":"2002","unstructured":"Sharan R, Shamir R: Algorithmic approaches to clustering gene expression data. Current Topics in Computational Molecular Biology. Edited by: Jiang T. 2002, The MIT Press, 269-300."},{"key":"83_CR23","volume-title":"ACM SIGKDD '03 Workshop on Data Cleaning, Record Linkage and Object consolidation","author":"A Baxter","year":"2003","unstructured":"Baxter A, Christen P, Churches T: A comparison of fast blocking methods for record linkage. ACM SIGKDD '03 Workshop on Data Cleaning, Record Linkage and Object consolidation. 2003, Washington DC"},{"key":"83_CR24","volume-title":"ACM SIGMOD International Conference on Management of Data, San Jose","author":"M Hernandez","year":"1995","unstructured":"Hernandez M, Stolfo S: The merge\/purge problem for large databases. ACM SIGMOD International Conference on Management of Data, San Jose. 1995"},{"key":"83_CR25","volume-title":"Information Retrieval: Algorithms and Data Structures","author":"R Baeza-Yates","year":"1992","unstructured":"Baeza-Yates R, Frakes WB: Information Retrieval: Algorithms and Data Structures. 1992, Englewood Cliffs, Prentice-Hall"},{"key":"83_CR26","volume-title":"Sixth International Conference on Knowledge Discovery and Data Mining, Boston","author":"AK McCallum","year":"2000","unstructured":"McCallum AK, Nigam K, Ungar LH: Efficient clustering of high-dimensional datasets with application to reference matching. Sixth International Conference on Knowledge Discovery and Data Mining, Boston. 2000"},{"key":"83_CR27","volume-title":"ACM SIGKDD '03 Workshop on Data Cleaning, Record Linkage and Object consolidation","author":"W Cohen","year":"2003","unstructured":"Cohen W, Ravikumar P, Fienberg S: A comparison of string metrics for matching names and records. ACM SIGKDD '03 Workshop on Data Cleaning, Record Linkage and Object consolidation. 2003, Washington DC"},{"key":"83_CR28","volume-title":"Approximate string comparison and its effect on an advanced record linkage system","author":"EH Porter","year":"1997","unstructured":"Porter EH, Winkler WE: Approximate string comparison and its effect on an advanced record linkage system. 1997, Washington DC, US Census Bureau"},{"key":"83_CR29","doi-asserted-by":"publisher","first-page":"175","DOI":"10.1016\/S0020-0190(00)00142-3","volume":"76","author":"E Hartuv","year":"2000","unstructured":"Hartuv E, Shamir R: A clustering algorithm based on graph connectivity. Information Processing Letters. 2000, 76: 175-181. 10.1016\/S0020-0190(00)00142-3.","journal-title":"Information Processing Letters"},{"key":"83_CR30","volume-title":"Eighth International Conference on Intelligent Systems for Molecular Biology, La Jolla","author":"R Sharan","year":"2000","unstructured":"Sharan R, Shamir R: CLICK: a clustering algorithm with application to gene expression analysis. Eighth International Conference on Intelligent Systems for Molecular Biology, La Jolla. 2000"},{"key":"83_CR31","doi-asserted-by":"publisher","first-page":"281","DOI":"10.1089\/106652799318274","volume":"6","author":"A Ben-Dor","year":"1999","unstructured":"Ben-Dor A, Shamir R, Yakhini Z: Clustering gene expression patterns. Journal of Computational Biology. 1999, 6: 281-297. 10.1089\/106652799318274.","journal-title":"Journal of Computational Biology"},{"key":"83_CR32","volume-title":"IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Madison","author":"M Pavan","year":"2003","unstructured":"Pavan M, Pellilo M: A new graph-theoretic approach to clustering and segmentation. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Madison. 2003"},{"key":"83_CR33","first-page":"93","volume":"12","author":"H Kawaji","year":"2001","unstructured":"Kawaji H, Yamaguchi Y, Matsuda H, Hashimoto A: A graph-based clustering method for a large set of sequences using a graph partitioning algorithm. Genome Informatics. 2001, 12: 93-102.","journal-title":"Genome Informatics"},{"key":"83_CR34","volume-title":"Statistical research report series N\u00b007","author":"WE Yancey","year":"2000","unstructured":"Yancey WE: Frequency-dependent probability measures for record linkage. Statistical research report series N\u00b007. 2000, US bureau of census, Washington DC"},{"key":"83_CR35","volume-title":"Proceedings of the Survey Research Methods Section","author":"WE Winkler","year":"2000","unstructured":"Winkler WE: Machine learning, information retrieval and record linkage. Proceedings of the Survey Research Methods Section. 2000, American Statistical Association"},{"key":"83_CR36","volume-title":"Sixth World Meeting of the International Society for Bayesian Analysis, Hersonissos, Greece","author":"M Fortini","year":"2000","unstructured":"Fortini M, Liseo B, Nuccitelli A, Scanu M: On bayesian record linkage. Sixth World Meeting of the International Society for Bayesian Analysis, Hersonissos, Greece. 2000"},{"key":"83_CR37","unstructured":"Kilss B, Alvey W, Eds: Record linkage techniques \u2013 1985. Proceedings of the Workshop on Exact Matching Methodologiese. 1985, Arlington, US Internal Revenue Service"},{"key":"83_CR38","first-page":"14","volume":"23","author":"AE Monge","year":"2000","unstructured":"Monge AE: Matching algortihm within a duplicate detection system. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 2000, 23: 14-20.","journal-title":"Bulletin of the IEEE Computer Society Technical Committee on Data Engineering"},{"key":"83_CR39","doi-asserted-by":"crossref","first-page":"271","DOI":"10.1055\/s-0038-1634527","volume":"37","author":"C Quantin","year":"1998","unstructured":"Quantin C, Bouzelat H, Allaert FA, Benhamiche AM, Faivre J, Dusserre L: Automatic record hash coding and linkage for epidemiological follow-up data confidentiality. Methods on Information in Medicine. 1998, 37: 271-277.","journal-title":"Methods on Information in Medicine"}],"container-title":["BMC Medical Informatics and Decision Making"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/1472-6947-5-32.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1186\/1472-6947-5-32\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/1472-6947-5-32","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1472-6947-5-32.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,1,4]],"date-time":"2025-01-04T22:08:05Z","timestamp":1736028485000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcmedinformdecismak.biomedcentral.com\/articles\/10.1186\/1472-6947-5-32"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2005,10,11]]},"references-count":39,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2005,12]]}},"alternative-id":["83"],"URL":"https:\/\/doi.org\/10.1186\/1472-6947-5-32","relation":{},"ISSN":["1472-6947"],"issn-type":[{"value":"1472-6947","type":"electronic"}],"subject":[],"published":{"date-parts":[[2005,10,11]]},"assertion":[{"value":"6 May 2005","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 October 2005","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 October 2005","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"32"}}