{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T09:33:10Z","timestamp":1772184790568,"version":"3.50.1"},"reference-count":21,"publisher":"Oxford University Press (OUP)","issue":"11","license":[{"start":{"date-parts":[[2024,9,20]],"date-time":"2024-09-20T00:00:00Z","timestamp":1726790400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"funder":[{"DOI":"10.13039\/100006093","name":"Patient-Centered Outcomes Research Institute","doi-asserted-by":"publisher","award":["ME-2018C1-11287"],"award-info":[{"award-number":["ME-2018C1-11287"]}],"id":[{"id":"10.13039\/100006093","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,11,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Objectives<\/jats:title>\n                  <jats:p>Accurate record linkage (RL) enables consolidation and de-duplication of data from disparate datasets, resulting in more comprehensive and complete patient data. However, conducting RL with low quality or unfit data can waste institutional resources on poor linkage results. We aim to evaluate data linkability to enhance the effectiveness of record linkage.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Materials and Methods<\/jats:title>\n                  <jats:p>We describe a systematic approach using data fitness (\u201clinkability\u201d) measures, defined as metrics that characterize the availability, discriminatory power, and distribution of potential variables for RL. We used the isolation forest algorithm to detect abnormal linkability values from 188 sites in Indiana and Colorado, and manually reviewed the data to understand the cause of anomalies.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Result<\/jats:title>\n                  <jats:p>We calculated 10 linkability metrics for 11 potential linkage variables (LVs) across 188 sites for a total of 20\u00a0680 linkability metrics. Potential LVs such as first name, last name, date of birth, and sex have low missing data rates, while Social Security Number vary widely in completeness among all sites. We investigated anomalous linkability values to identify the cause of many records having identical values in certain LVs, issues with placeholder values disguising data missingness, and orphan records.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Discussion<\/jats:title>\n                  <jats:p>The fitness of a variable for RL is determined by its availability and its discriminatory power to uniquely identify individuals. These results highlight the need for awareness of placeholder values, which inform the selection of variables and methods to optimize RL performance.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Conclusion<\/jats:title>\n                  <jats:p>Evaluating linkability measures using the isolation forest algorithm to highlight anomalous findings can help identify fitness-for-use issues that must be addressed before initiating the RL process to ensure high-quality linkage outcomes.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/jamia\/ocae248","type":"journal-article","created":{"date-parts":[[2024,9,20]],"date-time":"2024-09-20T08:14:34Z","timestamp":1726820074000},"page":"2651-2659","source":"Crossref","is-referenced-by-count":1,"title":["Linkability measures to assess the data characteristics for record linkage"],"prefix":"10.1093","volume":"31","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6787-1407","authenticated-orcid":false,"given":"Toan C","family":"Ong","sequence":"first","affiliation":[{"name":"Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus , Aurora, CO 80045,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Andrew","family":"Hill","sequence":"additional","affiliation":[{"name":"Colorado School of Public Health, University of Colorado Denver , Denver, CO 80045,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4786-6875","authenticated-orcid":false,"given":"Michael G","family":"Kahn","sequence":"additional","affiliation":[{"name":"University of Colorado Anschutz Medical Campus , Aurora, CO 80045,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lauren R","family":"Lembcke","sequence":"additional","affiliation":[{"name":"Regenstrief Institute , Indianapolis, IN 45202,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lisa M","family":"Schilling","sequence":"additional","affiliation":[{"name":"Division of General Internal Medicine, University of Colorado Anschutz Medical Campus , Aurora, CO 80045,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8093-6639","authenticated-orcid":false,"given":"Shaun J","family":"Grannis","sequence":"additional","affiliation":[{"name":"Regenstrief Institute , Indianapolis, IN 45202,","place":["United States"]},{"name":"Department of Family Medicine, Indiana University , Indianapolis, IN 46202,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2024,9,20]]},"reference":[{"key":"2025030506163388300_ocae248-B1","author":"Herzog","year":"2007"},{"key":"2025030506163388300_ocae248-B2","doi-asserted-by":"crossref","first-page":"23","DOI":"10.1007\/978-3-642-31164-2_2","volume-title":"Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection","author":"Christen","year":"2012"},{"key":"2025030506163388300_ocae248-B3","doi-asserted-by":"publisher","first-page":"505","DOI":"10.1093\/jamia\/ocz232","article-title":"A hybrid approach to record linkage using a combination of deterministic and probabilistic methodology","volume":"27","author":"Ong","year":"2020","journal-title":"J Am Med Inform Assoc"},{"key":"2025030506163388300_ocae248-B4","doi-asserted-by":"publisher","first-page":"419","DOI":"10.1016\/j.asej.2019.08.009","article-title":"Cost-aware load balancing for multilingual record linkage using MapReduce","volume":"11","author":"Medhat","year":"2020","journal-title":"Ain Shams Eng J"},{"key":"2025030506163388300_ocae248-B5","doi-asserted-by":"publisher","first-page":"539","DOI":"10.1007\/s11222-017-9746-6","article-title":"A note on using the F-measure for evaluating record linkage algorithms","volume":"28","author":"Hand","year":"2018","journal-title":"Stat Comput"},{"key":"2025030506163388300_ocae248-B6","doi-asserted-by":"publisher","first-page":"276","DOI":"10.3414\/ME15-01-0152","article-title":"A simple sampling method for estimating the accuracy of large scale record linkage projects","volume":"55","author":"Boyd","year":"2016","journal-title":"Methods Inf Med"},{"key":"2025030506163388300_ocae248-B7","first-page":"2100","author":"Qahtan","year":"2018"},{"key":"2025030506163388300_ocae248-B8","first-page":"413","author":"Liu","year":"2008"},{"key":"2025030506163388300_ocae248-B9","doi-asserted-by":"publisher","first-page":"379","DOI":"10.1002\/j.1538-7305.1948.tb01338.x","article-title":"A mathematical theory of communication","volume":"27","author":"Shannon","year":"1948","journal-title":"Bell Syst Tech J"},{"key":"2025030506163388300_ocae248-B10","doi-asserted-by":"publisher","first-page":"145","DOI":"10.1109\/18.61115","article-title":"Divergence measures based on the Shannon entropy","volume":"37","author":"Lin","year":"1991","journal-title":"IEEE Trans Inform Theory"},{"key":"2025030506163388300_ocae248-B11","author":"R\u00e9nyi","year":"1961"},{"key":"2025030506163388300_ocae248-B12","doi-asserted-by":"publisher","first-page":"1367","DOI":"10.1214\/009053604000000553","article-title":"Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory","volume":"32","author":"Gr\u00fcnwald","year":"2004","journal-title":"Ann Stat"},{"key":"2025030506163388300_ocae248-B13","doi-asserted-by":"publisher","first-page":"1479","DOI":"10.1109\/TKDE.2019.2947676","article-title":"Extended isolation forest","volume":"33","author":"Hariri","year":"2021","journal-title":"IEEE Trans Knowl Data Eng"},{"key":"2025030506163388300_ocae248-B14","doi-asserted-by":"publisher","first-page":"7225","DOI":"10.1039\/C6AY01574C","article-title":"Representative subset selection and outlier detection via isolation forest","volume":"8","author":"Chen","year":"2016","journal-title":"Anal Methods"},{"key":"2025030506163388300_ocae248-B15","doi-asserted-by":"publisher","first-page":"407","DOI":"10.1016\/j.ijmedinf.2006.09.004","article-title":"An \u2018Honest Broker\u2019 mechanism to maintain privacy for patient care and academic medical research","volume":"76","author":"Boyd","year":"2007","journal-title":"Int J Med Inform"},{"key":"2025030506163388300_ocae248-B16","doi-asserted-by":"crossref","first-page":"103","DOI":"10.1007\/978-3-319-20810-7_7","volume-title":"Data and Applications Security and Privacy XXIX","author":"Lazrig","year":"2015"},{"key":"2025030506163388300_ocae248-B17","doi-asserted-by":"publisher","first-page":"322","DOI":"10.4338\/ACI-2016-11-RA-0196","article-title":"The building blocks of interoperability. A multisite analysis of patient demographic attributes available for matching","volume":"8","author":"Culbertson","year":"2017","journal-title":"Appl Clin Inform"},{"key":"2025030506163388300_ocae248-B18","doi-asserted-by":"crossref","first-page":"1431","DOI":"10.1590\/S0102-311X2010000700022","article-title":"Accuracy of a probabilistic record linkage strategy applied to identify deaths among cases reported to the Brazilian AIDS surveillance database","volume":"26","author":"Fonseca","year":"2010","journal-title":"Cad Saude Publica"},{"key":"2025030506163388300_ocae248-B19","first-page":"31","article-title":"The discrimination power of dependency structures in record linkage","volume":"19","author":"Thibaudeau","year":"1993","journal-title":"Surv Methodol"},{"key":"2025030506163388300_ocae248-B20","doi-asserted-by":"publisher","first-page":"83","DOI":"10.1186\/s12911-017-0478-5","article-title":"Evaluating privacy-preserving record linkage using cryptographic long-term keys and multibit trees on large medical datasets","volume":"17","author":"Brown","year":"2017","journal-title":"BMC Med Inform Decis Mak"},{"key":"2025030506163388300_ocae248-B21","doi-asserted-by":"publisher","first-page":"85","DOI":"10.1186\/1472-6947-14-85","article-title":"Optimal strategy for linkage of datasets containing a statistical linkage key and datasets with full personal identifiers","volume":"14","author":"Taylor","year":"2014","journal-title":"BMC Med Inform Decis Mak"}],"container-title":["Journal of the American Medical Informatics Association"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/jamia\/article-pdf\/31\/11\/2651\/59813694\/ocae248.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/jamia\/article-pdf\/31\/11\/2651\/59813694\/ocae248.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,5]],"date-time":"2025-03-05T06:16:46Z","timestamp":1741155406000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/jamia\/article\/31\/11\/2651\/7762307"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,9,20]]},"references-count":21,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2024,9,20]]},"published-print":{"date-parts":[[2024,11,1]]}},"URL":"https:\/\/doi.org\/10.1093\/jamia\/ocae248","relation":{},"ISSN":["1067-5027","1527-974X"],"issn-type":[{"value":"1067-5027","type":"print"},{"value":"1527-974X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024,11]]},"published":{"date-parts":[[2024,9,20]]}}}