{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,27]],"date-time":"2025-10-27T16:09:24Z","timestamp":1761581364491},"reference-count":25,"publisher":"Oxford University Press (OUP)","issue":"6","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2016,3,15]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Motivation: The increasing diversity of data available to the biomedical scientist holds promise for better understanding of diseases and discovery of new treatments for patients. In order to provide a complete picture of a biomedical question, data from many different origins needs to be combined into a unified representation. During this data integration process, inevitable errors and ambiguities present in the initial sources compromise the quality of the resulting data warehouse, and greatly diminish the scientific value of the content. Expensive and time-consuming manual curation is then required to improve the quality of the information. However, it becomes increasingly difficult to dedicate and optimize the resources for data integration projects as available repositories are growing both in size and in number everyday.<\/jats:p>\n               <jats:p>Results: We present a new generic methodology to identify problematic records, causing what we describe as \u2018data hairball\u2019 structures. The approach is graph-based and relies on two metrics traditionally used in social sciences: the graph density and the betweenness centrality. We evaluate and discuss these measures and show their relevance for flexible, optimized and automated data curation and linkage. The methodology focuses on information coherence and correctness to improve the scientific meaningfulness of data integration endeavors, such as knowledge bases and large data warehouses.<\/jats:p>\n               <jats:p>Contact: \u00a0samuel.croset@roche.com<\/jats:p>\n               <jats:p>Supplementary information: \u00a0Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/btv644","type":"journal-article","created":{"date-parts":[[2015,11,11]],"date-time":"2015-11-11T01:08:24Z","timestamp":1447204104000},"page":"918-925","source":"Crossref","is-referenced-by-count":12,"title":["Flexible data integration and curation using a graph-based approach"],"prefix":"10.1093","volume":"32","author":[{"given":"Samuel","family":"Croset","sequence":"first","affiliation":[{"name":"Roche Innovation Center Basel, F. Hoffmann-La Roche AG, CH-4070 Basel, Switzerland"}]},{"given":"Joachim","family":"Rupp","sequence":"additional","affiliation":[{"name":"Roche Innovation Center Basel, F. Hoffmann-La Roche AG, CH-4070 Basel, Switzerland"}]},{"given":"Martin","family":"Romacker","sequence":"additional","affiliation":[{"name":"Roche Innovation Center Basel, F. Hoffmann-La Roche AG, CH-4070 Basel, Switzerland"}]}],"member":"286","published-online":{"date-parts":[[2015,11,10]]},"reference":[{"key":"2023020111551202400_btv644-B1","doi-asserted-by":"crossref","first-page":"98","DOI":"10.1007\/978-3-319-11964-9_7","article-title":"Scientific lenses to support multiple views over linked chemistry data","volume-title":"The Semantic Web\u2013ISWC 2014","author":"Batchelor","year":"2014"},{"key":"2023020111551202400_btv644-B2","doi-asserted-by":"crossref","first-page":"28","DOI":"10.1038\/scientificamerican0501-34","article-title":"The semantic web","volume":"284","author":"Berners-Lee","year":"2001","journal-title":"Scientific American"},{"key":"2023020111551202400_btv644-B3","doi-asserted-by":"crossref","DOI":"10.1145\/1376616.1376746","article-title":"Freebase: a collaboratively created graph database for structuring human knowledge","author":"Bollacker","year":"2008"},{"key":"2023020111551202400_btv644-B4","doi-asserted-by":"crossref","first-page":"163","DOI":"10.1080\/0022250X.2001.9990249","article-title":"A faster algorithm for betweenness centrality*","volume":"25","author":"Brandes","year":"2001","journal-title":"Journal of Mathematical Sociology"},{"key":"2023020111551202400_btv644-B5","doi-asserted-by":"crossref","DOI":"10.1145\/2623330.2623623","article-title":"Knowledge vault: A web-scale approach to probabilistic knowledge fusion","author":"Dong","year":"2014"},{"key":"2023020111551202400_btv644-B6","article-title":"Graphstream: A tool for bridging the gap between complex systems and dynamic graphs","author":"Dutot","year":"2007"},{"key":"2023020111551202400_btv644-B7","doi-asserted-by":"crossref","first-page":"1183","DOI":"10.1080\/01621459.1969.10501049","article-title":"A theory for record linkage","volume":"64","author":"Fellegi","year":"1969","journal-title":"J. Am. Stat. Assoc."},{"key":"2023020111551202400_btv644-B8","doi-asserted-by":"crossref","first-page":"9","DOI":"10.1023\/A:1009761603038","article-title":"Real-world data is dirty: Data cleansing and the merge\/purge problem","volume":"2","author":"Hern\u00e1ndez","year":"1998","journal-title":"Data Mining Knowled. Discov."},{"key":"2023020111551202400_btv644-B9","doi-asserted-by":"crossref","first-page":"D580","DOI":"10.1093\/nar\/gkr1097","article-title":"Identifiers. org and miriam registry: community resources to provide persistent identification","volume":"40","author":"Juty","year":"2012","journal-title":"Nucleic Acids Res."},{"key":"2023020111551202400_btv644-B10","doi-asserted-by":"crossref","first-page":"813","DOI":"10.1038\/nrd2156","article-title":"Life after statin patent expiries","volume":"5","author":"Kidd","year":"2006","journal-title":"Nature Reviews Drug Discovery"},{"key":"2023020111551202400_btv644-B11","article-title":"Parallel worlds of public and commercial bioactive chemistry data","author":"Lipinski","year":"2014","journal-title":"J. Med. Chem"},{"key":"2023020111551202400_btv644-B12","doi-asserted-by":"crossref","first-page":"255","DOI":"10.1038\/498255a","article-title":"Biology: The big challenges of big data","volume":"498","author":"Marx","year":"2013","journal-title":"Nature"},{"key":"2023020111551202400_btv644-B13","doi-asserted-by":"crossref","first-page":"1123","DOI":"10.1021\/ed100697w","article-title":"Chemspider: an online chemical information resource","volume":"87","author":"Pence","year":"2010","journal-title":"J. Chem. Educ."},{"key":"2023020111551202400_btv644-B14","doi-asserted-by":"crossref","first-page":"117","DOI":"10.1055\/s-0038-1634828","article-title":"Record linkage strategies. part i: Estimating information and evaluating approaches","volume":"30","author":"Roos","year":"1991","journal-title":"Methods Inform. Med."},{"key":"2023020111551202400_btv644-B15","article-title":"Introducing the knowledge graph: things, not strings","volume-title":"Official Google Blog","author":"Singhal","year":"2012"},{"key":"2023020111551202400_btv644-B16","doi-asserted-by":"crossref","first-page":"68","DOI":"10.1186\/1479-5876-8-68","article-title":"Effective knowledge management in translational medicine","volume":"8","author":"Szalma","year":"2010","journal-title":"J. Trans. Med."},{"key":"2023020111551202400_btv644-B17","doi-asserted-by":"crossref","first-page":"2499","DOI":"10.1021\/ci400099q","article-title":"Estimating error rates in bioactivity databases","volume":"53","author":"Tiikkainen","year":"2013","journal-title":"J. Chem. Inform. Model."},{"key":"2023020111551202400_btv644-B18","doi-asserted-by":"crossref","first-page":"210","DOI":"10.1055\/s-0038-1634840","article-title":"Record linkage strategies: Part ii. portable software and deterministic matching","volume":"30","author":"Wajda","year":"1991","journal-title":"Methods Inform. Med."},{"key":"2023020111551202400_btv644-B19","author":"Wikipedia","year":"2014"},{"key":"2023020111551202400_btv644-B20","author":"Wikipedia","year":"2014"},{"key":"2023020111551202400_btv644-B21","doi-asserted-by":"crossref","first-page":"1188","DOI":"10.1016\/j.drudis.2012.05.016","article-title":"Open phacts: semantic interoperability for drug discovery","volume":"17","author":"Williams","year":"2012","journal-title":"Drug Discov. Today"},{"key":"2023020111551202400_btv644-B22","doi-asserted-by":"crossref","first-page":"685","DOI":"10.1016\/j.drudis.2012.02.013","article-title":"Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation","volume":"17","author":"Williams","year":"2012","journal-title":"Drug Discov. Today"},{"key":"2023020111551202400_btv644-B23","doi-asserted-by":"crossref","DOI":"10.1109\/IJCNN.2011.6033192","article-title":"Beyond probabilistic record linkage: Using neural networks and complex features to improve genealogical record linkage","author":"Wilson","year":"2011"},{"key":"2023020111551202400_btv644-B24","first-page":"355","article-title":"Matching and record linkage","volume":"1","author":"Winkler","year":"1995","journal-title":"Business Survey Methods"},{"key":"2023020111551202400_btv644-B25","doi-asserted-by":"crossref","first-page":"313","DOI":"10.1002\/wics.1317","article-title":"Matching and record linkage","volume":"6","author":"Winkler","year":"2014","journal-title":"Wiley Interdisciplinary Reviews: Computational Statistics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/32\/6\/918\/49018284\/bioinformatics_32_6_918.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/32\/6\/918\/49018284\/bioinformatics_32_6_918.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,2,1]],"date-time":"2023-02-01T22:15:44Z","timestamp":1675289744000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/32\/6\/918\/1743746"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,11,10]]},"references-count":25,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2016,3,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btv644","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2016,3,15]]},"published":{"date-parts":[[2015,11,10]]}}}