{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,20]],"date-time":"2026-02-20T02:42:58Z","timestamp":1771555378626,"version":"3.50.1"},"reference-count":32,"publisher":"Emerald","issue":"4","license":[{"start":{"date-parts":[[2020,6,26]],"date-time":"2020-06-26T00:00:00Z","timestamp":1593129600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/www.emerald.com\/insight\/site-policies"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["DTA"],"published-print":{"date-parts":[[2020,6,26]]},"abstract":"<jats:sec><jats:title content-type=\"abstract-subheading\">Purpose<\/jats:title><jats:p>Several online services offer functionalities to access information from \u201cbig research graphs\u201d (e.g. Google Scholar, OpenAIRE, Microsoft Academic Graph), which correlate scholarly\/scientific communication entities such as publications, authors, datasets, organizations, projects, funders, etc. Depending on the target users, access can vary from search and browse content to the consumption of statistics for monitoring and provision of feedback. Such graphs are populated over time as aggregations of multiple sources and therefore suffer from major entity-duplication problems. Although deduplication of graphs is a known and actual problem, existing solutions are dedicated to specific scenarios, operate on flat collections, local topology-drive challenges and cannot therefore be re-used in other contexts.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-subheading\">Design\/methodology\/approach<\/jats:title><jats:p>This work presents GDup, an integrated, scalable, general-purpose system that can be customized to address deduplication over arbitrary large information graphs. The paper presents its high-level architecture, its implementation as a service used within the OpenAIRE infrastructure system and reports numbers of real-case experiments.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-subheading\">Findings<\/jats:title><jats:p>GDup provides the functionalities required to deliver a fully-fledged entity deduplication workflow over a generic input graph. The system offers out-of-the-box Ground Truth management, acquisition of feedback from data curators and algorithms for identifying and merging duplicates, to obtain an output disambiguated graph.<\/jats:p><\/jats:sec><jats:sec><jats:title content-type=\"abstract-subheading\">Originality\/value<\/jats:title><jats:p>To our knowledge GDup is the only system in the literature that offers an integrated and general-purpose solution for the deduplication graphs, while targeting big data scalability issues. GDup is today one of the key modules of the OpenAIRE infrastructure production system, which monitors Open Science trends on behalf of the European Commission, National funders and institutions.<\/jats:p><\/jats:sec>","DOI":"10.1108\/dta-09-2019-0163","type":"journal-article","created":{"date-parts":[[2020,6,29]],"date-time":"2020-06-29T05:51:36Z","timestamp":1593409896000},"page":"409-435","source":"Crossref","is-referenced-by-count":8,"title":["Entity deduplication in big data graphs for scholarly communication"],"prefix":"10.1108","volume":"54","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7291-3210","authenticated-orcid":false,"given":"Paolo","family":"Manghi","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9613-6639","authenticated-orcid":false,"given":"Claudio","family":"Atzori","sequence":"additional","affiliation":[]},{"given":"Michele","family":"De Bonis","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1112-1292","authenticated-orcid":false,"given":"Alessia","family":"Bardi","sequence":"additional","affiliation":[]}],"member":"140","reference":[{"key":"key2020082512233687900_ref001","doi-asserted-by":"publisher","first-page":"952","DOI":"10.1109\/ICDE.2009.43","article-title":"Large-scale deduplication with constraints using dedupalog","year":"2017"},{"key":"key2020082512233687900_ref002","doi-asserted-by":"publisher","article-title":"gdup: a big graph entity deduplication system - Release 1","year":"2017","DOI":"10.5281\/zenodo.292980"},{"key":"key2020082512233687900_ref003","doi-asserted-by":"publisher","first-page":"142","DOI":"10.1109\/BDCAT.2018.00025","article-title":"Gdup: De-duplication of scholarly communication big graphs","year":"2018"},{"key":"key2020082512233687900_ref004","doi-asserted-by":"publisher","article-title":"gDup: an integrated and scalable graph deduplication system","year":"2016","DOI":"10.5281\/zenodo.1454880"},{"key":"key2020082512233687900_ref005","article-title":"Deduplication and group detection using links","year":"2004"},{"key":"key2020082512233687900_ref006","first-page":"4","article-title":"Bigtable: A distributed storage system for structured data","volume":"26","year":"2008","journal-title":"ACM Transactions on Computer Systems (TOCS)"},{"key":"key2020082512233687900_ref007","first-page":"1065","article-title":"Febrl-: an open source data cleaning, deduplication and record linkage system with a graphical user interface","year":"2008"},{"key":"key2020082512233687900_ref008","first-page":"73","article-title":"A comparison of string metrics for matching names and records","volume-title":"Kdd Workshop on Data Cleaning and Object Consolidation","year":"2003"},{"issue":"1","key":"key2020082512233687900_ref009","doi-asserted-by":"crossref","first-page":"107","DOI":"10.1145\/1327452.1327492","article-title":"Mapreduce: simplified data processing on large clusters","volume":"51","year":"2008","journal-title":"Communications of the ACM"},{"issue":"328","key":"key2020082512233687900_ref010","doi-asserted-by":"crossref","first-page":"1183","DOI":"10.1080\/01621459.1969.10501049","article-title":"A theory for record linkage","volume":"64","year":"1969","journal-title":"Journal of the American Statistical Association"},{"key":"key2020082512233687900_ref011","volume-title":"HBase: The Definitive Guide","year":"2011"},{"key":"key2020082512233687900_ref012","article-title":"Reconciliation of rdf* and property graphs","year":"2014"},{"key":"key2020082512233687900_ref013","article-title":"Gradoop: Scalable graph data management and analytics with hadoop","year":"2015"},{"key":"key2020082512233687900_ref014","first-page":"440","article-title":"FRIL: A tool for comparative record linkage","year":"2008"},{"issue":"5","key":"key2020082512233687900_ref015","doi-asserted-by":"crossref","first-page":"999","DOI":"10.1109\/TVCG.2008.55","article-title":"Interactive entity resolution in relational data: A visual analytic tool and its evaluation","volume":"14","year":"2008","journal-title":"IEEE Transactions on Visualization and Computer Graphics"},{"issue":"1","key":"key2020082512233687900_ref016","doi-asserted-by":"crossref","first-page":"23","DOI":"10.1007\/s13222-012-0110-x","article-title":"Parallel entity resolution with Dedoop","volume":"13","year":"2013","journal-title":"Datenbank-Spektrum"},{"key":"key2020082512233687900_ref017","article-title":"Parallel sorted neighborhood blocking with mapreduce","year":"2010"},{"issue":"1-2","key":"key2020082512233687900_ref018","doi-asserted-by":"crossref","first-page":"484","DOI":"10.14778\/1920841.1920904","article-title":"Evaluation of entity resolution approaches on real-world match problems","volume":"3","year":"2010","journal-title":"Proceedings of the VLDB Endowment"},{"key":"key2020082512233687900_ref019","doi-asserted-by":"crossref","unstructured":"La Bruzzo, S., Manghi, P. and Mannocci, A. (2019), \u201cOpenaire's doiboost - boosting crossref for research\u201d, in Manghi, P., Candela, L. and Silvello, G. (Eds), Digital Libraries: Supporting Open Science, Springer International Publishing, Cham, ISBN 978-3-030-11226-4, pp. 133-143.","DOI":"10.1007\/978-3-030-11226-4_11"},{"issue":"4","key":"key2020082512233687900_ref020","doi-asserted-by":"crossref","first-page":"11","DOI":"10.1145\/2094114.2094118","article-title":"Parallel data processing with mapreduce: a survey","volume":"40","year":"2012","journal-title":"AcM sIGMoD Record"},{"key":"key2020082512233687900_ref021","doi-asserted-by":"publisher","first-page":"78","article-title":"Design patterns for efficient graph algorithms in mapreduce","DOI":"10.1145\/1830252.1830263"},{"issue":"1","key":"key2020082512233687900_ref022","first-page":"31","article-title":"An infrastructure for managing ec funded research output-the openaire project","volume":"6","year":"2010","journal-title":"The Grey Journal (TGJ): An International Journal on Grey Literature"},{"key":"key2020082512233687900_ref023","first-page":"168","article-title":"The data model of the openaire scientific communication e-infrastructure","volume-title":"Metadata and Semantics Research","year":"2012"},{"issue":"2","key":"key2020082512233687900_ref024","doi-asserted-by":"crossref","first-page":"114","DOI":"10.1504\/IJMSO.2012.050014","article-title":"De-duplication of aggregation authority files","volume":"7","year":"2012","journal-title":"International Journal of Metadata, Semantics and Ontologies"},{"issue":"4","key":"key2020082512233687900_ref025","doi-asserted-by":"publisher","first-page":"322","DOI":"10.1108\/PROG-08-2013-0045","article-title":"The d-net software toolkit: A framework for the realization, maintenance, and operation of aggregative infrastructures","volume":"48","year":"2014","journal-title":"Program"},{"key":"key2020082512233687900_ref031","first-page":"169","article-title":"Efficient clustering of high-dimensional data sets with application to reference matching","year":"2000"},{"key":"key2020082512233687900_ref026","volume-title":"Graph Databases: New Opportunities for Connected Data","year":"2015"},{"issue":"6","key":"key2020082512233687900_ref027","doi-asserted-by":"crossref","first-page":"35","DOI":"10.1002\/bult.2010.1720360610","article-title":"Constructions from dots and lines","volume":"36","year":"2010","journal-title":"Bulletin of the American Society for Information Science and Technology"},{"key":"key2020082512233687900_ref033","first-page":"576","author":"Gangemi, A., Navigli, R., Vidal, M.-E., Hitzler, P., Troncy, R., Hollink, L., Tordai, A. and Alam, M.","year":"2018","journal-title":"The Semantic Web"},{"issue":"2","key":"key2020082512233687900_ref028","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1371\/journal.pone.0191917","article-title":"Reducing vertices in property graphs","volume":"13","year":"2018","journal-title":"PloS One"},{"issue":"1","key":"key2020082512233687900_ref029","doi-asserted-by":"publisher","first-page":"166","DOI":"10.1109\/TKDE.2015.2468711","article-title":"Semantic-aware blocking for entity resolution","volume":"28","year":"2016","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"issue":"1","key":"key2020082512233687900_ref030","doi-asserted-by":"publisher","first-page":"18","DOI":"10.1109\/TBDATA.2016.2641460","article-title":"Big scholarly data: a survey","volume":"3","year":"2017","journal-title":"IEEE Transactions on Big Data"}],"container-title":["Data Technologies and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/DTA-09-2019-0163\/full\/xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/DTA-09-2019-0163\/full\/html","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,24]],"date-time":"2025-07-24T23:15:21Z","timestamp":1753398921000},"score":1,"resource":{"primary":{"URL":"http:\/\/www.emerald.com\/dta\/article\/54\/4\/409-435\/20969"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,6,26]]},"references-count":32,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2020,6,26]]}},"alternative-id":["10.1108\/DTA-09-2019-0163"],"URL":"https:\/\/doi.org\/10.1108\/dta-09-2019-0163","relation":{},"ISSN":["2514-9288"],"issn-type":[{"value":"2514-9288","type":"print"}],"subject":[],"published":{"date-parts":[[2020,6,26]]}}}