{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,4]],"date-time":"2026-05-04T10:21:55Z","timestamp":1777890115905,"version":"3.51.4"},"reference-count":27,"publisher":"SAGE Publications","license":[{"start":{"date-parts":[[2023,12,27]],"date-time":"2023-12-27T00:00:00Z","timestamp":1703635200000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["SW"],"published-print":{"date-parts":[[2023,12,27]]},"abstract":"<jats:p>Wikidata is a massive Knowledge Graph (KG), including more than 100 million data items and nearly 1.5 billion statements, covering a wide range of topics such as geography, history, scholarly articles, and life science data. The large volume of Wikidata is difficult to handle for research purposes; many researchers cannot afford the costs of hosting 100\u00a0GB of data. While Wikidata provides a public SPARQL endpoint, it can only be used for short-running queries. Often, researchers only require a limited range of data from Wikidata focusing on a particular topic for their use case. Subsetting is the process of defining and extracting the required data range from the KG; this process has received increasing attention in recent years. Specific tools and several approaches have been developed for subsetting, which have not been evaluated yet. In this paper, we survey the available subsetting approaches, introducing their general strengths and weaknesses, and evaluate four practical tools specific for Wikidata subsetting \u2013 WDSub, KGTK, WDumper, and WDF \u2013 in terms of execution performance, extraction accuracy, and flexibility in defining the subsets. Results show that all four tools have a minimum of 99.96% accuracy in extracting defined items and 99.25% in extracting statements. The fastest tool in extraction is WDF, while the most flexible tool is WDSub. During the experiments, multiple subset use cases have been defined and the extracted subsets have been analyzed, obtaining valuable information about the variety and quality of Wikidata, which would otherwise not be possible through the public Wikidata SPARQL endpoint.<\/jats:p>","DOI":"10.3233\/sw-233491","type":"journal-article","created":{"date-parts":[[2023,12,29]],"date-time":"2023-12-29T12:23:31Z","timestamp":1703852611000},"page":"1-27","source":"Crossref","is-referenced-by-count":3,"title":["Wikidata subsetting: Approaches, tools, and evaluation"],"prefix":"10.1177","author":[{"given":"Seyed Amir","family":"Hosseini Beghaeiraveri","sequence":"first","affiliation":[{"name":"School of Mathematical and Computer Science, Heriot-Watt University, Edinburgh, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jose Emilio","family":"Labra Gayo","sequence":"additional","affiliation":[{"name":"University of Oviedo, Oviedo, Spain"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Andra","family":"Waagmeester","sequence":"additional","affiliation":[{"name":"Micelio, Belgium"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ammar","family":"Ammar","sequence":"additional","affiliation":[{"name":"Dept of Bioinformatics - BiGCaT, NUTRIM, Maastricht University, Netherlads"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Carolina","family":"Gonzalez","sequence":"additional","affiliation":[{"name":"The Scripps Research Institute, US"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Denise","family":"Slenter","sequence":"additional","affiliation":[{"name":"Dept of Bioinformatics - BiGCaT, NUTRIM, Maastricht University, Netherlads"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sabah","family":"Ul-Hasan","sequence":"additional","affiliation":[{"name":"The Scripps Research Institute, US"},{"name":"Hologic Inc, US"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Egon","family":"Willighagen","sequence":"additional","affiliation":[{"name":"Dept of Bioinformatics - BiGCaT, NUTRIM, Maastricht University, Netherlads"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Fiona","family":"McNeill","sequence":"additional","affiliation":[{"name":"School of Informatics, The University of Edinburgh, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Alasdair J.G.","family":"Gray","sequence":"additional","affiliation":[{"name":"School of Mathematical and Computer Science, Heriot-Watt University, Edinburgh, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"179","reference":[{"key":"10.3233\/SW-233491_ref1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-09917-5_16"},{"key":"10.3233\/SW-233491_ref3","unstructured":"S.A.H.\u00a0Beghaeiraveri, Towards automated technologies in the referencing quality of Wikidata, in: Companion Proceedings of the Web Conference 2022, 2022, https:\/\/www2022.thewebconf.org\/PaperFiles\/8.pdf."},{"key":"10.3233\/SW-233491_ref4","unstructured":"S.A.H.\u00a0Beghaeiraveri, A.\u00a0Gray and F.\u00a0McNeill, Reference statistics in Wikidata topical subsets, in: Proceedings of the 2nd Wikidata Workshop (Wikidata 2021), CEUR Workshop Proceedings, CEUR, Virtual Conference, Vol.\u00a02982, 2021, ISSN: 1613-0073, https:\/\/researchportal.hw.ac.uk\/files\/53252708\/Reference_Statistics_in_Wikidata_Topical_Subsets_corrected_version.pdf."},{"key":"10.3233\/SW-233491_ref5","unstructured":"S.A.H.\u00a0Beghaeiraveri, A.J.G.\u00a0Gray and F.J.\u00a0McNeill, Experiences of using WDumper to create topical subsets from Wikidata, in: CEUR Workshop Proceedings, Vols\u00a02873, CEUR-WS, 2021, p.\u00a013, ISSN: 1613\u20130073, https:\/\/researchportal.hw.ac.uk\/files\/45184682\/paper13.pdf."},{"key":"10.3233\/SW-233491_ref6","doi-asserted-by":"publisher","DOI":"10.5281\/zenodo.8015611"},{"key":"10.3233\/SW-233491_ref7","doi-asserted-by":"publisher","DOI":"10.5281\/zenodo.8015689"},{"key":"10.3233\/SW-233491_ref8","doi-asserted-by":"publisher","DOI":"10.1093\/database\/baw015"},{"key":"10.3233\/SW-233491_ref10","unstructured":"M.\u00a0Cutcher, M.\u00a0Personick and B.\u00a0Thompson, The Bigdata\u00ae RDF graph database, in: Linked Data Management, Chapman and Hall\/CRC, 2014, 46 pp. ISBN 978-0-429-10245-5."},{"key":"10.3233\/SW-233491_ref11","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-88361-4_37"},{"key":"10.3233\/SW-233491_ref13","doi-asserted-by":"publisher","first-page":"22","DOI":"10.1016\/j.websem.2013.01.002","article-title":"Binary RDF representation for publication and exchange (HDT)","volume":"19","author":"Fern\u00e1ndez","year":"2013","journal-title":"Journal of Web Semantics"},{"key":"10.3233\/SW-233491_ref15","unstructured":"D.\u00a0Henselmann and A.\u00a0Harth, Constructing demand-driven Wikidata subsets, in: Wikidata@ ISWC, 2021."},{"key":"10.3233\/SW-233491_ref16","doi-asserted-by":"crossref","unstructured":"F.\u00a0Ilievski, D.\u00a0Garijo, H.\u00a0Chalupsky, N.T.\u00a0Divvala, Y.\u00a0Yao, C.\u00a0Rogers, R.\u00a0Li, J.\u00a0Liu, A.\u00a0Singh and D.\u00a0Schwabe, KGTK: A toolkit for large knowledge graph manipulation and analysis, in: International Semantic Web Conference, Springer, 2020, pp.\u00a0278\u2013293, https:\/\/arxiv.org\/pdf\/2006.00088.pdf.","DOI":"10.1007\/978-3-030-62466-8_18"},{"key":"10.3233\/SW-233491_ref17","first-page":"680","volume-title":"Cskg: The Commonsense Knowledge Graph, in: European Semantic Web Conference","author":"Ilievski","year":"2021"},{"issue":"8","key":"10.3233\/SW-233491_ref19","doi-asserted-by":"publisher","first-page":"100","DOI":"10.1016\/j.patter.2020.100136","article-title":"Dataset reuse: toward translating principles to practice","volume":"1","author":"Koesten","year":"2020","journal-title":"Patterns"},{"key":"10.3233\/SW-233491_ref21","unstructured":"J.E.\u00a0Labra-Gayo, WShEx: A language to describe and validate Wikibase entities, in: Proceedings of the 3rd Wikidata Workshop 2022 Co-Located with the 21st International Semantic Web Conference (ISWC2022), Vols\u00a0Vol-3262, 2022."},{"key":"10.3233\/SW-233491_ref26","doi-asserted-by":"publisher","DOI":"10.37044\/osf.io\/n7qku"},{"key":"10.3233\/SW-233491_ref27","doi-asserted-by":"publisher","DOI":"10.37044\/osf.io\/wu9et"},{"key":"10.3233\/SW-233491_ref28","first-page":"1","volume-title":"Validating RDF Data","author":"Labra-Gayo","year":"2017"},{"issue":"1","key":"10.3233\/SW-233491_ref29","doi-asserted-by":"publisher","first-page":"35","DOI":"10.1186\/s13326-017-0136-y","article-title":"RDFIO: Extending semantic MediaWiki for interoperable biomedical data management","volume":"8","author":"Lampa","year":"2017","journal-title":"Journal of Biomedical Semantics"},{"key":"10.3233\/SW-233491_ref33","unstructured":"N.\u00a0Mimouni, J.-C.\u00a0Moissinac and A.\u00a0Vu, Knowledge base completion with analogical inference on context graphs, in: Semapro 2019, 2019."},{"key":"10.3233\/SW-233491_ref34","unstructured":"L.\u00a0Pintscher, Wikidata EntitySchemas Telegram Group, 2022, Message: https:\/\/t.me\/joinchat\/ZeRz5wPDxpNkZGVk, https:\/\/t.me\/c\/1540810474\/327."},{"key":"10.3233\/SW-233491_ref37","doi-asserted-by":"publisher","DOI":"10.1145\/2815072.2815073"},{"key":"10.3233\/SW-233491_ref38","doi-asserted-by":"crossref","unstructured":"K.\u00a0Shenoy, F.\u00a0Ilievski, D.\u00a0Garijo, D.\u00a0Schwabe and P.\u00a0Szekely, A study of the quality of Wikidata, in: Journal of Web Semantics, Vol.\u00a072, Elsevier, 2022, p.\u00a0100679.","DOI":"10.1016\/j.websem.2021.100679"},{"issue":"10","key":"10.3233\/SW-233491_ref41","doi-asserted-by":"publisher","first-page":"78","DOI":"10.1145\/2629489","article-title":"Wikidata: A free collaborative knowledgebase","volume":"57","author":"Vrande\u010di\u0107","year":"2014","journal-title":"Communications of the ACM"},{"key":"10.3233\/SW-233491_ref42","unstructured":"A.\u00a0Waagmeester et al., Wikidata:WikiProject Schemas\/Subsetting \u2013 Wikidata, 2019, https:\/\/www.wikidata.org\/wiki\/Wikidata:WikiProject_Schemas\/Subsetting \u2013 accessed 31 December 2020."},{"key":"10.3233\/SW-233491_ref43","doi-asserted-by":"publisher","DOI":"10.7554\/eLife.52614"},{"key":"10.3233\/SW-233491_ref49","doi-asserted-by":"publisher","DOI":"10.1038\/sdata.2016.18"}],"container-title":["Semantic Web"],"original-title":[],"link":[{"URL":"https:\/\/content.iospress.com\/download?id=10.3233\/SW-233491","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T05:27:11Z","timestamp":1777613231000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/full\/10.3233\/SW-233491"}},"subtitle":[],"editor":[{"given":"Lucie-Aim\u00e9e","family":"Kaffee","sequence":"additional","affiliation":[{"name":"University of Southampton, United Kingdom"}],"role":[{"role":"editor","vocabulary":"crossref"}]},{"given":"Simon","family":"Razniewski","sequence":"additional","affiliation":[{"name":"Max Planck Institute for Informatics, Germany"}],"role":[{"role":"editor","vocabulary":"crossref"}]},{"given":"Pavlos","family":"Vougiouklis","sequence":"additional","affiliation":[{"name":"Huawei Technologies, United Kingdom"}],"role":[{"role":"editor","vocabulary":"crossref"}]}],"short-title":[],"issued":{"date-parts":[[2023,12,27]]},"references-count":27,"URL":"https:\/\/doi.org\/10.3233\/sw-233491","relation":{},"ISSN":["2210-4968","1570-0844"],"issn-type":[{"value":"2210-4968","type":"electronic"},{"value":"1570-0844","type":"print"}],"subject":[],"published":{"date-parts":[[2023,12,27]]}}}