{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,31]],"date-time":"2026-01-31T04:56:09Z","timestamp":1769835369096,"version":"3.49.0"},"reference-count":53,"publisher":"Oxford University Press (OUP)","license":[{"start":{"date-parts":[[2020,11,13]],"date-time":"2020-11-13T00:00:00Z","timestamp":1605225600000},"content-version":"vor","delay-in-days":317,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001871","name":"Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","doi-asserted-by":"publisher","award":["PTDC\/EEI-ESS\/4633\/2014"],"award-info":[{"award-number":["PTDC\/EEI-ESS\/4633\/2014"]}],"id":[{"id":"10.13039\/501100001871","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001871","name":"Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","doi-asserted-by":"publisher","award":["SFRH\/BD\/145377\/2019"],"award-info":[{"award-number":["SFRH\/BD\/145377\/2019"]}],"id":[{"id":"10.13039\/501100001871","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001871","name":"Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","doi-asserted-by":"publisher","award":["UIDB\/00408\/2020"],"award-info":[{"award-number":["UIDB\/00408\/2020"]}],"id":[{"id":"10.13039\/501100001871","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001871","name":"Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","doi-asserted-by":"publisher","award":["UIDP\/00408\/2020"],"award-info":[{"award-number":["UIDP\/00408\/2020"]}],"id":[{"id":"10.13039\/501100001871","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,1,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>The ability to compare entities within a knowledge graph is a cornerstone technique for several applications, ranging from the integration of heterogeneous data to machine learning. It is of particular importance in the biomedical domain, where semantic similarity can be applied to the prediction of protein\u2013protein interactions, associations between diseases and genes, cellular localization of proteins, among others. In recent years, several knowledge graph-based semantic similarity measures have been developed, but building a gold standard data set to support their evaluation is non-trivial.<\/jats:p>\n               <jats:p>We present a collection of 21 benchmark data sets that aim at circumventing the difficulties in building benchmarks for large biomedical knowledge graphs by exploiting proxies for biomedical entity similarity. These data sets include data from two successful biomedical ontologies, Gene Ontology and Human Phenotype Ontology, and explore proxy similarities calculated based on protein sequence similarity, protein family similarity, protein\u2013protein interactions and phenotype-based gene similarity. Data sets have varying sizes and cover four different species at different levels of annotation completion. For each data set, we also provide semantic similarity computations with state-of-the-art representative measures.<\/jats:p>\n               <jats:p>Database URL: https:\/\/github.com\/liseda-lab\/kgsim-benchmark.<\/jats:p>","DOI":"10.1093\/database\/baaa078","type":"journal-article","created":{"date-parts":[[2020,8,24]],"date-time":"2020-08-24T19:26:29Z","timestamp":1598297189000},"source":"Crossref","is-referenced-by-count":10,"title":["A Collection of Benchmark Data Sets for Knowledge Graph-based Similarity in the Biomedical Domain"],"prefix":"10.1093","volume":"2020","author":[{"given":"Carlota","family":"Cardoso","sequence":"first","affiliation":[{"name":"Departamento de inform\u00e1tica, LASIGE Faculdade de Ci\u00eancias da Universidade de Lisboa, 1749 - 016 Lisboa, Portugal"}]},{"given":"Rita T","family":"Sousa","sequence":"additional","affiliation":[{"name":"Departamento de inform\u00e1tica, LASIGE Faculdade de Ci\u00eancias da Universidade de Lisboa, 1749 - 016 Lisboa, Portugal"}]},{"given":"Sebastian","family":"K\u00f6hler","sequence":"additional","affiliation":[{"name":"Ada Health GmbH Karl-Liebknecht-Str. 1. 10178 Berlin"}]},{"given":"Catia","family":"Pesquita","sequence":"additional","affiliation":[{"name":"Departamento de inform\u00e1tica, LASIGE Faculdade de Ci\u00eancias da Universidade de Lisboa, 1749 - 016 Lisboa, Portugal"}]}],"member":"286","published-online":{"date-parts":[[2020,11,11]]},"reference":[{"key":"2020111216461401700_R1","doi-asserted-by":"crossref","first-page":"167","DOI":"10.3233\/SW-140134","article-title":"DBpedia\u2014a large-scale, multilingual knowledge base extracted from Wikipedia","volume":"6","author":"Lehmann","year":"2015","journal-title":"Semant. Web."},{"key":"2020111216461401700_R2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.2200\/S00639ED1V01Y201504HLT027","article-title":"Semantic similarity from natural language and ontology analysis","volume":"8","author":"Harispe","year":"2015","journal-title":"Synth. Lect. Hum. Lang. Technol."},{"key":"2020111216461401700_R3","article-title":"Gene Ontology enrichment improves performances of functional similarity of genes","volume":"8","author":"Liu","year":"2018","journal-title":"Sci. Rep."},{"key":"2020111216461401700_R4","doi-asserted-by":"crossref","first-page":"69","DOI":"10.1093\/bioinformatics\/btr610","article-title":"Gene Ontology-driven inference of protein\u2013protein interactions using inducers","volume":"28","author":"Maetschke","year":"2011","journal-title":"Bioinformatics"},{"key":"2020111216461401700_R5","doi-asserted-by":"crossref","DOI":"10.1186\/1471-2105-11-562","article-title":"An improved method for scoring protein-protein interactions using semantic similarity within the Gene Ontology","volume":"11","author":"Jain","year":"2010","journal-title":"BMC Bioinform."},{"key":"2020111216461401700_R6","first-page":"131","article-title":"Drug-target interaction prediction using semantic similarity and edge partitioning","author":"Palma","year":"2014"},{"key":"2020111216461401700_R7","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pcbi.1000443","article-title":"Semantic similarity in biomedical ontologies","volume":"5","author":"Pesquita","year":"2009","journal-title":"PLoS Comput. Biol."},{"key":"2020111216461401700_R8","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1038\/75556","article-title":"Gene Ontology: tool for the unification of biology","volume":"25","author":"Ashburner","year":"2000","journal-title":"Nat. Genet."},{"key":"2020111216461401700_R9","doi-asserted-by":"crossref","first-page":"457","DOI":"10.1016\/j.ajhg.2009.09.003","article-title":"Clinical diagnostics in human genetics with semantic similarity searches in ontologies","volume":"85","author":"K\u00f6hler","year":"2009","journal-title":"Am. J. Hum. Genet."},{"key":"2020111216461401700_R10","doi-asserted-by":"publisher","first-page":"256","DOI":"10.1093\/bib\/bbl027","article-title":"Bio-ontologies: current trends and future directions","volume":"7","author":"Bodenreider","year":"2006","journal-title":"Brief. Bioinform"},{"key":"2020111216461401700_R11","doi-asserted-by":"crossref","first-page":"161","DOI":"10.1007\/978-1-4939-3743-1_12","volume-title":"The Gene Ontology Handbook","author":"Pesquita","year":"2017"},{"key":"2020111216461401700_R12","doi-asserted-by":"crossref","first-page":"569","DOI":"10.1093\/bib\/bbr066","article-title":"Semantic similarity analysis of protein data: assessment with biological features and issues","volume":"13","author":"Guzzi","year":"2011","journal-title":"Brief. Bioinform."},{"key":"2020111216461401700_R13","doi-asserted-by":"crossref","first-page":"662","DOI":"10.1101\/gr.461403","article-title":"The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro","volume":"13","author":"Camon","year":"2003","journal-title":"Genome Res."},{"key":"2020111216461401700_R14","doi-asserted-by":"publisher","first-page":"368","DOI":"10.1016\/j.ygeno.2013.04.010","article-title":"A novel insight into Gene Ontology semantic similarity","volume":"101","author":"Xu","year":"2013","journal-title":"Genomics"},{"key":"2020111216461401700_R15","doi-asserted-by":"publisher","DOI":"10.1186\/s12859-016-1160-0","article-title":"TopoICSim: a new semantic similarity measure based on Gene Ontology","volume":"17","author":"Ehsani","year":"2016","journal-title":"BMC Bioinform."},{"key":"2020111216461401700_R16","doi-asserted-by":"publisher","DOI":"10.1186\/1471-2105-8-475","article-title":"Predicting Gene Ontology functions from protein\u2019s regional surface structures","volume":"8","author":"Liu","year":"2007","journal-title":"BMC Bioinform."},{"key":"2020111216461401700_R17","doi-asserted-by":"publisher","DOI":"10.1186\/s12918-016-0361-5","article-title":"Interspecies gene function prediction using semantic similarity","volume":"10","author":"Yu","year":"2016","journal-title":"BMC Syst. Biol."},{"key":"2020111216461401700_R18","doi-asserted-by":"publisher","first-page":"1116","DOI":"10.1093\/bioinformatics\/bty751","article-title":"Improving protein function prediction using protein sequence and GO-term similarities","volume":"35","author":"Makrodimitris","year":"2018","journal-title":"Bioinformatics"},{"key":"2020111216461401700_R19","doi-asserted-by":"publisher","DOI":"10.1186\/s12859-018-2152-z","article-title":"An improved approach to infer protein-protein interaction based on a hierarchical vector space model","volume":"19","author":"Zhang","year":"2018","journal-title":"BMC Bioinform."},{"key":"2020111216461401700_R20","doi-asserted-by":"publisher","DOI":"10.1186\/1471-2105-8-262","article-title":"False positive reduction in protein-protein interaction predictions using Gene Ontology annotations","volume":"8","author":"Mahdavi","year":"2007","journal-title":"BMC Bioinform."},{"key":"2020111216461401700_R21","first-page":"531","author":"Al-Mubaid","year":"2008"},{"key":"2020111216461401700_R22","doi-asserted-by":"publisher","first-page":"389","DOI":"10.1109\/CBMS.2005.29","article-title":"An ontology-driven clustering method for supporting gene expression analysis","author":"Wang","year":"2005"},{"key":"2020111216461401700_R23","doi-asserted-by":"publisher","first-page":"555","DOI":"10.1109\/CBMS.2006.100","article-title":"Incorporating Gene Ontology in clustering gene expression data","author":"Kustra","year":"2006"},{"key":"2020111216461401700_R24","doi-asserted-by":"crossref","first-page":"D1018","DOI":"10.1093\/nar\/gky1105","article-title":"Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources","volume":"47","author":"K\u00f6hler","year":"2018","journal-title":"Nucleic Acids Res."},{"key":"2020111216461401700_R25","doi-asserted-by":"publisher","DOI":"10.1186\/s12859-018-2064-y","article-title":"A new method to measure the semantic similarity from query phenotypic abnormalities to diseases based on the human phenotype ontology","volume":"19","author":"Gong","year":"2018","journal-title":"BMC Bioinform."},{"key":"2020111216461401700_R26","doi-asserted-by":"publisher","DOI":"10.1186\/1471-2105-15-248","article-title":"Clinical phenotype-based gene prioritization: an initial study using semantic similarity and the Human Phenotype Ontology","volume":"15","author":"Masino","year":"2014","journal-title":"BMC Bioinform."},{"key":"2020111216461401700_R27","doi-asserted-by":"publisher","DOI":"10.1186\/s12918-019-0697-8","article-title":"Predicting disease-related phenotypes using an integrated phenotype similarity measurement based on HPO","volume":"13","author":"Xue","year":"2019","journal-title":"BMC Syst. Biol"},{"key":"2020111216461401700_R28","doi-asserted-by":"crossref","first-page":"e119","DOI":"10.1093\/nar\/gkr538","article-title":"A whole-phenome approach to disease gene discovery","volume":"39","author":"Hoehndorf","year":"2011","journal-title":"Nucleic Acids Res."},{"key":"2020111216461401700_R29","doi-asserted-by":"crossref","first-page":"288","DOI":"10.1016\/j.jbi.2006.06.004","article-title":"Measures of semantic similarity and relatedness in the biomedical domain","volume":"40","author":"Pedersen","year":"2007","journal-title":"J. Biomed. Inform."},{"key":"2020111216461401700_R30","doi-asserted-by":"publisher","first-page":"D267","DOI":"10.1093\/nar\/gkh061","article-title":"The Unified Medical Language System (UMLS): integrating biomedical terminology","volume":"32","author":"Bodenreider","year":"2004","journal-title":"Nucleic Acids Res."},{"key":"2020111216461401700_R31","first-page":"33","article-title":"Conference v2. 0: An uncertain version of the OAEI conference benchmark","author":"Cheatham","year":"2014"},{"key":"2020111216461401700_R32","article-title":"Crowdsourcing the verification of relationships in biomedical ontologies","author":"Mortensen","year":"2013"},{"key":"2020111216461401700_R33","article-title":"CESSM: Collaborative Evaluation of Semantic Similarity Measures","volume":"157","author":"Pesquita","year":"2009","journal-title":"JB2009 Challenges Bioinforma."},{"key":"2020111216461401700_R34","doi-asserted-by":"crossref","first-page":"D427","DOI":"10.1093\/nar\/gky995","article-title":"The Pfam protein families database in 2019","volume":"47","author":"El-Gebali","year":"2018","journal-title":"Nucleic Acids Res."},{"key":"2020111216461401700_R35","doi-asserted-by":"crossref","first-page":"304","DOI":"10.1093\/nar\/28.1.304","article-title":"The ENZYME database in 2000","volume":"28","author":"Bairoch","year":"2000","journal-title":"Nucleic Acids Res."},{"key":"2020111216461401700_R36","doi-asserted-by":"publisher","DOI":"10.1101\/459107","article-title":"A new family of similarity measures for scoring confidence of protein interactions using Gene Ontology","author":"Paul","year":"2018","journal-title":"BioRxiv."},{"key":"2020111216461401700_R37","doi-asserted-by":"publisher","DOI":"10.1186\/s12864-019-6272-2","article-title":"GO2Vec: transforming GO terms and proteins to vector representations via graph embeddings","volume":"20","author":"Zhong","year":"2019","journal-title":"BMC Genomics"},{"key":"2020111216461401700_R38","first-page":"pp. 246","article-title":"MateTee: a semantic similarity metric based on translation embeddings for knowledge graphs","author":"Morales","year":"2017"},{"key":"2020111216461401700_R39","article-title":"Determining similarity of scientific entities in annotation datasets","author":"Palma","year":"2014","journal-title":"Database."},{"key":"2020111216461401700_R40","first-page":"2787","volume-title":"Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 Advances in Neural Information Processing Systems (NIPS\u201913)","author":"Bordes","year":"2013"},{"key":"2020111216461401700_R41","first-page":"926","volume-title":"Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1 Advances in Neural Information Processing Systems (NIPS\u201913)","author":"Socher","year":"2013"},{"key":"2020111216461401700_R42","doi-asserted-by":"crossref","first-page":"186","DOI":"10.1007\/978-3-319-46547-0_20","volume-title":"The Semantic Web\u2014ISWC 2016","author":"Ristoski","year":"2016"},{"key":"2020111216461401700_R43","article-title":"Open Graph Benchmark: datasets for machine learning on graphs","author":"Hu","year":"2020","journal-title":"arXiv."},{"key":"2020111216461401700_R44","first-page":"1089","article-title":"An intrinsic information content metric for semantic similarity in WordNet","author":"Seco","year":"2004"},{"key":"2020111216461401700_R45","first-page":"448","article-title":"Using information content to evaluate semantic similarity in a taxonomy","author":"Resnik","year":"1995"},{"key":"2020111216461401700_R46","doi-asserted-by":"crossref","DOI":"10.1186\/1471-2105-9-S5-S4","article-title":"Metrics for GO based protein semantic similarity: a systematic evaluation","volume":"9","author":"Pesquita","year":"2008","journal-title":"BMC Bioinform."},{"key":"2020111216461401700_R47","doi-asserted-by":"publisher","first-page":"905","DOI":"10.1109\/TCBB.2017.2695542","article-title":"Investigating correlation between protein sequence similarity and semantic similarity using Gene Ontology annotations","volume":"15","author":"Ikram","year":"2018","journal-title":"IEEE\/ACM Trans. Comput. Biol. Bioinforma"},{"key":"2020111216461401700_R48","doi-asserted-by":"crossref","DOI":"10.1186\/s12859-019-3296-1","article-title":"Evolving knowledge graph similarity for supervised learning in complex biomedical domains","volume":"21","author":"Sousa","year":"2020","journal-title":"BMC Bioinform."},{"key":"2020111216461401700_R49","doi-asserted-by":"crossref","first-page":"D789","DOI":"10.1093\/nar\/gku1205","article-title":"OMIM.org: Online Mendelian Inheritance in Man (OMIM\u00ae), an online catalog of human genes and genetic disorders","volume":"43","author":"Amberger","year":"2014","journal-title":"Nucleic Acids Res."},{"key":"2020111216461401700_R50","doi-asserted-by":"publisher","first-page":"42","DOI":"10.1002\/humu.22204","article-title":"VariBench: a benchmark database for variations","volume":"34","author":"Sasidharan Nair","year":"2013","journal-title":"Hum. Mutat"},{"key":"2020111216461401700_R51","doi-asserted-by":"crossref","first-page":"2610","DOI":"10.1093\/bioinformatics\/btq483","article-title":"Simple sequence-based kernels do not predict protein\u2013protein interactions","volume":"26","author":"Yu","year":"2010","journal-title":"Bioinformatics."},{"key":"2020111216461401700_R52","doi-asserted-by":"crossref","first-page":"225","DOI":"10.1093\/bib\/bbl004","article-title":"Automated protein function prediction\u2014the genomic challenge","volume":"7","author":"Friedberg","year":"2006","journal-title":"Brief. Bioinform."},{"key":"2020111216461401700_R53","doi-asserted-by":"crossref","first-page":"i38","DOI":"10.1093\/bioinformatics\/bti1016","article-title":"Kernel methods for predicting protein\u2013protein interactions","volume":"21","author":"Ben-Hur","year":"2005","journal-title":"Bioinformatics."}],"container-title":["Database"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baaa078\/34283820\/baaa078.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baaa078\/34283820\/baaa078.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2020,11,12]],"date-time":"2020-11-12T21:46:28Z","timestamp":1605217588000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/database\/article\/doi\/10.1093\/database\/baaa078\/5979744"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,1,1]]},"references-count":53,"URL":"https:\/\/doi.org\/10.1093\/database\/baaa078","relation":{},"ISSN":["1758-0463"],"issn-type":[{"value":"1758-0463","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020]]},"published":{"date-parts":[[2020,1,1]]},"article-number":"baaa078"}}