{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T20:34:10Z","timestamp":1772138050508,"version":"3.50.1"},"reference-count":33,"publisher":"Oxford University Press (OUP)","issue":"6","license":[{"start":{"date-parts":[[2022,1,5]],"date-time":"2022-01-05T00:00:00Z","timestamp":1641340800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Fraunhofer Cluster of Excellence \u2018Cognitive Internet Technologies\u2019 and the Defense Advanced Research Projects Agency","award":["HR00111990009"],"award-info":[{"award-number":["HR00111990009"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,3,4]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>The majority of biomedical knowledge is stored in structured databases or as unstructured text in scientific publications. This vast amount of information has led to numerous machine learning-based biological applications using either text through natural language processing (NLP) or structured data through knowledge graph embedding models. However, representations based on a single modality are inherently limited.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>To generate better representations of biological knowledge, we propose STonKGs, a Sophisticated Transformer trained on biomedical text and Knowledge Graphs (KGs). This multimodal Transformer uses combined input sequences of structured information from KGs and unstructured text data from biomedical literature to learn joint representations in a shared embedding space. First, we pre-trained STonKGs on a knowledge base assembled by the Integrated Network and Dynamical Reasoning Assembler consisting of millions of text-triple pairs extracted from biomedical literature by multiple NLP systems. Then, we benchmarked STonKGs against three baseline models trained on either one of the modalities (i.e. text or KG) across eight different classification tasks, each corresponding to a different biological application. Our results demonstrate that STonKGs outperforms both baselines, especially on the more challenging tasks with respect to the number of classes, improving upon the F1-score of the best baseline by up to 0.084 (i.e. from 0.881 to 0.965). Finally, our pre-trained model as well as the model architecture can be adapted to various other transfer learning applications.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>We make the source code and the Python package of STonKGs available at GitHub (https:\/\/github.com\/stonkgs\/stonkgs) and PyPI (https:\/\/pypi.org\/project\/stonkgs\/). The pre-trained STonKGs models and the task-specific classification models are respectively available at https:\/\/huggingface.co\/stonkgs\/stonkgs-150k and https:\/\/zenodo.org\/communities\/stonkgs.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btac001","type":"journal-article","created":{"date-parts":[[2022,1,3]],"date-time":"2022-01-03T10:36:31Z","timestamp":1641206191000},"page":"1648-1656","source":"Crossref","is-referenced-by-count":22,"title":["STonKGs: a sophisticated transformer trained on biomedical text and knowledge graphs"],"prefix":"10.1093","volume":"38","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6392-9306","authenticated-orcid":false,"given":"Helena","family":"Balabin","sequence":"first","affiliation":[{"name":"Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing , 53757 Sankt Augustin, Germany"},{"name":"Department of Bonn-Rhein-Sieg, University of Applied Sciences , 53757 Sankt Augustin, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4423-4370","authenticated-orcid":false,"given":"Charles Tapley","family":"Hoyt","sequence":"additional","affiliation":[{"name":"Laboratory of Systems Pharmacology, Harvard Medical School , Boston, MA 02115, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7212-7700","authenticated-orcid":false,"given":"Colin","family":"Birkenbihl","sequence":"additional","affiliation":[{"name":"Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing , 53757 Sankt Augustin, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9439-5346","authenticated-orcid":false,"given":"Benjamin M","family":"Gyori","sequence":"additional","affiliation":[{"name":"Laboratory of Systems Pharmacology, Harvard Medical School , Boston, MA 02115, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6095-2466","authenticated-orcid":false,"given":"John","family":"Bachman","sequence":"additional","affiliation":[{"name":"Laboratory of Systems Pharmacology, Harvard Medical School , Boston, MA 02115, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9896-3531","authenticated-orcid":false,"given":"Alpha Tom","family":"Kodamullil","sequence":"additional","affiliation":[{"name":"Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing , 53757 Sankt Augustin, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5563-5458","authenticated-orcid":false,"given":"Paul G","family":"Pl\u00f6ger","sequence":"additional","affiliation":[{"name":"Department of Bonn-Rhein-Sieg, University of Applied Sciences , 53757 Sankt Augustin, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9012-6720","authenticated-orcid":false,"given":"Martin","family":"Hofmann-Apitius","sequence":"additional","affiliation":[{"name":"Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing , 53757 Sankt Augustin, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2046-6145","authenticated-orcid":false,"given":"Daniel","family":"Domingo-Fern\u00e1ndez","sequence":"additional","affiliation":[{"name":"Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing , 53757 Sankt Augustin, Germany"},{"name":"Fraunhofer Center for Machine Learning , Sankt Augustin, Germany"},{"name":"Enveda Biosciences , Boulder, CO 80301, USA"}]}],"member":"286","published-online":{"date-parts":[[2022,1,5]]},"reference":[{"key":"2023020108580301800_btac001-B1","doi-asserted-by":"crossref","first-page":"432","DOI":"10.1093\/bioinformatics\/btv585","article-title":"Automatic semantic classification of scientific literature according to the hallmarks of cancer","volume":"32","author":"Baker","year":"2016","journal-title":"Bioinformatics"},{"key":"2023020108580301800_btac001-B2","doi-asserted-by":"crossref","first-page":"154","DOI":"10.1016\/j.websem.2009.07.002","article-title":"DBpedia\u2014a crystallization point for the Web of Data","volume":"7","author":"Bizer","year":"2009","journal-title":"J. Web Semant"},{"key":"2023020108580301800_btac001-B3","author":"Bordes","year":"2013"},{"key":"2023020108580301800_btac001-B4","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s40537-019-0217-0","article-title":"Big data in healthcare: management, analysis and future prospects","volume":"6","author":"Dash","year":"2019","journal-title":"J. Big Data"},{"key":"2023020108580301800_btac001-B5","first-page":"4171","author":"Devlin","year":"2019"},{"key":"2023020108580301800_btac001-B6","doi-asserted-by":"crossref","first-page":"3679","DOI":"10.1093\/bioinformatics\/btx399","article-title":"Multimodal mechanistic signatures for neurodegenerative diseases (NeuroMMSig): a web server for mechanism enrichment","volume":"33","author":"Domingo-Fern\u00e1ndez","year":"2017","journal-title":"Bioinformatics"},{"key":"2023020108580301800_btac001-B7","doi-asserted-by":"crossref","first-page":"1859","DOI":"10.1093\/nar\/gkab012","article-title":"Human pathways in animal models: possibilities and limitations","volume":"49","author":"Doncheva","year":"2021","journal-title":"Nucleic Acids Res"},{"key":"2023020108580301800_btac001-B8","doi-asserted-by":"crossref","first-page":"100153","DOI":"10.1016\/j.patter.2020.100153","article-title":"Contextualized protein\u2013protein interactions","volume":"2","author":"Federico","year":"2021","journal-title":"Patterns"},{"key":"2023020108580301800_btac001-B9","first-page":"1","article-title":"Enriching contextualized language model from knowledge graph for biomedical information extraction","volume":"22","author":"Fei","year":"2020","journal-title":"Brief Bioinformatics"},{"key":"2023020108580301800_btac001-B10","first-page":"855","author":"Grover","year":"2016"},{"key":"2023020108580301800_btac001-B11","doi-asserted-by":"crossref","first-page":"954","DOI":"10.15252\/msb.20177651","article-title":"From word models to executable models of signaling networks using automated assembly","volume":"13","author":"Gyori","year":"2017","journal-title":"Mol. Syst. Biol"},{"key":"2023020108580301800_btac001-B12","first-page":"2281","author":"He","year":"2020"},{"key":"2023020108580301800_btac001-B13","first-page":"1","article-title":"A survey on knowledge graphs: representation, acquisition, and applications","volume":"2021","author":"Ji","year":"2021","journal-title":"IEEE Trans. Neural Netw. Learn. Syst"},{"key":"2023020108580301800_btac001-B14","author":"Kamath","year":"2021"},{"key":"2023020108580301800_btac001-B15","doi-asserted-by":"crossref","first-page":"1234","DOI":"10.1093\/bioinformatics\/btz682","article-title":"BioBERT: a pre-trained biomedical language representation model for biomedical text mining","volume":"36","author":"Lee","year":"2019","journal-title":"Bioinformatics"},{"key":"2023020108580301800_btac001-B16","article-title":"BioCreative V CDR task corpus: a resource for chemical disease relation extraction","volume":"2016","author":"Li","year":"2016","journal-title":"Database"},{"key":"2023020108580301800_btac001-B17","author":"Liu","year":"2019"},{"key":"2023020108580301800_btac001-B18","author":"Loshchilov","year":"2019"},{"key":"2023020108580301800_btac001-B19","author":"Mikolov","year":"2013"},{"key":"2023020108580301800_btac001-B20","volume-title":"arXiv preprint","author":"Nadkarni","year":"2021"},{"key":"2023020108580301800_btac001-B21","doi-asserted-by":"crossref","first-page":"609","DOI":"10.1093\/bib\/bby025","article-title":"Navigating the disease landscape: knowledge representations for contextualizing molecular signatures","volume":"20","author":"Saqi","year":"2019","journal-title":"Brief Bioinform"},{"key":"2023020108580301800_btac001-B22","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s12864-018-5139-2","article-title":"Context-specific interactions in literature-curated protein interaction databases","volume":"19","author":"Stacey","year":"2018","journal-title":"BMC Genomics"},{"key":"2023020108580301800_btac001-B23","author":"Sun","year":"2020"},{"key":"2023020108580301800_btac001-B24","doi-asserted-by":"crossref","first-page":"138","DOI":"10.1186\/s12859-015-0564-6","article-title":"An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition","volume":"16","author":"Tsatsaronis","year":"2015","journal-title":"BMC Bioinform"},{"key":"2023020108580301800_btac001-B25","first-page":"6558","author":"Tsai","year":"2019"},{"key":"2023020108580301800_btac001-B26","first-page":"1499","author":"Toutanova","year":"2015"},{"key":"2023020108580301800_btac001-B27","first-page":"6000","author":"Vaswani","year":"2017"},{"key":"2023020108580301800_btac001-B28","doi-asserted-by":"crossref","first-page":"78","DOI":"10.1145\/2629489","article-title":"Wikidata: a free collaborative knowledgebase","volume":"57","author":"Vrande\u010di\u0107","year":"2014","journal-title":"Commun. ACM"},{"key":"2023020108580301800_btac001-B29","author":"Wang","year":"2014"},{"key":"2023020108580301800_btac001-B30","first-page":"353","author":"Wang","year":"2018"},{"key":"2023020108580301800_btac001-B31","author":"Ying","year":"2021"},{"key":"2023020108580301800_btac001-B32","first-page":"1441","author":"Zhang","year":"2019"},{"key":"2023020108580301800_btac001-B33","author":"Zaheer","year":"2020"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btac001\/42237346\/btac001.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/6\/1648\/49008658\/btac001.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/6\/1648\/49008658\/btac001.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,2,1]],"date-time":"2023-02-01T15:31:02Z","timestamp":1675265462000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/38\/6\/1648\/6497782"}},"subtitle":[],"editor":[{"given":"Jonathan","family":"Wren","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2022,1,5]]},"references-count":33,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2022,3,4]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btac001","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2021.08.17.456616","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022,3,15]]},"published":{"date-parts":[[2022,1,5]]}}}