{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,6]],"date-time":"2025-12-06T17:14:55Z","timestamp":1765041295365},"reference-count":51,"publisher":"MIT Press","issue":"1","license":[{"start":{"date-parts":[[2022,9,20]],"date-time":"2022-09-20T00:00:00Z","timestamp":1663632000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,3,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Specialized transformers-based models (such as BioBERT and BioMegatron) are adapted for the biomedical domain based on publicly available biomedical corpora. As such, they have the potential to encode large-scale biological knowledge. We investigate the encoding and representation of biological knowledge in these models, and its potential utility to support inference in cancer precision medicine\u2014namely, the interpretation of the clinical significance of genomic alterations. We compare the performance of different transformer baselines; we use probing to determine the consistency of encodings for distinct entities; and we use clustering methods to compare and contrast the internal properties of the embeddings for genes, variants, drugs, and diseases. We show that these models do indeed encode biological knowledge, although some of this is lost in fine-tuning for specific tasks. Finally, we analyze how the models behave with regard to biases and imbalances in the dataset.<\/jats:p>","DOI":"10.1162\/coli_a_00462","type":"journal-article","created":{"date-parts":[[2022,9,20]],"date-time":"2022-09-20T19:32:21Z","timestamp":1663702341000},"page":"73-115","update-policy":"http:\/\/dx.doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":11,"title":["Transformers and the Representation of Biomedical Background Knowledge"],"prefix":"10.1162","volume":"49","author":[{"given":"Oskar","family":"Wysocki","sequence":"first","affiliation":[{"name":"Digital Experimental Cancer Medicine Team, Cancer Biomarker Centre, CRUK Manchester Institute, University of Manchester. oskar.wysocki@manchester.ac.uk"}]},{"given":"Zili","family":"Zhou","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Manchester. zili.zhou@manchester.ac.uk"}]},{"given":"Paul","family":"O\u2019Regan","sequence":"additional","affiliation":[{"name":"Digital Experimental Cancer Medicine Team, Cancer Biomarker Centre, CRUK Manchester Institute, University of Manchester. paul.oregan@digitalecmt.com"}]},{"given":"Deborah","family":"Ferreira","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Manchester. deborah.ferreira@manchester.ac.uk"}]},{"given":"Magdalena","family":"Wysocka","sequence":"additional","affiliation":[{"name":"Digital Experimental Cancer Medicine Team, Cancer Biomarker Centre, CRUK Manchester Institute, University of Manchester. magdalena.wysocka@digitalecmt.org"}]},{"given":"D\u00f3nal","family":"Landers","sequence":"additional","affiliation":[{"name":"Digital Experimental Cancer Medicine Team, Cancer Biomarker Centre, CRUK Manchester Institute, University of Manchester. donal.landers@delondraoncology.com"}]},{"given":"Andr\u00e9","family":"Freitas","sequence":"additional","affiliation":[{"name":"Idiap Research Institute Martigny, Switzerland. andre.freitas@manchester.ac.uk"}]}],"member":"281","published-online":{"date-parts":[[2023,3,1]]},"reference":[{"key":"2023030119560161400_","article-title":"Fine-grained analysis of sentence embeddings using auxiliary prediction tasks","author":"Adi","year":"2017"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"3023","DOI":"10.18653\/v1\/2021.findings-acl.266","article-title":"Probing pre-trained language models for disease knowledge","volume-title":"Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021","author":"Alghanmi","year":"2021"},{"issue":"3","key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1200\/CCI.19.00077","article-title":"Open-sourced civic annotation pipeline to identify and annotate clinically relevant variants using single-molecule molecular inversion probes","author":"Barnell","year":"2019","journal-title":"JCO Clinical Cancer Informatics"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","DOI":"10.1162\/coli_a_00422","article-title":"Probing classifiers: Promises, shortcomings, and advances","author":"Belinkov","year":"2021"},{"issue":"6","key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"Bbab134","DOI":"10.1093\/bib\/bbab134","article-title":"Knowledge bases and software support for variant interpretation in precision oncology","volume":"22","author":"Borchert","year":"2021","journal-title":"Briefings in Bioinformatics"},{"issue":"1","key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1200\/PO.17.00011","article-title":"OncoKB: A precision oncology knowledge base","author":"Chakravarty","year":"2017","journal-title":"JCO Precision Oncology"},{"key":"2023030119560161400_","article-title":"Combining pre-trained language models and structured knowledge","author":"Colon-Hernandez","year":"2021"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"Article 200 (17 pp)","DOI":"10.1186\/s12920-019-0647-8","article-title":"Genome analysis and knowledge-driven variant interpretation with TGex","volume":"12","author":"Dahary","year":"2019","journal-title":"BMC Medical Genomics"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"Article 76 (12 pp)","DOI":"10.1186\/s13073-019-0687-x","article-title":"Standard operating procedure for curation and clinical interpretation of variants in cancer","volume":"11","author":"Danos","year":"2019","journal-title":"Genome Medicine"},{"issue":"11","key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"1721","DOI":"10.1002\/humu.23651","article-title":"Adapting crowdsourced clinical cancer curation in CIViC to the ClinGen minimum variant level data community-driven standards","volume":"39","author":"Danos","year":"2018","journal-title":"Human Mutation"},{"issue":"2","key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"118","DOI":"10.1158\/2159-8290.CD-14-1118","article-title":"Database of genomic biomarkers for cancer drugs and clinical targetability in solid tumors","volume":"5","author":"Dienstmann","year":"2015","journal-title":"Cancer Discovery"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"4947","DOI":"10.18653\/v1\/2021.findings-acl.438","article-title":"How transfer learning impacts linguistic knowledge in deep NLP models?","volume-title":"Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021","author":"Durrani","year":"2021"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"194","DOI":"10.18653\/v1\/2021.acl-demo.23","article-title":"Does my representation capture X? Probe-ably","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations","author":"Ferreira","year":"2021"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"238","DOI":"10.2307\/1403797","article-title":"Discriminatory analysis - Nonparametric discrimination: Consistency properties","volume":"57","author":"Fix","year":"1989","journal-title":"International Statistical Review"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"3356","DOI":"10.18653\/v1\/2020.findings-emnlp.301","article-title":"RealToxicityPrompts: Evaluating neural toxic degeneration in language models","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2020","author":"Gehman","year":"2020"},{"issue":"8","key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"438","DOI":"10.1186\/s13059-014-0438-7","article-title":"Organizing knowledge to enable personalization of medicine in cancer","volume":"15","author":"Good","year":"2014","journal-title":"Genome Biology"},{"issue":"2","key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"170","DOI":"10.1038\/ng.3774","article-title":"CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer","volume":"49","author":"Griffith","year":"2017","journal-title":"Nature Genetics"},{"issue":"1","key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"vbac035","DOI":"10.1093\/bioadv\/vbac035","article-title":"MarkerGenie: An NLP-enabled text-mining system for biomedical entity relation extraction","volume":"2","author":"Gu","year":"2022","journal-title":"Bioinformatics Advances"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"107","DOI":"10.18653\/v1\/N18-2017","article-title":"Annotation artifacts in natural language inference data","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)","author":"Gururangan","year":"2018"},{"issue":"1","key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"53","DOI":"10.1186\/s13073-019-0664-4","article-title":"Variant Interpretation for Cancer (VIC): A computational tool for assessing clinical impacts of somatic variants","volume":"11","author":"He","year":"2019","journal-title":"Genome Medicine"},{"key":"2023030119560161400_","first-page":"4129","article-title":"A structural probe for finding syntax in word representations","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Hewitt","year":"2019"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2018\/796","article-title":"Visualisation and \u2018diagnostic classifiers\u2019 reveal how recurrent and recursive neural networks process hierarchical structure","author":"Hupkes","year":"2018"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"2021","DOI":"10.18653\/v1\/D17-1215","article-title":"Adversarial examples for evaluating reading comprehension systems","volume-title":"Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing","author":"Jia","year":"2017"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1370","article-title":"Document-level N-ary relation extraction with multiscale representation learning","author":"Jia","year":"2019","journal-title":"CoRR"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-2011","article-title":"Probing biomedical embeddings from language models","author":"Jin","year":"2019"},{"key":"2023030119560161400_","article-title":"Do transformers encode a foundational ontology? Probing abstract classes in natural language","author":"Jullien","year":"2022"},{"issue":"4","key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"1234","DOI":"10.1093\/bioinformatics\/btz682","article-title":"Biobert: A pre-trained biomedical language representation model for biomedical text mining","volume":"36","author":"Lee","year":"2020","journal-title":"Bioinformatics"},{"issue":"1","key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"Article 78 (16 pp)","DOI":"10.1186\/s13073-019-0686-y","article-title":"Text-mining clinically relevant cancer biomarkers for curation into the CIViC database","volume":"11","author":"Lever","year":"2018","journal-title":"bioRxiv Genome Medicine"},{"issue":"1","key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"4","DOI":"10.1016\/j.jmoldx.2016.10.002","article-title":"Standards and guidelines for the interpretation and reporting of sequence variants in cancer: A joint consensus recommendation of the Association for Molecular Pathology, American Society of Clinical Oncology, and College of American Pathologists","volume":"19","author":"Li","year":"2017","journal-title":"The Journal of Molecular Diagnostics"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"3428","DOI":"10.18653\/v1\/P19-1334","article-title":"Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"McCoy","year":"2019"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"33","DOI":"10.1109\/ICDMW.2017.12","article-title":"Accelerated hierarchical density based clustering","volume-title":"Data Mining Workshops (ICDMW), 2017 IEEE International Conference on","author":"McInnes","year":"2017"},{"issue":"11","key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"205","DOI":"10.21105\/joss.00205","article-title":"hdbscan: Hierarchical density based clustering","volume":"2","author":"McInnes","year":"2017","journal-title":"The Journal of Open Source Software"},{"issue":"29","key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"861","DOI":"10.21105\/joss.00861","article-title":"UMAP: Uniform manifold approximation and projection","volume":"3","author":"McInnes","year":"2018","journal-title":"The Journal of Open Source Software"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"33","DOI":"10.18653\/v1\/2020.blackboxnlp-1.4","article-title":"What happens to BERT embeddings during fine-tuning?","volume-title":"Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP","author":"Merchant","year":"2020"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"2339","DOI":"10.18653\/v1\/2020.acl-main.212","article-title":"Syntactic data augmentation increases robustness to inference heuristics","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Min","year":"2020"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"5356","DOI":"10.18653\/v1\/2021.acl-long.416","article-title":"StereoSet: Measuring stereotypical bias in pretrained language models","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Nadeem","year":"2021"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"4609","DOI":"10.18653\/v1\/2020.acl-main.420","article-title":"Information-theoretic probing for linguistic structure","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Pimentel","year":"2020"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/ICPET53277.2021.00007","article-title":"Biomedical information extraction pipeline to identify disease-gene interactions from PubMed breast cancer literature","volume-title":"2021 International Conference on Promising Electronic Technologies (ICPET)","author":"Qumsiyeh","year":"2021"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"3042","DOI":"10.18653\/v1\/2021.findings-emnlp.261","article-title":"How does fine-tuning affect the geometry of embedding space: A case study on isotropy","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2021","author":"Rajaee","year":"2021"},{"issue":"2","key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1200\/PO.18.00098","article-title":"Comparison of treatment recommendations by molecular tumor boards worldwide","author":"Rieke","year":"2018","journal-title":"JCO Precision Oncology"},{"issue":"15","key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"e2016239118 (12 pp)","DOI":"10.1073\/pnas.2016239118","article-title":"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences","volume":"118","author":"Rives","year":"2021","journal-title":"Proceedings of the National Academy of Sciences"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"35","DOI":"10.18653\/v1\/W18-2305","article-title":"Identifying key sentences for precision oncology using semi-supervised learning","volume-title":"Proceedings of the BioNLP 2018 Workshop","author":"\u0160eva","year":"2018"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"4700","DOI":"10.18653\/v1\/2020.emnlp-main.379","article-title":"Bio-Megatron: Larger biomedical domain language model","volume-title":"EMNLP","author":"Shin","year":"2020"},{"issue":"11","key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"e1005017","DOI":"10.1371\/journal.pcbi.1005017","article-title":"Text mining genotype-phenotype relationships from biomedical literature for database curation and precision medicine","volume":"12","author":"Singhal","year":"2016","journal-title":"PLoS Computational Biology"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","DOI":"10.1101\/2020.06.26.174417","article-title":"BERTology meets biology: Interpreting attention in protein language models","author":"Vig","year":"2021"},{"issue":"4","key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"245","DOI":"10.1200\/CCI.19.00127","article-title":"Civicpy: A Python software development and analysis toolkit for the CIViC knowledgebase","author":"Wagner","year":"2020","journal-title":"JCO Clinical Cancer Informatics"},{"key":"2023030119560161400_","article-title":"Pre-trained language models in biomedical domain: A systematic survey","author":"Wang","year":"2021"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1215","article-title":"Deep probabilistic logic: A unifying framework for indirect supervision","author":"Wang","year":"2018","journal-title":"CoRR"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.bionlp-1.20","article-title":"Improving biomedical pretrained language models with knowledge","author":"Yuan","year":"2021"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"4889","DOI":"10.18653\/v1\/2020.findings-emnlp.439","article-title":"Do language embeddings capture scales?","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2020","author":"Zhang","year":"2020"},{"key":"2023030119560161400_","doi-asserted-by":"publisher","first-page":"5017","DOI":"10.18653\/v1\/2021.naacl-main.398","article-title":"Factual probing is [MASK]: Learning vs. learning to recall","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Zhong","year":"2021"}],"container-title":["Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/coli\/article-pdf\/49\/1\/73\/2069018\/coli_a_00462.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/coli\/article-pdf\/49\/1\/73\/2069018\/coli_a_00462.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,3,1]],"date-time":"2023-03-01T19:56:08Z","timestamp":1677700568000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/coli\/article\/49\/1\/73\/113017\/Transformers-and-the-Representation-of-Biomedical"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023]]},"references-count":51,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,3,1]]},"published-print":{"date-parts":[[2023,3,1]]}},"URL":"https:\/\/doi.org\/10.1162\/coli_a_00462","relation":{},"ISSN":["0891-2017","1530-9312"],"issn-type":[{"value":"0891-2017","type":"print"},{"value":"1530-9312","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2023]]},"published":{"date-parts":[[2023]]}}}