{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,28]],"date-time":"2026-03-28T06:39:28Z","timestamp":1774679968801,"version":"3.50.1"},"reference-count":33,"publisher":"Oxford University Press (OUP)","issue":"7","license":[{"start":{"date-parts":[[2022,3,25]],"date-time":"2022-03-25T00:00:00Z","timestamp":1648166400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/"}],"funder":[{"DOI":"10.13039\/100008460","name":"National Center for Complementary and Integrative Health","doi-asserted-by":"crossref","award":["R01AT009457"],"award-info":[{"award-number":["R01AT009457"]}],"id":[{"id":"10.13039\/100008460","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100018188","name":"University of Minnesota Clinical and Translational Science Institute","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100018188","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100006108","name":"National Center for Advancing Translational Sciences","doi-asserted-by":"publisher","award":["UL1TR002494"],"award-info":[{"award-number":["UL1TR002494"]}],"id":[{"id":"10.13039\/100006108","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,6,14]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Objective<\/jats:title><jats:p>Accurate extraction of breast cancer patients\u2019 phenotypes is important for clinical decision support and clinical research. This study developed and evaluated cancer domain pretrained CancerBERT models for extracting breast cancer phenotypes from clinical texts. We also investigated the effect of customized cancer-related vocabulary on the performance of CancerBERT models.<\/jats:p><\/jats:sec><jats:sec><jats:title>Materials and Methods<\/jats:title><jats:p>A cancer-related corpus of breast cancer patients was extracted from the electronic health records of a local hospital. We annotated named entities in 200 pathology reports and 50 clinical notes for 8 cancer phenotypes for fine-tuning and evaluation. We kept pretraining the BlueBERT model on the cancer corpus with expanded vocabularies (using both term frequency-based and manually reviewed methods) to obtain CancerBERT models. The CancerBERT models were evaluated and compared with other baseline models on the cancer phenotype extraction task.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>All CancerBERT models outperformed all other models on the cancer phenotyping NER task. Both CancerBERT models with customized vocabularies outperformed the CancerBERT with the original BERT vocabulary. The CancerBERT model with manually reviewed customized vocabulary achieved the best performance with macro F1 scores equal to 0.876 (95% CI, 0.873\u20130.879) and 0.904 (95% CI, 0.902\u20130.906) for exact match and lenient match, respectively.<\/jats:p><\/jats:sec><jats:sec><jats:title>Conclusions<\/jats:title><jats:p>The CancerBERT models were developed to extract the cancer phenotypes in clinical notes and pathology reports. The results validated that using customized vocabulary may further improve the performances of domain specific BERT models in clinical NLP tasks. The CancerBERT models developed in the study would further help clinical decision support.<\/jats:p><\/jats:sec>","DOI":"10.1093\/jamia\/ocac040","type":"journal-article","created":{"date-parts":[[2022,3,9]],"date-time":"2022-03-09T20:11:19Z","timestamp":1646856679000},"page":"1208-1216","source":"Crossref","is-referenced-by-count":107,"title":["CancerBERT: a cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records"],"prefix":"10.1093","volume":"29","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9846-1475","authenticated-orcid":false,"given":"Sicheng","family":"Zhou","sequence":"first","affiliation":[{"name":"Institute for Health Informatics, University of Minnesota , Minneapolis, Minnesota, USA"}]},{"given":"Nan","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Statistics, University of Minnesota , Minneapolis, Minnesota, USA"}]},{"given":"Liwei","family":"Wang","sequence":"additional","affiliation":[{"name":"Department of AI and Informatics Research, Mayo Clinic , Rochester, Minnesota, USA"}]},{"given":"Hongfang","family":"Liu","sequence":"additional","affiliation":[{"name":"Department of AI and Informatics Research, Mayo Clinic , Rochester, Minnesota, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8258-3585","authenticated-orcid":false,"given":"Rui","family":"Zhang","sequence":"additional","affiliation":[{"name":"Institute for Health Informatics, University of Minnesota , Minneapolis, Minnesota, USA"},{"name":"Department of Pharmaceutical Care & Health Systems, University of Minnesota , Minneapolis, Minnesota, USA"}]}],"member":"286","published-online":{"date-parts":[[2022,3,25]]},"reference":[{"issue":"6","key":"2022061415540899900_ocac040-B1","doi-asserted-by":"crossref","first-page":"439","DOI":"10.3322\/caac.21412","article-title":"Breast cancer statistics, 2017, racial disparity in mortality by state","volume":"67","author":"DeSantis","year":"2017","journal-title":"CA Cancer J Clin"},{"issue":"12","key":"2022061415540899900_ocac040-B2","doi-asserted-by":"crossref","first-page":"693","DOI":"10.1038\/nrclinonc.2015.123","article-title":"Precision medicine for metastatic breast cancer \u2013 limitations and solutions","volume":"12","author":"Arnedos","year":"2015","journal-title":"Nat Rev Clin Oncol"},{"issue":"1","key":"2022061415540899900_ocac040-B3","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1186\/s12976-016-0035-4","article-title":"Toward precision medicine of breast cancer","volume":"13","author":"Carels","year":"2016","journal-title":"Theor Biol Med Model"},{"issue":"1","key":"2022061415540899900_ocac040-B4","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s12967-017-1239-z","article-title":"Precision medicine in breast cancer: reality or utopia?","volume":"15","author":"Bettaieb","year":"2017","journal-title":"J Transl Med"},{"issue":"e1","key":"2022061415540899900_ocac040-B5","doi-asserted-by":"crossref","first-page":"e162\u20139","DOI":"10.1136\/amiajnl-2011-000583","article-title":"Portability of an algorithm to identify rheumatoid arthritis in electronic health records","volume":"19","author":"Carroll","year":"2012","journal-title":"J Am Med Inform Assoc"},{"issue":"1","key":"2022061415540899900_ocac040-B6","doi-asserted-by":"crossref","first-page":"85","DOI":"10.1111\/cts.12514","article-title":"Electronic health record phenotypes for precision medicine: perspectives and caveats from treatment of breast cancer at a single institution","volume":"11","author":"Breitenstein","year":"2018","journal-title":"Clin Transl Sci"},{"key":"2022061415540899900_ocac040-B7","first-page":"1","author":"Zhou","year":"2019"},{"key":"2022061415540899900_ocac040-B8","author":"Devlin","year":"2018"},{"issue":"1","key":"2022061415540899900_ocac040-B9","doi-asserted-by":"crossref","first-page":"13","DOI":"10.1093\/jamia\/ocz063","article-title":"A study of deep learning approaches for medication and adverse drug event extraction from clinical text","volume":"27","author":"Wei","year":"2020","journal-title":"J Am Med Inform Assoc"},{"issue":"5","key":"2022061415540899900_ocac040-B10","doi-asserted-by":"crossref","first-page":"239","DOI":"10.1186\/s12911-019-0931-8","article-title":"Natural language processing for populating lung cancer clinical research data","volume":"19","author":"Wang","year":"2019","journal-title":"BMC Med Inform Decis Mak"},{"key":"2022061415540899900_ocac040-B11","first-page":"953","article-title":"Assessing the utility of automatic cancer registry notifications data extraction from free-text pathology reports","volume":"2015","author":"Nguyen","year":"2015","journal-title":"AMIA Annu Symp Proc"},{"issue":"2","key":"2022061415540899900_ocac040-B12","doi-asserted-by":"crossref","first-page":"203","DOI":"10.1007\/s10549-016-4035-1","article-title":"Using machine learning to parse breast pathology reports","volume":"161","author":"Yala","year":"2017","journal-title":"Breast Cancer Res Treat"},{"issue":"21","key":"2022061415540899900_ocac040-B13","doi-asserted-by":"crossref","first-page":"e115","DOI":"10.1158\/0008-5472.CAN-17-0615","article-title":"DeepPhe: a natural language processing system for extracting cancer phenotypes from clinical records","volume":"77","author":"Savova","year":"2017","journal-title":"Cancer Res"},{"issue":"1","key":"2022061415540899900_ocac040-B14","doi-asserted-by":"crossref","first-page":"244","DOI":"10.1109\/JBHI.2017.2700722","article-title":"Deep learning for automated extraction of primary sites from cancer pathology reports","volume":"22","author":"Qiu","year":"2018","journal-title":"IEEE J Biomed Health Inform"},{"key":"2022061415540899900_ocac040-B15","first-page":"218","article-title":"Coarse-to-fine multi-task training of convolutional neural networks for automated information extraction from cancer pathology reports","author":"Alawad","year":"2018","journal-title":"IEEE EMBS Int Conf Biomed Health Inform BHI"},{"issue":"4","key":"2022061415540899900_ocac040-B16","doi-asserted-by":"crossref","first-page":"1234","DOI":"10.1093\/bioinformatics\/btz682","article-title":"BioBERT: a pre-trained biomedical language representation model for biomedical text mining","volume":"36","author":"Lee","year":"2020","journal-title":"Bioinformatics"},{"key":"2022061415540899900_ocac040-B17","author":"Peng","year":"2019"},{"key":"2022061415540899900_ocac040-B18","author":"Gu","year":"2020"},{"issue":"7","key":"2022061415540899900_ocac040-B19","doi-asserted-by":"crossref","first-page":"1393","DOI":"10.1093\/jamia\/ocab014","article-title":"Extracting postmarketing adverse events from safety reports in the vaccine adverse event reporting system (VAERS) using deep learning","volume":"28","author":"Du","year":"2021","journal-title":"J Am Med Inform Assoc"},{"issue":"3","key":"2022061415540899900_ocac040-B20","doi-asserted-by":"crossref","first-page":"569","DOI":"10.1093\/jamia\/ocaa218","article-title":"Deep learning approaches for extracting adverse events and indications of dietary supplements from clinical text","volume":"28","author":"Fan","year":"2021","journal-title":"J Am Med Inform Assoc"},{"key":"2022061415540899900_ocac040-B21","doi-asserted-by":"crossref","first-page":"103985","DOI":"10.1016\/j.ijmedinf.2019.103985","article-title":"Extracting comprehensive clinical information for breast cancer using deep learning methods","volume":"132","author":"Zhang","year":"2019","journal-title":"Int J Med Inform"},{"key":"2022061415540899900_ocac040-B22","author":"Ma","year":"2020"},{"key":"2022061415540899900_ocac040-B23","author":"Boukkouri","year":"2020"},{"key":"2022061415540899900_ocac040-B24","author":"Beltagy","year":"2019"},{"key":"2022061415540899900_ocac040-B25","first-page":"5","author":"Klie","year":"2018"},{"key":"2022061415540899900_ocac040-B26","author":"Wu","year":"2016"},{"key":"2022061415540899900_ocac040-B27","author":"Honnibal","year":"2020"},{"key":"2022061415540899900_ocac040-B28","first-page":"1524","author":"Ritter","year":","},{"issue":"12","key":"2022061415540899900_ocac040-B29","doi-asserted-by":"crossref","first-page":"1935","DOI":"10.1093\/jamia\/ocaa189","article-title":"Clinical concept extraction using transformers","volume":"27","author":"Yang","year":"2020","journal-title":"J Am Med Inform Assoc"},{"key":"2022061415540899900_ocac040-B30","first-page":"3111","article-title":"Distributed representations of words and phrases and their compositionality","author":"Mikolov","year":"2013","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2022061415540899900_ocac040-B31","first-page":"1532","author":"Pennington","year":"2014"},{"issue":"11","key":"2022061415540899900_ocac040-B32","first-page":"2579","article-title":"Visualizing data using t-SNE","volume":"9","author":"Van der Maaten","year":"2008","journal-title":"J Mach Learn Res"},{"key":"2022061415540899900_ocac040-B33","author":"Eyre","year":"2021"}],"container-title":["Journal of the American Medical Informatics Association"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/jamia\/article-pdf\/29\/7\/1208\/44062229\/ocac040.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/jamia\/article-pdf\/29\/7\/1208\/44062229\/ocac040.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,28]],"date-time":"2023-01-28T18:47:19Z","timestamp":1674931639000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/jamia\/article\/29\/7\/1208\/6554005"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,3,25]]},"references-count":33,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2022,3,25]]},"published-print":{"date-parts":[[2022,6,14]]}},"URL":"https:\/\/doi.org\/10.1093\/jamia\/ocac040","relation":{},"ISSN":["1527-974X"],"issn-type":[{"value":"1527-974X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022,7,1]]},"published":{"date-parts":[[2022,3,25]]}}}