{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,10]],"date-time":"2026-02-10T20:22:00Z","timestamp":1770754920006,"version":"3.50.0"},"reference-count":36,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2022,11,16]],"date-time":"2022-11-16T00:00:00Z","timestamp":1668556800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,11,16]],"date-time":"2022-11-16T00:00:00Z","timestamp":1668556800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100005006","name":"Asan Institute for Life Sciences, Asan Medical Center","doi-asserted-by":"publisher","award":["Elimination of Cancer Project Fund"],"award-info":[{"award-number":["Elimination of Cancer Project Fund"]}],"id":[{"id":"10.13039\/501100005006","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100006404","name":"Ministry of Information and Communication","doi-asserted-by":"publisher","award":["2020-0-01361"],"award-info":[{"award-number":["2020-0-01361"]}],"id":[{"id":"10.13039\/501100006404","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100002701","name":"Ministry of Education","doi-asserted-by":"publisher","award":["NRF-2020S1A5B1104865"],"award-info":[{"award-number":["NRF-2020S1A5B1104865"]}],"id":[{"id":"10.13039\/501100002701","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100002573","name":"Yonsei University","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100002573","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Background<\/jats:title><jats:p>Unstructured text in medical records, such as Electronic Health Records, contain an enormous amount of valuable information for research; however, it is difficult to extract and structure important information because of frequent typographical errors. Therefore, improving the quality of data with errors for text analysis is an essential task. To date, few prior studies have been conducted addressing this. Here, we propose a new methodology for extracting important information from unstructured medical texts by overcoming the typographical problem in surgical pathology records related to lung cancer.<\/jats:p><\/jats:sec><jats:sec><jats:title>Methods<\/jats:title><jats:p>We propose a typo correction model that considers context, based on the Masked Language Model, to solve the problem of typographical errors in real-world medical data. In addition, a word dictionary was used for the typo correction model based on PubMed abstracts. After refining the data through typo correction, fine tuning was performed on pre-trained BERT model. Next, deep learning-based Named Entity Recognition (NER) was performed. By solving the quality problem of medical data, we sought to improve the accuracy of information extraction in unstructured text data.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>We compared the performance of the proposed typo correction model based on contextual information with an existing SymSpell model. We confirmed that our proposed model outperformed the existing model in a typographical correction task. The F1-score of the model improved by approximately 5% and 9% when compared with the model without contextual information in the NCBI-disease and surgical pathology record datasets, respectively. In addition, the F1-score of NER after typo correction increased by 2% in the NCBI-disease dataset. There was a significant performance difference of approximately 25% between the before and after typo correction in the Surgical pathology record dataset. This confirmed that typos influenced the information extraction of the unstructured text.<\/jats:p><\/jats:sec><jats:sec><jats:title>Conclusion<\/jats:title><jats:p>We verified that typographical errors in unstructured text negatively affect the performance of natural language processing tasks. The proposed method of a typo correction model outperformed the existing SymSpell model. This study shows that the proposed model is robust and can be applied in real-world environments by focusing on the typos that cause difficulties in analyzing unstructured medical text.<\/jats:p><\/jats:sec>","DOI":"10.1186\/s12859-022-05035-9","type":"journal-article","created":{"date-parts":[[2022,11,16]],"date-time":"2022-11-16T17:04:58Z","timestamp":1668618298000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":13,"title":["MLM-based typographical error correction of unstructured medical texts for named entity recognition"],"prefix":"10.1186","volume":"23","author":[{"given":"Eun Byul","family":"Lee","sequence":"first","affiliation":[]},{"given":"Go Eun","family":"Heo","sequence":"additional","affiliation":[]},{"given":"Chang Min","family":"Choi","sequence":"additional","affiliation":[]},{"given":"Min","family":"Song","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2022,11,16]]},"reference":[{"issue":"3","key":"5035_CR1","doi-asserted-by":"publisher","first-page":"287","DOI":"10.1093\/bib\/6.3.287","volume":"6","author":"M Scherf","year":"2005","unstructured":"Scherf M, Epple A, Werner T. The next generation of literature analysis: integration of genomic analysis into text mining. Brief Bioinform. 2005;6(3):287\u201397.","journal-title":"Brief Bioinform"},{"issue":"3","key":"5035_CR2","doi-asserted-by":"publisher","first-page":"1707","DOI":"10.1016\/j.eswa.2007.01.035","volume":"34","author":"D Delen","year":"2008","unstructured":"Delen D, Crossland MD. Seeding the survey and analysis of research literature with text mining. Expert Syst Appl. 2008;34(3):1707\u201320.","journal-title":"Expert Syst Appl"},{"issue":"1","key":"5035_CR3","doi-asserted-by":"publisher","first-page":"30","DOI":"10.1109\/TKDE.2010.211","volume":"24","author":"N Zhong","year":"2010","unstructured":"Zhong N, Li Y, Wu ST. Effective pattern discovery for text mining. IEEE Trans Knowl Data Eng. 2010;24(1):30\u201344.","journal-title":"IEEE Trans Knowl Data Eng"},{"key":"5035_CR4","doi-asserted-by":"crossref","unstructured":"Chen H, Chiang RH, Storey VC. Business intelligence and analytics: From big data to big impact. MIS Quart. 2012;1165\u201388.","DOI":"10.2307\/41703503"},{"issue":"1","key":"5035_CR5","first-page":"153","volume":"5","author":"TK Das","year":"2013","unstructured":"Das TK, Kumar PM. Big data analytics: A framework for unstructured data analysis. Int J Eng Sci Technol. 2013;5(1):153.","journal-title":"Int J Eng Sci Technol"},{"issue":"2","key":"5035_CR6","doi-asserted-by":"publisher","first-page":"137","DOI":"10.1016\/j.ijinfomgt.2014.10.007","volume":"35","author":"A Gandomi","year":"2015","unstructured":"Gandomi A, Haider M. Beyond the hype: Big data concepts, methods, and analytics. nt J Inf Manage. 2015;35(2):137\u201344.","journal-title":"nt J Inf Manage"},{"issue":"3","key":"5035_CR7","doi-asserted-by":"publisher","first-page":"1314","DOI":"10.1016\/j.eswa.2014.09.024","volume":"42","author":"S Moro","year":"2015","unstructured":"Moro S, Cortez P, Rita P. Business intelligence in banking: A literature analysis from 2002 to 2013 using text mining and latent Dirichlet allocation. Expert Syst Appl. 2015;42(3):1314\u201324.","journal-title":"Expert Syst Appl"},{"key":"5035_CR8","doi-asserted-by":"publisher","first-page":"45","DOI":"10.1016\/j.inffus.2015.08.005","volume":"28","author":"G Bello-Orgaz","year":"2016","unstructured":"Bello-Orgaz G, Jung JJ, Camacho D. Social big data: Recent achievements and new challenges. Inf Fusion. 2016;28:45\u201359.","journal-title":"Inf Fusion"},{"issue":"10","key":"5035_CR9","doi-asserted-by":"publisher","first-page":"1421","DOI":"10.1001\/jamaoncol.2019.1800","volume":"5","author":"KL Kehl","year":"2019","unstructured":"Kehl KL, Elmarakeby H, Nishino M, Van Allen EM, Lepisto EM, Hassett MJ, et al. Assessment of deep natural language processing in ascertaining oncologic outcomes from radiology reports. JAMA Oncol. 2019;5(10):1421\u20139.","journal-title":"JAMA Oncol"},{"issue":"23","key":"5035_CR10","doi-asserted-by":"publisher","first-page":"2293","DOI":"10.1056\/NEJMsb1609216","volume":"375","author":"RE Sherman","year":"2016","unstructured":"Sherman RE, Anderson SA, Dal Pan GJ, Gray GW, Gross T, Hunter NL, et al. Real-world evidence\u2014what is it and what can it tell us. N Engl J Med. 2016;375(23):2293\u20137.","journal-title":"N Engl J Med"},{"key":"5035_CR11","unstructured":"Hersh WR, Campbell EM, Malveau SE. Assessing the feasibility of large-scale natural language processing in a corpus of ordinary medical records: a lexical analysis. AMIA Annu Symp Proc. 1997;580."},{"key":"5035_CR12","unstructured":"Zhou L, Mahoney LM, Shakurova A, Goss F, Chang FY, Bates DW, et al. How many medication orders are entered through free-text in EHRs?-a study on hypoglycemic agents. AMIA Annu Symp Proc. 2012;1079."},{"issue":"4","key":"5035_CR13","doi-asserted-by":"publisher","first-page":"923","DOI":"10.2214\/AJR.11.6691","volume":"197","author":"S Basma","year":"2011","unstructured":"Basma S, Lord B, Jacks LM, Rizk M, Scaranelo AM. Error rates in breast imaging reports: comparison of automatic speech recognition and dictation transcription. AJR. 2011;197(4):923\u20137.","journal-title":"AJR"},{"issue":"10","key":"5035_CR14","doi-asserted-by":"publisher","first-page":"1161","DOI":"10.1093\/ajhp\/54.10.1161","volume":"54","author":"BL Lambert","year":"1997","unstructured":"Lambert BL. Predicting look-alike and sound-alike medication errors. Am J Health-Syst Pharm. 1997;54(10):1161\u201371.","journal-title":"Am J Health-Syst Pharm"},{"key":"5035_CR15","doi-asserted-by":"crossref","unstructured":"Ruch P. Using contextual spelling correction to improve retrieval effectiveness in degraded text collections. In COLING 2002: Proc Conf Assoc Comput Linguist Meet. 2002;19.","DOI":"10.3115\/1072228.1072337"},{"key":"5035_CR16","doi-asserted-by":"crossref","unstructured":"Britz D, Goldie A, Luong MT, Le Q. Massive exploration of neural machine translation architectures. arXiv preprint arXiv. 2017;1703.03906.","DOI":"10.18653\/v1\/D17-1151"},{"key":"5035_CR17","doi-asserted-by":"publisher","first-page":"188","DOI":"10.1016\/j.jbi.2015.04.008","volume":"55","author":"KH Lai","year":"2015","unstructured":"Lai KH, Topaz M, Goss FR, Zhou L. Automated misspelling detection and correction in clinical free-text records. J Biomed Inform. 2015;55:188\u201395.","journal-title":"J Biomed Inform"},{"key":"5035_CR18","doi-asserted-by":"publisher","first-page":"152565","DOI":"10.1109\/ACCESS.2020.3014779","volume":"8","author":"JH Lee","year":"2020","unstructured":"Lee JH, Kim M, Kwon HC. Deep Learning-Based Context-Sensitive Spelling Typing Error Correction. IEEE Access. 2020;8:152565\u201378.","journal-title":"IEEE Access"},{"issue":"12","key":"5035_CR19","doi-asserted-by":"publisher","first-page":"832","DOI":"10.1016\/j.ijmedinf.2010.09.005","volume":"79","author":"C Senger","year":"2010","unstructured":"Senger C, Kaltschmidt J, Schmitt SP, Pruszydlo MG, Haefeli WE. Misspellings in drug information system queries: characteristics of drug name spelling errors and strategies for their prevention. Int J Med Inform. 2010;79(12):832\u20139.","journal-title":"Int J Med Inform"},{"key":"5035_CR20","unstructured":"Kilicoglu H, Fiszman M, Roberts K, Demner-Fushman D. An ensemble method for spelling correction in consumer health questions. AMIA Annu Symp Proc. 2015;727."},{"issue":"1","key":"5035_CR21","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13104-019-4073-y","volume":"12","author":"TE Workman","year":"2019","unstructured":"Workman TE, Shao Y, Divita G, Zeng-Treitler Q. An efficient prototype method to identify and correct misspellings in clinical text. BMC Res Notes. 2019;12(1):1\u20135.","journal-title":"BMC Res Notes"},{"key":"5035_CR22","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;5998\u20136008."},{"key":"5035_CR23","unstructured":"Devlin J, Chang MW, Lee K, Toutanova K. Bert. Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv. 2018;1810.04805."},{"key":"5035_CR24","unstructured":"Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst. 2013;3111\u20139."},{"key":"5035_CR25","first-page":"135","volume":"5","author":"P Bojanowski","year":"2017","unstructured":"Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Trans Assoc Comput. 2017;5:135\u201346.","journal-title":"Trans Assoc Comput"},{"key":"5035_CR26","doi-asserted-by":"crossref","unstructured":"Pennington J, Socher R, Manning CD. Glove: Global vectors for word representation. Proc EMNLP. 2014;1532\u201343.","DOI":"10.3115\/v1\/D14-1162"},{"key":"5035_CR27","doi-asserted-by":"publisher","first-page":"11","DOI":"10.1016\/j.jbi.2015.09.010","volume":"58","author":"Y Chen","year":"2015","unstructured":"Chen Y, Lasko TA, Mei Q, Denny JC, Xu H. A study of active learning methods for named entity recognition in clinical text. J Biomed Inform. 2015;58:11\u20138.","journal-title":"J Biomed Inform"},{"key":"5035_CR28","unstructured":"Wu Y, Xu J, Jiang M, Zhang Y, Xu H. A study of neural word embeddings for named entity recognition in clinical text. AMIA Annu Symp Proc. 2015;1326."},{"key":"5035_CR29","unstructured":"Wu Y, Jiang M, Xu J, Zhi D, Xu H. Clinical named entity recognition using deep learning models. AMIA Annu Symp Proc. 2017;1812."},{"issue":"12","key":"5035_CR30","first-page":"1935","volume":"27","author":"X Yang","year":"2020","unstructured":"Yang X, Bian J, Hogan WR, Wu Y. Clinical concept extraction using transformers. JAMIA. 2020;27(12):1935\u201342.","journal-title":"JAMIA"},{"key":"5035_CR31","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.jbi.2013.12.006","volume":"47","author":"RI Do\u011fan","year":"2014","unstructured":"Do\u011fan RI, Leaman R, Lu Z. NCBI disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inform. 2014;47:1\u201310.","journal-title":"J Biomed Inform"},{"key":"5035_CR32","unstructured":"ASAN MEDICAL CENTER. http:\/\/eng.amc.seoul.kr\/. Accessed 10 August 2020."},{"key":"5035_CR33","unstructured":"SymSpell. https:\/\/github.com\/wolfgarbe\/SymSpell. Accessed 20 August 2020."},{"issue":"4","key":"5035_CR34","doi-asserted-by":"crossref","first-page":"1234","DOI":"10.1093\/bioinformatics\/btz682","volume":"36","author":"J Lee","year":"2020","unstructured":"Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234\u201340.","journal-title":"Bioinformatics"},{"key":"5035_CR35","doi-asserted-by":"crossref","unstructured":"Alsentzer E, Murphy JR, Boag W, Weng WH, Jin D, Naumann T, McDermott M. Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323. 2019.","DOI":"10.18653\/v1\/W19-1909"},{"issue":"1","key":"5035_CR36","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41746-021-00455-y","volume":"4","author":"L Rasmy","year":"2021","unstructured":"Rasmy L, Xiang Y, Xie Z, Tao C, Zhi D. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit Med. 2021;4(1):1\u201313.","journal-title":"NPJ Digit Med"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-022-05035-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s12859-022-05035-9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-022-05035-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,3,12]],"date-time":"2023-03-12T08:59:37Z","timestamp":1678611577000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/s12859-022-05035-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,11,16]]},"references-count":36,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2022,12]]}},"alternative-id":["5035"],"URL":"https:\/\/doi.org\/10.1186\/s12859-022-05035-9","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,11,16]]},"assertion":[{"value":"11 February 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"4 November 2022","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 November 2022","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"486"}}