{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T11:34:22Z","timestamp":1777635262767,"version":"3.51.4"},"reference-count":54,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2019,8,20]],"date-time":"2019-08-20T00:00:00Z","timestamp":1566259200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"SIDN fonds","award":["Pioneers of 2017: Wetenschappelijke onderzoeksprojecten"],"award-info":[{"award-number":["Pioneers of 2017: Wetenschappelijke onderzoeksprojecten"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MTI"],"abstract":"<jats:p>In the medical domain, user-generated social media text is increasingly used as a valuablecomplementary knowledge source to scientific medical literature. The extraction of this knowledge iscomplicated by colloquial language use and misspellings. However, lexical normalization of suchdata has not been addressed effectively. This paper presents a data-driven lexical normalizationpipeline with a novel spelling correction module for medical social media. Our method significantlyoutperforms state-of-the-art spelling correction methods and can detect mistakes with an F1 of 0.63despite extreme imbalance in the data. We also present the first corpus for spelling mistake detectionand correction in a medical patient forum.<\/jats:p>","DOI":"10.3390\/mti3030060","type":"journal-article","created":{"date-parts":[[2019,8,21]],"date-time":"2019-08-21T11:19:06Z","timestamp":1566386346000},"page":"60","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":11,"title":["Data-Driven Lexical Normalization for Medical Social\r\nMedia"],"prefix":"10.3390","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4332-0296","authenticated-orcid":false,"given":"Anne","family":"Dirkson","sequence":"first","affiliation":[{"name":"Leiden Institute for Advanced Computer Science, Leiden University, 2333 CA Leiden, The Netherlands"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9609-9505","authenticated-orcid":false,"given":"Suzan","family":"Verberne","sequence":"additional","affiliation":[{"name":"Leiden Institute for Advanced Computer Science, Leiden University, 2333 CA Leiden, The Netherlands"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7358-544X","authenticated-orcid":false,"given":"Abeed","family":"Sarker","sequence":"additional","affiliation":[{"name":"Department of Biomedical Informatics, Emory University, Atlanta, GA 30322, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7797-619X","authenticated-orcid":false,"given":"Wessel","family":"Kraaij","sequence":"additional","affiliation":[{"name":"Leiden Institute for Advanced Computer Science, Leiden University, 2333 CA Leiden, The Netherlands"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2019,8,20]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Gonzalez-Hernandez, G., Sarker, A., O\u2019Connor, K., and Savova, G. (2017). Capturing the Patient\u2019s Perspective: A Review of Advances in Natural Language Processing of Health-Related Text. Yearb. Med. Inform., 214\u2013217.","DOI":"10.1055\/s-0037-1606506"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"231","DOI":"10.1007\/s40264-015-0379-4","article-title":"Social Media Mining for Toxicovigilance: Automatic Monitoring of Prescription Medication Abuse from Twitter","volume":"39","author":"Sarker","year":"2016","journal-title":"Drug Saf."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"202","DOI":"10.1016\/j.jbi.2015.02.004","article-title":"Utilizing social media data for pharmacovigilance: A review","volume":"54","author":"Sarker","year":"2015","journal-title":"J. Biomed. Inform."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"2","DOI":"10.1016\/j.sbspro.2011.10.577","article-title":"Text Normalization in Social Media: Progress, Problems and Applications for a Pre-Processing System of Casual English","volume":"27","author":"Clark","year":"2011","journal-title":"Procedia Soc. Behav. Sci."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"45","DOI":"10.1007\/s13278-017-0464-z","article-title":"A customizable pipeline for social media text normalization","volume":"7","author":"Sarker","year":"2017","journal-title":"Soc. Netw. Anal. Min."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Park, A., Hartzler, A.L., Huh, J., Mcdonald, D.W., and Pratt, W. (2015). Automatically Detecting Failures in Natural Language Processing Tools for Online Community Text. J. Med. Internet Res., 17.","DOI":"10.2196\/jmir.4612"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"43","DOI":"10.1186\/s13326-016-0084-y","article-title":"Normalizing acronyms and abbreviations to aid patient understanding of clinical texts: ShARe\/CLEF eHealth Challenge 2013, Task 2","volume":"7","author":"Mowery","year":"2016","journal-title":"J. Biomed. Semant."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"188","DOI":"10.1016\/j.jbi.2015.04.008","article-title":"Automated misspelling detection and correction in clinical free-text records","volume":"55","author":"Lai","year":"2015","journal-title":"J. Biomed. Inform."},{"key":"ref_9","unstructured":"Patrick, J., Sabbagh, M., Jain, S., and Zheng, H. (2010, January 18). Spelling correction in clinical notes with emphasis on first suggestion accuracy. Proceedings of the 2nd Workshop on Building and Evaluating Resources for Biomedical Text Mining, Valletta, Malta."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"e27","DOI":"10.2196\/medinform.4211","article-title":"Context-Sensitive Spelling Correction of Consumer-Generated Content on Health Care","volume":"3","author":"Zhou","year":"2015","journal-title":"JMIR Med. Inform."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"73","DOI":"10.1016\/j.jbi.2015.03.010","article-title":"Cadec: A corpus of adverse drug event annotations","volume":"55","author":"Karimi","year":"2015","journal-title":"J. Biomed. Inform."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"103838","DOI":"10.1016\/j.dib.2019.103838","article-title":"The PsyTAR dataset: From patients generated narratives to a corpus of adverse drug events and effectiveness of psychiatric medications","volume":"24","author":"Zolnoori","year":"2019","journal-title":"Data Brief"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Weissenbacher, D., Sarker, A., Paul, M.J., and Gonzalez-Hernandez, G. (2018). Overview of the third Social Media Mining for Health (SMM4H) shared tasks at EMNLP 2018. Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop and Shared Task, Association for Computational Linguistics.","DOI":"10.18653\/v1\/W18-5904"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"1274","DOI":"10.1093\/jamia\/ocy114","article-title":"Data and systems for medication-related text classification and concept normalization from Twitter: Insights from the Social Media Mining for Health (SMM4H)-2017 shared task","volume":"25","author":"Sarker","year":"2018","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Baldwin, T., de Marneffe, M.C., Han, B., Kim, Y.B., Ritter, A., and Xu, W. (2015). Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition. Proceedings of the Workshop on Noisy User-generated Text, Association for Computational Linguistics.","DOI":"10.18653\/v1\/W15-4319"},{"key":"ref_16","unstructured":"Van der Goot, R., and van Noord, G. (2017). MoNoise: Modeling Noise Using a Modular Normalization System. CoRR."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"24","DOI":"10.1197\/jamia.M1761","article-title":"Exploring and developing consuming health vocabulary","volume":"13","author":"Zeng","year":"2006","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"ref_18","unstructured":"Han, B., Cook, P., and Baldwin, T. (2012). Automatically constructing a normalisation dictionary for microblogs. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Association for Computational Linguistics."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2414425.2414430","article-title":"Lexical normalization for social media text","volume":"4","author":"Han","year":"2013","journal-title":"ACM Trans. Intell. Syst. Technol."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Jin, N. (2015). NCSU-SAS-Ning: Candidate generation and feature engineering for supervised lexical normalization. Proceedings of the Workshop on Noisy User-generated Text, Association for Computational Linguistics.","DOI":"10.18653\/v1\/W15-4313"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Leeman-Munk, S., Lester, J., and Cox, J. (2015). NCSU_SAS_SAM: Deep Encoding and Reconstruction for Normalization of Noisy Text. Proceedings of the Workshop on Noisy User-generated Text, Association for Computational Linguistics.","DOI":"10.18653\/v1\/W15-4323"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Min, W., and Mott, B. (2015). NCSU_SAS_WOOKHEE: A deep contextual long-short term memory model for text normalization. Proceedings of the Workshop on Noisy User-generated Text, Association for Computational Linguistics.","DOI":"10.18653\/v1\/W15-4317"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"1239","DOI":"10.1007\/s11063-018-9873-x","article-title":"Multi-task Character-Level Attentional Networks for Medical Concept Normalization","volume":"49","author":"Niu","year":"2019","journal-title":"Neural Process. Lett."},{"key":"ref_24","unstructured":"Belinkov, Y., and Bisk, Y. (2017). Synthetic and Natural Noise Both Break Neural Machine Translation. arXiv."},{"key":"ref_25","first-page":"68","article-title":"How robust are character-based word embeddings in tagging and MT against wrod scramlbing or randdm nouse?","volume":"Volume 1","author":"Heigold","year":"2018","journal-title":"Proceedings of the 13th Conference of the Association for Machine Translation in the Americas"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Malykh, V., Logacheva, V., and Khakhulin, T. (2018). Robust word vectors: Context-informed embeddings for noisy texts. Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, Association for Computational Linguistics.","DOI":"10.18653\/v1\/W18-6108"},{"key":"ref_27","first-page":"361","article-title":"RCV1: A New Benchmark Collection for Text Categorization Research","volume":"5","author":"Lewis","year":"2004","journal-title":"J. Mach. Learn. Res."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"169","DOI":"10.1016\/S0933-3657(03)00052-6","article-title":"Using lexical disambiguation and named-entity recognition to improve spelling correction in the electronic patient record","volume":"29","author":"Ruch","year":"2003","journal-title":"Artif. Intell. Med."},{"key":"ref_29","unstructured":"Wu, Y., Tang, B., Jiang, M., Moon, S., Denny, J.C., and Xu, H. (2013, January 23\u201326). Clinical Acronym\/Abbreviation Normalization using a Hybrid Approach. Proceedings of the Working Notes for CLEF 2013 Conference, Valencia, Spain."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"379","DOI":"10.1002\/j.1538-7305.1948.tb01338.x","article-title":"A mathematical theory of communication","volume":"27","author":"Shannon","year":"1948","journal-title":"Bell Syst. Tech. J."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"93","DOI":"10.1007\/BF01889984","article-title":"Probability scoring for spelling correction","volume":"1","author":"Church","year":"1991","journal-title":"Stat. Comput."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Fivez, P., \u0160uster, S., and Daelemans, W. (2017). Unsupervised context-sensitive spelling correction of clinical free-Text with Word and Character N-Gram Embeddings. BioNLP 2017, Association for Computational Linguistics.","DOI":"10.18653\/v1\/W17-2317"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"160035","DOI":"10.1038\/sdata.2016.35","article-title":"MIMIC-III, a freely accessible critical care database","volume":"3","author":"Johnson","year":"2016","journal-title":"Sci. Data"},{"key":"ref_34","unstructured":"Burnage, G., Baayen, R., Piepenbrock, R., and van Rijn, H. (1990). CELEX: A Guide for Users, Centre for Lexical Information."},{"key":"ref_35","unstructured":"National Cancer Institute (2019, August 18). NCI Dictionary of Cancer Terms, Available online: https:\/\/www.cancer.gov\/publications\/dictionaries\/cancer-terms."},{"key":"ref_36","unstructured":"National Library of Medicine (US) (2019, August 18). RxNorm, Available online: https:\/\/www.nlm.nih.gov\/research\/umls\/rxnorm\/index.html."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"671","DOI":"10.1093\/jamia\/ocu041","article-title":"Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features","volume":"22","author":"Nikfarjam","year":"2015","journal-title":"J. Am. Med. Inform. Assoc. JAMIA"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"122","DOI":"10.1016\/j.dib.2016.11.056","article-title":"A corpus for mining drug-related knowledge from Twitter chatter: Language models and their utilities","volume":"10","author":"Sarker","year":"2017","journal-title":"Data Brief"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Reynaert, M. (2005). Text-Induced Spelling Correction. [Ph.D. Thesis, Tilburg University].","DOI":"10.3115\/1706238.1706256"},{"key":"ref_40","unstructured":"Miftahutdinov, Z.S., Tutubalina, E.V., and Tropsha, A.E. (June, January 31). Identifying disease-related expressions in reviews using conditional random fields. Proceedings of the International Conference Dialogue 2017, Moscow, Russia."},{"key":"ref_41","first-page":"38","article-title":"The Double Metaphone Search Algorithm","volume":"18","author":"Philips","year":"2000","journal-title":"C\/C++ Users J."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Beeksma, M., Verberne, S., van den Bosch, A., Hendrickx, I., Das, E., and Groenewoud, S. (2019). Predicting life expectancy with a recurrent neural network. BMC Med. Inform. Decis. Mak., 19.","DOI":"10.1186\/s12911-019-0775-2"},{"key":"ref_43","unstructured":"Toby Segaran, J.H. (2009). Natural language corpus data. Beautiful Data: The Stories Behind Elegant Data Solutions, O\u2019Reilly Media."},{"key":"ref_44","unstructured":"Huang, X., Smith, M.C., Paul, M., Ryzhkov, D., Quinn, S., Broniatowski, D., and Dredze, M. (2017, January 4\u20135). Examining patterns of influenza vaccination in social media. Proceedings of the AAAI Joint Workshop on Health Intelligence (W3PHIAI), San Francisco, CA, USA."},{"key":"ref_45","unstructured":"Paul, M.J., and Dredze, M. (2009). A Model for Mining Public Health Topics from Twitter, Johns Hopkins University. Technical Report."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"3441","DOI":"10.1016\/j.vaccine.2016.05.008","article-title":"Zika vaccine misconceptions: A social media analysis","volume":"34","author":"Dredze","year":"2016","journal-title":"Vaccine"},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"377","DOI":"10.1145\/146370.146380","article-title":"Techniques for automatically correcting words in text","volume":"24","author":"Kukich","year":"1992","journal-title":"ACM Comput. Surv."},{"key":"ref_48","first-page":"90","article-title":"Phonetic spelling filter for keyword selection in drug mention mining from social media","volume":"2014","author":"Pimpalkhute","year":"2014","journal-title":"AMIA Joint Summits Transl. Sci. Proc."},{"key":"ref_49","unstructured":"Verberne, S. (2002). Context-Sensitive Spell Checking Based on Word Trigram Probabilities. [Master\u2019s Thesis, Radboud University]."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Supranovich, D., and Patsepnia, V. (2015). IHS_RD: Lexical normalization for english tweets. Proceedings of the Workshop on Noisy User-generated Text, Association for Computational Linguistics.","DOI":"10.18653\/v1\/W15-4311"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Berend, G., and Tasn\u00e1di, E. (2015). USZEGED: Correction type-sensitive normalization of english tweets using efficiently indexed n-gram statistics. Proceedings of the Workshop on Noisy User-generated Text, Association for Computational Linguistics.","DOI":"10.18653\/v1\/W15-4318"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Beckley, R. (2015). Bekli: A simple approach to twitter text normalization. Proceedings of the Workshop on Noisy User-generated Text, Association for Computational Linguistics.","DOI":"10.18653\/v1\/W15-4312"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Doval Mosquera, Y., Vilares, J., and G\u00f3mez-Rodr\u00edguez, C. (2015). LYSGROUP: Adapting a Spanish microtext normalization system to English. Proceedings of the Workshop on Noisy User-Generated Text, Association for Computational Linguistics.","DOI":"10.18653\/v1\/W15-4315"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Sarker, A., and Gonzalez, G. (2017). HLP@UPenn at SemEval-2017 Task 4A: A simple, self-optimizing text classification system combining dense and sparse vectors. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Association for Computational Linguistics.","DOI":"10.18653\/v1\/S17-2105"}],"container-title":["Multimodal Technologies and Interaction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2414-4088\/3\/3\/60\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T13:12:31Z","timestamp":1760188351000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2414-4088\/3\/3\/60"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,8,20]]},"references-count":54,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2019,9]]}},"alternative-id":["mti3030060"],"URL":"https:\/\/doi.org\/10.3390\/mti3030060","relation":{},"ISSN":["2414-4088"],"issn-type":[{"value":"2414-4088","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,8,20]]}}}