{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T08:29:12Z","timestamp":1765268952278,"version":"build-2065373602"},"reference-count":30,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2024,3,29]],"date-time":"2024-03-29T00:00:00Z","timestamp":1711670400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"FCT and FEDER","award":["01\/SAICT\/2016 n\u00b0 022153","5545\/2020"],"award-info":[{"award-number":["01\/SAICT\/2016 n\u00b0 022153","5545\/2020"]}]},{"name":"Polytechnic Institute of Coimbra within the scope of Regulamento de Apoio \u00e0 Publica\u00e7\u00e3o Cient\u00edfica dos Estudantes do Instituto Polit\u00e9cnico de Coimbra","award":["01\/SAICT\/2016 n\u00b0 022153","5545\/2020"],"award-info":[{"award-number":["01\/SAICT\/2016 n\u00b0 022153","5545\/2020"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Algorithms"],"abstract":"<jats:p>The International Classification of Diseases, 10th edition (ICD-10), has been widely used for the classification of patient diagnostic information. This classification is usually performed by dedicated physicians with specific coding training, and it is a laborious task. Automatic classification is a challenging task for the domain of natural language processing. Therefore, automatic methods have been proposed to aid the classification process. This paper proposes a method where Cosine text similarity is combined with a pretrained language model, PLM-ICD, in order to increase the number of probably useful suggestions of ICD-10 codes, based on the Medical Information Mart for Intensive Care (MIMIC)-IV dataset. The results show that a strategy of using multiple runs, and bucket category search, in the Cosine method, improves the results, providing more useful suggestions. Also, the use of a strategy composed by the Cosine method and PLM-ICD, which was called PLM-ICD-C, provides better results than just the PLM-ICD.<\/jats:p>","DOI":"10.3390\/a17040144","type":"journal-article","created":{"date-parts":[[2024,3,29]],"date-time":"2024-03-29T06:33:16Z","timestamp":1711693996000},"page":"144","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Aiding ICD-10 Encoding of Clinical Health Records Using Improved Text Cosine Similarity and PLM-ICD"],"prefix":"10.3390","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1184-2433","authenticated-orcid":false,"given":"Hugo","family":"Silva","sequence":"first","affiliation":[{"name":"Polytechnic Institute of Coimbra, Coimbra Institute of Engineering, Rua Pedro Nunes\u2013Quinta da Nora, 3030-199 Coimbra, Portugal"}]},{"given":"V\u00edtor","family":"Duque","sequence":"additional","affiliation":[{"name":"Department of Infectious Diseases, Coimbra Hospital and University Centre, 3000-075 Coimbra, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4145-7301","authenticated-orcid":false,"given":"M\u00e1rio","family":"Macedo","sequence":"additional","affiliation":[{"name":"RCM2+ Research Centre for Asset Management and Systems Engineering, ISEC\/IPC, Rua Pedro Nunes, 3030-199 Coimbra, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4313-7966","authenticated-orcid":false,"given":"Mateus","family":"Mendes","sequence":"additional","affiliation":[{"name":"Polytechnic Institute of Coimbra, Coimbra Institute of Engineering, Rua Pedro Nunes\u2013Quinta da Nora, 3030-199 Coimbra, Portugal"}]}],"member":"1968","published-online":{"date-parts":[[2024,3,29]]},"reference":[{"key":"ref_1","unstructured":"(2024, March 10). International Classification of Diseases 11th Revision. Available online: https:\/\/www.who.int\/standards\/classifications\/classification-of-diseases."},{"key":"ref_2","first-page":"28","article-title":"Health records as the basis of clinical coding: Is the quality adequate? A qualitative study of medical coders\u2019 perceptions","volume":"49","author":"Alonso","year":"2020","journal-title":"Health Inf. Manag. J."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"766","DOI":"10.1186\/s12913-017-2697-y","article-title":"Barriers to data quality resulting from the process of coding health information to administrative data: A qualitative study","volume":"17","author":"Lucyk","year":"2017","journal-title":"BMC Health Serv. Res."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"102086","DOI":"10.1016\/j.artmed.2021.102086","article-title":"Med7: A transferable clinical natural language processing model for electronic health records","volume":"118","author":"Kormilitzin","year":"2021","journal-title":"Artif. Intell. Med."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"100511","DOI":"10.1016\/j.cosrev.2022.100511","article-title":"Neural natural language processing for unstructured data in electronic health records: A review","volume":"46","author":"Li","year":"2022","journal-title":"Comput. Sci. Rev."},{"key":"ref_6","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"1234","DOI":"10.1093\/bioinformatics\/btz682","article-title":"BioBERT: A pre-trained biomedical language representation model for biomedical text mining","volume":"36","author":"Lee","year":"2020","journal-title":"Bioinformatics"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Alsentzer, E., Murphy, J.R., Boag, W., Weng, W.H., Jin, D., Naumann, T., and McDermott, M. (2019). Publicly available clinical BERT embeddings. arXiv.","DOI":"10.18653\/v1\/W19-1909"},{"key":"ref_9","first-page":"1","article-title":"Domain-specific language model pretraining for biomedical natural language processing","volume":"3","author":"Gu","year":"2021","journal-title":"ACM Trans. Comput. Healthc. (HEALTH)"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Lewis, P., Ott, M., Du, J., and Stoyanov, V. (2020, January 19). Pretrained language models for biomedical and clinical tasks: Understanding and extending the state-of-the-art. Proceedings of the 3rd Clinical Natural Language Processing Workshop, Online.","DOI":"10.18653\/v1\/2020.clinicalnlp-1.17"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Liu, J., and Razavian, N. (2020). BERT-XML: Large scale automated ICD coding using BERT pretraining. arXiv.","DOI":"10.18653\/v1\/2020.clinicalnlp-1.3"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Huang, C.W., Tsai, S.C., and Chen, Y.N. (2022). PLM-ICD: Automatic ICD coding with pretrained language models. arXiv.","DOI":"10.18653\/v1\/2022.clinicalnlp-1.2"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Edin, J., Junge, A., Havtorn, J.D., Borgholt, L., Maistro, M., Ruotsalo, T., and Maal\u00f8e, L. (2023). Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study. arXiv.","DOI":"10.1145\/3539618.3591918"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Falter, M., Godderis, D., Scherrenberg, M., Kizilkilic, S.E., Xu, L., Mertens, M., Jansen, J., Legroux, P., Kindermans, H., and Sinnaeve, P. (2024). Using Natural Language Processing for Automated Classification of Disease and to Identify Misclassified ICD Codes in Cardiac Disease. Eur. Heart J. Digit. Health, ztae008.","DOI":"10.1093\/ehjdh\/ztae008"},{"key":"ref_15","unstructured":"Silva, A., Chaves, P., Rijo, S., Bone, J., Oliveira, T., and Novais, P. (September, January 31). Leveraging TFR-BERT for ICD Diagnoses Ranking. Proceedings of the EPIA Conference on Artificial Intelligence, Horta, Portugal."},{"key":"ref_16","unstructured":"Falter, M., Godderis, D., Scherrenberg, M., Kizilkilic, S.E., Xu, L., Mertens, M., and Dendale, P. (2022). Automatic International Classification of Diseases Coding System: Deep Contextualized Language Model with Rule-Based Approaches, JMIR Publications Inc."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Silvestri, S., Gargiulo, F., Ciampi, M., and De Pietro, G. (2020, January 7\u201310). Exploit multilingual language model at scale for ICD-10 clinical text classification. Proceedings of the 2020 IEEE Symposium on Computers and Communications (ISCC), Rennes, France.","DOI":"10.1109\/ISCC50000.2020.9219640"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Yu, X., Hu, W., Lu, S., Sun, X., and Yuan, Z. (2019, January 23\u201325). BioBERT based named entity recognition in electronic medical record. Proceedings of the 2019 10th International Conference on Information Technology in Medicine and Education (ITME), Qingdao, China.","DOI":"10.1109\/ITME.2019.00022"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"e23230","DOI":"10.2196\/23230","article-title":"Automatic ICD-10 coding and training system: Deep neural network based on supervised learning","volume":"9","author":"Chen","year":"2021","journal-title":"JMIR Med. Inform."},{"key":"ref_20","unstructured":"Choi, E., Bahadori, M.T., Schuetz, A., Stewart, W.F., and Sun, J. (2016, January 19\u201320). Doctor ai: Predicting clinical events via recurrent neural networks. Proceedings of the Machine Learning for Healthcare Conference (PMLR), Los Angeles, CA, USA."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Li, F., and Yu, H. (2020, January 7\u201312). ICD coding from clinical text using multi-filter residual convolutional neural network. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA.","DOI":"10.1609\/aaai.v34i05.6331"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Mullenbach, J., Wiegreffe, S., Duke, J., Sun, J., and Eisenstein, J. (2018). Explainable prediction of medical codes from clinical text. arXiv.","DOI":"10.18653\/v1\/N18-1100"},{"key":"ref_23","unstructured":"Shi, H., Xie, P., Hu, Z., Zhang, M., and Xing, E.P. (2017). Towards automated ICD coding using deep learning. arXiv."},{"key":"ref_24","first-page":"4357","article-title":"A review on deep neural networks for ICD coding","volume":"35","author":"Teng","year":"2022","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/s41597-022-01899-x","article-title":"MIMIC-IV, a freely accessible electronic health record dataset","volume":"10","author":"Johnson","year":"2023","journal-title":"Sci. Data"},{"key":"ref_26","unstructured":"Gupta, M., Gallamoza, B., Cutrona, N., Dhakal, P., Poulain, R., and Beheshti, R. (2022, January 28). An extensive data processing pipeline for mimic-iv. Proceedings of the Machine Learning for Health (PMLR), New Orleans, LA, USA."},{"key":"ref_27","unstructured":"(2024, March 10). Guide to Classification on Imbalanced Datasets. Available online: https:\/\/resources.experfy.com\/ai-ml\/imbalanced-datasets-guide-classification\/."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Vu, T., Nguyen, D.Q., and Nguyen, A. (2020). A label attention model for icd coding from clinical text. arXiv.","DOI":"10.24963\/ijcai.2020\/461"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"138","DOI":"10.1093\/pubmed\/fdr054","article-title":"Systematic review of discharge coding accuracy","volume":"34","author":"Burns","year":"2012","journal-title":"J. Public Health"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Searle, T., Ibrahim, Z., and Dobson, R.J. (2020). Experimental evaluation and development of a silver-standard for the MIMIC-III clinical coding dataset. arXiv.","DOI":"10.18653\/v1\/2020.bionlp-1.8"}],"container-title":["Algorithms"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-4893\/17\/4\/144\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T14:20:52Z","timestamp":1760106052000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-4893\/17\/4\/144"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,29]]},"references-count":30,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2024,4]]}},"alternative-id":["a17040144"],"URL":"https:\/\/doi.org\/10.3390\/a17040144","relation":{},"ISSN":["1999-4893"],"issn-type":[{"type":"electronic","value":"1999-4893"}],"subject":[],"published":{"date-parts":[[2024,3,29]]}}}