{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T01:28:13Z","timestamp":1760059693306,"version":"build-2065373602"},"reference-count":28,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2025,7,1]],"date-time":"2025-07-01T00:00:00Z","timestamp":1751328000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Profi6 theme 6GESS","award":["346208","220106"],"award-info":[{"award-number":["346208","220106"]}]},{"DOI":"10.13039\/501100002341","name":"Research Council of Finland (former Academy of Finland) 6G Flagship Programme","doi-asserted-by":"publisher","award":["346208","220106"],"award-info":[{"award-number":["346208","220106"]}],"id":[{"id":"10.13039\/501100002341","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Jenny and Antti Wihuri Foundation, Helsinki, Finland","award":["346208","220106"],"award-info":[{"award-number":["346208","220106"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>In the era of digital healthcare, electronic health records generate vast amounts of data, much of which is unstructured, and therefore, not in a usable format for conventional machine learning and artificial intelligence applications. This study investigates how to extract meaningful insights from unstructured radiology reports written in Finnish, a minority language, using machine learning techniques for text analysis. With this approach, unstructured information could be transformed into a structured format. The results of this research show that relevant information can be effectively extracted from Finnish medical reports using classification algorithms with default parameter values. For the detection of breast tumour mentions from medical texts, classifiers achieved high accuracy, almost 90%. Detection of metastasis mentions, however, proved more challenging, with the best-performing models Support Vector Machine (SVM) and logistic regression achieving an F1-score of 81%. The lower performance in metastasis detection is likely due to the more complex problem, ambiguous labeling, and the smaller dataset size. The results of classical classifiers were also compared with FinBERT, a domain-adapted Finnish BERT model. However, classical classifiers outperformed FinBERT. This highlights the challenge of medical language processing when working with minority languages. Moreover, it was noted that parameter tuning based on translated English reports did not significantly improve the detection rates, likely due to linguistic differences between the datasets. This larger translated dataset used for tuning comes from a different clinical domain and employs noticeably simpler, less nuanced language than the Finnish breast cancer reports, which are written by native Finnish-speaking medical experts. This underscores the need for localised datasets and models, particularly for minority languages with unique grammatical structures.<\/jats:p>","DOI":"10.3390\/data10070104","type":"journal-article","created":{"date-parts":[[2025,7,1]],"date-time":"2025-07-01T04:04:22Z","timestamp":1751342662000},"page":"104","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Extracting Information from Unstructured Medical Reports Written in Minority Languages: A Case Study of Finnish"],"prefix":"10.3390","volume":"10","author":[{"given":"Elisa","family":"Myllyl\u00e4","sequence":"first","affiliation":[{"name":"Biomimetics and Intelligent Systems Group, University of Oulu, FI-90014 Oulu, Finland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5995-5421","authenticated-orcid":false,"given":"Pekka","family":"Siirtola","sequence":"additional","affiliation":[{"name":"Biomimetics and Intelligent Systems Group, University of Oulu, FI-90014 Oulu, Finland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5335-7535","authenticated-orcid":false,"given":"Antti","family":"Isosalo","sequence":"additional","affiliation":[{"name":"Research Unit of Health Sciences and Technology, University of Oulu, FI-90014 Oulu, Finland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2306-3111","authenticated-orcid":false,"given":"Jarmo","family":"Reponen","sequence":"additional","affiliation":[{"name":"Research Unit of Health Sciences and Technology, University of Oulu, FI-90014 Oulu, Finland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5215-6889","authenticated-orcid":false,"given":"Satu","family":"Tamminen","sequence":"additional","affiliation":[{"name":"Biomimetics and Intelligent Systems Group, University of Oulu, FI-90014 Oulu, Finland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0208-5178","authenticated-orcid":false,"given":"Outi","family":"Laatikainen","sequence":"additional","affiliation":[{"name":"Medical Research Center Oulu and Oulu University Hospital, FI-90014 Oulu, Finland"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,7,1]]},"reference":[{"unstructured":"Vehko, T., Ruotsalainen, S., and Hypp\u00f6nen, H. (2019). E-Health and E-Welfare of Finland: Check Point 2018, Finnish Institute for Health and Welfare. report 7.","key":"ref_1"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"101349","DOI":"10.1016\/j.imu.2023.101349","article-title":"Explainable Artificial Intelligence to predict clinical outcomes in Type 1 diabetes and relapsing-remitting multiple sclerosis adult patients","volume":"42","author":"Ihalapathirana","year":"2023","journal-title":"Inform. Med. Unlocked"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"13","DOI":"10.2147\/CLEP.S380828","article-title":"Data-Driven Identification of Long-Term Glycemia Clusters and Their Individualized Predictors in Finnish Patients with Type 2 Diabetes","volume":"15","author":"Lavikainen","year":"2023","journal-title":"Clin. Epidemiol."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1351","DOI":"10.1001\/jama.2013.393","article-title":"The inevitable application of big data to health care","volume":"309","author":"Murdoch","year":"2013","journal-title":"Jama"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"257","DOI":"10.1038\/s41746-024-01233-2","article-title":"Privacy-preserving large language models for structured medical information retrieval","volume":"7","author":"Wiest","year":"2024","journal-title":"NPJ Digit. Med."},{"doi-asserted-by":"crossref","unstructured":"Wiest, I.C., Wolf, F., Lessmann, M.E., van Treeck, M., Ferber, D., Zhu, J., Boehme, H., Bressem, K.K., Ulrich, H., and Ebert, M.P. (2024). LLM-AIx: An open source pipeline for Information Extraction from unstructured medical text based on privacy pre-serving Large Language Models. medRxiv.","key":"ref_6","DOI":"10.1101\/2024.09.02.24312917"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"532","DOI":"10.1192\/bjp.2024.134","article-title":"Detection of suicidality from medical text using privacy-preserving large language models","volume":"225","author":"Wiest","year":"2024","journal-title":"Br. J. Psychiatry"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"1737","DOI":"10.1109\/JBHI.2021.3123192","article-title":"A deep language model for symptom extraction from clinical text and its application to extract COVID-19 symptoms from social media","volume":"26","author":"Luo","year":"2021","journal-title":"IEEE J. Biomed. Health Inform."},{"unstructured":"Shaaban, M.A., Akkasi, A., Khan, A., Komeili, M., and Yaqub, M. (2024). Fine-Tuned Large Language Models for Symptom Recognition from Spanish Clinical Text. arXiv.","key":"ref_9"},{"doi-asserted-by":"crossref","unstructured":"Ye, Y., Wagner, M.M., Cooper, G.F., Ferraro, J.P., Su, H., Gesteland, P.H., Haug, P.J., Millett, N.E., Aronis, J.M., and Nowalk, A.J. (2017). A study of the transferability of influenza case detection systems between two large healthcare systems. PLoS ONE, 12.","key":"ref_10","DOI":"10.1371\/journal.pone.0174970"},{"doi-asserted-by":"crossref","unstructured":"Klippi, A., and Launonen, K. (2008). 2. Aspects of the Structure of Finnish. Research in Logopedics, Multilingual Matters.","key":"ref_11","DOI":"10.21832\/9781847690593"},{"unstructured":"Marr, V., Aldus, V., and Brookes, I. (2008). The Chambers Dictionary, Chambers.","key":"ref_12"},{"unstructured":"Virtanen, A., Kanerva, J., Ilo, R., Luoma, J., Luotolahti, J., Salakoski, T., Ginter, F., and Pyysalo, S. (2019). Multilingual is not enough: BERT for Finnish. arXiv.","key":"ref_13"},{"unstructured":"Tanskanen, A., Toivanen, R., and Vehvil\u00e4inen, T. (2025, June 25). RoBERTa Large Model for Finnish. Hugging Face Model Repository, 2022. Available online: https:\/\/huggingface.co\/Finnish-NLP\/roberta-large-finnish.","key":"ref_14"},{"unstructured":"Tanskanen, A., and Toivanen, R. (2025, June 25). ConvBERT for Finnish. Hugging Face Model Repository, 2022. Available online: https:\/\/huggingface.co\/Finnish-NLP\/convbert-base-finnish.","key":"ref_15"},{"unstructured":"Eskelinen, A., Silvala, L., Ginter, F., Pyysalo, S., and Laippala, V. (2023, January 28\u201330). Toxicity detection in Finnish using machine translation. Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), Reykjavik, Iceland.","key":"ref_16"},{"unstructured":"Johnson, A., Pollard, T., Mark, R., Berkowitz, S., and Horng, S. (2025, June 25). MIMIC-CXR Database (Version 2.0.0). PhysioNet, 2019. RRID:SCR_007345. Available online: https:\/\/doi.org\/10.13026\/C2JT1Q.","key":"ref_17"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"317","DOI":"10.1038\/s41597-019-0322-0","article-title":"MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports","volume":"6","author":"Johnson","year":"2019","journal-title":"Sci. Data"},{"unstructured":"Johnson, A., Lungren, M., Peng, Y., Lu, Z., Mark, R., Berkowitz, S., and Horng, S. (2025, June 25). MIMIC-CXR-JPG: Chest Radiographs with Structured Labels (Version 2.0.0). PhysioNet, 2019. RRID:SCR_007345. Available online: https:\/\/doi.org\/10.13026\/8360-t248.","key":"ref_19"},{"unstructured":"DeepL (2025, June 25). DeepL Translator. Cologne, Germany: DeepL SE, 2024. Available online: https:\/\/www.deepl.com\/translator.","key":"ref_20"},{"unstructured":"Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., and Artzi, Y. (2020, January 25\u201329). BERTScore: Evaluating Text Generation with BERT. Proceedings of the 8th International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia.","key":"ref_21"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"381","DOI":"10.3758\/BRM.42.2.381","article-title":"MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment","volume":"42","author":"McCarthy","year":"2010","journal-title":"Behav. Res. Methods"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"34","DOI":"10.1016\/j.jbi.2017.11.011","article-title":"Clinical information extraction applications: A literature review","volume":"77","author":"Wang","year":"2018","journal-title":"J. Biomed. Inform."},{"unstructured":"Bird, S., Klein, E., and Loper, E. (2009). Natural Language Processing with Python, O\u2019Reilly Media.","key":"ref_24"},{"unstructured":"Porter, M. (2025, June 25). Snowball: A Language for Stemming Algorithms. Available online: http:\/\/snowball.tartarus.org\/texts\/introduction.html.","key":"ref_25"},{"unstructured":"Scikit-learn (2025, June 25). Text Feature Extraction, Available online: https:\/\/scikit-learn.org\/stable\/modules\/feature_extraction.html.","key":"ref_26"},{"doi-asserted-by":"crossref","unstructured":"Martynov, P., Mitropolskii, N., Kukkola, K., Gretsch, M., Koivisto, V.M., Lindgren, I., Saunavaara, J., Reponen, J., and M\u00e4kynen, A. (2017). Testing of the assisting software for radiologists analysing head CT images: Lessons learned. BMC Med. Imaging, 17.","key":"ref_27","DOI":"10.1186\/s12880-017-0229-1"},{"key":"ref_28","first-page":"e75532","article-title":"Evaluating the Role of GPT-4 and GPT-4o in the Detectability of Chest Radiography Reports Requiring Further Assessment","volume":"16","author":"Kanzawa","year":"2024","journal-title":"Cureus"}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/10\/7\/104\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:02:01Z","timestamp":1760032921000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/10\/7\/104"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,1]]},"references-count":28,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2025,7]]}},"alternative-id":["data10070104"],"URL":"https:\/\/doi.org\/10.3390\/data10070104","relation":{},"ISSN":["2306-5729"],"issn-type":[{"type":"electronic","value":"2306-5729"}],"subject":[],"published":{"date-parts":[[2025,7,1]]}}}