{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T02:20:19Z","timestamp":1760149219428,"version":"build-2065373602"},"reference-count":52,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2023,7,7]],"date-time":"2023-07-07T00:00:00Z","timestamp":1688688000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MAKE"],"abstract":"<jats:p>Clinical text often includes numbers of various types and formats. However, most current text classification approaches do not take advantage of these numbers. This study aims to demonstrate that using numbers as features can significantly improve the performance of text classification models. This study also demonstrates the feasibility of extracting such features from clinical text. Unsupervised learning was used to identify patterns of number usage in clinical text. These patterns were analyzed manually and converted into pattern-matching rules. Information extraction was used to incorporate numbers as features into a document representation model. We evaluated text classification models trained on such representation. Our experiments were performed with two document representation models (vector space model and word embedding model) and two classification models (support vector machines and neural networks). The results showed that even a handful of numerical features can significantly improve text classification performance. We conclude that commonly used document representations do not represent numbers in a way that machine learning algorithms can effectively utilize them as features. Although we demonstrated that traditional information extraction can be effective in converting numbers into features, further community-wide research is required to systematically incorporate number representation into the word embedding process.<\/jats:p>","DOI":"10.3390\/make5030040","type":"journal-article","created":{"date-parts":[[2023,7,11]],"date-time":"2023-07-11T02:05:21Z","timestamp":1689041121000},"page":"746-762","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["The Value of Numbers in Clinical Text Classification"],"prefix":"10.3390","volume":"5","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1748-8546","authenticated-orcid":false,"given":"Kristian","family":"Miok","sequence":"first","affiliation":[{"name":"Advanced Environmental Research Institute, West University of Timisoara, 300223 Timisoara, Romania"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9731-3385","authenticated-orcid":false,"given":"Padraig","family":"Corcoran","sequence":"additional","affiliation":[{"name":"School of Computer Science & Informatics, Cardiff University, Cardiff CF10 4AG, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8132-3885","authenticated-orcid":false,"given":"Irena","family":"Spasi\u0107","sequence":"additional","affiliation":[{"name":"School of Computer Science & Informatics, Cardiff University, Cardiff CF10 4AG, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2023,7,7]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Manning, C.D., Raghavan, P., and Sch\u00fctze, H. (2008). Introduction to Information Retrieval, Cambridge University Press.","DOI":"10.1017\/CBO9780511809071"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Roy, R., K\u00f6ppen, M., Ovaska, S., Furuhashi, T., and Hoffmann, F. (2002). Soft Computing and Industry, Springer.","DOI":"10.1007\/978-1-4471-0123-9"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"9979","DOI":"10.1007\/s11229-021-03233-1","article-title":"The no-free-lunch theorems of supervised learning","volume":"199","author":"Sterkenburg","year":"2021","journal-title":"Synthese"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Jackson, P., and Moulinier, I. (2002). Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization, John Benjamins Publishing Company.","DOI":"10.1075\/nlp.5(1st)"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"613","DOI":"10.1145\/361219.361220","article-title":"A vector space model for automatic indexing","volume":"18","author":"Salton","year":"1975","journal-title":"Commun. ACM"},{"key":"ref_6","unstructured":"Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013, January 5\u201310). Distributed representations of words and phrases and their compositionality. Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA."},{"key":"ref_7","unstructured":"Naik, A., Ravichander, A., Rose, C., and Hovy, E. (August, January 28). Exploring numeracy in word embeddings. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Sundararaman, D., Si, S., Subramanian, V., Wang, G., Hazarika, D., and Carin, L. (2020, January 16\u201320). Methods for numeracy-preserving word embeddings. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online.","DOI":"10.18653\/v1\/2020.emnlp-main.384"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"418","DOI":"10.1016\/j.inffus.2022.08.024","article-title":"Beyond word embeddings: A survey","volume":"89","author":"Incitti","year":"2023","journal-title":"Inf. Fusion"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"677","DOI":"10.1017\/S1351324919000512","article-title":"Twenty-five years of information extraction","volume":"25","author":"Grishman","year":"2019","journal-title":"Nat. Lang. Eng."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Thawani, A., Pujara, J., Szekely, P.A., and Ilievski, F. (2021). Representing numbers in NLP: A survey and a vision. arXiv.","DOI":"10.18653\/v1\/2021.naacl-main.53"},{"key":"ref_12","unstructured":"Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. (2019, January 2\u20137). DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA."},{"key":"ref_13","unstructured":"Zhang, X., Ramachandran, D., Tenney, I., Elazar, Y., and Roth, D. (2020). Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Berg-Kirkpatrick, T., and Spokoyny, D. (2020, January 16\u201320). An empirical investigation of contextualized number prediction. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online.","DOI":"10.18653\/v1\/2020.emnlp-main.385"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Wallace, E., Wang, Y., Li, S., Singh, S., and Gardner, M. (2019, January 3\u20137). Do NLP models know numbers? Probing numeracy in embeddings. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China.","DOI":"10.18653\/v1\/D19-1534"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Geva, M., Gupta, A., and Berant, J. (2020, January 5\u201310). Injecting numerical reasoning skills into language models. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online.","DOI":"10.18653\/v1\/2020.acl-main.89"},{"key":"ref_17","unstructured":"Nogueira, R., Jiang, Z., and Lin, J. (2021). Investigating the limitations of transformers with simple arithmetic tasks. arXiv."},{"key":"ref_18","unstructured":"Chen, C.-C., Huang, H.-H., Takamura, H., and Chen, H.-H. (August, January 28). Numeracy-600K: Learning numeracy for detecting exaggerated information in market comments. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Jiang, C., Nian, Z., Guo, K., Zhao, S.C.Y., Shen, L., and Tu, K. (2020, January 16\u201320). Learning numeral embeddings. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online.","DOI":"10.18653\/v1\/2020.findings-emnlp.235"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Spithourakis, G., and Riedel, S. (2018, January 15\u201320). Numeracy for language models: Evaluating and improving their ability to predict numbers. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia.","DOI":"10.18653\/v1\/P18-1196"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"519","DOI":"10.1136\/jamia.2010.004200","article-title":"Community annotation experiment for ground truth generation for the i2b2 medication challenge","volume":"17","author":"Uzuner","year":"2010","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"514","DOI":"10.1136\/jamia.2010.003947","article-title":"Extracting medication information from clinical text","volume":"17","author":"Uzuner","year":"2010","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Button, K., Spasi\u0107, I., Playle, R., Owen, D., Lau, M., Hannaway, L., and Jones, S. (2020). Using routine referral data for patients with knee and hip pain to improve access to specialist care. BMC Musculoskelet. Disord., 21.","DOI":"10.1186\/s12891-020-3087-x"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"561","DOI":"10.1197\/jamia.M3115","article-title":"Recognizing obesity and comorbidities in sparse data","volume":"16","author":"Uzuner","year":"2009","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"104","DOI":"10.1016\/j.ipm.2013.08.006","article-title":"The impact of preprocessing on text classification","volume":"50","author":"Uysal","year":"2014","journal-title":"Inf. Process. Manag."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"e15980","DOI":"10.2196\/15980","article-title":"Cohort selection from longitudinal patient records: Text mining approach","volume":"7","author":"Corcoran","year":"2019","journal-title":"JMIR Med. Inform."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"e21252","DOI":"10.2196\/21252","article-title":"Patient triage by topic modeling of referral letters: Feasibility study","volume":"8","author":"Button","year":"2020","journal-title":"JMIR Med. Inform."},{"key":"ref_28","unstructured":"O\u2019Keeffe, A., and McCarthy, M.J. (2010). The Routledge Handbook of Corpus Linguistics, Routledge. [2nd ed.]."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Reimers, N., and Gurevych, I. (2019, January 3\u20137). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China.","DOI":"10.18653\/v1\/D19-1410"},{"key":"ref_30","unstructured":"Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"1713","DOI":"10.1016\/S0277-9536(01)00339-2","article-title":"Placing gender at the centre of health programming: Challenges and limitations","volume":"54","author":"Vlassoff","year":"2002","journal-title":"Soc. Sci. Med."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"64","DOI":"10.1162\/tacl_a_00300","article-title":"SpanBERT: Improving pre-training by representing and predicting spans","volume":"8","author":"Joshi","year":"2020","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_33","unstructured":"Yatskar, M. (2019, January 2\u20137). A qualitative comparison of CoQA, SQuAD 2.0 and QuAC. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"e17984","DOI":"10.2196\/17984","article-title":"Clinical text data in machine learning: Systematic review","volume":"8","year":"2020","journal-title":"JMIR Med. Inform."},{"key":"ref_35","first-page":"35","article-title":"Biomedical question answering: A survey of approaches and challenges","volume":"55","author":"Jin","year":"2022","journal-title":"ACM Comput. Surv."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"157","DOI":"10.1007\/s40708-016-0036-4","article-title":"An adaptive annotation approach for biomedical entity and relation recognition","volume":"3","author":"Yimam","year":"2016","journal-title":"Brain Inform."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"100729","DOI":"10.1016\/j.patter.2023.100729","article-title":"Fine-tuning large neural language models for biomedical natural language processing","volume":"4","author":"Tinn","year":"2023","journal-title":"Patterns"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"146","DOI":"10.1080\/00437956.1954.11659520","article-title":"Distributional structure","volume":"10","author":"Harris","year":"1954","journal-title":"WORD"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"11","DOI":"10.1108\/eb026526","article-title":"A statistical interpretation of term specificity and its application in retrieval","volume":"28","year":"1972","journal-title":"J. Doc."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"141","DOI":"10.1613\/jair.2934","article-title":"From frequency to meaning: Vector space models of semantics","volume":"37","author":"Turney","year":"2010","journal-title":"J. Artif. Intell. Res."},{"key":"ref_41","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., and Polosukhin, \u0141.K.I. (2017, January 4\u20139). Attention is all you need. Proceedings of the 31st Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_42","unstructured":"Beltagy, I., Peters, M.E., and Cohan, A. (2020). Longformer: The long-document transformer. arXiv."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Sannigrahi, S., Genabith, J.v., and Espa\u00f1a-Bonet, C. (2023, January 2\u20136). Are the best multilingual document embeddings simply based on sentence embeddings?. Proceedings of the Findings of the Association for Computational Linguistics: EACL 2023, Dubrovnik, Croatia.","DOI":"10.18653\/v1\/2023.findings-eacl.174"},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"494","DOI":"10.1016\/j.eswa.2018.09.034","article-title":"Clinical text classification research trends: Systematic literature review and open issues","volume":"116","author":"Mujtaba","year":"2019","journal-title":"Expert Syst. Appl."},{"key":"ref_45","unstructured":"Sprent, P., and Smeeton, N.C. (2007). Applied Nonparametric Statistical Methods, Chapman and Hall\/CRC. [4th ed.]."},{"key":"ref_46","unstructured":"de Marneffe, M.-C., Manning, C.D., and Potts, C. (2010, January 11\u201316). \u201cWas it good? It was provocative.\u201d Learning the meaning of scalar adjectives. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden."},{"key":"ref_47","unstructured":"Sharp, R., Nagesh, M.P.A., Bell, D., and Surdeanu, M. (2018, January 7\u201312). Grounding gradable adjectives through crowdsourcing. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"D267","DOI":"10.1093\/nar\/gkh061","article-title":"The Unified Medical Language System (UMLS): Integrating biomedical terminology","volume":"32","author":"Bodenreider","year":"2004","journal-title":"Nucleic Acids Res."},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"1251","DOI":"10.1038\/nbt1346","article-title":"The OBO Foundry: Coordinated evolution of ontologies to support biomedical data integration","volume":"25","author":"Smith","year":"2007","journal-title":"Nat. Biotechnol."},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"W170","DOI":"10.1093\/nar\/gkp440","article-title":"BioPortal: Ontologies and integrated data resources at the click of a mouse","volume":"37","author":"Noy","year":"2009","journal-title":"Nucleic Acids Res."},{"key":"ref_51","first-page":"279","article-title":"SNOMED-CT: The advanced terminology and coding system for eHealth","volume":"121","author":"Donnelly","year":"2006","journal-title":"Stud. Health Technol. Inform."},{"key":"ref_52","first-page":"273","article-title":"LOINC: A universal catalogue of individual clinical observations and uniform representation of enumerated collections","volume":"3","author":"Vreeman","year":"2011","journal-title":"Int. J. Funct. Inform. Pers. Med."}],"container-title":["Machine Learning and Knowledge Extraction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-4990\/5\/3\/40\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T20:08:09Z","timestamp":1760126889000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-4990\/5\/3\/40"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,7,7]]},"references-count":52,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2023,9]]}},"alternative-id":["make5030040"],"URL":"https:\/\/doi.org\/10.3390\/make5030040","relation":{},"ISSN":["2504-4990"],"issn-type":[{"type":"electronic","value":"2504-4990"}],"subject":[],"published":{"date-parts":[[2023,7,7]]}}}