{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,20]],"date-time":"2026-06-20T00:58:02Z","timestamp":1781917082770,"version":"3.54.5"},"reference-count":73,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,10,29]],"date-time":"2025-10-29T00:00:00Z","timestamp":1761696000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,10,29]],"date-time":"2025-10-29T00:00:00Z","timestamp":1761696000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100000272","name":"National Institute for Health Research","doi-asserted-by":"crossref","award":["NIHR202637"],"award-info":[{"award-number":["NIHR202637"]}],"id":[{"id":"10.13039\/501100000272","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Med Inform Decis Mak"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Background<\/jats:title>\n                    <jats:p>In clinical research, there is a strong drive to leverage big data from population cohort studies and routine electronic healthcare records to design new interventions, improve health outcomes and increase the efficiency of healthcare delivery. However, realising these potential demands requires substantial efforts in harmonising source datasets and curating study data, which currently relies on costly, time-consuming and labour-intensive methods. We explore and assess the use of natural language processing (NLP) and unsupervised machine learning (ML) to address the challenges of big data semantic harmonisation and curation.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Methods<\/jats:title>\n                    <jats:p>Our aim is to establish an efficient and robust technological foundation for the development of automated tools supporting data curation of large clinical datasets. We propose two AI based pipelines for automated semantic harmonisation: a pipeline for semantics-aware search for domain relevant variables and a pipeline for clustering of semantically similar variables. We evaluate pipeline performance using 94,037 textual variable descriptions from the English Longitudinal Study of Ageing (ELSA) database.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We observe high accuracy of our Semantic Search pipeline, with an AUC of 0.899 (SD\u2009=\u20090.056). Our semantic clustering pipeline achieves a V-measure of 0.237 (SD\u2009=\u20090.157), which is on par with that of leading implementations in other relevant domains. Automation can significantly accelerate the process of dataset harmonisation. Manual labelling was performed at a speed of 2.1 descriptions per minute, with our automated labelling increasing speed to 245 descriptions per minute.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Conclusions<\/jats:title>\n                    <jats:p>Our study findings underscore the potential of AI technologies, such as NLP and unsupervised ML, in automating the harmonisation and curation of big data for clinical research. By establishing a robust technological foundation, we pave the way for the development of automated tools that streamline the process, enabling health data scientists to leverage big data more efficiently and effectively in their studies and accelerating insights from data for clinical benefit.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1186\/s12911-025-03055-y","type":"journal-article","created":{"date-parts":[[2025,10,29]],"date-time":"2025-10-29T12:34:43Z","timestamp":1761741283000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Pretrained language models for semantics-aware data harmonisation of observational clinical studies in the era of big data"],"prefix":"10.1186","volume":"25","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6263-7339","authenticated-orcid":false,"given":"Jakub J.","family":"Dylag","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6275-5626","authenticated-orcid":false,"given":"Zlatko","family":"Zlatev","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9281-6095","authenticated-orcid":false,"given":"Michael","family":"Boniface","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2025,10,29]]},"reference":[{"issue":"25","key":"3055_CR1","doi-asserted-by":"publisher","first-page":"1878","DOI":"10.1056\/NEJM200006223422506","volume":"342","author":"K Benson","year":"2000","unstructured":"Benson K, Hartz AJ. A comparison of observational studies and randomized, controlled trials. N Engl J Med. 2000;342(25):1878\u201386. https:\/\/doi.org\/10.1056\/NEJM200006223422506.","journal-title":"N Engl J Med"},{"key":"3055_CR2","doi-asserted-by":"publisher","DOI":"10.1001\/jama.2013.393","author":"TB Murdoch","year":"2013","unstructured":"Murdoch TB, Detsky AS. Health Care JAMA. 2013;309(13). https:\/\/doi.org\/10.1001\/jama.2013.393. The Inevitable Application of Big Data to."},{"issue":"3","key":"3055_CR3","doi-asserted-by":"publisher","first-page":"241","DOI":"10.1007\/s41060-018-0095-0","volume":"6","author":"JM Kraus","year":"2018","unstructured":"Kraus JM, Lausser L, Kuhn P, Jobst F, Bock M, Halanke C, Hummel M, Heuschmann P, Kestler HA. Big data and precision medicine: challenges and strategies with healthcare data. Int J Data Sci Anal. 2018;6(3):241\u20139. https:\/\/doi.org\/10.1007\/s41060-018-0095-0.","journal-title":"Int J Data Sci Anal"},{"issue":"6","key":"3055_CR4","doi-asserted-by":"publisher","first-page":"e34405","DOI":"10.2196\/34405","volume":"11","author":"H Dambha-Miller","year":"2022","unstructured":"Dambha-Miller H, Simpson G, Akyea RK, Hounkpatin H, Morrison L, Gibson J, Stokes J, Islam N, Chapman A, Stuart B, Zaccardi F, Zlatev Z, Jones K, Roderick P, Boniface M, Santer M, Farmer A. Development and validation of population clusters for integrating health and social care: protocol for a mixed methods study in multiple Long-Term conditions (Cluster-Artificial intelligence for multiple Long-Term conditions). JMIR Res Protoc. 2022;11(6):e34405. https:\/\/doi.org\/10.2196\/34405.","journal-title":"JMIR Res Protoc"},{"key":"3055_CR5","doi-asserted-by":"publisher","unstructured":"Simpson G, Stuart B, Hijryana M, Akyea RK, Stokes J, Gibson J, Jones K, Morrison L, Santer M, Boniface M, Zlatev Z, Farmer A, Dambha-Miller H. Eliciting and prioritising determinants of improved care in multiple long term health conditions (MLTC): A modified online Delphi study. 2023; https:\/\/doi.org\/10.1101\/2023.03.19.23287406","DOI":"10.1101\/2023.03.19.23287406"},{"key":"3055_CR6","doi-asserted-by":"publisher","unstructured":"Khan N, Chalitsios CV, Nartey Y, Simpson G, Zaccardi F, Santer M, Roderick P, Stuart B, Farmer A, Dambha-Miller H. Clustering by multiple Long-Term conditions and social care needs: A cohort study amongst 10,025 older adults in England. 2023; https:\/\/doi.org\/10.1101\/2023.05.18.23290064","DOI":"10.1101\/2023.05.18.23290064"},{"issue":"2","key":"3055_CR7","doi-asserted-by":"publisher","first-page":"e0147795","DOI":"10.1371\/journal.pone.0147795","volume":"11","author":"K Winters","year":"2016","unstructured":"Winters K, Netscher S. Proposed standards for variable harmonization Documentation and referencing: A case study using quickcharmstats 1.1. Rosenbloom JL. Editor PLoS One. 2016;11(2):e0147795. https:\/\/doi.org\/10.1371\/journal.pone.0147795.","journal-title":"Editor PLoS One"},{"key":"3055_CR8","doi-asserted-by":"publisher","unstructured":"Bosch-Capblanch X. Harmonisation of variables names prior to conducting statistical analyses with multiple datasets: an automated approach. BMC Med Inf Decis Mak. 2011;11(1). https:\/\/doi.org\/10.1186\/1472-6947-11-33.","DOI":"10.1186\/1472-6947-11-33"},{"key":"3055_CR9","doi-asserted-by":"publisher","first-page":"83","DOI":"10.1109\/ojemb.2020.2981258","volume":"1","author":"VC Pezoulas","year":"2020","unstructured":"Pezoulas VC, Kourou KD, Kalatzis F, Exarchos TP, Zampeli E, Gandolfo S, Goules A, Baldini C, Skopouli F, De Vita S, Tzioufas AG, Fotiadis DI. Overcoming the barriers that obscure the interlinking and analysis of clinical data through harmonization and incremental learning. IEEE Open J Eng Med Biol. 2020;1:83\u201390. https:\/\/doi.org\/10.1109\/ojemb.2020.2981258.","journal-title":"IEEE Open J Eng Med Biol"},{"issue":"1","key":"3055_CR10","doi-asserted-by":"publisher","first-page":"65","DOI":"10.1136\/amiajnl-2013-002577","volume":"22","author":"C Pang","year":"2014","unstructured":"Pang C, Hendriksen D, Dijkstra M, van der Velde KJ, Kuiper J, Hillege HL, Swertz MA. BiobankConnect: software to rapidly connect data elements for pooled analysis across biobanks using ontological and lexical indexing. J Am Med Inform Assoc. 2014;22(1):65\u201375. https:\/\/doi.org\/10.1136\/amiajnl-2013-002577.","journal-title":"J Am Med Inform Assoc"},{"key":"3055_CR11","doi-asserted-by":"publisher","first-page":"bav089","DOI":"10.1093\/database\/bav089","volume":"2015","author":"C Pang","year":"2015","unstructured":"Pang C, Sollie A, Sijtsma A, Hendriksen D, Charbon B, de Haan M, de Boer T, Kelpin F, Jetten J, van der Velde JK, Smidt N, Sijmons R, Hillege H, Swertz MA. SORTA: a system for ontology-based re-coding and technical annotation of biomedical phenotype data. Database. 2015;2015:bav089. https:\/\/doi.org\/10.1093\/database\/bav089.","journal-title":"Database"},{"issue":"5","key":"3055_CR12","doi-asserted-by":"publisher","first-page":"1383","DOI":"10.1093\/ije\/dyq139","volume":"39","author":"I Fortier","year":"2010","unstructured":"Fortier I, Burton P, Robson PJ, Ferretti V, Little J, L\u2019Heureux F, Desch\u00eanes M, Knoppers BM, Doiron D, Keers JC, Linksted P, Harris JR, Lachance G, Boileau C, Pedersen NL, Hamilton CD, Hveem K, Borugian MJ, Gallagher RP, McLaughlin JR, Parker L, Potter JD, Gallacher J, Kaaks R, Liu B, Sprosen T, Vilain A, Atkinson SJ, Rengifo A, Morton RA, Metspalu A, Wichmann H-E, Tremblay MS, Chisholm RL, Garc\u00eda-Montero AC, Hillege HL, Litton J-E, Palmer LJ, Perola M, Peltonen L, Hudson TJ. Quality, quantity and harmony: the datashaper approach to integrating data across bioclinical studies. Int J Epidemiol Oxf Univ Press. 2010;39(5):1383\u201393. https:\/\/doi.org\/10.1093\/ije\/dyq139.","journal-title":"Int J Epidemiol Oxf Univ Press"},{"key":"3055_CR13","doi-asserted-by":"publisher","DOI":"10.5255\/UKDA-SN-5050-25","author":"J Banks","year":"2023","unstructured":"Banks J, Batty G, David BJ, Coughlin K, Crawford R, Marmot M, Nazroo J, Oldfield Z, Steel N, Steptoe A, Wood M, Zaninotto P. English longitudinal study of ageing: waves 0\u20139, 1998\u20132019. UK Data Service. 2023. https:\/\/doi.org\/10.5255\/UKDA-SN-5050-25.","journal-title":"UK Data Service"},{"issue":"Supplement1","key":"3055_CR14","doi-asserted-by":"publisher","first-page":"S5","DOI":"10.1093\/geronb\/gbab050","volume":"76","author":"J Lee","year":"2021","unstructured":"Lee J, Phillips D, Wilkens J. Gateway to global aging data: resources for Cross-National comparisons of family, social environment, and healthy aging. Journals Gerontology: Ser B. 2021;76(Supplement1):S5\u201316. https:\/\/doi.org\/10.1093\/geronb\/gbab050.","journal-title":"Journals Gerontology: Ser B"},{"key":"3055_CR15","doi-asserted-by":"publisher","first-page":"103323","DOI":"10.1016\/j.jbi.2019.103323","volume":"101","author":"KS Kalyan","year":"2020","unstructured":"Kalyan KS, Sangeetha S. SECNLP: A survey of embeddings in clinical natural Language processing. J Biomed Inf. 2020;101:103323. https:\/\/doi.org\/10.1016\/j.jbi.2019.103323.","journal-title":"J Biomed Inf"},{"key":"3055_CR16","unstructured":"Chen Q, Du J, Kim S, Wilbur W, Lu Z. Combining rich features and deep learning for finding similar sentences in electronic medical records. Proceedings of the BioCreative\/OHNLP Challenge. 2018."},{"key":"3055_CR17","unstructured":"Kiros R, Zhu Y, Salakhutdinov R, Zemel RS, Torralba A, Urtasun R, Fidler S. Skip-Thought Vectors. arXiv.org. 2015; Available from: https:\/\/arxiv.org\/abs\/1506.06726"},{"key":"3055_CR18","doi-asserted-by":"crossref","unstructured":"Cer D, Yang Y, Kong S, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C, Sung Y-H, Strope B, Kurzweil R. Universal Sentence Encoder. arXiv:180311175 [cs]. 2018; Available from: https:\/\/arxiv.org\/abs\/1803.11175","DOI":"10.18653\/v1\/D18-2029"},{"key":"3055_CR19","unstructured":"Thakur N, Reimers N, R\u00fcckl\u00e9 A, Srivastava A, Gurevych I. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. arXiv:210408663 [cs] 2021; Available from: https:\/\/arxiv.org\/abs\/2104.08663v1"},{"key":"3055_CR20","doi-asserted-by":"publisher","unstructured":"Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional Transformers for Language Understanding. Proc 2019 Conf North. 2019;1. https:\/\/doi.org\/10.18653\/v1\/n19-1423.","DOI":"10.18653\/v1\/n19-1423"},{"key":"3055_CR21","unstructured":"Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser \u0141, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J. Google\u2019s neural machine translation system. Bridging the Gap between Human and Machine Translation; 2016."},{"key":"3055_CR22","doi-asserted-by":"crossref","unstructured":"May C, Wang A, Bordia S, Bowman S, Rudinger R. On Measuring Social Biases in Sentence Encoders. 2019;1\u201312. Available from: https:\/\/arxiv.org\/pdf\/1903.10561.pdf","DOI":"10.18653\/v1\/N19-1063"},{"key":"3055_CR23","doi-asserted-by":"publisher","unstructured":"Iyyer M, Manjunatha V, Boyd-Graber J, Daum\u00e9 IIIH. Deep Unordered Composition Rivals Syntactic Methods for Text Classification. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) Stroudsburg, PA, USA: Association for Computational Linguistics; 2015. pp. 1681\u20131691. https:\/\/doi.org\/10.3115\/v1\/P15-1162","DOI":"10.3115\/v1\/P15-1162"},{"key":"3055_CR24","doi-asserted-by":"crossref","unstructured":"Reimers N, Gurevych I, Sentence-BERT. Sentence Embeddings using Siamese BERT-Networks. arXiv.org 2019; Available from: https:\/\/arxiv.org\/abs\/1908.10084","DOI":"10.18653\/v1\/D19-1410"},{"key":"3055_CR25","doi-asserted-by":"publisher","unstructured":"Wang W, Wei F, Dong L, Bao H, Yang N, Zhou M. MiniLM: deep Self-Attention distillation for Task-Agnostic compression of Pre-Trained Transformers. arXiv:200210957 [cs] 2020; https:\/\/doi.org\/10.48550\/arXiv.2002.10957","DOI":"10.48550\/arXiv.2002.10957"},{"key":"3055_CR26","unstructured":"Reimers N, Espejel O, Cuenca P. All-MiniLM-L6-v2. Hugging Face. 2021. Available from: https:\/\/huggingface.co\/sentence-transformers\/all-MiniLM-L6-v2 [accessed Jun 21, 2023]."},{"key":"3055_CR27","doi-asserted-by":"crossref","unstructured":"Henderson M, Budzianowski P, Casanueva I, Coope S, Gerz D, Kumar G, Mrk\u0161i\u0107 N, Spithourakis G, Su P-H, Vuli\u0107 I, Wen T-H. A Repository of Conversational Datasets. arXiv:190406472 [cs]. 2019; Available from: https:\/\/arxiv.org\/abs\/1904.06472","DOI":"10.18653\/v1\/W19-4101"},{"key":"3055_CR28","doi-asserted-by":"publisher","unstructured":"Lo K, Wang LL, Neumann M, Kinney R, Weld D. S2ORC: The Semantic Scholar Open Research Corpus. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020; https:\/\/doi.org\/10.18653\/v1\/2020.acl-main.447","DOI":"10.18653\/v1\/2020.acl-main.447"},{"key":"3055_CR29","doi-asserted-by":"publisher","unstructured":"Fader A, Zettlemoyer L, Etzioni O. Open question answering over curated and extracted knowledge bases. Proc 20th ACM SIGKDD Int Conf Knowl Discovery Data Min. 2014;20. https:\/\/doi.org\/10.1145\/2623330.2623677.","DOI":"10.1145\/2623330.2623677"},{"key":"3055_CR30","doi-asserted-by":"crossref","unstructured":"Lewis P, Wu Y, Liu L, Minervini P, K\u00fcttler H, Piktus A, Stenetorp P, Riedel S. PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them. arXiv:210207033 [cs] 2021; Available from: https:\/\/arxiv.org\/abs\/2102.07033","DOI":"10.1162\/tacl_a_00415"},{"key":"3055_CR31","unstructured":"Song K, Tan X, Qin T, Lu J, Liu T-Y, MPNet. Masked and Permuted Pre-training for Language Understanding. arXiv:200409297 [cs] 2020; Available from: https:\/\/arxiv.org\/abs\/2004.09297"},{"key":"3055_CR32","unstructured":"Reimers N, Espejel O, Cuenca P. All-mpnet-base-v1. Hugging Face. 2021. Available from: https:\/\/huggingface.co\/sentence-transformers\/all-mpnet-base-v1 [accessed Jun 21, 2023]."},{"key":"3055_CR33","unstructured":"Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu P. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research. 2020;21:1\u201367. Available from: https:\/\/jmlr.org\/papers\/volume21\/20-074\/20-074.pdf"},{"key":"3055_CR34","unstructured":"Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV, XLNet. Generalized Autoregressive Pretraining for Language Understanding. Adv Neural Inf Process Syst 2019;32. Available from: https:\/\/proceedings.neurips.cc\/paper\/2019\/hash\/dc6a7e655d7e5840e66733e9ee67cc69-Abstract.html"},{"key":"3055_CR35","doi-asserted-by":"crossref","unstructured":"Ni J, \u00c1brego GH, Constant N, Ma J, Hall KB, Cer D, Yang Y. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models. arXiv:210808877 [cs] 2021; Available from: https:\/\/arxiv.org\/abs\/2108.08877","DOI":"10.18653\/v1\/2022.findings-acl.146"},{"key":"3055_CR36","doi-asserted-by":"publisher","unstructured":"Bowman S, Angeli G, Potts C, Manning CD. A large annotated corpus for learning natural Language inference. Aclanthology Org. 2015;632\u201342. https:\/\/doi.org\/10.18653\/v1\/D15-1075.","DOI":"10.18653\/v1\/D15-1075"},{"key":"3055_CR37","doi-asserted-by":"crossref","unstructured":"Yasunaga M, Leskovec J, Liang P, LinkBERT. Pretraining Language Models with Document Links. arXiv:220315827 [cs] 2022; Available from: https:\/\/arxiv.org\/abs\/2203.15827","DOI":"10.18653\/v1\/2022.acl-long.551"},{"key":"3055_CR38","doi-asserted-by":"publisher","unstructured":"Tsatsaronis G, Balikas G, Malakasiotis P, Partalas I, Zschunke M, Alvers MR, Weissenborn D, Krithara A, Petridis S, Polychronopoulos D, Almirantis Y, Pavlopoulos J, Baskiotis N, Gallinari P, Arti\u00e9res T, Ngomo A-CN, Heino N, Gaussier E, Barrio-Alvers L, Schroeder M, Androutsopoulos I, Paliouras G. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics. 2015;16(1). https:\/\/doi.org\/10.1186\/s12859-015-0564-6.","DOI":"10.1186\/s12859-015-0564-6"},{"key":"3055_CR39","doi-asserted-by":"crossref","unstructured":"Jin D, Pan E, Oufattole N, Weng W-H, Fang H, Szolovits P. What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams. arXiv:200913081 [cs]. 2020; Available from: https:\/\/arxiv.org\/abs\/2009.13081","DOI":"10.20944\/preprints202105.0498.v1"},{"key":"3055_CR40","doi-asserted-by":"publisher","DOI":"10.1145\/775152.775250","author":"R Guha","year":"2003","unstructured":"Guha R, McCool R, Miller E. Semantic search. Proc Twelfth Int Conf World Wide Web - WWW \u201903. 2003. https:\/\/doi.org\/10.1145\/775152.775250.","journal-title":"Proc Twelfth Int Conf World Wide Web - WWW \u201903"},{"issue":"3","key":"3055_CR41","doi-asserted-by":"publisher","first-page":"733","DOI":"10.1016\/j.ipm.2018.10.015","volume":"56","author":"F Lashkari","year":"2019","unstructured":"Lashkari F, Bagheri E, Ghorbani AA. Neural embedding-based indices for semantic search. Inf Process Manag. 2019;56(3):733\u201355. https:\/\/doi.org\/10.1016\/j.ipm.2018.10.015.","journal-title":"Inf Process Manag"},{"key":"3055_CR42","unstructured":"Bex F, Villata S. Legal Knowledge and Information Systems: JURIX 2016: The Twenty-Ninth Annual Conference. Google Books. IOS Press; 2016. Available from: https:\/\/books.google.com\/books?hl=en%26lr=%26id=-MnzDQAAQBAJ%26oi=fnd%26pg=PA73%26dq=word+embedding+phrase+search%26ots=e1yyrnLXgB%26sig=V_s4_yyZdpyO5yAyn-TUQGuVr20"},{"key":"3055_CR43","doi-asserted-by":"publisher","DOI":"10.1093\/jamia\/ocu041","author":"A Nikfarjam","year":"2015","unstructured":"Nikfarjam A, Sarker A, O\u2019Connor K, Ginn R, Gonzalez G. Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. J Am Med Inform Assoc. 2015. https:\/\/doi.org\/10.1093\/jamia\/ocu041.","journal-title":"J Am Med Inform Assoc"},{"key":"3055_CR44","unstructured":"Mikolov T, Chen K, Corrado G, Dean J. Efficient Estimation of Word Representations in Vector Space. 2013."},{"key":"3055_CR45","doi-asserted-by":"publisher","first-page":"22","DOI":"10.1016\/j.neunet.2016.12.008","volume":"88","author":"J Xu","year":"2017","unstructured":"Xu J, Xu B, Wang P, Zheng S, Tian G, Zhao J, Xu B. Self-Taught convolutional neural networks for short text clustering. Neural Netw. 2017;88:22\u201331. https:\/\/doi.org\/10.1016\/j.neunet.2016.12.008.","journal-title":"Neural Netw"},{"issue":"9","key":"3055_CR46","doi-asserted-by":"publisher","first-page":"144","DOI":"10.3390\/fi12090144","volume":"12","author":"SS Bodrunova","year":"2020","unstructured":"Bodrunova SS, Orekhov AV, Blekanov IS, Lyudkevich NS, Tarasov NA. Topic detection based on sentence embeddings and agglomerative clustering with Markov moment. Future Internet. 2020;12(9):144. https:\/\/doi.org\/10.3390\/fi12090144.","journal-title":"Future Internet"},{"key":"3055_CR47","doi-asserted-by":"publisher","unstructured":"An Y, Kalinowski A, Greenberg J, Clustering, and Network Analysis for the Embedding Spaces of Sentences and Sub-Sentences. 2021 Second International Conference on Intelligent Data Science TechnologiesApplications (IDSTA) IEEE; 2021. pp. 138\u2013145. https:\/\/doi.org\/10.1109\/IDSTA53674.2021.9660801","DOI":"10.1109\/IDSTA53674.2021.9660801"},{"key":"3055_CR48","unstructured":"Gupta V, Shi H, Gimpel K, Sachan M. Deep Clustering of Text Representations for Supervision-free Probing of Syntax. arXiv:201012784 [cs] 2021; Available from: https:\/\/arxiv.org\/abs\/2010.12784"},{"key":"3055_CR49","doi-asserted-by":"publisher","unstructured":"Usino W, Satria A, Hamed K, Bramantoro A, Amaldi AH. Document similarity detection using K-Means and cosine distance. Int J Adv Comput Sci Appl. 2019;10(2). https:\/\/doi.org\/10.14569\/IJACSA.2019.0100222.","DOI":"10.14569\/IJACSA.2019.0100222"},{"issue":"11","key":"3055_CR50","doi-asserted-by":"publisher","first-page":"559","DOI":"10.1080\/14786440109462720","volume":"2","author":"K Pearson","year":"1901","unstructured":"Pearson K. On lines and Planes of closest fit to systems of points in space. The london, edinburgh, and Dublin philosophical magazine and. J Sci. 1901;2(11):559\u201372. https:\/\/doi.org\/10.1080\/14786440109462720.","journal-title":"J Sci"},{"key":"3055_CR51","doi-asserted-by":"crossref","unstructured":"McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. 2018.","DOI":"10.21105\/joss.00861"},{"key":"3055_CR52","unstructured":"Com L, Hinton G. Visualizing Data using t-SNE Laurens van der Maaten. Journal of Machine Learning Research. 2008;9:2579\u20132605. Available from: https:\/\/www.jmlr.org\/papers\/volume9\/vandermaaten08a\/vandermaaten08a.pdf"},{"key":"3055_CR53","unstructured":"Macqueen J, SOME METHODS FOR CLASSIFICATION AND ANALYSIS OF MULTIVARIATE, OBSERVATIONS. 1967. Available from: https:\/\/digitalassets.lib.berkeley.edu\/math\/ucb\/text\/math_s5_v1_article-17.pdf"},{"issue":"2","key":"3055_CR54","doi-asserted-by":"publisher","first-page":"129","DOI":"10.1109\/tit.1982.1056489","volume":"28","author":"S Lloyd","year":"1982","unstructured":"Lloyd S. Least squares quantization in PCM. IEEE Trans Inf Theory. 1982;28(2):129\u201337. https:\/\/doi.org\/10.1109\/tit.1982.1056489.","journal-title":"IEEE Trans Inf Theory"},{"issue":"1","key":"3055_CR55","doi-asserted-by":"publisher","first-page":"142","DOI":"10.1109\/tvcg.2017.2745085","volume":"24","author":"BC Kwon","year":"2018","unstructured":"Kwon BC, Eysenbach B, Verma J, Ng K, De Filippi C, Stewart WF, Perer A. Clustervision: visual supervision of unsupervised clustering. IEEE Trans Vis Comput Graph. 2018;24(1):142\u201351. https:\/\/doi.org\/10.1109\/tvcg.2017.2745085.","journal-title":"IEEE Trans Vis Comput Graph"},{"key":"3055_CR56","unstructured":"M\u00fcllner D. Modern hierarchical, agglomerative clustering algorithms. arXiv:11092378 [cs, stat]. 2011; Available from: https:\/\/arxiv.org\/abs\/1109.2378"},{"key":"3055_CR57","doi-asserted-by":"publisher","unstructured":"Campello RJGB, Moulavi D, Sander J. Density-Based clustering based on hierarchical density estimates. Adv Knowl Discovery Data Min. 2013;160\u201372. https:\/\/doi.org\/10.1007\/978-3-642-37456-2_14.","DOI":"10.1007\/978-3-642-37456-2_14"},{"key":"3055_CR58","doi-asserted-by":"publisher","unstructured":"Ester M, Kriegel H-P, Sander J, Xu X. A density-based algorithm for discovering clusters in large Spatial databases with noise. AAAI Press. https:\/\/doi.org\/10.5555\/3001460.3001507","DOI":"10.5555\/3001460.3001507"},{"issue":"2","key":"3055_CR59","doi-asserted-by":"publisher","first-page":"49","DOI":"10.1145\/304181.304187","volume":"28","author":"M Ankerst","year":"1999","unstructured":"Ankerst M, Breunig MM, Kriegel H-P, Sander J. OPTICS: ordering points to identify the clustering structure. ACM SIGMOD Record. 1999;28(2):49\u201360. https:\/\/doi.org\/10.1145\/304181.304187.","journal-title":"ACM SIGMOD Record"},{"key":"3055_CR60","unstructured":"Jain K, Dubes AC. R. Algorithms for clustering data. Prentice-Hall, Inc.Division of Simon and Schuster One Lake Street Upper Saddle; 1988. Available from: https:\/\/homepages.inf.ed.ac.uk\/rbf\/BOOKS\/JAIN\/Clustering_Jain_Dubes.pdf ISBN:978-0-13-022278-7."},{"issue":"12","key":"3055_CR61","doi-asserted-by":"publisher","first-page":"1650","DOI":"10.1109\/tpami.2002.1114856","volume":"24","author":"U Maulik","year":"2002","unstructured":"Maulik U, Bandyopadhyay S. Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell. 2002;24(12):1650\u20134. https:\/\/doi.org\/10.1109\/tpami.2002.1114856.","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"3055_CR62","doi-asserted-by":"publisher","unstructured":"Liu Y, Li Z, Xiong H, Gao X, Wu J. Understanding of Internal Clustering Validation Measures. 2010 IEEE International Conference on Data Mining 2010; https:\/\/doi.org\/10.1109\/icdm.2010.35","DOI":"10.1109\/icdm.2010.35"},{"issue":"4","key":"3055_CR63","doi-asserted-by":"publisher","first-page":"731","DOI":"10.1111\/1467-9868.00095","volume":"59","author":"PJ Richardson Sylvia, Green","year":"1997","unstructured":"Richardson Sylvia, Green PJ. On bayesian analysis of mixtures with an unknown number of components (with discussion). J R Stat Soc Ser B Stat Methodol. 1997;59(4):731\u201392. https:\/\/doi.org\/10.1111\/1467-9868.00095.","journal-title":"J R Stat Soc Ser B Stat Methodol"},{"key":"3055_CR64","doi-asserted-by":"publisher","unstructured":"Halkidi M, Vazirgiannis M. Clustering validity assessment: finding the optimal partitioning of a data set. IEEE Xplore. 2001;187\u201394. https:\/\/doi.org\/10.1109\/ICDM.2001.989517.","DOI":"10.1109\/ICDM.2001.989517"},{"key":"3055_CR65","doi-asserted-by":"publisher","unstructured":"Nisha, Kaur PJ. Cluster quality based performance evaluation of hierarchical clustering method. IEEE Xplore. 2015;649\u201353. https:\/\/doi.org\/10.1109\/NGCT.2015.7375201.","DOI":"10.1109\/NGCT.2015.7375201"},{"issue":"7","key":"3055_CR66","doi-asserted-by":"publisher","first-page":"1145","DOI":"10.1016\/s0031-3203(96)00142-2","volume":"30","author":"AP Bradley","year":"1997","unstructured":"Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997;30(7):1145\u201359. https:\/\/doi.org\/10.1016\/s0031-3203(96)00142-2.","journal-title":"Pattern Recognit"},{"key":"3055_CR67","unstructured":"Rosenberg A, Hirschberg J. V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. Association for Computational Linguistics. 2007;410\u2013420. Available from: https:\/\/aclanthology.org\/D07-1043 [accessed May 31, 2023]."},{"key":"3055_CR68","doi-asserted-by":"crossref","unstructured":"Boltu\u017ei\u0107 F, \u0160najder J. Identifying Prominent Arguments in Online Debates Using Semantic Textual Similarity. Proceedings of the 2nd Workshop on Argumentation Mining Association for Computational Linguistics; 2015. pp. 110\u2013115. Available from: https:\/\/aclanthology.org\/W15-0514.pdf","DOI":"10.3115\/v1\/W15-0514"},{"key":"3055_CR69","unstructured":"Dom BE. An Information-Theoretic External Cluster-Validity Measure. arXiv:13010565 [cs, stat]. 2012; Available from: https:\/\/arxiv.org\/abs\/1301.0565"},{"key":"3055_CR70","doi-asserted-by":"publisher","unstructured":"Meil\u0103 M, Heckerman D. An experimental comparison of Model-Based clustering methods. 2001;9\u201329. https:\/\/doi.org\/10.1023\/a:1007648401407","DOI":"10.1023\/a:1007648401407"},{"issue":"12","key":"3055_CR71","doi-asserted-by":"publisher","first-page":"3799","DOI":"10.3390\/cancers12123799","volume":"12","author":"F Valle","year":"2020","unstructured":"Valle F, Osella M, Caselle M. A topic modeling analysis of TCGA breast and lung Cancer transcriptomic data. Cancers (Basel). 2020;12(12):3799. https:\/\/doi.org\/10.3390\/cancers12123799.","journal-title":"Cancers (Basel)"},{"key":"3055_CR72","doi-asserted-by":"publisher","unstructured":"Sui X, Wang W, Zhang J. Text Mining Drug-Protein Interactions using an Ensemble of BERT, Sentence BERT and T5 models. 2021; https:\/\/doi.org\/10.1101\/2021.10.26.465944","DOI":"10.1101\/2021.10.26.465944"},{"key":"3055_CR73","doi-asserted-by":"publisher","unstructured":"Landthaler J, Waltl B, Holl P, Matthes F. Extending full text search for legal document collections using word embeddings. Legal Knowl Inform Syst. 2016;2673\u201382. https:\/\/doi.org\/10.3233\/978-1-61499-726-9-73.","DOI":"10.3233\/978-1-61499-726-9-73"}],"container-title":["BMC Medical Informatics and Decision Making"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12911-025-03055-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s12911-025-03055-y\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12911-025-03055-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,29]],"date-time":"2025-10-29T12:34:47Z","timestamp":1761741287000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcmedinformdecismak.biomedcentral.com\/articles\/10.1186\/s12911-025-03055-y"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,29]]},"references-count":73,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["3055"],"URL":"https:\/\/doi.org\/10.1186\/s12911-025-03055-y","relation":{"references":[{"id-type":"doi","id":"10.1001\/jama.2013.393","asserted-by":"subject"}],"has-preprint":[{"id-type":"doi","id":"10.21203\/rs.3.rs-4829846\/v1","asserted-by":"object"}]},"ISSN":["1472-6947"],"issn-type":[{"value":"1472-6947","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,29]]},"assertion":[{"value":"30 July 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 June 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"29 October 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Ethical approval was granted by the University of Southampton Faculty of Medicine Research Committee (67953). Informed consent to participate was obtained from all the participants in the study. Procedures complied with the Helsinki Declaration of 1975 as revised in 2000.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"400"}}