{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,3,26]],"date-time":"2025-03-26T23:36:52Z","timestamp":1743032212365,"version":"3.40.3"},"publisher-location":"Cham","reference-count":25,"publisher":"Springer Nature Switzerland","isbn-type":[{"type":"print","value":"9783031434266"},{"type":"electronic","value":"9783031434273"}],"license":[{"start":{"date-parts":[[2023,1,1]],"date-time":"2023-01-01T00:00:00Z","timestamp":1672531200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,9,17]],"date-time":"2023-09-17T00:00:00Z","timestamp":1694908800000},"content-version":"vor","delay-in-days":259,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2023]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>One of the central tasks of medical text analysis is to extract and structure meaningful information from plain-text clinical documents. Named Entity Recognition (NER) is a sub-task of information extraction that involves identifying predefined entities from unstructured free text. Notably, NER models require large amounts of human-labeled data to train, but human annotation is costly and laborious and often requires medical training. Here, we aim to overcome the shortage of manually annotated data by introducing a training scheme for NER models that uses an existing medical ontology to assign weak labels to entities and provides enhanced domain-specific model adaptation with in-domain continual pretraining. Due to limited human annotation resources, we develop a specific module to collect a more representative test dataset from the data lake than a random selection. To validate our framework, we invite clinicians to annotate the test set. In this way, we construct two Finnish medical NER datasets based on clinical records retrieved from a hospital\u2019s data lake and evaluate the effectiveness of the proposed methods. The code is available at <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/VRCMF\/HAM-net.git\">https:\/\/github.com\/VRCMF\/HAM-net.git<\/jats:ext-link>.<\/jats:p>","DOI":"10.1007\/978-3-031-43427-3_27","type":"book-chapter","created":{"date-parts":[[2023,9,16]],"date-time":"2023-09-16T21:01:41Z","timestamp":1694898101000},"page":"444-459","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Weak Supervision and\u00a0Clustering-Based Sample Selection for\u00a0Clinical Named Entity Recognition"],"prefix":"10.1007","author":[{"given":"Wei","family":"Sun","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3281-8002","authenticated-orcid":false,"given":"Shaoxiong","family":"Ji","sequence":"additional","affiliation":[]},{"given":"Tuulia","family":"Denti","sequence":"additional","affiliation":[]},{"given":"Hans","family":"Moen","sequence":"additional","affiliation":[]},{"given":"Oleg","family":"Kerro","sequence":"additional","affiliation":[]},{"given":"Antti","family":"Rannikko","sequence":"additional","affiliation":[]},{"given":"Pekka","family":"Marttinen","sequence":"additional","affiliation":[]},{"given":"Miika","family":"Koskinen","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,9,17]]},"reference":[{"issue":"4","key":"27_CR1","doi-asserted-by":"publisher","first-page":"433","DOI":"10.1002\/wics.101","volume":"2","author":"H Abdi","year":"2010","unstructured":"Abdi, H., Williams, L.J.: Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2(4), 433\u2013459 (2010)","journal-title":"Wiley Interdiscip. Rev. Comput. Stat."},{"unstructured":"Arora, S., Ge, R., Neyshabur, B., Zhang, Y.: Stronger generalization bounds for deep nets via a compression approach. In: International Conference on Machine Learning, pp. 254\u2013263. PMLR (2018)","key":"27_CR2"},{"unstructured":"Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)","key":"27_CR3"},{"key":"27_CR4","doi-asserted-by":"publisher","first-page":"224","DOI":"10.1109\/TPAMI.1979.4766909","volume":"2","author":"DL Davies","year":"1979","unstructured":"Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 2, 224\u2013227 (1979)","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"unstructured":"Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)","key":"27_CR5"},{"unstructured":"Ferreira, M.D., Malyska, M., Sahar, N., Miotto, R., Paulovich, F., Milios, E.: Active learning for medical code assignment. arXiv preprint arXiv:2104.05741 (2021)","key":"27_CR6"},{"doi-asserted-by":"crossref","unstructured":"Gururangan, S., et al.: Don\u2019t stop pretraining: adapt language models to domains and tasks. In: ACL (2020)","key":"27_CR7","DOI":"10.18653\/v1\/2020.acl-main.740"},{"unstructured":"Jain, S., et al.: Radgraph: extracting clinical entities and relations from radiology reports. arXiv preprint arXiv:2106.14463 (2021)","key":"27_CR8"},{"doi-asserted-by":"crossref","unstructured":"Jiang, H., Zhang, D., Cao, T., Yin, B., Zhao, T.: Named entity recognition with small strongly labeled and large weakly labeled data. arXiv preprint arXiv:2106.08977 (2021)","key":"27_CR9","DOI":"10.18653\/v1\/2021.acl-long.140"},{"issue":"2","key":"27_CR10","doi-asserted-by":"publisher","first-page":"145","DOI":"10.1016\/j.artmed.2015.05.007","volume":"65","author":"I Korkontzelos","year":"2015","unstructured":"Korkontzelos, I., Piliouras, D., Dowsey, A.W., Ananiadou, S.: Boosting drug named entity recognition using an aggregate classifier. Artif. Intell. Med. 65(2), 145\u2013153 (2015)","journal-title":"Artif. Intell. Med."},{"unstructured":"Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, pp. 282\u2013289. Morgan Kaufmann Publishers Inc., San Francisco (2001)","key":"27_CR11"},{"issue":"1","key":"27_CR12","doi-asserted-by":"publisher","first-page":"50","DOI":"10.1109\/TKDE.2020.2981314","volume":"34","author":"J Li","year":"2020","unstructured":"Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 34(1), 50\u201370 (2020)","journal-title":"IEEE Trans. Knowl. Data Eng."},{"unstructured":"Li, Y., Wehbe, R.M., Ahmad, F.S., Wang, H., Luo, Y.: Clinical-longformer and clinical-bigbird: transformers for long clinical sequences. arXiv preprint arXiv:2201.11838 (2022)","key":"27_CR13"},{"doi-asserted-by":"crossref","unstructured":"Lim, S.K., Muis, A.O., Lu, W., Ong, C.H.: Malwaretextdb: a database for annotated malware articles. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1557\u20131567 (2017)","key":"27_CR14","DOI":"10.18653\/v1\/P17-1143"},{"key":"27_CR15","volume-title":"Foundations of Statistical Natural Language Processing","author":"C Manning","year":"1999","unstructured":"Manning, C., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)"},{"unstructured":"Nesterov, A., Umerenkov, D.: Distantly supervised end-to-end medical entity extraction from electronic health records with human-level quality. arXiv preprint arXiv:2201.10463 (2022)","key":"27_CR16"},{"unstructured":"Rindflesch, T.C., Tanabe, L., JN, W., et al.: Extraction of drugs, genes and relations from the biomedical literature. In: Pacific Symposium on Bio2 Computing, vol. 5, p. 5172528 (2000)","key":"27_CR17"},{"unstructured":"Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. arXiv preprint CS\/0306050 (2003)","key":"27_CR18"},{"doi-asserted-by":"crossref","unstructured":"Settles, B.: Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA\/BioNLP), pp. 107\u2013110 (2004)","key":"27_CR19","DOI":"10.3115\/1567594.1567618"},{"doi-asserted-by":"crossref","unstructured":"Shang, J., Liu, L., Ren, X., Gu, X., Ren, T., Han, J.: Learning named entity tagger using domain-specific dictionary. arXiv preprint arXiv:1809.03599 (2018)","key":"27_CR20","DOI":"10.18653\/v1\/D18-1230"},{"doi-asserted-by":"crossref","unstructured":"Tsuruoka, Y., Tsujii, J.: Boosting precision and recall of dictionary-based protein name recognition. In: Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine, vol. 13, pp. 41\u201348. Citeseer (2003)","key":"27_CR21","DOI":"10.3115\/1118958.1118964"},{"key":"27_CR22","doi-asserted-by":"publisher","first-page":"55","DOI":"10.1162\/tacl_a_00165","volume":"2","author":"M Wang","year":"2014","unstructured":"Wang, M., Manning, C.D.: Cross-lingual projected expectation regularization for weakly supervised learning. Trans. Assoc. Comput. Linguist. 2, 55\u201366 (2014)","journal-title":"Trans. Assoc. Comput. Linguist."},{"unstructured":"Wu, Y., Jiang, M., Xu, J., Zhi, D., Xu, H.: Clinical named entity recognition using deep learning models. In: AMIA Annual Symposium Proceedings, vol. 2017, p. 1812. American Medical Informatics Association (2017)","key":"27_CR23"},{"issue":"1","key":"27_CR24","doi-asserted-by":"publisher","first-page":"44","DOI":"10.1093\/nsr\/nwx106","volume":"5","author":"ZH Zhou","year":"2018","unstructured":"Zhou, Z.H.: A brief introduction to weakly supervised learning. Natl. Sci. Rev. 5(1), 44\u201353 (2018)","journal-title":"Natl. Sci. Rev."},{"doi-asserted-by":"crossref","unstructured":"Zirikly, A., Hagiwara, M.: Cross-lingual transfer of named entity recognizers without parallel corpora. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 390\u2013396 (2015)","key":"27_CR25","DOI":"10.3115\/v1\/P15-2064"}],"container-title":["Lecture Notes in Computer Science","Machine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/978-3-031-43427-3_27","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,9,16]],"date-time":"2023-09-16T21:07:09Z","timestamp":1694898429000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/978-3-031-43427-3_27"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023]]},"ISBN":["9783031434266","9783031434273"],"references-count":25,"URL":"https:\/\/doi.org\/10.1007\/978-3-031-43427-3_27","relation":{},"ISSN":["0302-9743","1611-3349"],"issn-type":[{"type":"print","value":"0302-9743"},{"type":"electronic","value":"1611-3349"}],"subject":[],"published":{"date-parts":[[2023]]},"assertion":[{"value":"17 September 2023","order":1,"name":"first_online","label":"First Online","group":{"name":"ChapterHistory","label":"Chapter History"}},{"value":"The study was based on approval of HUS Helsinki University Hospital (HUS\/12199\/2022). Data was analyzed on HUS Acamedic that is a certified data analytics platform and meets the requirements (General Data Protection Regulation, Finlex 552\/2019) for processing sensitive healthcare data.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethical Statement"}},{"value":"ECML PKDD","order":1,"name":"conference_acronym","label":"Conference Acronym","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Joint European Conference on Machine Learning and Knowledge Discovery in Databases","order":2,"name":"conference_name","label":"Conference Name","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Turin","order":3,"name":"conference_city","label":"Conference City","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Italy","order":4,"name":"conference_country","label":"Conference Country","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"2023","order":5,"name":"conference_year","label":"Conference Year","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"18 September 2023","order":7,"name":"conference_start_date","label":"Conference Start Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"22 September 2023","order":8,"name":"conference_end_date","label":"Conference End Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"23","order":9,"name":"conference_number","label":"Conference Number","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"ecml2023","order":10,"name":"conference_id","label":"Conference ID","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"https:\/\/2023.ecmlpkdd.org\/","order":11,"name":"conference_url","label":"Conference URL","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Double-blind","order":1,"name":"type","label":"Type","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"CMT","order":2,"name":"conference_management_system","label":"Conference Management System","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"829","order":3,"name":"number_of_submissions_sent_for_review","label":"Number of Submissions Sent for Review","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"196","order":4,"name":"number_of_full_papers_accepted","label":"Number of Full Papers Accepted","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"0","order":5,"name":"number_of_short_papers_accepted","label":"Number of Short Papers Accepted","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"24% - The value is computed by the equation \"Number of Full Papers Accepted \/ Number of Submissions Sent for Review * 100\" and then rounded to a whole number.","order":6,"name":"acceptance_rate_of_full_papers","label":"Acceptance Rate of Full Papers","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"3.63","order":7,"name":"average_number_of_reviews_per_paper","label":"Average Number of Reviews per Paper","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"4.5","order":8,"name":"average_number_of_papers_per_reviewer","label":"Average Number of Papers per Reviewer","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"Yes","order":9,"name":"external_reviewers_involved","label":"External Reviewers Involved","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"Applied Data Science Track: 239 submissions, 58 accepted papers; Demo Track: 31 submissions, 16 accepted papers.","order":10,"name":"additional_info_on_review_process","label":"Additional Info on Review Process","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}}]}}