{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,15]],"date-time":"2026-04-15T10:08:43Z","timestamp":1776247723924,"version":"3.50.1"},"reference-count":52,"publisher":"Cambridge University Press (CUP)","issue":"3","license":[{"start":{"date-parts":[[2023,6,6]],"date-time":"2023-06-06T00:00:00Z","timestamp":1686009600000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["cambridge.org"],"crossmark-restriction":true},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2024,5]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Named entity recognition (NER) aims to identify mentions of named entities in an unstructured text and classify them into predefined named entity classes. While deep learning-based pre-trained language models help to achieve good predictive performances in NER, many domain-specific NER applications still call for a substantial amount of labeled data. Active learning (AL), a general framework for the label acquisition problem, has been used for NER tasks to minimize the annotation cost without sacrificing model performance. However, the heavily imbalanced class distribution of tokens introduces challenges in designing effective AL querying methods for NER. We propose several AL sentence query evaluation functions that pay more attention to potential positive tokens and evaluate these proposed functions with both sentence-based and token-based cost evaluation strategies. We also propose a better data-driven normalization approach to penalize sentences that are too long or too short. Our experiments on three datasets from different domains reveal that the proposed approach reduces the number of annotated tokens while achieving better or comparable prediction performance with conventional methods.<\/jats:p>","DOI":"10.1017\/s1351324923000165","type":"journal-article","created":{"date-parts":[[2023,6,6]],"date-time":"2023-06-06T19:27:15Z","timestamp":1686079635000},"page":"602-624","update-policy":"https:\/\/doi.org\/10.1017\/policypage","source":"Crossref","is-referenced-by-count":5,"title":["Focusing on potential named entities during active label acquisition"],"prefix":"10.1017","volume":"30","author":[{"given":"Ali Osman Berk","family":"\u015eapc\u0131","sequence":"first","affiliation":[]},{"given":"Hasan","family":"Kemik","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8501-6209","authenticated-orcid":false,"given":"Reyyan","family":"Yeniterzi","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7058-5372","authenticated-orcid":false,"given":"Oznur","family":"Tastan","sequence":"additional","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2023,6,6]]},"reference":[{"key":"S1351324923000165_ref12","unstructured":"Guyon, I. , Cawley, G.C. , Dror, G. and Lemaire, V. (2011). Results of the active learning challenge. In Active Learning and Experimental Design Workshop in Conjunction with AISTATS 2010, JMLR Workshop and Conference Proceedings, pp. 19\u201345. ISSN: 1938-7228."},{"key":"S1351324923000165_ref27","unstructured":"Okazaki, N. (2007). CRFsuite: a fast implementation of Conditional Random Fields (CRFs)."},{"key":"S1351324923000165_ref2","doi-asserted-by":"publisher","DOI":"10.3115\/1564131.1564136"},{"key":"S1351324923000165_ref31","unstructured":"Reichart, R. , Tomanek, K. , Hahn, U. and Rappoport, A. (2008). Multi-task active learning for linguistic annotations. In Proceedings of ACL-08: HLT, Columbus, OH. Association for Computational Linguistics, pp. 861\u2013869."},{"key":"S1351324923000165_ref33","first-page":"2881","article-title":"Parametric UMAP embeddings for representation and semisupervised learning","volume":"33","author":"Sainburg","year":"2021","journal-title":"Neural Computation"},{"key":"S1351324923000165_ref39","doi-asserted-by":"publisher","DOI":"10.1002\/j.1538-7305.1948.tb01338.x"},{"key":"S1351324923000165_ref1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-1909"},{"key":"S1351324923000165_ref15","doi-asserted-by":"publisher","DOI":"10.3115\/1614049.1614067"},{"key":"S1351324923000165_ref24","unstructured":"Liu, Y. , Ott, M. , Goyal, N. , Du, J. , Joshi, M. , Chen, D. , Levy, O. , Lewis, M. , Zettlemoyer, L. and Stoyanov, V. (2019). RoBERTa: A robustly Optimized BERT Pretraining Approach. Number: arXiv: 1907.11692 [cs]."},{"key":"S1351324923000165_ref16","unstructured":"Lafferty, J.D. , McCallum, A. and Pereira, F.C.N. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML\u201901, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc, pp. 282\u2013289."},{"key":"S1351324923000165_ref44","unstructured":"Thompson, C.A. , Califf, M.E. and Mooney, R.J. (1999). Active learning for natural language parsing and information extraction. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML\u201999, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc, pp. 406\u2013414."},{"key":"S1351324923000165_ref52","doi-asserted-by":"publisher","DOI":"10.1109\/ICSMC.2009.5346315"},{"key":"S1351324923000165_ref22","doi-asserted-by":"publisher","DOI":"10.1007\/s11063-021-10737-x"},{"key":"S1351324923000165_ref7","doi-asserted-by":"publisher","DOI":"10.1016\/j.jbi.2015.09.010"},{"key":"S1351324923000165_ref21","doi-asserted-by":"publisher","DOI":"10.1007\/BF01589116"},{"key":"S1351324923000165_ref17","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btz682"},{"key":"S1351324923000165_ref4","doi-asserted-by":"publisher","DOI":"10.1145\/335191.335388"},{"key":"S1351324923000165_ref9","unstructured":"Devlin, J. , Chang, M.-W. , Lee, K. and Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA. Association for Computational Linguistics, pp. 4171\u20134186."},{"key":"S1351324923000165_ref32","doi-asserted-by":"publisher","DOI":"10.3115\/1642059.1642075"},{"key":"S1351324923000165_ref35","unstructured":"Settles, B. (2009). Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin\u2013Madison."},{"key":"S1351324923000165_ref41","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-2630"},{"key":"S1351324923000165_ref8","doi-asserted-by":"publisher","DOI":"10.21236\/ADA440382"},{"key":"S1351324923000165_ref6","doi-asserted-by":"publisher","DOI":"10.1145\/2733381"},{"key":"S1351324923000165_ref25","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1097"},{"key":"S1351324923000165_ref48","unstructured":"Tomanek, K. and Hahn, U. (2010). Annotation time stamps \u2013 temporal metadata from the linguistic annotation process. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC\u201910), Valletta, Malta. European Language Resources Association (ELRA)."},{"key":"S1351324923000165_ref42","first-page":"93","article-title":"Kernel density estimation using the fast fourier transform","volume":"31","author":"Silverman","year":"1982","journal-title":"Journal of the Royal Statistical Society. Series C (Applied Statistics)"},{"key":"S1351324923000165_ref14","doi-asserted-by":"publisher","DOI":"10.1162\/0891201041850894"},{"key":"S1351324923000165_ref3","unstructured":"Baldridge, J. and Osborne, M. (2004). Active learning and the total cost of annotation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain. Association for Computational Linguistics, pp. 9\u201316."},{"key":"S1351324923000165_ref13","doi-asserted-by":"publisher","DOI":"10.3115\/1557690.1557708"},{"key":"S1351324923000165_ref36","unstructured":"Settles, B. (2011). From theories to queries: active learning in practice. In Active Learning and Experimental Design Workshop in Conjunction with AISTATS 2010, JMLR Workshop and Conference Proceedings, pp. 1\u201318. ISSN: 1938-7228."},{"key":"S1351324923000165_ref37","doi-asserted-by":"publisher","DOI":"10.3115\/1613715.1613855"},{"key":"S1351324923000165_ref11","doi-asserted-by":"publisher","DOI":"10.3115\/981863.981905"},{"key":"S1351324923000165_ref47","unstructured":"Tomanek, K. (2010). Resource-Aware Annotation Through Active Learning. PhD Thesis, Dortmund University of Technology."},{"key":"S1351324923000165_ref46","doi-asserted-by":"publisher","DOI":"10.3115\/1119176.1119195"},{"key":"S1351324923000165_ref18","doi-asserted-by":"crossref","unstructured":"Lewis, D.D. and Catlett, J. (1994). Heterogenous uncertainty sampling for supervised learning. In Proceedings of the Eleventh International Conference on International Conference on Machine Learning, ICML\u201994, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc, pp. 148\u2013156.","DOI":"10.1016\/B978-1-55860-335-6.50026-X"},{"key":"S1351324923000165_ref29","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-4302"},{"key":"S1351324923000165_ref5","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-37456-2_14"},{"key":"S1351324923000165_ref30","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"Journal of Machine Learning Research"},{"key":"S1351324923000165_ref23","unstructured":"Liu, Q. , Kusner, M.J. and Blunsom, P. (2020). A Survey on Contextual Embeddings. Number: arXiv: 2003.07278 [cs]."},{"key":"S1351324923000165_ref38","unstructured":"Settles, B. , Craven, M. and Friedland, L.A. (2008). Active learning with real annotation costs. In Proceedings of the NIPS. 2008 Workshop on Cost-Sensitive Learning."},{"key":"S1351324923000165_ref49","unstructured":"Torfi, A. , Shirvani, R.A. , Keneshloo, Y. , Tavaf, N. and Fox, E.A. (2021). Natural Language Processing Advancements By Deep Learning: A Survey. Number: arXiv: 2003.01200 [cs]."},{"key":"S1351324923000165_ref50","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.240"},{"key":"S1351324923000165_ref51","volume-title":"Advances in Neural Information Processing Systems","volume":"32","author":"Yang","year":"2019"},{"key":"S1351324923000165_ref10","doi-asserted-by":"publisher","DOI":"10.1016\/j.jbi.2013.12.006"},{"key":"S1351324923000165_ref40","doi-asserted-by":"publisher","DOI":"10.3115\/1218955.1219030"},{"key":"S1351324923000165_ref28","doi-asserted-by":"publisher","DOI":"10.1080\/14786440109462720"},{"key":"S1351324923000165_ref45","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-51310-8_2"},{"key":"S1351324923000165_ref19","doi-asserted-by":"publisher","DOI":"10.1093\/database\/baw068"},{"key":"S1351324923000165_ref20","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-1033"},{"key":"S1351324923000165_ref26","unstructured":"McInnes, L. , Healy, J. and Melville, J. (2020). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. Number: arXiv: 1802.03426 [cs, stat]."},{"key":"S1351324923000165_ref43","unstructured":"Souza, F. , Nogueira, R. and Lotufo, R. (2020). Portuguese Named Entity Recognition Using BERT-CRF. Number: arXiv: 1909.10649 [cs]."},{"key":"S1351324923000165_ref34","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-44816-0_31"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324923000165","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,14]],"date-time":"2024-05-14T09:03:04Z","timestamp":1715677384000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324923000165\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,6]]},"references-count":52,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,5]]}},"alternative-id":["S1351324923000165"],"URL":"https:\/\/doi.org\/10.1017\/s1351324923000165","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,6,6]]},"assertion":[{"value":"\u00a9 The Author(s), 2023. Published by Cambridge University Press","name":"copyright","label":"Copyright","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}},{"value":"This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https:\/\/creativecommons.org\/licenses\/by\/4.0\/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.","name":"license","label":"License","group":{"name":"copyright_and_licensing","label":"Copyright and Licensing"}},{"value":"This content has been made available to all.","name":"free","label":"Free to read"}]}}