{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,30]],"date-time":"2026-05-30T03:57:49Z","timestamp":1780113469536,"version":"3.54.0"},"reference-count":31,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,3,9]],"date-time":"2021-03-09T00:00:00Z","timestamp":1615248000000},"content-version":"tdm","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"},{"start":{"date-parts":[[2021,3,9]],"date-time":"2021-03-09T00:00:00Z","timestamp":1615248000000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000015","name":"U.S. Department of Energy","doi-asserted-by":"publisher","award":["E-AC05-00OR22725"],"award-info":[{"award-number":["E-AC05-00OR22725"]}],"id":[{"id":"10.13039\/100000015","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2021,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Background<\/jats:title>\n                    <jats:p>Automated text classification has many important applications in the clinical setting; however, obtaining labelled data for training machine learning and deep learning models is often difficult and expensive. Active learning techniques may mitigate this challenge by reducing the amount of labelled data required to effectively train a model. In this study, we analyze the effectiveness of 11 active learning algorithms on classifying subsite and histology from cancer pathology reports using a Convolutional Neural Network as the text classification model.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We compare the performance of each active learning strategy using two differently sized datasets and two different classification tasks. Our results show that on all tasks and dataset sizes, all active learning strategies except diversity-sampling strategies outperformed random sampling, i.e., no active learning. On our large dataset (15K initial labelled samples, adding 15K additional labelled samples each iteration of active learning), there was no clear winner between the different active learning strategies. On our small dataset (1K initial labelled samples, adding 1K additional labelled samples each iteration of active learning), marginal and ratio uncertainty sampling performed better than all other active learning techniques. We found that compared to random sampling, active learning strongly helps performance on rare classes by focusing on underrepresented classes.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Conclusions<\/jats:title>\n                    <jats:p>Active learning can save annotation cost by helping human annotators efficiently and intelligently select which samples to label. Our results show that a dataset constructed using effective active learning techniques requires less than half the amount of labelled data to achieve the same performance as a dataset constructed using random sampling.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1186\/s12859-021-04047-1","type":"journal-article","created":{"date-parts":[[2021,3,9]],"date-time":"2021-03-09T04:03:37Z","timestamp":1615262617000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":33,"title":["Deep active learning for classifying cancer pathology reports"],"prefix":"10.1186","volume":"22","author":[{"given":"Kevin","family":"De Angeli","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1803-1457","authenticated-orcid":false,"given":"Shang","family":"Gao","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Mohammed","family":"Alawad","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Hong-Jun","family":"Yoon","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Noah","family":"Schaefferkoetter","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Xiao-Cheng","family":"Wu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Eric B.","family":"Durbin","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jennifer","family":"Doherty","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Antoinette","family":"Stroup","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Linda","family":"Coyle","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Lynne","family":"Penberthy","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Georgia","family":"Tourassi","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2021,3,9]]},"reference":[{"issue":"1","key":"4047_CR1","doi-asserted-by":"publisher","first-page":"89","DOI":"10.1093\/jamia\/ocz153","volume":"27","author":"M Alawad","year":"2019","unstructured":"Alawad M, Gao S, Qiu JX, Yoon HJ, Blair Christian J, Penberthy L, Mumphrey B, Wu X-C, Coyle L, Tourassi G. Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks. J Am Med Inform Assoc. 2019;27(1):89\u201398. https:\/\/doi.org\/10.1093\/jamia\/ocz153.","journal-title":"J Am Med Inform Assoc"},{"key":"4047_CR2","doi-asserted-by":"publisher","first-page":"101726","DOI":"10.1016\/j.artmed.2019.101726","volume":"101","author":"S Gao","year":"2019","unstructured":"Gao S, Qiu JX, Alawad M, Hinkle JD, Schaefferkoetter N, Yoon H-J, Christian B, Fearn PA, Penberthy L, Wu X-C, Coyle L, Tourassi G, Ramanathan A. Classifying cancer pathology reports with hierarchical self-attention networks. Artif Intell Med. 2019;101:101726. https:\/\/doi.org\/10.1016\/j.artmed.2019.101726.","journal-title":"Artif Intell Med"},{"key":"4047_CR3","doi-asserted-by":"crossref","unstructured":"Young T, Hazarika D, Poria S, Cambria E. Recent trends in deep learning based natural language processing. 2017. arXiv:1708.02709","DOI":"10.1109\/MCI.2018.2840738"},{"key":"4047_CR4","unstructured":"Settles B. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin\u2013Madison. 2009."},{"key":"4047_CR5","unstructured":"Howlader N, Noone AM, Krapcho M, Miller D, Brest A, Yu M, Ruhl J, Tatalovich Z, Mariotto A, Lewis DR. Seer cancer statistics review, 1975\u20132017. National Cancer Institute. 2020."},{"issue":"5","key":"4047_CR6","first-page":"1","volume":"15","author":"S Gao","year":"2020","unstructured":"Gao S, Alawad M, Schaefferkoetter N, Penberthy L, Wu X-C, Durbin EB, Coyle LM, Ramanathan A, Tourassi GD. Using case-level context to classify cancer pathology reports. PLoS ONE. 2020;15(5):1\u201321.","journal-title":"PLoS ONE"},{"key":"4047_CR7","doi-asserted-by":"publisher","unstructured":"Hoi SCH, Jin R, Zhu J, Lyu MR. Batch mode active learning and its application to medical image classification. In: Proceedings of the 23rd international conference on machine learning. ICML \u201906. Association for Computing Machinery, New York, NY, USA; 2006, p. 417\u201324. https:\/\/doi.org\/10.1145\/1143844.1143897.","DOI":"10.1145\/1143844.1143897"},{"key":"4047_CR8","unstructured":"Gal Y, Islam R, Ghahramani Z. Deep bayesian active learning with image data. 2017. CoRR arXiv:1703.02910."},{"issue":"4","key":"4047_CR9","doi-asserted-by":"publisher","first-page":"504","DOI":"10.1109\/TSA.2005.848882","volume":"13","author":"G Riccardi","year":"2005","unstructured":"Riccardi G, Hakkani-Tur D. Active learning: theory and applications to automatic speech recognition. IEEE Trans Speech Audio Process. 2005;13(4):504\u201311.","journal-title":"IEEE Trans Speech Audio Process"},{"key":"4047_CR10","unstructured":"Thompson CA, Califf ME, Mooney RJ. Active learning for natural language parsing and information extraction. In: Proceedings of the 16th international conference on machine learning (ICML-99), Bled, Slovenia, 1999; p. 406\u201314. http:\/\/www.cs.utexas.edu\/users\/ai-lab?thompson:ml99.\u00a0Accessed 16 Aug 2020."},{"key":"4047_CR11","unstructured":"Olsson F. A literature survey of active machine learning in the context of natural language processing. 2009."},{"key":"4047_CR12","unstructured":"Settles B. From theories to queries: active learning in practice. In: Guyon I, Cawley G, Dror G, Lemaire V, Statnikov A, editors, Active learning and experimental design workshop in conjunction with AISTATS 2010. Proceedings of machine learning research, vol. 16. PMLR, Sardinia, Italy. 2011, p. 1\u201318. http:\/\/proceedings.mlr.press\/v16\/settles11a.html.\u00a0Accessed 17 Aug 2020."},{"issue":"2","key":"4047_CR13","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/1899412.1899414","volume":"2","author":"M Wang","year":"2011","unstructured":"Wang M, Hua X-S. Active learning in multimedia annotation and retrieval: a survey. ACM Trans Intell Syst Technol. 2011;2(2):1\u201321. https:\/\/doi.org\/10.1145\/1899412.1899414.","journal-title":"ACM Trans Intell Syst Technol"},{"issue":"3","key":"4047_CR14","doi-asserted-by":"publisher","first-page":"606","DOI":"10.1109\/JSTSP.2011.2139193","volume":"5","author":"D Tuia","year":"2011","unstructured":"Tuia D, Volpi M, Copa L, Kanevski M, Munoz-Mari J. A survey of active learning algorithms for supervised remote sensing image classification. IEEE J Sel Top Signal Process. 2011;5(3):606\u201317.","journal-title":"IEEE J Sel Top Signal Process"},{"key":"4047_CR15","doi-asserted-by":"crossref","unstructured":"Settles B, Craven M. An analysis of active learning strategies for sequence labeling tasks. In: Proceedings of the conference on empirical methods in natural language processing. EMNLP \u201908. Association for Computational Linguistics, USA. 2008, p. 1070\u20131079.","DOI":"10.3115\/1613715.1613855"},{"key":"4047_CR16","doi-asserted-by":"crossref","unstructured":"Shen Y, Yun H, Lipton ZC, Kronrod Y, Anandkumar A. Deep active learning for named entity recognition. 2017. CoRR arXiv:1707.05928.","DOI":"10.18653\/v1\/W17-2630"},{"key":"4047_CR17","doi-asserted-by":"crossref","unstructured":"Wang K, Zhang D, Li Y, Zhang R, Lin L. Cost-effective active learning for deep image classification. 2017. CoRR arXiv:1701.03551.","DOI":"10.1109\/TCSVT.2016.2589879"},{"key":"4047_CR18","doi-asserted-by":"crossref","unstructured":"Zhang Y, Wallace BC. Active discriminative word embedding learning. 2016. CoRR arXiv:1606.04212.","DOI":"10.1609\/aaai.v31i1.10962"},{"key":"4047_CR19","doi-asserted-by":"publisher","first-page":"265","DOI":"10.1016\/j.jbi.2011.11.003","volume":"45","author":"Y Chen","year":"2011","unstructured":"Chen Y, Mani S, Xu H. Applying active learning to assertion classification of concepts in clinical text. J Biomed Inform. 2011;45:265\u201372. https:\/\/doi.org\/10.1016\/j.jbi.2011.11.003.","journal-title":"J Biomed Inform"},{"key":"4047_CR20","doi-asserted-by":"publisher","DOI":"10.1093\/jamia\/ocv069","author":"M Kholghi","year":"2015","unstructured":"Kholghi M, Sitbon L, Zuccon G, Nguyen A. Active learning: a step towards automating medical concept extraction. J Am Med Inform Assoc. 2015;. https:\/\/doi.org\/10.1093\/jamia\/ocv069.","journal-title":"J Am Med Inform Assoc"},{"key":"4047_CR21","doi-asserted-by":"publisher","first-page":"809","DOI":"10.1136\/amiajnl-2011-000648","volume":"19","author":"R Figueroa","year":"2012","unstructured":"Figueroa R, Zeng-Treitler Q, Ngo L, Goryachev S, Wiechmann E. Active learning for clinical text classification: is it better than random sampling? J Am Med Inform Assoc. 2012;19:809\u201316. https:\/\/doi.org\/10.1136\/amiajnl-2011-000648.","journal-title":"J Am Med Inform Assoc"},{"key":"4047_CR22","doi-asserted-by":"crossref","unstructured":"Lewis DD, Gale WA. A Sequential Algorithm for Training Text Classifiers. 1994. arXiv:cmp-lg\/9407020.","DOI":"10.1007\/978-1-4471-2099-5_1"},{"key":"4047_CR23","doi-asserted-by":"publisher","first-page":"235","DOI":"10.1007\/s10994-007-5019-5","volume":"68","author":"AI Schein","year":"2007","unstructured":"Schein AI, Ungar LH. Active learning for logistic regression: an evaluation. Mach Learn. 2007;68:235\u201365. https:\/\/doi.org\/10.1007\/s10994-007-5019-5.","journal-title":"Mach Learn"},{"key":"4047_CR24","doi-asserted-by":"publisher","first-page":"379","DOI":"10.1002\/j.1538-7305.1948.tb01338.x","volume":"27","author":"CE Shannon","year":"1948","unstructured":"Shannon CE. A mathematical theory of communication. Bell Syst Tech J. 1948;27:379\u2013423.","journal-title":"Bell Syst Tech J"},{"key":"4047_CR25","doi-asserted-by":"publisher","unstructured":"Seung HS, Opper M, Sompolinsky H. Query by committee. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory. COLT \u201992. Association for Computing Machinery, New York, NY, USA. 1992, p. 287\u201394. https:\/\/doi.org\/10.1145\/130385.130417.","DOI":"10.1145\/130385.130417"},{"key":"4047_CR26","unstructured":"Argamon-Engelson S, Dagan I. Committee-based sample selection for probabilistic classifiers. 2011. CoRR arXiv:1106.0220."},{"key":"4047_CR27","doi-asserted-by":"publisher","unstructured":"Pereira F, Tishby N, Lee L. Distributional clustering of English words. In: 31st Annual meeting of the association for computational linguistics. Association for Computational Linguistics, Columbus, Ohio, USA. 1993, p. 183\u2013190. https:\/\/doi.org\/10.3115\/981574.981598. https:\/\/www.aclweb.org\/anthology\/P93-1024","DOI":"10.3115\/981574.981598"},{"key":"4047_CR28","unstructured":"Roy N, McCallum A. Toward optimal active learning through sampling estimation of error reduction. In: Proceedings of the Eighteenth International Conference on Machine Learning. ICML\u201901. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 2001, p. 441\u20138."},{"key":"4047_CR29","doi-asserted-by":"publisher","first-page":"1419","DOI":"10.1093\/jamia\/ocy068","volume":"25","author":"C Xiao","year":"2018","unstructured":"Xiao C, Choi E, Sun J. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. J Am Med Inform Assoc. 2018;25:1419\u201328. https:\/\/doi.org\/10.1093\/jamia\/ocy068.","journal-title":"J Am Med Inform Assoc"},{"issue":"1","key":"4047_CR30","doi-asserted-by":"publisher","first-page":"244","DOI":"10.1109\/JBHI.2017.2700722","volume":"22","author":"J Qiu","year":"2017","unstructured":"Qiu J, Yoon H-J, Fearn PA, Tourassi GD. Deep learning for automated extraction of primary sites from cancer pathology reports. IEEE J Biomed Health Inform. 2017;22(1):244\u201351. https:\/\/doi.org\/10.1109\/JBHI.2017.2700722.","journal-title":"IEEE J Biomed Health Inform"},{"issue":"3","key":"4047_CR31","doi-asserted-by":"publisher","first-page":"189","DOI":"10.1214\/ss\/1032280214","volume":"11","author":"TJ DiCiccio","year":"1996","unstructured":"DiCiccio TJ, Efron B. Bootstrap confidence intervals. Stat Sci. 1996;11(3):189\u2013212.","journal-title":"Stat Sci"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-021-04047-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1186\/s12859-021-04047-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-021-04047-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,20]],"date-time":"2022-12-20T14:45:20Z","timestamp":1671547520000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/s12859-021-04047-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,3,9]]},"references-count":31,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["4047"],"URL":"https:\/\/doi.org\/10.1186\/s12859-021-04047-1","relation":{"has-preprint":[{"id-type":"doi","id":"10.21203\/rs.3.rs-70788\/v1","asserted-by":"object"}]},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,3,9]]},"assertion":[{"value":"3 September 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 February 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 March 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"No ethics approval was required for the study.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"113"}}