{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T07:20:40Z","timestamp":1775028040123,"version":"3.50.1"},"reference-count":32,"publisher":"Oxford University Press (OUP)","issue":"13","license":[{"start":{"date-parts":[[2021,1,20]],"date-time":"2021-01-20T00:00:00Z","timestamp":1611100800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000092","name":"National Library of Medicine","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000092","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,7,27]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>Automatic phenotype concept recognition from unstructured text remains a challenging task in biomedical text mining research. Previous works that address the task typically use dictionary-based matching methods, which can achieve high precision but suffer from lower recall. Recently, machine learning-based methods have been proposed to identify biomedical concepts, which can recognize more unseen concept synonyms by automatic feature learning. However, most methods require large corpora of manually annotated data for model training, which is difficult to obtain due to the high cost of human annotation.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>In this article, we propose PhenoTagger, a hybrid method that combines both dictionary and machine learning-based methods to recognize Human Phenotype Ontology (HPO) concepts in unstructured biomedical text. We first use all concepts and synonyms in HPO to construct a dictionary, which is then used to automatically build a distantly supervised training dataset for machine learning. Next, a cutting-edge deep learning model is trained to classify each candidate phrase (n-gram from input sentence) into a corresponding concept label. Finally, the dictionary and machine learning-based prediction results are combined for improved performance. Our method is validated with two HPO corpora, and the results show that PhenoTagger compares favorably to previous methods. In addition, to demonstrate the generalizability of our method, we retrained PhenoTagger using the disease ontology MEDIC for disease concept recognition to investigate the effect of training on different ontologies. Experimental results on the NCBI disease corpus show that PhenoTagger without requiring manually annotated training data achieves competitive performance as compared with state-of-the-art supervised methods.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availabilityand implementation<\/jats:title>\n                  <jats:p>The source code, API information and data for PhenoTagger are freely available at https:\/\/github.com\/ncbi-nlp\/PhenoTagger.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Supplementary information<\/jats:title>\n                  <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btab019","type":"journal-article","created":{"date-parts":[[2021,1,13]],"date-time":"2021-01-13T05:22:32Z","timestamp":1610515352000},"page":"1884-1890","source":"Crossref","is-referenced-by-count":59,"title":["PhenoTagger: a hybrid method for phenotype concept recognition using human phenotype ontology"],"prefix":"10.1093","volume":"37","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5141-0259","authenticated-orcid":false,"given":"Ling","family":"Luo","sequence":"first","affiliation":[{"name":"National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) , Bethesda, MD 20894, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0369-4979","authenticated-orcid":false,"given":"Shankai","family":"Yan","sequence":"additional","affiliation":[{"name":"National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) , Bethesda, MD 20894, USA"}]},{"given":"Po-Ting","family":"Lai","sequence":"additional","affiliation":[{"name":"National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) , Bethesda, MD 20894, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6101-6693","authenticated-orcid":false,"given":"Daniel","family":"Veltri","sequence":"additional","affiliation":[{"name":"Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health , Bethesda, MD 209892, USA"}]},{"given":"Andrew","family":"Oler","sequence":"additional","affiliation":[{"name":"Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health , Bethesda, MD 209892, USA"}]},{"given":"Sandhya","family":"Xirasagar","sequence":"additional","affiliation":[{"name":"Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health , Bethesda, MD 209892, USA"}]},{"given":"Rajarshi","family":"Ghosh","sequence":"additional","affiliation":[{"name":"Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health , Bethesda, MD 209892, USA"}]},{"given":"Morgan","family":"Similuk","sequence":"additional","affiliation":[{"name":"Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health , Bethesda, MD 209892, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0736-9199","authenticated-orcid":false,"given":"Peter N","family":"Robinson","sequence":"additional","affiliation":[{"name":"The Jackson Laboratory for Genomic Medicine , Farmington, CT 06032, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9998-916X","authenticated-orcid":false,"given":"Zhiyong","family":"Lu","sequence":"additional","affiliation":[{"name":"National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) , Bethesda, MD 20894, USA"}]}],"member":"286","published-online":{"date-parts":[[2021,1,20]]},"reference":[{"key":"2024041009354587900_btab019-B1","doi-asserted-by":"crossref","first-page":"e12596","DOI":"10.2196\/12596","article-title":"Identifying clinical terms in medical text using Ontology-Guided machine learning","volume":"7","author":"Arbabi","year":"2019","journal-title":"JMIR Med. Inf"},{"key":"2024041009354587900_btab019-B2","first-page":"17","author":"Aronson","year":"2001"},{"key":"2024041009354587900_btab019-B3","doi-asserted-by":"crossref","DOI":"10.1186\/gb-2008-9-s2-s9","article-title":"Concept recognition for extracting protein interaction relations from biomedical text","volume":"9,","author":"Baumgartner","year":"2008","journal-title":"Genome Biol"},{"key":"2024041009354587900_btab019-B4","first-page":"281","article-title":"Random search for hyper-parameter optimization","volume":"13","author":"Bergstra","year":"2012","journal-title":"J. Mach. Learn. Res"},{"key":"2024041009354587900_btab019-B5","volume-title":"Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit","author":"Bird","year":"2009"},{"key":"2024041009354587900_btab019-B6","doi-asserted-by":"crossref","first-page":"3533","DOI":"10.1093\/bioinformatics\/btz070","article-title":"PMC text mining subset in BioC: about three million full-text articles and growing","volume":"35","author":"Comeau","year":"2019","journal-title":"Bioinformatics"},{"key":"2024041009354587900_btab019-B7","doi-asserted-by":"crossref","first-page":"bar065","DOI":"10.1093\/database\/bar065","article-title":"MEDIC: a practical disease vocabulary used at the comparative toxicogenomics database","volume":"2012","author":"Davis","year":"2012","journal-title":"Database"},{"key":"2024041009354587900_btab019-B8","first-page":"4171","author":"Devlin","year":"2019"},{"key":"2024041009354587900_btab019-B9","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.jbi.2013.12.006","article-title":"NCBI disease corpus: a resource for disease name recognition and concept normalization","volume":"47","author":"Do\u011fan","year":"2014","journal-title":"J. Biomed. Inf"},{"key":"2024041009354587900_btab019-B10","doi-asserted-by":"crossref","first-page":"490","DOI":"10.1145\/367390.367400","article-title":"TRIE memory","volume":"3","author":"Fredkin","year":"1960","journal-title":"Commun. ACM"},{"key":"2024041009354587900_btab019-B11","doi-asserted-by":"crossref","first-page":"bav005","DOI":"10.1093\/database\/bav005","article-title":"Automatic concept recognition using the human phenotype ontology reference and test suite corpora","volume":"2015","author":"Groza","year":"2015","journal-title":"Database"},{"key":"2024041009354587900_btab019-B12","first-page":"56","author":"Jonquet","year":"2009"},{"key":"2024041009354587900_btab019-B13","first-page":"D1077","author":"Kapushesky","year":"2012"},{"key":"2024041009354587900_btab019-B14","first-page":"1","author":"Kingma","year":"2015"},{"key":"2024041009354587900_btab019-B15","doi-asserted-by":"crossref","first-page":"D1018","DOI":"10.1093\/nar\/gky1105","article-title":"Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources","volume":"47","author":"K\u00f6hler","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2024041009354587900_btab019-B16","first-page":"652","article-title":"BANNER: an executable survey of advances in biomedical named entity recognition","author":"Leaman","year":"2008"},{"key":"2024041009354587900_btab019-B17","doi-asserted-by":"crossref","first-page":"2909","DOI":"10.1093\/bioinformatics\/btt474","article-title":"DNorm: disease name normalization with pairwise learning to rank","volume":"29","author":"Leaman","year":"2013","journal-title":"Bioinformatics"},{"key":"2024041009354587900_btab019-B18","doi-asserted-by":"crossref","first-page":"2839","DOI":"10.1093\/bioinformatics\/btw343","article-title":"TaggerOne: joint named entity recognition and normalization with semi-Markov Models","volume":"32","author":"Leaman","year":"2016","journal-title":"Bioinformatics"},{"key":"2024041009354587900_btab019-B19","doi-asserted-by":"crossref","first-page":"S1","DOI":"10.1186\/1758-2946-7-S1-S3","article-title":"tmChem: a high performance approach for chemical named entity recognition and normalization","volume":"7","author":"Leaman","year":"2015","journal-title":"J. Cheminf"},{"key":"2024041009354587900_btab019-B20","doi-asserted-by":"crossref","first-page":"1234","DOI":"10.1093\/bioinformatics\/btz682","article-title":"BioBERT: a pre-trained biomedical language representation model for biomedical text mining","volume":"36","author":"Lee","year":"2020","journal-title":"Bioinformatics"},{"key":"2024041009354587900_btab019-B21","doi-asserted-by":"crossref","first-page":"W566","DOI":"10.1093\/nar\/gkz386","article-title":"Doc2Hpo: a web application for efficient and accurate HPO concept curation","volume":"47","author":"Liu","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2024041009354587900_btab019-B22","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1155\/2017\/8565739","article-title":"Identifying human phenotype terms by combining machine learning and validation rules","volume":"2017","author":"Lobo","year":"2017","journal-title":"BioMed Res. Int"},{"key":"2024041009354587900_btab019-B23","doi-asserted-by":"crossref","first-page":"bav089","DOI":"10.1093\/database\/bav089","article-title":"SORTA: a system for ontology-based re-coding and technical annotation of biomedical phenotype data","volume":"2015","author":"Pang","year":"2015","journal-title":"Database"},{"key":"2024041009354587900_btab019-B24","first-page":"58","author":"Peng","year":"2019"},{"key":"2024041009354587900_btab019-B25","doi-asserted-by":"crossref","first-page":"761","DOI":"10.1016\/S0893-6080(98)00010-0","article-title":"Automatic early stopping using cross validation: quantifying the criteria","volume":"11","author":"Prechelt","year":"1998","journal-title":"Neural Netw"},{"key":"2024041009354587900_btab019-B26","first-page":"451","volume-title":"Pacific Symposium on Biocomputing","author":"Schwartz","year":"2003"},{"key":"2024041009354587900_btab019-B27","doi-asserted-by":"crossref","first-page":"D704","DOI":"10.1093\/nar\/gkz997","article-title":"The Monarch Initiative in 2019: an integrative data and analytic platform connecting phenotypes to genotypes across species","volume":"48","author":"Shefchek","year":"2020","journal-title":"Nucleic Acids Res"},{"key":"2024041009354587900_btab019-B18192218","doi-asserted-by":"crossref","first-page":"103246","DOI":"10.1016\/j.jbi.2019.103246","article-title":"HPO2Vec+: Leveraging heterogeneous knowledge resources to enrich node embeddings for the Human Phenotype Ontology","volume":"96","author":"Shen","year":"2019","journal-title":"Journal of Biomedical Informatics"},{"key":"2024041009354587900_btab019-B28","doi-asserted-by":"crossref","first-page":"bau045","DOI":"10.1093\/database\/bau045","article-title":"Automated semantic annotation of rare disease cases: a case study","volume":"2014","author":"Taboada","year":"2014","journal-title":"Database"},{"key":"2024041009354587900_btab019-B29","first-page":"5998","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani","year":"2017"},{"key":"2024041009354587900_btab019-B30","first-page":"1","article-title":"GNormPlus: an integrative approach for tagging genes, gene families, and protein domains","volume":"2015","author":"Wei","year":"2015","journal-title":"BioMed Res. Int"},{"key":"2024041009354587900_btab019-B31","article-title":"Google's neural machine translation system: bridging the gap between human and machine translation","author":"Wu","year":"2016","journal-title":"arXiv Preprint arXiv:1609.08144"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btab019\/36158831\/btab019.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/13\/1884\/57196261\/btab019.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/13\/1884\/57196261\/btab019.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,4,10]],"date-time":"2024-04-10T09:40:30Z","timestamp":1712742030000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/37\/13\/1884\/6104813"}},"subtitle":[],"editor":[{"given":"Jonathan","family":"Wren","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2021,1,20]]},"references-count":32,"journal-issue":{"issue":"13","published-print":{"date-parts":[[2021,7,27]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btab019","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021,7,1]]},"published":{"date-parts":[[2021,1,20]]}}}