{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,15]],"date-time":"2026-02-15T03:24:38Z","timestamp":1771125878579,"version":"3.50.1"},"reference-count":23,"publisher":"Oxford University Press (OUP)","issue":"7","license":[{"start":{"date-parts":[[2024,6,24]],"date-time":"2024-06-24T00:00:00Z","timestamp":1719187200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"European Union\u2019s Horizon 2020 research and innovation program"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,7,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>Human Phenotype Ontology (HPO)-based phenotype concept recognition (CR) underpins a faster and more effective mechanism to create patient phenotype profiles or to document novel phenotype-centred knowledge statements. While the increasing adoption of large language models (LLMs) for natural language understanding has led to several LLM-based solutions, we argue that their intrinsic resource-intensive nature is not suitable for realistic management of the phenotype CR lifecycle. Consequently, we propose to go back to the basics and adopt a dictionary-based approach that enables both an immediate refresh of the ontological concepts as well as efficient re-analysis of past data.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>We developed a dictionary-based approach using a pre-built large collection of clusters of morphologically equivalent tokens\u2014to address lexical variability and a more effective CR step by reducing the entity boundary detection strictly to candidates consisting of tokens belonging to ontology concepts. Our method achieves state-of-the-art results (0.76 F1 on the GSC+ corpus) and a processing efficiency of 10\u2009000 publication abstracts in 5\u2009s.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>FastHPOCR is available as a Python package installable via pip. The source code is available at https:\/\/github.com\/tudorgroza\/fast_hpo_cr. A Java implementation of FastHPOCR will be made available as part of the Fenominal Java library available at https:\/\/github.com\/monarch-initiative\/fenominal. The up-to-date GCS-2024 corpus is available at https:\/\/github.com\/tudorgroza\/code-for-papers\/tree\/main\/gsc-2024.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btae406","type":"journal-article","created":{"date-parts":[[2024,6,24]],"date-time":"2024-06-24T18:50:44Z","timestamp":1719255044000},"source":"Crossref","is-referenced-by-count":12,"title":["FastHPOCR: pragmatic, fast, and accurate concept recognition using the human phenotype ontology"],"prefix":"10.1093","volume":"40","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2267-8333","authenticated-orcid":false,"given":"Tudor","family":"Groza","sequence":"first","affiliation":[{"name":"Rare Care Centre, Perth Children\u2019s Hospital , Nedlands, WA 6009, Australia"},{"name":"Telethon Kids Institute , Nedlands, WA 6009, Australia"},{"name":"School of Electrical Engineering, Computing and Mathematical Sciences, Curtin University , Bentley, WA 6102, Australia"},{"name":"SingHealth Duke-NUS Institute of Precision Medicine , Singapore 169609, Singapore"}]},{"given":"Dylan","family":"Gration","sequence":"additional","affiliation":[{"name":"Western Australian Register of Developmental Anomalies, King Edward Memorial Hospital , Subiaco, WA 6008, Australia"}]},{"given":"Gareth","family":"Baynam","sequence":"additional","affiliation":[{"name":"Rare Care Centre, Perth Children\u2019s Hospital , Nedlands, WA 6009, Australia"},{"name":"Telethon Kids Institute , Nedlands, WA 6009, Australia"},{"name":"Western Australian Register of Developmental Anomalies, King Edward Memorial Hospital , Subiaco, WA 6008, Australia"},{"name":"Faculty of Health and Medical Sciences, University of Western Australia , Crawley, WA 6009, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0736-9199","authenticated-orcid":false,"given":"Peter N","family":"Robinson","sequence":"additional","affiliation":[{"name":"Berlin Institute of Health at Charit\u00e9 \u2013 Universit\u00e4tsmedizin Berlin, Charit\u00e9platz 1 , 10117 Berlin, Germany"},{"name":"The Jackson Laboratory for Genomic Medicine , Farmington, CT 06032, United States"}]}],"member":"286","published-online":{"date-parts":[[2024,6,24]]},"reference":[{"key":"2024070622011609700_btae406-B1","doi-asserted-by":"crossref","first-page":"e12596","DOI":"10.2196\/12596","article-title":"Identifying clinical terms in medical text using ontology-guided machine learning","volume":"7","author":"Arbabi","year":"2019","journal-title":"JMIR Med Inform"},{"key":"2024070622011609700_btae406-B2","first-page":"659","article-title":"Seven years since the launch of the matchmaker exchange: the evolution of genomic matchmaking","volume":"43","author":"Boycott","year":"2022","journal-title":"Hum Mutat"},{"key":"2024070622011609700_btae406-B3","doi-asserted-by":"crossref","first-page":"16","DOI":"10.1038\/s41525-018-0053-8","article-title":"Meta-analysis of the diagnostic and clinical utility of genome and exome sequencing and chromosomal microarray in children with suspected genetic diseases","volume":"3","author":"Clark","year":"2018","journal-title":"NPJ Genom Med"},{"key":"2024070622011609700_btae406-B4","doi-asserted-by":"crossref","first-page":"1585","DOI":"10.1038\/s41436-018-0381-1","article-title":"ClinPhen extracts and prioritizes patient phenotypes directly from medical records to expedite genetic disease diagnosis","volume":"21","author":"Deisseroth","year":"2019","journal-title":"Genet Med"},{"key":"2024070622011609700_btae406-B5","doi-asserted-by":"crossref","first-page":"1269","DOI":"10.1109\/TCBB.2022.3170301","article-title":"PhenoBERT: a combined deep learning method for automated recognition of human phenotype ontology","volume":"20","author":"Feng","year":"2023","journal-title":"IEEE\/ACM Trans Comput Biol Bioinform"},{"key":"2024070622011609700_btae406-B6","doi-asserted-by":"crossref","first-page":"bav005","DOI":"10.1093\/database\/bav005","article-title":"Automatic concept recognition using the human phenotype ontology reference and test suite corpora","volume":"2015","author":"Groza","year":"2015","journal-title":"Database"},{"key":"2024070622011609700_btae406-B7","doi-asserted-by":"crossref","first-page":"btad716","DOI":"10.1093\/bioinformatics\/btad716","article-title":"Term-BLAST-like alignment tool for concept recognition in noisy clinical texts","volume":"39","author":"Groza","year":"2023","journal-title":"Bioinformatics"},{"key":"2024070622011609700_btae406-B8","doi-asserted-by":"crossref","first-page":"817","DOI":"10.1038\/s41587-022-01357-4","article-title":"The GA4GH phenopacket schema defines a computable representation of clinical data","volume":"40","author":"Jacobsen","year":"2022","journal-title":"Nat Biotechnol"},{"key":"2024070622011609700_btae406-B9","first-page":"56","author":"Jonquet","year":"2009"},{"key":"2024070622011609700_btae406-B10","doi-asserted-by":"crossref","first-page":"D1018","DOI":"10.1093\/nar\/gky1105","article-title":"Expansion of the human phenotype ontology (HPO) knowledge base and resources","volume":"47","author":"K\u00f6hler","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2024070622011609700_btae406-B11","doi-asserted-by":"crossref","first-page":"1234","DOI":"10.1093\/bioinformatics\/btz682","article-title":"BioBERT: a pre-trained biomedical language representation model for biomedical text mining","volume":"36","author":"Lee","year":"2019","journal-title":"Bioinformatics"},{"key":"2024070622011609700_btae406-B12","doi-asserted-by":"crossref","first-page":"W566","DOI":"10.1093\/nar\/gkz386","article-title":"Doc2Hpo: a web application for efficient and accurate HPO concept curation","volume":"47","author":"Liu","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2024070622011609700_btae406-B13","doi-asserted-by":"crossref","first-page":"8565739","DOI":"10.1155\/2017\/8565739","article-title":"Identifying human phenotype terms by combining machine learning and validation rules","volume":"2017","author":"Lobo","year":"2017","journal-title":"Biomed Res Int"},{"key":"2024070622011609700_btae406-B14","doi-asserted-by":"crossref","first-page":"1884","DOI":"10.1093\/bioinformatics\/btab019","article-title":"PhenoTagger: a hybrid method for phenotype concept recognition using human phenotype ontology","volume":"37","author":"Luo","year":"2021","journal-title":"Bioinformatics"},{"key":"2024070622011609700_btae406-B15","doi-asserted-by":"crossref","first-page":"bav089","DOI":"10.1093\/database\/bav089","article-title":"SORTA: a system for ontology-based re-coding and technical annotation of biomedical phenotype data","volume":"2015","author":"Pang","year":"2015","journal-title":"Database"},{"key":"2024070622011609700_btae406-B16","doi-asserted-by":"crossref","first-page":"610","DOI":"10.1016\/j.ajhg.2008.09.017","article-title":"The human phenotype ontology: a tool for annotating and analyzing human hereditary disease","volume":"83","author":"Robinson","year":"2008","journal-title":"Am J Hum Genet"},{"key":"2024070622011609700_btae406-B17","doi-asserted-by":"crossref","first-page":"D704","DOI":"10.1093\/nar\/gkz997","article-title":"The monarch initiative in 2019: an integrative data and analytic platform connecting phenotypes to genotypes across species","volume":"48","author":"Shefchek","year":"2020","journal-title":"Nucleic Acids Res"},{"key":"2024070622011609700_btae406-B18","doi-asserted-by":"crossref","first-page":"595","DOI":"10.1016\/j.ajhg.2016.07.005","article-title":"A whole-genome analysis framework for effective identification of pathogenic regulatory variants in Mendelian disease","volume":"99","author":"Smedley","year":"2016","journal-title":"Am J Hum Genet"},{"key":"2024070622011609700_btae406-B19","doi-asserted-by":"crossref","first-page":"58","DOI":"10.1016\/j.ajhg.2018.05.010","article-title":"Deep phenotyping on electronic health records facilitates genetic diagnosis by clinical exomes","volume":"103","author":"Son","year":"2018","journal-title":"Am J Hum Genet"},{"key":"2024070622011609700_btae406-B20","doi-asserted-by":"crossref","first-page":"bau045","DOI":"10.1093\/database\/bau045","article-title":"Automated semantic annotation of rare disease cases: a case study","volume":"2014","author":"Taboada","year":"2014","journal-title":"Database"},{"key":"2024070622011609700_btae406-B21","doi-asserted-by":"crossref","first-page":"223","DOI":"10.1016\/j.ymgme.2015.11.003","article-title":"Undiagnosed diseases network international (UDNI): white paper for global actions to meet patient needs","volume":"116","author":"Taruscio","year":"2015","journal-title":"Mol Genet Metab"},{"key":"2024070622011609700_btae406-B22","article-title":"PheNorm, a language model normalizer of physical examinations from genetics clinical notes","author":"Weissenbacher","year":"2023"},{"key":"2024070622011609700_btae406-B23","doi-asserted-by":"crossref","first-page":"100887","DOI":"10.1016\/j.patter.2023.100887","article-title":"Enhancing phenotype recognition in clinical notes using large language models: phenoBCBERT and PhenoGPT","volume":"5","author":"Yang","year":"2024","journal-title":"Patterns"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btae406\/58319602\/btae406.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/7\/btae406\/58463767\/btae406.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/7\/btae406\/58463767\/btae406.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,6]],"date-time":"2024-07-06T22:02:58Z","timestamp":1720303378000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btae406\/7698025"}},"subtitle":[],"editor":[{"given":"Jonathan","family":"Wren","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2024,6,24]]},"references-count":23,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2024,7,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btae406","relation":{},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024,7]]},"published":{"date-parts":[[2024,6,24]]},"article-number":"btae406"}}