{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,26]],"date-time":"2025-09-26T00:19:21Z","timestamp":1758845961273,"version":"3.44.0"},"reference-count":69,"publisher":"Oxford University Press (OUP)","license":[{"start":{"date-parts":[[2025,9,25]],"date-time":"2025-09-25T00:00:00Z","timestamp":1758758400000},"content-version":"vor","delay-in-days":267,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,1,18]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Electronic health record (EHR) phenotyping is a high-demand task because most phenotypes are not usually readily defined. The objective of this study is to develop an effective text-mining approach that automatically extracts clinical phenotype definitions-related sentences from biomedical literature. Abstract-level and full-text sentence-level classifiers were developed for clinical phenotype discovery from PubMed. We compared the performance of the abstract-level classifier on machine learning algorithms: support vector machine (SVM), logistic regression (LR), na\u00efve Bayes, and decision tree. SVM classifier showed the best performance (F-measure\u00a0=\u00a098%) in identifying clinical phenotype-relevant abstracts. It predicted 459\u2009406 clinical phenotype-related abstracts. For the full-text sentence-level classifier, we compared the performance of SVM, LR, na\u00efve Bayes, decision trees, convolutional neural networks, Bidirectional Encoder Representations from Transformers (BERT), and Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT). BioBERT model was the best performer among the full-text sentence-level classifiers (F-measure\u00a0=\u00a091%). We used these two optimal classifiers for large-scale screening of the PubMed database, starting with abstract retrieval and followed by predicting clinical phenotype-related sentences from full texts. The large-scale screening predicted over two million clinical phenotype-related sentences. Lastly, we developed a knowledgebase using positively predicted sentences, allowing users to query clinical phenotype-related sentences with a phenotype term of interest. The Clinical Phenotype Knowledgebase (CliPheKB) enables users to search for clinical phenotype terms and retrieve sentences related to a specific clinical phenotype of interest (https:\/\/cliphekb.shinyapps.io\/phenotype-main\/). Building upon prior methods, we developed a text mining pipeline to automatically extract clinical phenotype definition-related sentences from the literature. This high-throughput phenotyping approach is generalizable and scalable, and it is complementary to existing EHR phenotyping methods.<\/jats:p>","DOI":"10.1093\/database\/baaf047","type":"journal-article","created":{"date-parts":[[2025,9,25]],"date-time":"2025-09-25T14:26:57Z","timestamp":1758810417000},"source":"Crossref","is-referenced-by-count":0,"title":["Biomedical literature-based clinical phenotype definition discovery using large language models"],"prefix":"10.1093","volume":"2025","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0400-823X","authenticated-orcid":false,"given":"Samar","family":"Binkheder","sequence":"first","affiliation":[{"name":"King Saud University Medical Informatics Unit, Department of Medical Education, College of Medicine, , Riyadh 12372 ,","place":["Saudi Arabia"]},{"name":"Ohio State University Department of Biomedical Informatics,, College of Medicine, , 1800 Cannon Drive, Columbus, OH 43210 ,","place":["United States"]}]},{"given":"Xiaofu","family":"Liu","sequence":"additional","affiliation":[{"name":"Ohio State University Department of Biomedical Informatics,, College of Medicine, , 1800 Cannon Drive, Columbus, OH 43210 ,","place":["United States"]}]},{"given":"Michael","family":"Wu","sequence":"additional","affiliation":[{"name":"gRED Computational Sciences, Computational Biology & Translation, Genentech, Inc , 1 DNA Way, South San Francisco, CA 94080 ,","place":["United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1904-1737","authenticated-orcid":false,"given":"Lei","family":"Wang","sequence":"additional","affiliation":[{"name":"Ohio State University Department of Biomedical Informatics,, College of Medicine, , 1800 Cannon Drive, Columbus, OH 43210 ,","place":["United States"]}]},{"given":"Aditi","family":"Shendre","sequence":"additional","affiliation":[{"name":"Ohio State University Department of Biomedical Informatics,, College of Medicine, , 1800 Cannon Drive, Columbus, OH 43210 ,","place":["United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6554-0695","authenticated-orcid":false,"given":"Sara K","family":"Quinney","sequence":"additional","affiliation":[{"name":"Indiana University Department of Obstetrics and Gynecology, School of Medicine, , 950 W Walnut Street, Indianapolis, IN 46202 ,","place":["United States"]}]},{"given":"Wei-Qi","family":"Wei","sequence":"additional","affiliation":[{"name":"Vanderbilt University Medical Center Department of Biomedical Informatics, , 2525 West End Ave, Nashville, TN 37203 ,","place":["United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0746-1809","authenticated-orcid":false,"given":"Lang","family":"Li","sequence":"additional","affiliation":[{"name":"Ohio State University Department of Biomedical Informatics,, College of Medicine, , 1800 Cannon Drive, Columbus, OH 43210 ,","place":["United States"]}]}],"member":"286","published-online":{"date-parts":[[2025,9,24]]},"reference":[{"key":"2025092510265335800_bib1","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3127881","article-title":"Mining electronic health records (EHRs): a survey","volume":"50","author":"Yadav","year":"2018","journal-title":"ACM Comput Surv"},{"key":"2025092510265335800_bib2","doi-asserted-by":"publisher","first-page":"211","DOI":"10.1093\/jamia\/ocac247","article-title":"Advancing phenotyping through informatics innovation","volume":"30","author":"Bakken","year":"2023","journal-title":"J Am Med Inform Assoc"},{"key":"2025092510265335800_bib3","doi-asserted-by":"publisher","first-page":"e226","DOI":"10.1136\/amiajnl-2013-001926","article-title":"Electronic health records based phenotyping in next-generation clinical trials: a perspective from the NIH Health Care Systems Collaboratory","volume":"20","author":"Richesson","year":"2013","journal-title":"J Am Med Inform Assoc"},{"key":"2025092510265335800_bib4","doi-asserted-by":"publisher","first-page":"53","DOI":"10.1146\/annurev-biodatasci-080917-013315","article-title":"Advances in electronic phenotyping: from rule-based definitions to machine learning models","volume":"1","author":"Banda","year":"2018","journal-title":"Annu Rev Biomed Data Sci"},{"key":"2025092510265335800_bib5","first-page":"189","article-title":"Naive electronic health record phenotype identification for Rheumatoid arthritis","author":"Carroll","year":"2011","journal-title":"AMIA Annu Symp Proc"},{"key":"2025092510265335800_bib6","doi-asserted-by":"publisher","first-page":"103746","DOI":"10.1016\/j.jbi.2021.103746","article-title":"Automatic phenotyping of electronical health record: pheVis algorithm","volume":"117","author":"Ferte","year":"2021","journal-title":"J Biomed Inform"},{"key":"2025092510265335800_bib7","doi-asserted-by":"publisher","first-page":"2626","DOI":"10.1093\/jamia\/ocab202","article-title":"Enhancing the use of EHR systems for pragmatic embedded research: lessons from the NIH Health Care Systems Research Collaboratory","volume":"28","author":"Richesson","year":"2021","journal-title":"J Am Med Inform Assoc"},{"key":"2025092510265335800_bib8","doi-asserted-by":"publisher","first-page":"e147","DOI":"10.1136\/amiajnl-2012-000896","article-title":"Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network","volume":"20","author":"Newton","year":"2013","journal-title":"J Am Med Inform Assoc"},{"key":"2025092510265335800_bib9","doi-asserted-by":"publisher","first-page":"41","DOI":"10.1186\/s13073-015-0166-y","article-title":"Extracting research-quality phenotypes from electronic health records to support precision medicine","volume":"7","author":"Wei","year":"2015","journal-title":"Genome Med"},{"key":"2025092510265335800_bib10","unstructured":"BHF Data Science Centre . (2023). Ensuring Phenotyping Algorithms Using National Electronic Health Records Are FAIR: Meeting the Needs of the Cardiometabolic Research Community. Zenodo. 10.5281\/zenodo.10209724"},{"key":"2025092510265335800_bib11","doi-asserted-by":"publisher","first-page":"212","DOI":"10.1136\/amiajnl-2011-000439","article-title":"Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study","volume":"19","author":"Kho","year":"2012","journal-title":"J Am Med Inform Assoc"},{"key":"2025092510265335800_bib12","doi-asserted-by":"publisher","first-page":"1839","DOI":"10.1161\/CIRCULATIONAHA.117.031356","article-title":"LPA variants are associated with residual cardiovascular risk in patients receiving statins","volume":"138","author":"Wei","year":"2018","journal-title":"Circulation"},{"key":"2025092510265335800_bib13","first-page":"112","article-title":"Creation and validation of an EMR-based algorithm for identifying major adverse cardiac events while on statins","volume":"2014","author":"Wei","year":"2014","journal-title":"AMIA Jt Summits Transl Sci Proc"},{"key":"2025092510265335800_bib14","doi-asserted-by":"publisher","first-page":"717","DOI":"10.1038\/s41598-018-36745-x","article-title":"Learning from longitudinal data in electronic health record and genetic data to improve cardiovascular event prediction","volume":"9","author":"Zhao","year":"2019","journal-title":"Sci Rep"},{"key":"2025092510265335800_bib15","doi-asserted-by":"publisher","first-page":"103270","DOI":"10.1016\/j.jbi.2019.103270","article-title":"Detecting time-evolving phenotypic topics via tensor factorization on electronic health records: cardiovascular disease case study","volume":"98","author":"Zhao","year":"2019","journal-title":"J Biomed Inform"},{"key":"2025092510265335800_bib16","doi-asserted-by":"publisher","first-page":"1359","DOI":"10.1093\/jamia\/ocy056","article-title":"PheProb: probabilistic phenotyping using diagnosis codes to improve power for genetic association studies","volume":"25","author":"Sinnott","year":"2018","journal-title":"J Am Med Inform Assoc"},{"key":"2025092510265335800_bib17","doi-asserted-by":"publisher","first-page":"1305","DOI":"10.1093\/jamia\/ocad077","article-title":"De-black-boxing health AI: demonstrating reproducible machine learning computable phenotypes using the N3C-RECOVER Long COVID model in the All of Us data repository","volume":"30","author":"Pfaff","year":"2023","journal-title":"J Am Med Inform Assoc"},{"key":"2025092510265335800_bib18","doi-asserted-by":"publisher","first-page":"e532","DOI":"10.1016\/S2589-7500(22)00048-6","article-title":"Identifying who has long COVID in the USA: a machine learning approach using N3C data","volume":"4","author":"Pfaff","year":"2022","journal-title":"Lancet Digit Health"},{"key":"2025092510265335800_bib19","doi-asserted-by":"publisher","first-page":"367","DOI":"10.1093\/jamia\/ocac216","article-title":"Machine learning approaches for electronic health records phenotyping: a methodical review","volume":"30","author":"Yang","year":"2023","journal-title":"J Am Med Inform Assoc"},{"key":"2025092510265335800_bib20","doi-asserted-by":"publisher","first-page":"e0175508","DOI":"10.1371\/journal.pone.0175508","article-title":"Evaluating phecodes, clinical classification software, and ICD-9-CM codes for phenome-wide association studies in the electronic health record","volume":"12","author":"Wei","year":"2017","journal-title":"PLoS One"},{"key":"2025092510265335800_bib21","doi-asserted-by":"publisher","first-page":"e14325","DOI":"10.2196\/14325","article-title":"Mapping ICD-10 and ICD-10-CM codes to phecodes: workflow development and initial evaluation","volume":"7","author":"Wu","year":"2019","journal-title":"JMIR Med Inform"},{"key":"2025092510265335800_bib22","doi-asserted-by":"publisher","first-page":"456","DOI":"10.1093\/jamia\/ocac234","article-title":"Evaluating resources composing the PheMAP knowledge base to enhance high-throughput phenotyping","volume":"30","author":"Wan","year":"2023","journal-title":"J Am Med Inform Assoc"},{"key":"2025092510265335800_bib23","doi-asserted-by":"publisher","first-page":"1675","DOI":"10.1093\/jamia\/ocaa104","article-title":"PheMap: a multi-resource knowledge base for high-throughput phenotyping within electronic health records","volume":"27","author":"Zheng","year":"2020","journal-title":"J Am Med Inform Assoc"},{"key":"2025092510265335800_bib24","doi-asserted-by":"publisher","first-page":"e20","DOI":"10.1093\/jamia\/ocv130","article-title":"Combining billing codes, clinical notes, and medications from electronic health records provides superior phenotyping performance","volume":"23","author":"Wei","year":"2016","journal-title":"J Am Med Inform Assoc"},{"key":"2025092510265335800_bib25","doi-asserted-by":"publisher","first-page":"3426","DOI":"10.1038\/s41596-019-0227-6","article-title":"High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP)","volume":"14","author":"Zhang","year":"2019","journal-title":"Nat Protoc"},{"key":"2025092510265335800_bib26","first-page":"682","article-title":"Using linked data for mining drug\u2013drug interactions in electronic health records","volume":"192","author":"Pathak","year":"2013","journal-title":"Stud Health Technol"},{"key":"2025092510265335800_bib27","doi-asserted-by":"publisher","first-page":"103122","DOI":"10.1016\/j.jbi.2019.103122","article-title":"Feature extraction for phenotyping from semantic and knowledge resources","volume":"91","author":"Ning","year":"2019","journal-title":"J Biomed Inform"},{"key":"2025092510265335800_bib28","doi-asserted-by":"publisher","first-page":"993","DOI":"10.1093\/jamia\/ocv034","article-title":"Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources","volume":"22","author":"Yu","year":"2015","journal-title":"J Am Med Inform Assoc"},{"key":"2025092510265335800_bib29","doi-asserted-by":"publisher","first-page":"e143","DOI":"10.1093\/jamia\/ocw135","article-title":"Surrogate-assisted feature extraction for high-throughput phenotyping","volume":"24","author":"Yu","year":"2016","journal-title":"J Am Med Inform Assoc"},{"key":"2025092510265335800_bib30","doi-asserted-by":"crossref","first-page":"515","DOI":"10.4338\/ACI-2013-04-RA-0028","article-title":"Automating case definitions using literature-based reasoning","volume":"4","author":"Botsis","year":"2013","journal-title":"Appl Clin Inform"},{"key":"2025092510265335800_bib31","doi-asserted-by":"publisher","first-page":"e1002614","DOI":"10.1371\/journal.pcbi.1002614","article-title":"Literature based drug interaction prediction with clinical assessment using electronic medical records: novel myopathy associated drug interactions","volume":"8","author":"Duke","year":"2012","journal-title":"PLoS Comput Biol"},{"key":"2025092510265335800_bib32","first-page":"S91","article-title":"Translational drug interaction evidence gap discovery using text mining","volume":"101","author":"Wu","year":"2017","journal-title":"Clin Pharmacol Ther"},{"key":"2025092510265335800_bib33","first-page":"376","article-title":"Evidence-based medicine, the essential role of systematic reviews, and the need for automated text mining tools","volume-title":"Proceedings of the 1st ACM International Health Informatics Symposium,\u00a0Arlington, VA","author":"Cohen"},{"key":"2025092510265335800_bib34","doi-asserted-by":"publisher","first-page":"17","DOI":"10.1186\/s13326-022-00272-6","article-title":"PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature","volume":"13","author":"Binkheder","year":"2022","journal-title":"J Biomed Semantics"},{"key":"2025092510265335800_bib35","doi-asserted-by":"crossref","DOI":"10.1109\/ICHI.2018.00061","article-title":"Analyzing patterns of literature-based phenotyping definitions for text mining applications","volume-title":"IEEE International Conference on Healthcare Informatics (ICHI)","author":"Binkheder","year":"2018"},{"key":"2025092510265335800_bib36","doi-asserted-by":"publisher","first-page":"1046","DOI":"10.1093\/jamia\/ocv202","article-title":"PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability","volume":"23","author":"Kirby","year":"2016","journal-title":"J Am Med Inform Assoc"},{"key":"2025092510265335800_bib37","doi-asserted-by":"publisher","first-page":"221","DOI":"10.1136\/amiajnl-2013-001935","article-title":"A review of approaches to identifying patient phenotype cohorts using electronic health records","volume":"21","author":"Shivade","year":"2014","journal-title":"J Am Med Inform Assoc"},{"key":"2025092510265335800_bib38","first-page":"1269","article-title":"Weka: a machine learning workbench for data mining","author":"Frank","year":"2010","journal-title":"Data Mining and Knowledge Discovery Handbook"},{"key":"2025092510265335800_bib39","first-page":"185","article-title":"Fast training of support vector machines using sequential minimal optimization","volume-title":"Advances in kernel methods","author":"Platt","year":"1999"},{"key":"2025092510265335800_bib40","volume-title":"C4. 5: Programs for Machine Learning","author":"Quinlan","year":"2014"},{"key":"2025092510265335800_bib41","first-page":"338","article-title":"Estimating continuous distributions in Bayesian classifiers","volume-title":"Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, Canada","author":"John","year":"1995"},{"key":"2025092510265335800_bib42","first-page":"191","article-title":"Ridge estimators in logistic-regression","volume":"41","author":"Lecessie","year":"1992","journal-title":"J R Stat Soc Ser C"},{"key":"2025092510265335800_bib43","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9780511810114","volume-title":"Data Mining and Analysis: Fundamental Concepts and Algorithms","author":"Zaki","year":"2014"},{"key":"2025092510265335800_bib44","doi-asserted-by":"publisher","first-page":"65","DOI":"10.1007\/BF03256752","article-title":"MedDRA","volume":"23","author":"Mozzicato","year":"2009","journal-title":"Pharm Med"},{"key":"2025092510265335800_bib45","doi-asserted-by":"publisher","first-page":"bar065","DOI":"10.1093\/database\/bar065","article-title":"MEDIC: a practical disease vocabulary used at the Comparative Toxicogenomics Database","volume":"2012","author":"Davis","year":"2012","journal-title":"Database"},{"key":"2025092510265335800_bib46","first-page":"4171","article-title":"BERT: pre-training of deep bidirectional transformers for language understanding","volume-title":"Proceedings of the Conference of NAACL HLT 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Devlin","year":"2018"},{"key":"2025092510265335800_bib47","doi-asserted-by":"publisher","first-page":"1234","DOI":"10.1093\/bioinformatics\/btz682","article-title":"BioBERT: a pre-trained biomedical language representation model for biomedical text mining","volume":"36","author":"Lee","year":"2020","journal-title":"Bioinformatics"},{"key":"2025092510265335800_bib48","article-title":"Xpdf and XpdfReader 1995","author":"Noonburg"},{"key":"2025092510265335800_bib49"},{"key":"2025092510265335800_bib50","doi-asserted-by":"publisher","first-page":"i180","DOI":"10.1093\/bioinformatics\/btg1023","article-title":"GENIA corpus\u2014semantically annotated corpus for bio-textmining","volume":"19","author":"Kim","year":"2003","journal-title":"Bioinformatics"},{"key":"2025092510265335800_bib51","doi-asserted-by":"crossref","first-page":"3174","DOI":"10.1093\/bioinformatics\/btp548","article-title":"Automatically classifying sentences in full-text biomedical articles into Introduction, Methods, Results and Discussion","volume":"25","author":"Agarwal","year":"2009","journal-title":"Bioinformatics"},{"key":"2025092510265335800_bib52","author":"The R Project for Statistical Computing"},{"key":"2025092510265335800_bib53"},{"key":"2025092510265335800_bib54"},{"key":"2025092510265335800_bib55"},{"key":"2025092510265335800_bib56"},{"key":"2025092510265335800_bib57"},{"key":"2025092510265335800_bib58"},{"key":"2025092510265335800_bib59"},{"key":"2025092510265335800_bib60"},{"key":"2025092510265335800_bib61"},{"key":"2025092510265335800_bib62"},{"key":"2025092510265335800_bib63","doi-asserted-by":"publisher","first-page":"216","DOI":"10.1093\/bioinformatics\/btg393","article-title":"GAPSCORE: finding gene and protein names one word at a time","volume":"20","author":"Chang","year":"2004","journal-title":"Bioinformatics"},{"key":"2025092510265335800_bib64","doi-asserted-by":"publisher","first-page":"512","DOI":"10.1016\/j.jbi.2004.08.004","article-title":"Term identification in the biomedical literature","volume":"37","author":"Krauthammer","year":"2004","journal-title":"J Biomed Inform"},{"key":"2025092510265335800_bib65","doi-asserted-by":"publisher","first-page":"373","DOI":"10.1007\/s13042-015-0426-6","article-title":"A comparative study for biomedical named entity recognition","volume":"9","author":"Wang","year":"2018","journal-title":"Int J Mach Learn Cyb"},{"key":"2025092510265335800_bib66","doi-asserted-by":"publisher","first-page":"203","DOI":"10.1038\/s41586-018-0579-z","article-title":"The UK Biobank resource with deep phenotyping and genomic data","volume":"562","author":"Bycroft","year":"2018","journal-title":"Nature"},{"key":"2025092510265335800_bib67","doi-asserted-by":"publisher","first-page":"668","DOI":"10.1056\/NEJMsr1809937","article-title":"The \u2018All of Us\u2019 Research Program","volume":"381","author":"The All of Us\u00a0Research Program Investigators","year":"2019","journal-title":"N Engl J Med"},{"key":"2025092510265335800_bib68","doi-asserted-by":"publisher","first-page":"363","DOI":"10.1176\/appi.ajp.2014.14030423","article-title":"Validation of electronic health record phenotyping of bipolar disorder cases and controls","volume":"172","author":"Castro","year":"2015","journal-title":"Am J Psychiatry"},{"key":"2025092510265335800_bib69","doi-asserted-by":"publisher","first-page":"299","DOI":"10.1136\/amiajnl-2012-001506","article-title":"A sense inventory for clinical abbreviations and acronyms created using clinical notes and medical dictionary resources","volume":"21","author":"Moon","year":"2014","journal-title":"J Am Med Inform Assoc"}],"container-title":["Database"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baaf047\/64394823\/baaf047.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baaf047\/64394823\/baaf047.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,25]],"date-time":"2025-09-25T14:27:02Z","timestamp":1758810422000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/database\/article\/doi\/10.1093\/database\/baaf047\/8263859"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025]]},"references-count":69,"URL":"https:\/\/doi.org\/10.1093\/database\/baaf047","relation":{},"ISSN":["1758-0463"],"issn-type":[{"value":"1758-0463","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025]]},"published":{"date-parts":[[2025]]},"article-number":"baaf047"}}