{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,9]],"date-time":"2026-03-09T20:59:17Z","timestamp":1773089957221,"version":"3.50.1"},"reference-count":48,"publisher":"Oxford University Press (OUP)","issue":"6","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2016,11,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Objective Traditionally, patient groups with a phenotype are selected through rule-based definitions whose creation and validation are time-consuming. Machine learning approaches to electronic phenotyping are limited by the paucity of labeled training datasets. We demonstrate the feasibility of utilizing semi-automatically labeled training sets to create phenotype models via machine learning, using a comprehensive representation of the patient medical record.<\/jats:p><jats:p>Methods We use a list of keywords specific to the phenotype of interest to generate noisy labeled training data. We train L1 penalized logistic regression models for a chronic and an acute disease and evaluate the performance of the models against a gold standard.<\/jats:p><jats:p>Results Our models for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.90, 0.89, and 0.86, 0.89, respectively. Local implementations of the previously validated rule-based definitions for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.96, 0.92 and 0.84, 0.87, respectively.<\/jats:p><jats:p>We have demonstrated feasibility of learning phenotype models using imperfectly labeled data for a chronic and acute phenotype. Further research in feature engineering and in specification of the keyword list can improve the performance of the models and the scalability of the approach.<\/jats:p><jats:p>Conclusions Our method provides an alternative to manual labeling for creating training sets for statistical models of phenotypes. Such an approach can accelerate research with large observational healthcare datasets and may also be used to create local phenotype models.<\/jats:p>","DOI":"10.1093\/jamia\/ocw028","type":"journal-article","created":{"date-parts":[[2016,5,13]],"date-time":"2016-05-13T01:29:08Z","timestamp":1463102948000},"page":"1166-1173","source":"Crossref","is-referenced-by-count":120,"title":["Learning statistical models of phenotypes using noisy labeled training data"],"prefix":"10.1093","volume":"23","author":[{"given":"Vibhu","family":"Agarwal","sequence":"first","affiliation":[{"name":"Biomedical Informatics Training Program, Stanford University, Stanford CA 94305-5479, USA"}]},{"given":"Tanya","family":"Podchiyska","sequence":"additional","affiliation":[{"name":"Biomedical Informatics Training Program, Stanford University, Stanford CA 94305-5479, USA"}]},{"given":"Juan M","family":"Banda","sequence":"additional","affiliation":[{"name":"Stanford Center for Biomedical Informatics Research, Stanford University, Stanford CA 94305-5479, USA"}]},{"given":"Veena","family":"Goel","sequence":"additional","affiliation":[{"name":"Department of Pediatrics, Stanford University School of Medicine, Stanford CA 94305-5208, USA"},{"name":"Department of Clinical Informatics, Stanford Children\u2019s Health, Stanford CA 94305-5474, USA"}]},{"given":"Tiffany I","family":"Leung","sequence":"additional","affiliation":[{"name":"Division of General Medical Disciplines, Stanford University, Stanford CA 94305, USA"}]},{"given":"Evan P","family":"Minty","sequence":"additional","affiliation":[{"name":"Biomedical Informatics Training Program, Stanford University, Stanford CA 94305-5479, USA"},{"name":"Faculty of Medicine, University of Calgary, Calgary Alberta, T2N 4N1, Canada"}]},{"given":"Timothy E","family":"Sweeney","sequence":"additional","affiliation":[{"name":"Biomedical Informatics Training Program, Stanford University, Stanford CA 94305-5479, USA"},{"name":"Department of Surgery, Stanford Hospital & Clinics, Stanford CA 94305-2200, USA"}]},{"given":"Elsie","family":"Gyang","sequence":"additional","affiliation":[{"name":"Division of Vascular Surgery, Stanford Hospital & Clinics, Stanford CA 94305-5642, USA"}]},{"given":"Nigam H","family":"Shah","sequence":"additional","affiliation":[{"name":"Stanford Center for Biomedical Informatics Research, Stanford University, Stanford CA 94305-5479, USA"}]}],"member":"286","published-online":{"date-parts":[[2016,5,12]]},"reference":[{"issue":"7","key":"2020110612372696500_ocw028-B1","doi-asserted-by":"crossref","first-page":"1229","DOI":"10.1377\/hlthaff.2014.0099","article-title":"A \u2018green button' for using aggregate patient data at the point of care","volume":"33","author":"Longhurst","year":"2014","journal-title":"Health Aff (Millwood)."},{"issue":"e2","key":"2020110612372696500_ocw028-B2","doi-asserted-by":"crossref","first-page":"e206","DOI":"10.1136\/amiajnl-2013-002428","article-title":"Electronic health records-driven phenotyping: challenges, recent advances, and perspectives","volume":"20","author":"Pathak","year":"2013","journal-title":"J Am Med Inform Assoc."},{"issue":"12","key":"2020110612372696500_ocw028-B3","doi-asserted-by":"crossref","first-page":"1095","DOI":"10.1038\/nbt.2757","article-title":"Mining the ultimate phenome repository","volume":"31","author":"Shah","year":"2013","journal-title":"Nat Biotechnol."},{"issue":"2","key":"2020110612372696500_ocw028-B4","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1136\/amiajnl-2013-001935","article-title":"A review of approaches to identifying patient phenotype cohorts using electronic health records","volume":"21","author":"Shivade","year":"2014","journal-title":"J Am Med Inform Assoc."},{"issue":"10","key":"2020110612372696500_ocw028-B5","doi-asserted-by":"crossref","first-page":"761","DOI":"10.1038\/gim.2013.72","article-title":"The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future","volume":"15","author":"Gottesman","year":"2013","journal-title":"Genet Med."},{"key":"2020110612372696500_ocw028-B6","doi-asserted-by":"crossref","first-page":"13","DOI":"10.1186\/1755-8794-4-13","article-title":"The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies","volume":"4","author":"McCarty","year":"2011","journal-title":"BMC Med Genomics."},{"issue":"1","key":"2020110612372696500_ocw028-B7","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1186\/s13073-015-0166-y","article-title":"Extracting research-quality phenotypes from electronic health records to support precision medicine","volume":"7","author":"Wei","year":"2015","journal-title":"Genome Med."},{"issue":"12","key":"2020110612372696500_ocw028-B8","doi-asserted-by":"crossref","first-page":"1102","DOI":"10.1038\/nbt.2749","article-title":"Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data","volume":"31","author":"Denny","year":"2013","journal-title":"Nat Biotechnol."},{"issue":"3","key":"2020110612372696500_ocw028-B9","doi-asserted-by":"crossref","first-page":"329","DOI":"10.1586\/1744666X.2015.1009895","article-title":"Intelligent use and clinical benefits of electronic health records in rheumatoid arthritis","volume":"11","author":"Carroll","year":"2015","journal-title":"Expert Rev Clin Immunol."},{"key":"2020110612372696500_ocw028-B10","first-page":"1281","article-title":"Electronic medical records and genomics (eMERGE) network exploration in cataract: several new potential susceptibility loci","volume":"20","author":"Ritchie","year":"2014","journal-title":"Mol Vis."},{"key":"2020110612372696500_ocw028-B11","article-title":"Automatic identification of methotrexate-induced liver toxicity in patients with rheumatoid arthritis from the electronic medical record","author":"Lin","year":"2014","journal-title":"J Am Med Inform Assoc."},{"issue":"5","key":"2020110612372696500_ocw028-B12","doi-asserted-by":"crossref","first-page":"601","DOI":"10.1136\/amiajnl-2011-000163","article-title":"A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries","volume":"18","author":"Jiang","year":"2011","journal-title":"J Am Med Inform Assoc."},{"issue":"6","key":"2020110612372696500_ocw028-B13","doi-asserted-by":"crossref","first-page":"859","DOI":"10.1136\/amiajnl-2011-000121","article-title":"A method and knowledge base for automated inference of patient problems from structured data in an electronic medical record","volume":"18","author":"Wright","year":"2011","journal-title":"J Am Med Inform Assoc."},{"issue":"e1","key":"2020110612372696500_ocw028-B14","doi-asserted-by":"crossref","first-page":"e147","DOI":"10.1136\/amiajnl-2012-000896","article-title":"Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network","volume":"20","author":"Newton","year":"2013","journal-title":"J Am Med Inform Assoc."},{"issue":"e2","key":"2020110612372696500_ocw028-B15","doi-asserted-by":"crossref","first-page":"e243","DOI":"10.1136\/amiajnl-2013-001930","article-title":"A collaborative approach to developing an electronic health record phenotyping algorithm for drug-induced liver injury","volume":"20","author":"Overby","year":"2013","journal-title":"J Am Med Inform Assoc."},{"issue":"8 Suppl 3","key":"2020110612372696500_ocw028-B16","doi-asserted-by":"crossref","first-page":"S80","DOI":"10.1097\/MLR.0b013e31829b1d48","article-title":"Challenges in using electronic health record data for CER: experience of 4 learning organizations and solutions applied","volume":"51","author":"Bayley","year":"2013","journal-title":"Med Care."},{"issue":"1","key":"2020110612372696500_ocw028-B17","first-page":"37","article-title":"Hospitals scramble to meet deadlines for adopting electronic health records: pharmacy systems will be updated slowly but surely","volume":"36","author":"Barlas","year":"2011","journal-title":"P T."},{"key":"2020110612372696500_ocw028-B18","volume-title":"Phenotype KnowledgeBase","author":"PheKB"},{"issue":"8","key":"2020110612372696500_ocw028-B19","doi-asserted-by":"crossref","first-page":"1120","DOI":"10.1002\/acr.20184","article-title":"Electronic medical records for discovery research in rheumatoid arthritis","volume":"62","author":"Liao","year":"2010","journal-title":"Arthritis Care Res (Hoboken)."},{"issue":"6","key":"2020110612372696500_ocw028-B20","doi-asserted-by":"crossref","first-page":"749","DOI":"10.1093\/aje\/kwt441","article-title":"Using natural language processing to improve efficiency of manual chart abstraction in research: the case of breast cancer recurrence","volume":"179","author":"Carrell","year":"2014","journal-title":"Am J Epidemiol."},{"issue":"e1","key":"2020110612372696500_ocw028-B21","doi-asserted-by":"crossref","first-page":"e162","DOI":"10.1136\/amiajnl-2011-000583","article-title":"Portability of an algorithm to identify rheumatoid arthritis in electronic health records","volume":"19","author":"Carroll","year":"2012","journal-title":"J Am Med Inform Assoc."},{"issue":"e2","key":"2020110612372696500_ocw028-B22","doi-asserted-by":"crossref","first-page":"e232","DOI":"10.1136\/amiajnl-2013-001932","article-title":"Defining a comprehensive verotype using electronic health records for personalized medicine","volume":"20","author":"Boland","year":"2013","journal-title":"J Am Med Inform Assoc."},{"issue":"e2","key":"2020110612372696500_ocw028-B23","doi-asserted-by":"crossref","first-page":"e275","DOI":"10.1136\/amiajnl-2013-001856","article-title":"Using electronic health records data to identify patients with chronic pain in a primary care setting","volume":"20","author":"Tian","year":"2013","journal-title":"J Am Med Inform Assoc."},{"issue":"6","key":"2020110612372696500_ocw028-B24","doi-asserted-by":"crossref","first-page":"759","DOI":"10.1093\/aje\/kwt443","article-title":"Invited commentary: Observational research in the age of the electronic health record","volume":"179","author":"Chute","year":"2014","journal-title":"Am J Epidemiol."},{"key":"2020110612372696500_ocw028-B25","doi-asserted-by":"crossref","first-page":"105","DOI":"10.1016\/j.jbi.2014.08.012","article-title":"Evaluation of matched control algorithms in EHR-based phenotyping studies: a case study of inflammatory bowel disease comorbidities","volume":"52","author":"Castro","year":"2014","journal-title":"J Biomed Inform."},{"issue":"e2","key":"2020110612372696500_ocw028-B26","doi-asserted-by":"crossref","first-page":"e253","DOI":"10.1136\/amiajnl-2013-001945","article-title":"Applying active learning to high-throughput phenotyping algorithms for electronic health records data","volume":"20","author":"Chen","year":"2013","journal-title":"J Am Med Inform Assoc."},{"issue":"11","key":"2020110612372696500_ocw028-B27","doi-asserted-by":"crossref","first-page":"1369","DOI":"10.1007\/s00439-014-1466-9","article-title":"Improving the power of genetic association tests with imperfect phenotype derived from electronic medical records","volume":"133","author":"Sinnott","year":"2014","journal-title":"Hum Genet."},{"key":"2020110612372696500_ocw028-B28","doi-asserted-by":"crossref","first-page":"260","DOI":"10.1016\/j.jbi.2014.07.007","article-title":"Relational machine learning for electronic health record-driven phenotyping","volume":"52","author":"Peissig","year":"2014","journal-title":"J Biomed Inform."},{"issue":"2","key":"2020110612372696500_ocw028-B29","doi-asserted-by":"crossref","first-page":"239","DOI":"10.1006\/jcss.1996.0019","article-title":"General bounds on the number of examples needed for learning probabilistic concepts","volume":"52","author":"Simon","year":"1996","journal-title":"J Comput Syst Sci."},{"issue":"4","key":"2020110612372696500_ocw028-B30","doi-asserted-by":"crossref","first-page":"189","DOI":"10.1016\/0020-0190(96)00006-3","article-title":"On the sample complexity of noise-tolerant learning","volume":"57","author":"Aslam","year":"1996","journal-title":"Inform Process Lett."},{"key":"2020110612372696500_ocw028-B31","volume-title":"Health Outcomes of Interest","author":"Observational Medical Outcomes Partnership"},{"key":"2020110612372696500_ocw028-B32","doi-asserted-by":"crossref","DOI":"10.1093\/jamia\/ocw011","article-title":"Electronic Medical Record Phenotyping using the Anchor & Learn Framework","author":"Halpern","year":"2016","journal-title":"J Am Med Inform Assoc."},{"issue":"1","key":"2020110612372696500_ocw028-B33","doi-asserted-by":"crossref","first-page":"121","DOI":"10.1136\/amiajnl-2014-002902","article-title":"Functional evaluation of out-of-the-box text-mining tools for data-mining tasks","volume":"22","author":"Jung","year":"2015","journal-title":"J Am Med Inform Assoc."},{"key":"2020110612372696500_ocw028-B34","volume-title":"Using narratives as a source to automatically learn phenotype models","author":"Agarwal","year":"2014"},{"key":"2020110612372696500_ocw028-B35","doi-asserted-by":"crossref","DOI":"10.1007\/978-0-387-84858-7","volume-title":"The Elements of Statistical Learning","author":"Hastie","year":"2009"},{"issue":"1","key":"2020110612372696500_ocw028-B36","doi-asserted-by":"crossref","first-page":"1","DOI":"10.18637\/jss.v033.i01","article-title":"Regularization Paths for Generalized Linear Models via Coordinate Descent","volume":"33","author":"Friedman","year":"2010","journal-title":"J Stat Softw."},{"key":"2020110612372696500_ocw028-B37","first-page":"2825","article-title":"Scikit-learn: Machine learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J Machine Learning Res."},{"key":"2020110612372696500_ocw028-B38","doi-asserted-by":"crossref","DOI":"10.1093\/jamia\/ocv034","article-title":"Toward high-throughput phenotyping: unbiased automated feature extraction and 90 selection from knowledge sources","author":"Yu","year":"2015","journal-title":"J Am Med Inform Assoc."},{"issue":"2","key":"2020110612372696500_ocw028-B39","doi-asserted-by":"crossref","first-page":"219","DOI":"10.1136\/amiajnl-2011-000597","article-title":"Impact of data fragmentation across healthcare centers on the accuracy of a high-throughput clinical phenotyping algorithm for specifying subjects with type 2 diabetes mellitus","volume":"19","author":"Wei","year":"2012","journal-title":"J Am Med Inform Assoc."},{"key":"2020110612372696500_ocw028-B40","first-page":"1","article-title":"The U.S. Food and Drug Administration's Mini-Sentinel program: status and direction","volume":"21 Suppl 1","author":"Platt","year":"2012","journal-title":"Pharmacoepidemiol Drug Saf."},{"issue":"e2","key":"2020110612372696500_ocw028-B41","doi-asserted-by":"crossref","first-page":"e341","DOI":"10.1136\/amiajnl-2013-001939","article-title":"Normalization and standardization of electronic health records for high-throughput phenotyping: the SHARPn consortium","volume":"20","author":"Pathak","year":"2013","journal-title":"J Am Med Inform Assoc."},{"key":"2020110612372696500_ocw028-B42","volume-title":"Aphrodite","author":"OHDSI","year":"2015"},{"key":"2020110612372696500_ocw028-B43","volume-title":"Observational Health Data Sciences and Informatics","author":"OHDSI"},{"issue":"12","key":"2020110612372696500_ocw028-B44","doi-asserted-by":"crossref","first-page":"e1002823","DOI":"10.1371\/journal.pcbi.1002823","article-title":"Chapter 13: Mining electronic health records in the genomics era","volume":"8","author":"Denny","year":"2012","journal-title":"PLoS Comput Biol."},{"key":"2020110612372696500_ocw028-B45","first-page":"132","article-title":"Personalized Predictive Modeling and Risk Factor Identification using Patient Similarity","volume":"2015","author":"Ng","year":"2015","journal-title":"AMIA Jt Summits Transl Sci Proc."},{"issue":"1","key":"2020110612372696500_ocw028-B46","doi-asserted-by":"crossref","first-page":"97","DOI":"10.1260\/2040-2295.2.1.97","article-title":"A clinical database-driven approach to decision support: Predicting mortality among patients with acute kidney injury","volume":"2","author":"Celi","year":"2011","journal-title":"J Healthcare Engineering."},{"issue":"1","key":"2020110612372696500_ocw028-B47","doi-asserted-by":"crossref","first-page":"117","DOI":"10.1136\/amiajnl-2012-001145","article-title":"Next-generation phenotyping of electronic health records","volume":"20","author":"Hripcsak","year":"2013","journal-title":"J Am Med Inform Assoc."},{"issue":"6","key":"2020110612372696500_ocw028-B48","doi-asserted-by":"crossref","first-page":"696","DOI":"10.1136\/jamia.2010.003228","article-title":"Biomedical negation scope detection with conditional random fields","volume":"17","author":"Agarwal","year":"2010","journal-title":"J Am Med Inform Assoc."}],"container-title":["Journal of the American Medical Informatics Association"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/jamia\/article\/23\/6\/1166\/2399304","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/jamia\/article\/23\/6\/1166\/2399304","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,6,22]],"date-time":"2022-06-22T08:44:19Z","timestamp":1655887459000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/jamia\/article\/23\/6\/1166\/2399304"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,5,12]]},"references-count":48,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2016,5,12]]},"published-print":{"date-parts":[[2016,11,1]]}},"URL":"https:\/\/doi.org\/10.1093\/jamia\/ocw028","relation":{},"ISSN":["1527-974X","1067-5027"],"issn-type":[{"value":"1527-974X","type":"electronic"},{"value":"1067-5027","type":"print"}],"subject":[],"published-other":{"date-parts":[[2016,11]]},"published":{"date-parts":[[2016,5,12]]}}}