{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,23]],"date-time":"2025-09-23T14:18:32Z","timestamp":1758637112460,"version":"3.44.0"},"reference-count":22,"publisher":"Oxford University Press (OUP)","issue":"10","license":[{"start":{"date-parts":[[2025,7,24]],"date-time":"2025-07-24T00:00:00Z","timestamp":1753315200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/"}],"funder":[{"name":"The NIDDK-CR Data Centric Challenge"},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100021952","name":"Office of Data Science Strategy","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100021952","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000062","name":"National Institute of Diabetes and Digestive and Kidney Diseases","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000062","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100021952","name":"Office of Data Science Strategy","doi-asserted-by":"publisher","award":["75N94021D00001\/75N94021DF00001"],"award-info":[{"award-number":["75N94021D00001\/75N94021DF00001"]}],"id":[{"id":"10.13039\/100021952","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,10,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Objectives<\/jats:title>\n                  <jats:p>The success of artificial intelligence (AI) and machine learning (ML) approaches in biomedical research depends on the quality of the underlying data. The National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) Data Centric Challenge was designed to address the challenge of making raw clinical research data AI ready, with a focus on type 1 diabetes studies available in the NIDDK Central Repository (NIDDK-CR). This paper aims to present a structured methodology for enhancing the AI readiness of clinical datasets.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Materials and Methods<\/jats:title>\n                  <jats:p>We detail a systematic approach for data aggregation and preprocessing, including binning continuous data, processing text features, managing missing values, and encoding for categorical variables while maintaining the data integrity and compatibility with ML algorithms.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>We applied the proposed methodology to transform raw clinical data from type 1 diabetes studies in the NIDDK-CR into a structured, AI-ready dataset. The evaluation process validated the effectiveness of our AI-readiness enhancement steps and explored the potential use cases in type 1 diabetes research.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Discussion<\/jats:title>\n                  <jats:p>The methodology discussed in this paper will serve as guidance for preparing data for AI-driven clinical research, with the resulting AI-ready data to serve as a training tool for building and improving AI\/ML model performance.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Conclusion<\/jats:title>\n                  <jats:p>We present a generalizable framework for preparing clinical research data for AI applications. The resulting datasets lay a strong foundation for downstream AI\/ML applications, setting the stage for a new era of data-driven discoveries.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/jamia\/ocaf114","type":"journal-article","created":{"date-parts":[[2025,7,24]],"date-time":"2025-07-24T19:10:08Z","timestamp":1753384208000},"page":"1609-1616","source":"Crossref","is-referenced-by-count":2,"title":["Preparing clinical research data for artificial intelligence readiness: insights from the National Institute of Diabetes and Digestive and Kidney Diseases data centric challenge"],"prefix":"10.1093","volume":"32","author":[{"given":"Marcin J","family":"Domagalski","sequence":"first","affiliation":[{"name":"Health Analytics, Research and Technology (HART), ICF , Rockville, MD 20850,","place":["United States"]}]},{"given":"Yin","family":"Lu","sequence":"additional","affiliation":[{"name":"Health Analytics, Research and Technology (HART), ICF , Rockville, MD 20850,","place":["United States"]}]},{"given":"Alexander","family":"Pilozzi","sequence":"additional","affiliation":[{"name":"Health Analytics, Research and Technology (HART), ICF , Rockville, MD 20850,","place":["United States"]}]},{"given":"Alicia","family":"Williamson","sequence":"additional","affiliation":[{"name":"Health Analytics, Research and Technology (HART), ICF , Rockville, MD 20850,","place":["United States"]}]},{"given":"Padmini","family":"Chilappagari","sequence":"additional","affiliation":[{"name":"Health Analytics, Research and Technology (HART), ICF , Rockville, MD 20850,","place":["United States"]}]},{"given":"Emma","family":"Luker","sequence":"additional","affiliation":[{"name":"Health and Life Sciences, Booz Allen Hamilton, Inc. , McLean, VA 22102,","place":["United States"]}]},{"given":"Courtney D","family":"Shelley","sequence":"additional","affiliation":[{"name":"Health and Life Sciences, Booz Allen Hamilton, Inc. , McLean, VA 22102,","place":["United States"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-6627-6259","authenticated-orcid":false,"given":"Anya","family":"Dabic","sequence":"additional","affiliation":[{"name":"Health and Life Sciences, Booz Allen Hamilton, Inc. , McLean, VA 22102,","place":["United States"]}]},{"given":"Michael A","family":"Keller","sequence":"additional","affiliation":[{"name":"Health and Life Sciences, Booz Allen Hamilton, Inc. , McLean, VA 22102,","place":["United States"]}]},{"given":"Rebecca M","family":"Rodriguez","sequence":"additional","affiliation":[{"name":"National Institute of Diabetes and Digestive and Kidney Diseases, NIH , Bethesda, MD 20892,","place":["United States"]}]},{"given":"Sharon","family":"Lawlor","sequence":"additional","affiliation":[{"name":"National Institute of Diabetes and Digestive and Kidney Diseases, NIH , Bethesda, MD 20892,","place":["United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6765-0401","authenticated-orcid":false,"given":"Ratna R","family":"Thangudu","sequence":"additional","affiliation":[{"name":"Health Analytics, Research and Technology (HART), ICF , Rockville, MD 20850,","place":["United States"]}]}],"member":"286","published-online":{"date-parts":[[2025,7,24]]},"reference":[{"key":"2025092208422048500_ocaf114-B1","doi-asserted-by":"publisher","first-page":"e2345892","DOI":"10.1001\/jamanetworkopen.2023.45892","article-title":"Perceptions of data set experts on important characteristics of health data sets ready for machine learning","volume":"6","author":"Ng","year":"2023","journal-title":"JAMA Netw Open"},{"author":"Polyzotis","key":"2025092208422048500_ocaf114-B2"},{"key":"2025092208422048500_ocaf114-B3","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1196\/annals.1447.062","article-title":"The environmental determinants of diabetes in the young (TEDDY) study","volume":"1150","author":"TEDDY Study Group","year":"2008","journal-title":"Ann N Y Acad Sci"},{"key":"2025092208422048500_ocaf114-B4","doi-asserted-by":"publisher","first-page":"136","DOI":"10.1007\/s11892-018-1113-2","article-title":"The environmental determinants of diabetes in the young (TEDDY) study: 2018 update","volume":"18","author":"Rewers","year":"2018","journal-title":"Curr Diab Rep"},{"year":"2025","author":"Krischer","key":"2025092208422048500_ocaf114-B5"},{"year":"2016","author":"Rubinsteyn","key":"2025092208422048500_ocaf114-B6"},{"key":"2025092208422048500_ocaf114-B7","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18637\/jss.v045.i03","article-title":"MICE: multivariate imputation by chained equations in R","volume":"45","author":"van Buuren","year":"2011","journal-title":"J Stat Softw"},{"key":"2025092208422048500_ocaf114-B8","doi-asserted-by":"publisher","first-page":"1355","DOI":"10.1056\/NEJMsr1203730","article-title":"The prevention and treatment of missing data in clinical trials","volume":"367","author":"Little","year":"2012","journal-title":"N Engl J Med"},{"key":"2025092208422048500_ocaf114-B9","doi-asserted-by":"publisher","first-page":"1798","DOI":"10.1109\/TPAMI.2013.50","article-title":"Representation learning: a review and new perspectives","volume":"35","author":"Bengio","year":"2013","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"2025092208422048500_ocaf114-B10","doi-asserted-by":"publisher","first-page":"850611","DOI":"10.3389\/fdata.2022.850611","article-title":"A survey of data quality measurement and monitoring tools","volume":"5","author":"Ehrlinger","year":"2022","journal-title":"Front Big Data"},{"year":"2021","author":"R Core Team","key":"2025092208422048500_ocaf114-B11"},{"key":"2025092208422048500_ocaf114-B12","doi-asserted-by":"publisher","first-page":"102587","DOI":"10.1016\/j.artmed.2023.102587","article-title":"Handling missing values in healthcare data: a systematic review of deep learning-based imputation techniques","volume":"142","author":"Liu","year":"2023","journal-title":"Artif Intell Med"},{"key":"2025092208422048500_ocaf114-B13","doi-asserted-by":"publisher","first-page":"e59587","DOI":"10.2196\/59587","article-title":"Data preprocessing techniques for AI and machine learning readiness: scoping review of wearable sensor data in cancer care","volume":"12","author":"Ortiz","year":"2024","journal-title":"JMIR Mhealth Uhealth"},{"first-page":"1","year":"2024","author":"Hiniduma","key":"2025092208422048500_ocaf114-B14"},{"author":"Gon\u00e7alves","key":"2025092208422048500_ocaf114-B15","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-21348-0_10"},{"key":"2025092208422048500_ocaf114-B16","doi-asserted-by":"publisher","first-page":"160018","DOI":"10.1038\/sdata.2016.18","article-title":"The FAIR Guiding Principles for scientific data management and stewardship","volume":"3","author":"Wilkinson","year":"2016","journal-title":"Sci Data"},{"author":"Clark","key":"2025092208422048500_ocaf114-B17","doi-asserted-by":"publisher","DOI":"10.1101\/2024.10.23.619844"},{"key":"2025092208422048500_ocaf114-B18","doi-asserted-by":"publisher","first-page":"1347","DOI":"10.1056\/NEJMra1814259","article-title":"Machine learning in medicine","volume":"380","author":"Rajkomar","year":"2019","journal-title":"N Engl J Med"},{"key":"2025092208422048500_ocaf114-B19","doi-asserted-by":"publisher","first-page":"24","DOI":"10.1038\/s41591-018-0316-z","article-title":"A guide to deep learning in healthcare","volume":"25","author":"Esteva","year":"2019","journal-title":"Nat Med"},{"key":"2025092208422048500_ocaf114-B20","doi-asserted-by":"publisher","first-page":"34","DOI":"10.1016\/j.jbi.2017.11.011","article-title":"Clinical information extraction applications: a literature review","volume":"77","author":"Wang","year":"2018","journal-title":"J Biomed Inform"},{"key":"2025092208422048500_ocaf114-B21","doi-asserted-by":"publisher","first-page":"104654","DOI":"10.1016\/j.jbi.2024.104654","article-title":"A roadmap to artificial intelligence (AI): methods for designing and building AI ready data to promote fairness","volume":"154","author":"Kidwai-Khan","year":"2024","journal-title":"J Biomed Inform"},{"key":"2025092208422048500_ocaf114-B22","doi-asserted-by":"publisher","first-page":"88","DOI":"10.1186\/s12880-025-01614-3","article-title":"AI-ready rectal cancer MR imaging: a workflow for tumor detection and segmentation","volume":"25","author":"Selby","year":"2025","journal-title":"BMC Med Imaging"}],"container-title":["Journal of the American Medical Informatics Association"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/jamia\/article-pdf\/32\/10\/1609\/63842340\/ocaf114.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/jamia\/article-pdf\/32\/10\/1609\/63842340\/ocaf114.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,22]],"date-time":"2025-09-22T12:42:29Z","timestamp":1758544949000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/jamia\/article\/32\/10\/1609\/8211965"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,24]]},"references-count":22,"journal-issue":{"issue":"10","published-online":{"date-parts":[[2025,7,24]]},"published-print":{"date-parts":[[2025,10,1]]}},"URL":"https:\/\/doi.org\/10.1093\/jamia\/ocaf114","relation":{},"ISSN":["1067-5027","1527-974X"],"issn-type":[{"type":"print","value":"1067-5027"},{"type":"electronic","value":"1527-974X"}],"subject":[],"published-other":{"date-parts":[[2025,10]]},"published":{"date-parts":[[2025,7,24]]}}}