{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,12]],"date-time":"2026-06-12T17:20:27Z","timestamp":1781284827429,"version":"3.54.1"},"reference-count":12,"publisher":"Springer Science and Business Media LLC","issue":"6","license":[{"start":{"date-parts":[[2021,10,4]],"date-time":"2021-10-04T00:00:00Z","timestamp":1633305600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,10,4]],"date-time":"2021-10-04T00:00:00Z","timestamp":1633305600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Digit Imaging"],"published-print":{"date-parts":[[2021,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>With vast interest in machine learning applications, more investigators are proposing to assemble large datasets for machine learning applications. We aim to delineate multiple possible roadblocks to exam retrieval that may present themselves and lead to significant time delays. This HIPAA-compliant, institutional review board\u2013approved, retrospective clinical study required identification and retrieval of all outpatient and emergency patients undergoing abdominal and pelvic computed tomography (CT) at three affiliated hospitals in the year 2012. If a patient had multiple abdominal CT exams, the first exam was selected for retrieval (<jats:italic>n<\/jats:italic>=23,186). Our experience in attempting to retrieve 23,186 abdominal CT exams yielded 22,852 valid CT abdomen\/pelvis exams and identified four major categories of challenges when retrieving large datasets: cohort selection and processing, retrieving DICOM exam files from PACS, data storage, and non-recoverable failures. The retrieval took 3 months of project time and at minimum 300 person-hours of time between the primary investigator (a radiologist), a data scientist, and a software engineer. Exam selection and retrieval may take significantly longer than planned. We share our experience so that other investigators can anticipate and plan for these challenges. We also hope to help institutions better understand the demands that may be placed on their infrastructure by large-scale medical imaging machine learning projects.<\/jats:p>","DOI":"10.1007\/s10278-021-00505-7","type":"journal-article","created":{"date-parts":[[2021,10,5]],"date-time":"2021-10-05T06:49:47Z","timestamp":1633416587000},"page":"1424-1429","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":19,"title":["The Trials and Tribulations of Assembling Large Medical Imaging Datasets for Machine Learning Applications"],"prefix":"10.1007","volume":"34","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7037-433X","authenticated-orcid":false,"given":"Kirti","family":"Magudia","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Christopher P.","family":"Bridge","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Katherine P.","family":"Andriole","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Michael H.","family":"Rosenthal","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2021,10,4]]},"reference":[{"key":"505_CR1","doi-asserted-by":"publisher","unstructured":"Soffer S, Ben-Cohen A, Shimon O, Amitai MM, Greenspan H, Klang E. Convolutional neural networks for radiologic images: a radiologist\u2019s guide. Radiology. NLM (Medline); 2019;290(3):590\u2013606. https:\/\/doi.org\/10.1148\/radiol.2018180547. Accessed June 29,\u00a02020.","DOI":"10.1148\/radiol.2018180547"},{"key":"505_CR2","doi-asserted-by":"crossref","unstructured":"Saba L, Biswas M, Kuppili V, et al. The present and future of deep learning in radiology. Eur. J. Radiol. Elsevier Ireland Ltd; 2019. p. 14\u201324.","DOI":"10.1016\/j.ejrad.2019.02.038"},{"key":"505_CR3","doi-asserted-by":"publisher","unstructured":"Willemink MJ, Koszek WA, Hardell C, et al. Preparing medical imaging data for machine learning. Radiology. Radiological Society of North America Inc.; 2020;295(1):4\u201315.\u00a0https:\/\/doi.org\/10.1148\/radiol.2020192224. Accessed June 26,\u00a02020.","DOI":"10.1148\/radiol.2020192224"},{"key":"505_CR4","doi-asserted-by":"crossref","unstructured":"Armato SG, Huisman H, Drukker K, et al. PROSTATEx Challenges for computerized classification of prostate lesions from multiparametric magnetic resonance images. J Med Imaging. International Society for Optics and Photonics; 2018;5(04):1.\u00a0https:\/\/www.spiedigitallibrary.org\/journals\/journal-of-medical-imaging\/volume-5\/issue-04\/044501\/PROSTATEx-Challenges-for-computerized-classification-of-prostate-lesions-from-multiparametric\/10.1117\/1.JMI.5.4.044501.full. Accessed November 19,\u00a02018.","DOI":"10.1117\/1.JMI.5.4.044501"},{"key":"505_CR5","doi-asserted-by":"publisher","unstructured":"Flanders AE, Prevedello LM, Shih G, et al. Construction of a machine learning dataset through collaboration: the RSNA 2019 Brain CT Hemorrhage Challenge. Radiol Artif Intell. Radiological Society of North America (RSNA); 2020;2(3):e190211.\u00a0https:\/\/doi.org\/10.1148\/ryai.2020190211. Accessed July 3,\u00a02020.","DOI":"10.1148\/ryai.2020190211"},{"key":"505_CR6","doi-asserted-by":"publisher","unstructured":"Shih G, Wu CC, Halabi SS, et al. Augmenting the National Institutes of Health Chest Radiograph Dataset with expert annotations of possible pneumonia. Radiol Artif Intell. Radiological Society of North America (RSNA); 2019;1(1):e180041.\u00a0https:\/\/doi.org\/10.1148\/ryai.2019180041. Accessed July 3,\u00a02020.","DOI":"10.1148\/ryai.2019180041"},{"key":"505_CR7","unstructured":"Kaggle. Find Open Datasets and Machine Learning Projects. https:\/\/www.kaggle.com\/datasets. Accessed April 29,\u00a02021."},{"key":"505_CR8","unstructured":"The Cancer Imaging Archive. Welcome to The Cancer Imaging Archive. 2021.\u00a0https:\/\/www.cancerimagingarchive.net\/. Accessed April 29,\u00a02021."},{"key":"505_CR9","doi-asserted-by":"publisher","unstructured":"Langlotz CP, Allen B, Erickson BJ, et al. A roadmap for foundational research on artificial intelligence in medical imaging: from the 2018 NIH\/RSNA\/ACR\/The Academy workshop. Radiology. Radiological Society of North America Inc.; 2019;291(3):781\u2013791.\u00a0https:\/\/doi.org\/10.1148\/radiol.2019190613. Accessed April 29,\u00a02021.","DOI":"10.1148\/radiol.2019190613"},{"key":"505_CR10","doi-asserted-by":"crossref","unstructured":"Khan SM, Liu X, Nath S, et al. A global review of publicly available datasets for ophthalmological imaging: barriers to access, usability, and generalisability. Lancet Digit. Heal. Elsevier Ltd; 2021. p. e51\u2013e66.\u00a0www.thelancet.com\/digital-health. Accessed April 29,\u00a02021.","DOI":"10.1016\/S2589-7500(20)30240-5"},{"key":"505_CR11","doi-asserted-by":"crossref","unstructured":"Moreno-Torres JG, Raeder T, Alaiz-Rodr\u00edguez R, Chawla N V., Herrera F. A unifying view on dataset shift in classification. Pattern Recognit. Elsevier Ltd; 2012;45(1):521\u2013530.","DOI":"10.1016\/j.patcog.2011.06.019"},{"key":"505_CR12","doi-asserted-by":"publisher","unstructured":"Yu AC, Eng J. One algorithm may not fit all: how selection bias affects machine learning performance. RadioGraphics. Radiological Society of North America Inc.; 2020;40(7):1932\u20131937.\u00a0https:\/\/doi.org\/10.1148\/rg.2020200040. Accessed April 29,\u00a02021.","DOI":"10.1148\/rg.2020200040"}],"container-title":["Journal of Digital Imaging"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10278-021-00505-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10278-021-00505-7\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10278-021-00505-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,12,13]],"date-time":"2021-12-13T17:48:35Z","timestamp":1639417715000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10278-021-00505-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,4]]},"references-count":12,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["505"],"URL":"https:\/\/doi.org\/10.1007\/s10278-021-00505-7","relation":{},"ISSN":["0897-1889","1618-727X"],"issn-type":[{"value":"0897-1889","type":"print"},{"value":"1618-727X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,10,4]]},"assertion":[{"value":"18 December 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 April 2021","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"17 August 2021","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"4 October 2021","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare that they had full access to this manuscript and take complete responsibility for the integrity and the accuracy of the submitted manuscript.\u00a0","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of Interest"}}]}}