{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,21]],"date-time":"2026-03-21T19:31:16Z","timestamp":1774121476208,"version":"3.50.1"},"reference-count":59,"publisher":"MDPI AG","issue":"5","license":[{"start":{"date-parts":[[2024,3,2]],"date-time":"2024-03-02T00:00:00Z","timestamp":1709337600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["2234836"],"award-info":[{"award-number":["2234836"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["2234468"],"award-info":[{"award-number":["2234468"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["1903466"],"award-info":[{"award-number":["1903466"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>The advancements in data acquisition, storage, and processing techniques have resulted in the rapid growth of heterogeneous medical data. Integrating radiological scans, histopathology images, and molecular information with clinical data is essential for developing a holistic understanding of the disease and optimizing treatment. The need for integrating data from multiple sources is further pronounced in complex diseases such as cancer for enabling precision medicine and personalized treatments. This work proposes Multimodal Integration of Oncology Data System (MINDS)\u2014a flexible, scalable, and cost-effective metadata framework for efficiently fusing disparate data from public sources such as the Cancer Research Data Commons (CRDC) into an interconnected, patient-centric framework. MINDS consolidates over 41,000 cases from across repositories while achieving a high compression ratio relative to the 3.78 PB source data size. It offers sub-5-s query response times for interactive exploration. MINDS offers an interface for exploring relationships across data types and building cohorts for developing large-scale multimodal machine learning models. By harmonizing multimodal data, MINDS aims to potentially empower researchers with greater analytical ability to uncover diagnostic and prognostic insights and enable evidence-based personalized care. MINDS tracks granular end-to-end data provenance, ensuring reproducibility and transparency. The cloud-native architecture of MINDS can handle exponential data growth in a secure, cost-optimized manner while ensuring substantial storage optimization, replication avoidance, and dynamic access capabilities. Auto-scaling, access controls, and other mechanisms guarantee pipelines\u2019 scalability and security. MINDS overcomes the limitations of existing biomedical data silos via an interoperable metadata-driven approach that represents a pivotal step toward the future of oncology data integration.<\/jats:p>","DOI":"10.3390\/s24051634","type":"journal-article","created":{"date-parts":[[2024,3,4]],"date-time":"2024-03-04T04:36:21Z","timestamp":1709526981000},"page":"1634","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":27,"title":["Building Flexible, Scalable, and Machine Learning-Ready Multimodal Oncology Datasets"],"prefix":"10.3390","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7231-0487","authenticated-orcid":false,"given":"Aakash","family":"Tripathi","sequence":"first","affiliation":[{"name":"Department of Machine Learning, Moffitt Cancer Center & Research Institute, Tampa, FL 33612, USA"},{"name":"Department of Electrical Engineering, University of South Florida, Tampa, FL 33620, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6834-4710","authenticated-orcid":false,"given":"Asim","family":"Waqas","sequence":"additional","affiliation":[{"name":"Department of Machine Learning, Moffitt Cancer Center & Research Institute, Tampa, FL 33612, USA"},{"name":"Department of Electrical Engineering, University of South Florida, Tampa, FL 33620, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-9069-5500","authenticated-orcid":false,"given":"Kavya","family":"Venkatesan","sequence":"additional","affiliation":[{"name":"Department of Machine Learning, Moffitt Cancer Center & Research Institute, Tampa, FL 33612, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2014-3060","authenticated-orcid":false,"given":"Yasin","family":"Yilmaz","sequence":"additional","affiliation":[{"name":"Department of Electrical Engineering, University of South Florida, Tampa, FL 33620, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8551-0090","authenticated-orcid":false,"given":"Ghulam","family":"Rasool","sequence":"additional","affiliation":[{"name":"Department of Machine Learning, Moffitt Cancer Center & Research Institute, Tampa, FL 33612, USA"},{"name":"Department of Electrical Engineering, University of South Florida, Tampa, FL 33620, USA"},{"name":"Department of Neuro-Oncology, Moffitt Cancer Center & Research Institute, Tampa, FL 33612, USA"},{"name":"Department of Oncologic Sciences, University of South Florida, Tampa, FL 33612, USA"}]}],"member":"1968","published-online":{"date-parts":[[2024,3,2]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"114","DOI":"10.1038\/s41568-021-00408-3","article-title":"Harnessing multimodal data integration to advance precision oncology","volume":"22","author":"Boehm","year":"2021","journal-title":"Nat. Rev. Cancer"},{"key":"ref_2","unstructured":"Waqas, A., Dera, D., Rasool, G., Bouaynaya, N.C., and Fathallah-Shaykh, H.M. (2021). Deep Learning for Biomedical Data Analysis, Springer."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"340","DOI":"10.1038\/s42256-023-00624-6","article-title":"Multimodal learning with graphs","volume":"5","author":"Ektefaie","year":"2023","journal-title":"Nat. Mach. Intell."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1095","DOI":"10.1016\/j.ccell.2022.09.012","article-title":"Artificial intelligence for multimodal data integration in oncology","volume":"40","author":"Lipkova","year":"2022","journal-title":"Cancer Cell"},{"key":"ref_5","unstructured":"Waqas, A., Tripathi, A., Ramachandran, R.P., Stewart, P., and Rasool, G. (2023). Multimodal Data Integration for Oncology in the Era of Deep Neural Networks: A Review. arXiv, Available online: https:\/\/arxiv.org\/abs\/2303.06471."},{"key":"ref_6","first-page":"5","article-title":"Moffitt Cancer Center: Why we are building the first machine learning department in oncology","volume":"47","author":"Rollison","year":"2021","journal-title":"Cancer Lett."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"1193","DOI":"10.1109\/JBHI.2015.2450362","article-title":"Big Data for Health","volume":"19","author":"Poon","year":"2015","journal-title":"IEEE J. Biomed. Health Inform."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"168","DOI":"10.1016\/j.soncn.2018.03.008","article-title":"The Rise of Big Data in Oncology","volume":"34","author":"Fessele","year":"2018","journal-title":"Semin. Oncol. Nurs."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Xu, P., Zhu, X., and Clifton, D.A. (2023). Multimodal Learning with Transformers: A Survey. arXiv.","DOI":"10.1109\/TPAMI.2023.3275156"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"100255","DOI":"10.1016\/j.labinv.2023.100255","article-title":"Revolutionizing Digital Pathology with the Power of Generative Artificial Intelligence and Foundation Models","volume":"103","author":"Waqas","year":"2023","journal-title":"Lab. Investig."},{"key":"ref_11","unstructured":"(2023, September 18). Common Crawl. Available online: https:\/\/commoncrawl.org\/."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Bote-Curiel, L., Mu\u00f1oz-Romero, S., Gerrero-Curieses, A., and Rojo-\u00c1lvarez, J.L. (2019). Deep Learning and Big Data in Healthcare: A Double Review for Critical Beginners. Appl. Sci., 9.","DOI":"10.3390\/app9112331"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Khan, M.A., Karim, M.R., and Kim, Y. (2018). A Two-Stage Big Data Analytics Framework with Real World Applications Using Spark Machine Learning and Long Short-Term Memory Network. Symmetry, 10.","DOI":"10.3390\/sym10100485"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"919046","DOI":"10.3389\/fmedt.2022.919046","article-title":"Failure detection in deep neural networks for medical imaging","volume":"4","author":"Ahmed","year":"2022","journal-title":"Front. Med. Technol."},{"key":"ref_15","first-page":"882","article-title":"TRustworthy Uncertainty Propagation for Sequential Time-Series Analysis in RNNs","volume":"36","author":"Dera","year":"2023","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"46","DOI":"10.1038\/s44172-022-00043-2","article-title":"Exploring Robust Architectures for Deep Artificial Neural Networks","volume":"1","author":"Waqas","year":"2022","journal-title":"Commun. Eng."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Benedum, C.M., Sondhi, A., Fidyk, E., Cohen, A.B., Nemeth, S., Adamson, B., Est\u00e9vez, M., and Bozkurt, S. (2023). Replication of Real-World Evidence in Oncology Using Electronic Health Record Data Extracted by Machine Learning. Cancers, 15.","DOI":"10.3390\/cancers15061853"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Specht, D.S., Waqas, A., Rasool, G., Clifford, C., and Bouaynaya, N. (2021). Intelligent Helipad Detection and (Grad-Cam) Estimation Using Satellite Imagery. Transp. Res. Board, TRBAM-21-01973. Available online: https:\/\/annualmeeting.mytrb.org\/OnlineProgram\/Details\/15715.","DOI":"10.4050\/F-0077-2021-16856"},{"key":"ref_19","unstructured":"Congress, U.S. (2023, December 01). Health Insurance Portability and Accountability Act of 1996, Available online: https:\/\/www.govinfo.gov\/content\/pkg\/PLAW-104publ191\/pdf\/PLAW-104publ191.pdf."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Oh, S.R., Seo, Y.D., Lee, E., and Kim, Y.G. (2021). A comprehensive survey on security and privacy for electronic health data. Int. J. Environ. Res. Public Health, 18.","DOI":"10.3390\/ijerph18189668"},{"key":"ref_21","unstructured":"National Cancer Institute (2023, June 18). CCG\u2019s Genome Characterization Pipeline, Available online: https:\/\/www.cancer.gov\/ccg\/research\/genome-characterization-pipeline."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"1109","DOI":"10.1056\/NEJMp1607591","article-title":"Toward a shared vision for cancer genomic data","volume":"375","author":"Grossman","year":"2016","journal-title":"N. Engl. J. Med."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"1045","DOI":"10.1007\/s10278-013-9622-7","article-title":"The Cancer Imaging Archive (TCIA): Maintaining and operating a public information repository","volume":"26","author":"Clark","year":"2013","journal-title":"J. Digit. Imaging"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Hinkson, I.V., Davidsen, T.M., Klemm, J.D., Chandramouliswaran, I., Kerlavage, A.R., and Kibbe, W.A. (2017). A Comprehensive Infrastructure for Big Data in Cancer Research: Accelerating Cancer Research and Precision Medicine. Front. Cell Dev. Biol., 5.","DOI":"10.3389\/fcell.2017.00108"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"936","DOI":"10.1093\/bib\/bbz044","article-title":"Implementing the FAIR Data Principles in precision oncology: Review of supporting initiatives","volume":"21","author":"Vesteghem","year":"2020","journal-title":"Brief. Bioinform."},{"key":"ref_26","first-page":"330","article-title":"The cancer biomedical informatics grid (caBIG\u2122): Infrastructure and applications for a worldwide research community","volume":"1","author":"Kuhn","year":"2007","journal-title":"Medinfo"},{"key":"ref_27","first-page":"96","article-title":"tranSMART: An open source knowledge management and high content data analytics platform","volume":"2014","author":"Scheufele","year":"2014","journal-title":"AMIA Summits Transl. Sci. Proc."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"124","DOI":"10.1136\/jamia.2009.000893","article-title":"Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2)","volume":"17","author":"Murphy","year":"2010","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"4536","DOI":"10.1016\/j.csbj.2023.09.014","article-title":"Multimodal analysis and the oncology patient: Creating a hospital system for integrated diagnostics and discovery","volume":"21","author":"Messiou","year":"2023","journal-title":"Comput. Struct. Biotechnol. J."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"769582","DOI":"10.3389\/frai.2021.769582","article-title":"The ReIMAGINE multimodal warehouse: Using artificial intelligence for accurate risk stratification of prostate cancer","volume":"4","author":"Santaolalla","year":"2021","journal-title":"Front. Artif. Intell."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Fedorov, A., Longabaugh, W., Pot, D., Clunie, D., Pieper, S., Lewis, R., Aerts, H., Homeyer, A., Herrmann, M., and Wagner, U. (2021). NCI Imaging Data Commons. Int. J. Radiat. Oncol. Biol. Phys., 111.","DOI":"10.1016\/j.ijrobp.2021.07.495"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"LB-242","DOI":"10.1158\/1538-7445.AM2020-LB-242","article-title":"Abstract LB-242: Proteomic Data Commons: A resource for proteogenomic analysis","volume":"80","author":"Thangudu","year":"2020","journal-title":"Cancer Res."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"493","DOI":"10.1186\/s12967-021-03147-z","article-title":"From biobank and data silos into a data commons: Convergence to support translational medicine","volume":"19","author":"Asiimwe","year":"2021","journal-title":"J. Transl. Med."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"525","DOI":"10.1038\/s41437-020-0303-2","article-title":"Big data in digital healthcare: Lessons learnt and recommendations for general practice","volume":"124","author":"Agrawal","year":"2020","journal-title":"Heredity"},{"key":"ref_35","unstructured":"Lecaros, J.A. (2023). Handbook of Bioethical Decisions. Volume I: Decisions at the Bench, Springer."},{"key":"ref_36","unstructured":"(2023, June 15). Cancer Data Aggregator, Available online: https:\/\/datacommons.cancer.gov\/cancer-data-aggregator."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"401","DOI":"10.1158\/2159-8290.CD-12-0095","article-title":"The cBio cancer genomics portal: An open platform for exploring multidimensional cancer genomics data","volume":"2","author":"Cerami","year":"2012","journal-title":"Cancer Discov."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"pl1","DOI":"10.1126\/scisignal.2004088","article-title":"Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal","volume":"6","author":"Gao","year":"2013","journal-title":"Sci. Signal."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"8","DOI":"10.1016\/j.oraloncology.2019.09.003","article-title":"The potential use of big data in oncology","volume":"98","author":"Willems","year":"2019","journal-title":"Oral Oncol."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Nambiar, A., and Mundra, D. (2022). An Overview of Data Warehouse and Data Lake in Modern Enterprise Data Management. Big Data Cogn. Comput., 6.","DOI":"10.3390\/bdcc6040132"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"675","DOI":"10.1038\/s41587-020-0546-8","article-title":"Visualizing and interpreting cancer genomics data via the Xena platform","volume":"38","author":"Goldman","year":"2020","journal-title":"Nat. Biotechnol."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"552","DOI":"10.1136\/jamia.2001.0080552","article-title":"The HL7 clinical document architecture","volume":"8","author":"Dolin","year":"2001","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"ref_43","unstructured":"(2023, December 01). HL7 FHIR. Available online: https:\/\/www.hl7.org\/fhir\/."},{"key":"ref_44","unstructured":"(2023, December 01). Clinical Data Interchange Standards Consortium. Available online: https:\/\/www.cdisc.org\/."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"115","DOI":"10.4103\/2229-3485.111779","article-title":"Clinical data interchange standards consortium: A bridge to overcome data standardisation","volume":"4","author":"Babre","year":"2013","journal-title":"Perspect. Clin. Res."},{"key":"ref_46","unstructured":"(2023, December 01). Overview of SNOMED CT. National Library of Medicine, Available online: https:\/\/www.nlm.nih.gov\/healthit\/snomedct\/snomed_overview.html."},{"key":"ref_47","unstructured":"(2023, December 01). NCI Thesaurus, Available online: https:\/\/ncit.nci.nih.gov\/ncitbrowser\/."},{"key":"ref_48","unstructured":"(2023, March 01). Amazon Web Services. Amazon QuickSight. Available online: https:\/\/aws.amazon.com\/quicksight\/."},{"key":"ref_49","unstructured":"(2023, March 01). Amazon Web Services. Amazon S3. Available online: https:\/\/aws.amazon.com\/s3\/."},{"key":"ref_50","unstructured":"(2023, March 01). Amazon Web Services. AWS Lake Formation. Available online: https:\/\/aws.amazon.com\/lake-formation\/."},{"key":"ref_51","unstructured":"(2023, March 01). Amazon Web Services. Data Catalog and Crawlers in AWS Glue. Available online: https:\/\/docs.aws.amazon.com\/glue\/latest\/dg\/catalog-and-crawler.html."},{"key":"ref_52","unstructured":"(2023, August 07). Amazon Web Services. Serverless Computing\u2014AWS Lambda\u2014Amazon Web Services. Available online: https:\/\/aws.amazon.com\/lambda\/."},{"key":"ref_53","unstructured":"Amazon Web Services (2023, March 01). AWS Glue. Available online: https:\/\/aws.amazon.com\/glue\/."},{"key":"ref_54","unstructured":"Amazon Web Services (2023, March 01). Amazon Redshift. Available online: https:\/\/aws.amazon.com\/redshift\/."},{"key":"ref_55","unstructured":"Amazon Web Services (2023, March 01). Amazon Athena. Available online: https:\/\/aws.amazon.com\/athena\/."},{"key":"ref_56","unstructured":"Amazon Web Services (2023, August 07). Encryption at Rest. Available online: https:\/\/docs.aws.amazon.com\/redshift\/latest\/mgmt\/security-server-side-encryption.html."},{"key":"ref_57","unstructured":"Amazon Web Services (2023, August 07). Security in AWS Glue. Available online: https:\/\/docs.aws.amazon.com\/glue\/latest\/dg\/security.html."},{"key":"ref_58","unstructured":"Amazon Web Services (2023, August 07). Amazon CloudWatch. Available online: https:\/\/aws.amazon.com\/cloudwatch\/."},{"key":"ref_59","unstructured":"(2023, November 28). Medical Imaging and Data Resource Center (MIDRIC). Available online: https:\/\/www.midrc.org\/."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/5\/1634\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T14:08:14Z","timestamp":1760105294000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/5\/1634"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,2]]},"references-count":59,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2024,3]]}},"alternative-id":["s24051634"],"URL":"https:\/\/doi.org\/10.3390\/s24051634","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,3,2]]}}}