{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,7]],"date-time":"2026-05-07T15:00:58Z","timestamp":1778166058288,"version":"3.51.4"},"reference-count":80,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,5,13]],"date-time":"2025-05-13T00:00:00Z","timestamp":1747094400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,5,13]],"date-time":"2025-05-13T00:00:00Z","timestamp":1747094400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["npj Digit. Med."],"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>This perspective proposes adapting video-text generative AI to 3D medical imaging (CT\/MRI) and medical videos (endoscopy\/laparoscopy) by treating 3D images as videos. The approach leverages modern video models to analyze multiple sequences simultaneously and provide real-time AI assistance during procedures. The paper examines medical imaging\u2019s unique characteristics (synergistic information, metadata, and world model), outlines applications in automated reporting, case retrieval, and education, and addresses challenges of limited datasets, benchmarks, and specialized training.<\/jats:p>","DOI":"10.1038\/s41746-025-01649-4","type":"journal-article","created":{"date-parts":[[2025,5,13]],"date-time":"2025-05-13T17:20:53Z","timestamp":1747156853000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":16,"title":["Multimodal generative AI for interpreting 3D medical images and videos"],"prefix":"10.1038","volume":"8","author":[{"given":"Jung-Oh","family":"Lee","sequence":"first","affiliation":[]},{"given":"Hong-Yu","family":"Zhou","sequence":"additional","affiliation":[]},{"given":"Tyler M.","family":"Berzin","sequence":"additional","affiliation":[]},{"given":"Daniel K.","family":"Sodickson","sequence":"additional","affiliation":[]},{"given":"Pranav","family":"Rajpurkar","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,5,13]]},"reference":[{"key":"1649_CR1","doi-asserted-by":"publisher","first-page":"e2117391","DOI":"10.1001\/jamanetworkopen.2021.17391","volume":"4","author":"EA Chi","year":"2021","unstructured":"Chi, E. A. et al. Development and validation of an artificial intelligence system to optimize clinician review of patient records. JAMA Netw. Open 4, e2117391 (2021).","journal-title":"JAMA Netw. Open"},{"key":"1649_CR2","doi-asserted-by":"publisher","first-page":"eaba4373","DOI":"10.1126\/scitranslmed.aba4373","volume":"13","author":"A Yala","year":"2021","unstructured":"Yala, A. et al. Toward robust mammography-based models for breast cancer risk. Sci. Transl. Med. 13, eaba4373 (2021).","journal-title":"Sci. Transl. Med."},{"key":"1649_CR3","doi-asserted-by":"publisher","first-page":"e2141096","DOI":"10.1001\/jamanetworkopen.2021.41096","volume":"4","author":"F Homayounieh","year":"2021","unstructured":"Homayounieh, F. et al. An artificial intelligence\u2013based chest X-ray model on human nodule detection accuracy from a multicenter study. JAMA Netw. Open 4, e2141096 (2021).","journal-title":"JAMA Netw. Open"},{"key":"1649_CR4","doi-asserted-by":"publisher","first-page":"1813","DOI":"10.1136\/gutjnl-2018-317500","volume":"68","author":"P Wang","year":"2019","unstructured":"Wang, P. et al. Real-time automatic detection system increases colonoscopic polyp and adenoma detection rates: a prospective randomised controlled study. Gut 68, 1813\u20131819 (2019).","journal-title":"Gut"},{"key":"1649_CR5","unstructured":"Bommasani, R. et al. On the Opportunities and Risks of Foundation Models. Preprint at http:\/\/arxiv.org\/abs\/2108.07258 (2022)."},{"key":"1649_CR6","doi-asserted-by":"publisher","first-page":"13293","DOI":"10.1007\/s10462-023-10414-6","volume":"56","author":"G Rafiq","year":"2023","unstructured":"Rafiq, G., Rafiq, M. & Choi, G. S. Video description: A comprehensive survey of deep learning approaches. Artif. Intell. Rev. 56, 13293\u201313372 (2023).","journal-title":"Artif. Intell. Rev."},{"key":"1649_CR7","doi-asserted-by":"publisher","unstructured":"Antol, S. et al. VQA: Visual Question Answering. In 2015 IEEE International Conference on Computer Vision (ICCV) 2425\u20132433 (IEEE, Santiago, Chile, 2015). https:\/\/doi.org\/10.1109\/ICCV.2015.279.","DOI":"10.1109\/ICCV.2015.279"},{"key":"1649_CR8","doi-asserted-by":"publisher","unstructured":"Bain, M., Nagrani, A., Varol, G. & Zisserman, A. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. In 2021 IEEE\/CVF International Conference on Computer Vision (ICCV) 1708\u20131718 (IEEE, Montreal, QC, Canada, 2021). https:\/\/doi.org\/10.1109\/ICCV48922.2021.00175.","DOI":"10.1109\/ICCV48922.2021.00175"},{"key":"1649_CR9","unstructured":"Singer, U. et al. Make-A-Video: Text-to-Video Generation without Text-Video Data. In The Eleventh International Conference on Learning Representations (2023)."},{"key":"1649_CR10","unstructured":"Nguyen, P., Quach, K. G., Kitani, K. M. & Luu, K. Type-to-Track: Retrieve Any Object via Prompt-based Tracking. In Thirty-seventh Conference on Neural Information Processing Systems (2023)."},{"key":"1649_CR11","unstructured":"Radford, A. et al. Learning Transferable Visual Models From Natural Language Supervision. In Proc. 38th International Conference on Machine Learning (eds. Meila, M. & Zhang, T.) vol. 139 8748\u20138763 (PMLR, 2021)."},{"key":"1649_CR12","doi-asserted-by":"publisher","first-page":"293","DOI":"10.1016\/j.neucom.2022.07.028","volume":"508","author":"H Luo","year":"2022","unstructured":"Luo, H. et al. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 508, 293\u2013304 (2022).","journal-title":"Neurocomputing"},{"key":"1649_CR13","doi-asserted-by":"publisher","unstructured":"Xu, H. et al. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding. In Proc. 2021 Conference on Empirical Methods in Natural Language Processing 6787\u20136800 (Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021). https:\/\/doi.org\/10.18653\/v1\/2021.emnlp-main.544.","DOI":"10.18653\/v1\/2021.emnlp-main.544"},{"key":"1649_CR14","unstructured":"Zhu, B. et al. LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. In The Twelfth International Conference on Learning Representations (2024)."},{"key":"1649_CR15","unstructured":"Chen, S. et al. VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset."},{"key":"1649_CR16","unstructured":"Liu, H., Yan, W., Zaharia, M. & Abbeel, P. World Model on Million-Length Video And Language With Blockwise RingAttention. In The Thirteenth International Conference on Learning Representations (2025)."},{"key":"1649_CR17","doi-asserted-by":"publisher","unstructured":"Zhang, H., Li, X. & Bing, L. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations 543\u2013553 (Association for Computational Linguistics, Singapore, 2023). https:\/\/doi.org\/10.18653\/v1\/2023.emnlp-demo.49.","DOI":"10.18653\/v1\/2023.emnlp-demo.49"},{"key":"1649_CR18","doi-asserted-by":"publisher","unstructured":"Lin, B. et al. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection. In Proc. of the 2024 Conference on Empirical Methods in Natural Language Processing 5971\u20135984 (Association for Computational Linguistics, Miami, Florida, USA, 2024). https:\/\/doi.org\/10.18653\/v1\/2024.emnlp-main.342.","DOI":"10.18653\/v1\/2024.emnlp-main.342"},{"key":"1649_CR19","unstructured":"OpenAI. GPT-4o https:\/\/openai.com\/index\/hello-gpt-4o\/ (2024)."},{"key":"1649_CR20","unstructured":"Gemini Team et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. Preprint at http:\/\/arxiv.org\/abs\/2403.05530 (2024)."},{"key":"1649_CR21","doi-asserted-by":"publisher","unstructured":"Wang, W. et al. Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks. In 2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 19175\u201319186 (IEEE, Vancouver, BC, Canada, 2023). https:\/\/doi.org\/10.1109\/CVPR52729.2023.01838.","DOI":"10.1109\/CVPR52729.2023.01838"},{"key":"1649_CR22","unstructured":"Li, J., Li, D., Savarese, S. & Hoi, S. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Proc. 40th International Conference on Machine Learning vol. 202 19730\u201319742 (JMLR.org, Honolulu, Hawaii, USA, 2023)."},{"key":"1649_CR23","doi-asserted-by":"publisher","unstructured":"He, K. et al. Masked Autoencoders Are Scalable Vision Learners. In 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 15979\u201315988 (IEEE, New Orleans, LA, USA, 2022). https:\/\/doi.org\/10.1109\/CVPR52688.2022.01553.","DOI":"10.1109\/CVPR52688.2022.01553"},{"key":"1649_CR24","unstructured":"Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. In Proc. of the 37th International Conference on Machine Learning vol. 119 1597\u20131607 (JMLR.org, 2020)."},{"key":"1649_CR25","unstructured":"Wang, Y. et al. InternVideo: General Video Foundation Models via Generative and Discriminative Learning. Preprint at http:\/\/arxiv.org\/abs\/2212.03191 (2022)."},{"key":"1649_CR26","doi-asserted-by":"publisher","unstructured":"Zhao, Y., Misra, I., Kr\u00e4henb\u00fchl, P. & Girdhar, R. Learning Video Representations from Large Language Models. In 2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 6586\u20136597 (IEEE, Vancouver, BC, Canada, 2023). https:\/\/doi.org\/10.1109\/CVPR52729.2023.00637.","DOI":"10.1109\/CVPR52729.2023.00637"},{"key":"1649_CR27","doi-asserted-by":"crossref","unstructured":"Yu, E. et al. Merlin: Empowering Multimodal LLMs with Foresight Minds. In Computer Vision \u2013 ECCV 2024 (eds. Leonardis, A. et al.) vol. 15062 425\u2013443 (Springer Nature Switzerland, Cham, 2025).","DOI":"10.1007\/978-3-031-73235-5_24"},{"key":"1649_CR28","unstructured":"Liu, C. et al. T3D: Towards 3D Medical Image Understanding through Vision-Language Pre-training. Preprint at http:\/\/arxiv.org\/abs\/2312.01529 (2025)."},{"key":"1649_CR29","unstructured":"Yuan, K. et al. Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures. Preprint at http:\/\/arxiv.org\/abs\/2307.15220 (2024)."},{"key":"1649_CR30","unstructured":"National Electrical Manufacturers Association. Digital Imaging and Communications in Medicine (DICOM) Part 5: Data Structures and Encoding. (2024)."},{"key":"1649_CR31","doi-asserted-by":"crossref","unstructured":"Masoudi, S. et al. Quick guide on radiology image pre-processing for deep learning applications in prostate cancer research. J. Med. Imaging 8, (2021).","DOI":"10.1117\/1.JMI.8.1.010901"},{"key":"1649_CR32","doi-asserted-by":"publisher","first-page":"568","DOI":"10.1117\/1.1695563","volume":"9","author":"K Gono","year":"2004","unstructured":"Gono, K. et al. Appearance of enhanced tissue features in narrow-band endoscopic imaging. J. Biomed. Opt. 9, 568 (2004).","journal-title":"J. Biomed. Opt."},{"key":"1649_CR33","doi-asserted-by":"publisher","first-page":"8843","DOI":"10.1007\/s00464-022-09313-8","volume":"36","author":"Y Takei","year":"2022","unstructured":"Takei, Y. et al. New diagnostic strategy using narrow-band imaging (NBI) during laparoscopic surgery for patients with colorectal cancer. Surg. Endosc. 36, 8843\u20138855 (2022).","journal-title":"Surg. Endosc."},{"key":"1649_CR34","doi-asserted-by":"publisher","DOI":"10.1038\/s41598-023-35564-z","volume":"13","author":"K Oka","year":"2023","unstructured":"Oka, K. et al. Red dichromatic imaging improves visibility of bleeding during gastric endoscopic submucosal dissection. Sci. Rep. 13, 8560 (2023).","journal-title":"Sci. Rep."},{"key":"1649_CR35","doi-asserted-by":"crossref","unstructured":"Barua, I. et al. Real-time artificial intelligence\u2013based optical diagnosis of neoplastic polyps during colonoscopy. NEJM Evid. 1, (2022).","DOI":"10.1056\/EVIDoa2200003"},{"key":"1649_CR36","doi-asserted-by":"publisher","first-page":"513","DOI":"10.1148\/rg.262055063","volume":"26","author":"R Bitar","year":"2006","unstructured":"Bitar, R. et al. MR pulse sequences: what every radiologist wants to know but is afraid to ask. RadioGraphics 26, 513\u2013537 (2006).","journal-title":"RadioGraphics"},{"key":"1649_CR37","doi-asserted-by":"publisher","first-page":"14","DOI":"10.1007\/s00330-003-2097-z","volume":"13","author":"D Fleischmann","year":"2003","unstructured":"Fleischmann, D. Use of high-concentration contrast media in multiple-detector-row CT: principles and rationale. Eur. Radiol. 13, 14\u201320 (2003).","journal-title":"Eur. Radiol."},{"key":"1649_CR38","doi-asserted-by":"publisher","first-page":"185","DOI":"10.1055\/a-1717-1391","volume":"54","author":"SW Van Der Merwe","year":"2022","unstructured":"Van Der Merwe, S. W. et al. Therapeutic endoscopic ultrasound: European Society of Gastrointestinal Endoscopy (ESGE) Guideline. Endoscopy 54, 185\u2013205 (2022).","journal-title":"Endoscopy"},{"key":"1649_CR39","doi-asserted-by":"publisher","first-page":"795","DOI":"10.1016\/j.gie.2014.11.019","volume":"81","author":"KV Chathadi","year":"2015","unstructured":"Chathadi, K. V. et al. The role of ERCP in benign diseases of the biliary tract. Gastrointest. Endosc. 81, 795\u2013803 (2015).","journal-title":"Gastrointest. Endosc."},{"key":"1649_CR40","doi-asserted-by":"publisher","first-page":"207","DOI":"10.2214\/ajr.175.1.1750207","volume":"175","author":"JR Petrella","year":"2000","unstructured":"Petrella, J. R. & Provenzale, J. M. MR Perfusion Imaging of the Brain: Techniques and Applications. Am. J. Roentgenol. 175, 207\u2013219 (2000).","journal-title":"Am. J. Roentgenol."},{"key":"1649_CR41","doi-asserted-by":"publisher","first-page":"128","DOI":"10.1016\/j.gie.2011.03.003","volume":"74","author":"RH Lee","year":"2011","unstructured":"Lee, R. H. et al. Quality of colonoscopy withdrawal technique and variability in adenoma detection rates (with videos). Gastrointest. Endosc. 74, 128\u2013134 (2011).","journal-title":"Gastrointest. Endosc."},{"key":"1649_CR42","first-page":"CD003677","volume":"2015","author":"JWM Aarts","year":"2015","unstructured":"Aarts, J. W. M. et al. Surgical approach to hysterectomy for benign gynaecological disease. Cochrane Database Syst. Rev. 2015, CD003677 (2015).","journal-title":"Cochrane Database Syst. Rev."},{"key":"1649_CR43","doi-asserted-by":"publisher","first-page":"29","DOI":"10.1007\/s13244-012-0201-0","volume":"4","author":"S Juanpere","year":"2013","unstructured":"Juanpere, S. et al. A diagnostic approach to the mediastinal masses. Insights Imaging 4, 29\u201352 (2013).","journal-title":"Insights Imaging"},{"key":"1649_CR44","doi-asserted-by":"publisher","first-page":"596","DOI":"10.5009\/gnl19181","volume":"13","author":"RJ Huang","year":"2019","unstructured":"Huang, R. J., Choi, A. Y., Truong, C. D., Yeh, M. M. & Hwang, J. H. Diagnosis and Management of Gastric Intestinal Metaplasia: Current Status and Future Directions. Gut Liver 13, 596\u2013603 (2019).","journal-title":"Gut Liver"},{"key":"1649_CR45","doi-asserted-by":"publisher","unstructured":"Ha, D. & Schmidhuber, J. World Models. https:\/\/doi.org\/10.5281\/zenodo.1207631 (2018) .","DOI":"10.5281\/zenodo.1207631"},{"key":"1649_CR46","doi-asserted-by":"publisher","DOI":"10.1038\/s41598-020-61055-6","volume":"10","author":"P Rajpurkar","year":"2020","unstructured":"Rajpurkar, P. et al. AppendiXNet: Deep Learning for Diagnosis of Appendicitis from A Small Dataset of CT Exams Using Video Pretraining. Sci. Rep. 10, 3958 (2020).","journal-title":"Sci. Rep."},{"key":"1649_CR47","unstructured":"Ke, A. et al. Video pretraining advances 3D deep learning on chest CT tasks. In Medical Imaging with Deep Learning (2023)."},{"key":"1649_CR48","unstructured":"Zunair, H., Rahman, A. & Mohammed, N. ViPTT-Net: Video pretraining of spatio-temporal model for tuberculosis type classification from chest CT scans. In Proc. Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, Bucharest, Romania, September 21st - to - 24th, 2021 (eds. Faggioli, G., Ferro, N., Joly, A., Maistro, M. & Piroi, F.) vol. 2936 1412\u20131421 (CEUR-WS.org, 2021)."},{"key":"1649_CR49","doi-asserted-by":"publisher","first-page":"e232178","DOI":"10.1148\/radiol.232178","volume":"311","author":"C Dai","year":"2024","unstructured":"Dai, C. et al. Deep Learning Assessment of Small Renal Masses at Contrast-enhanced Multiphase CT. Radiology 311, e232178 (2024).","journal-title":"Radiology"},{"key":"1649_CR50","unstructured":"Xu, H. et al. mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video. In Proc. 40th International Conference on Machine Learning vol. 202 38728\u201338748 (JMLR.org, Honolulu, Hawaii, USA, 2023)."},{"key":"1649_CR51","unstructured":"Saab, K. et al. Capabilities of Gemini Models in Medicine. Preprint at http:\/\/arxiv.org\/abs\/2404.18416 (2024)."},{"key":"1649_CR52","doi-asserted-by":"publisher","first-page":"354","DOI":"10.1038\/s42256-024-00807-9","volume":"6","author":"S Pai","year":"2024","unstructured":"Pai, S. et al. Foundation model for cancer imaging biomarkers. Nat. Mach. Intell. 6, 354\u2013367 (2024).","journal-title":"Nat. Mach. Intell."},{"key":"1649_CR53","doi-asserted-by":"publisher","unstructured":"Zhu, W. et al. 3D Foundation AI Model for Generalizable Disease Detection in Head Computed Tomography. Preprint at https:\/\/doi.org\/10.48550\/arXiv.2502.02779 (2025).","DOI":"10.48550\/arXiv.2502.02779"},{"key":"1649_CR54","doi-asserted-by":"crossref","unstructured":"Blankemeier, L. et al. Merlin: A Vision Language Foundation Model for 3D Computed Tomography. Preprint at http:\/\/arxiv.org\/abs\/2406.06512 (2024).","DOI":"10.21203\/rs.3.rs-4546309\/v1"},{"key":"1649_CR55","doi-asserted-by":"publisher","first-page":"e230024","DOI":"10.1148\/ryai.230024","volume":"5","author":"J Wasserthal","year":"2023","unstructured":"Wasserthal, J. et al. TotalSegmentator: Robust Segmentation of 104 Anatomic Structures in CT Images. Radiol. Artif. Intell. 5, e230024 (2023).","journal-title":"Radiol. Artif. Intell."},{"key":"1649_CR56","doi-asserted-by":"crossref","unstructured":"Hamamci, I. E., Er, S. & Menze, B. CT2Rep: Automated Radiology Report Generation for 3D Medical Imaging. In Proc. Medical Image Computing and Computer Assisted Intervention \u2013 MICCAI 2024 vol. LNCS 15012 (Springer Nature Switzerland, 2024).","DOI":"10.1007\/978-3-031-72390-2_45"},{"key":"1649_CR57","doi-asserted-by":"publisher","unstructured":"Krishna, R., Hata, K., Ren, F., Fei-Fei, L. & Niebles, J. C. Dense-Captioning Events in Videos. In 2017 IEEE International Conference on Computer Vision (ICCV) 706\u2013715 (IEEE, Venice, 2017). https:\/\/doi.org\/10.1109\/ICCV.2017.83.","DOI":"10.1109\/ICCV.2017.83"},{"key":"1649_CR58","first-page":"753","volume":"33","author":"M Matsubara","year":"2021","unstructured":"Matsubara, M. et al. Clinical significance of esophagogastroduodenoscopy in patients with esophageal motility disorders. Dig. Endosc. J. Jpn. Gastroenterol. Endosc. Soc. 33, 753\u2013760 (2021).","journal-title":"Dig. Endosc. J. Jpn. Gastroenterol. Endosc. Soc."},{"key":"1649_CR59","doi-asserted-by":"publisher","first-page":"1113","DOI":"10.3390\/diagnostics11061113","volume":"11","author":"Y-H Hsieh","year":"2021","unstructured":"Hsieh, Y.-H., Tang, C.-P., Tseng, C.-W., Lin, T.-L. & Leung, F. W. Computer-Aided Detection False Positives in Colonoscopy. Diagnostics 11, 1113 (2021).","journal-title":"Diagnostics"},{"key":"1649_CR60","doi-asserted-by":"publisher","first-page":"103624","DOI":"10.1016\/j.bspc.2022.103624","volume":"75","author":"N Goel","year":"2022","unstructured":"Goel, N., Kaur, S., Gunjan, D. & Mahapatra, S. J. Investigating the significance of color space for abnormality detection in wireless capsule endoscopy images. Biomed. Signal Process. Control 75, 103624 (2022).","journal-title":"Biomed. Signal Process. Control"},{"key":"1649_CR61","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1155\/2021\/5518948","volume":"2021","author":"T Sato","year":"2021","unstructured":"Sato, T. TXI: texture and color enhancement imaging for endoscopic image enhancement. J. Healthc. Eng. 2021, 1\u201311 (2021).","journal-title":"J. Healthc. Eng."},{"key":"1649_CR62","doi-asserted-by":"publisher","first-page":"974","DOI":"10.3390\/s23020974","volume":"23","author":"C Nie","year":"2023","unstructured":"Nie, C., Xu, C., Li, Z., Chu, L. & Hu, Y. Specular Reflections Detection and Removal for Endoscopic Images Based on Brightness Classification. Sensors 23, 974 (2023).","journal-title":"Sensors"},{"key":"1649_CR63","doi-asserted-by":"publisher","unstructured":"Yang, A. et al. Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning. In 2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10714\u201310726 (IEEE, Vancouver, BC, Canada, 2023). https:\/\/doi.org\/10.1109\/CVPR52729.2023.01032.","DOI":"10.1109\/CVPR52729.2023.01032"},{"key":"1649_CR64","doi-asserted-by":"publisher","first-page":"181","DOI":"10.1016\/j.gie.2023.02.025","volume":"98","author":"L Zhang","year":"2023","unstructured":"Zhang, L. et al. Effect of a deep learning\u2013based automatic upper GI endoscopic reporting system: a randomized crossover study (with video). Gastrointest. Endosc. 98, 181\u2013190.e10 (2023).","journal-title":"Gastrointest. Endosc."},{"key":"1649_CR65","doi-asserted-by":"publisher","first-page":"957","DOI":"10.1001\/jamasurg.2024.1510","volume":"159","author":"E Yanik","year":"2024","unstructured":"Yanik, E., Schwaitzberg, S. & De, S. Deep learning for video-based assessment in surgery. JAMA Surg. 159, 957 (2024).","journal-title":"JAMA Surg."},{"key":"1649_CR66","doi-asserted-by":"publisher","DOI":"10.1038\/s41598-021-00557-3","volume":"11","author":"Y Kumazu","year":"2021","unstructured":"Kumazu, Y. et al. Automated segmentation by deep learning of loose connective tissue fibers to define safe dissection planes in robot-assisted gastrectomy. Sci. Rep. 11, 21198 (2021).","journal-title":"Sci. Rep."},{"key":"1649_CR67","unstructured":"Peebles, B. et al. Video generation models as world simulators. https:\/\/openai.com\/research\/video-generation-models-as-world-simulators (2024)."},{"key":"1649_CR68","unstructured":"Google Deepmind. Veo. https:\/\/deepmind.google\/technologies\/veo\/ (2024)."},{"key":"1649_CR69","unstructured":"Snell, C. V., Lee, J., Xu, K. & Kumar, A. Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for Reasoning. In The Thirteenth International Conference on Learning Representations (2025)."},{"key":"1649_CR70","doi-asserted-by":"publisher","unstructured":"Xu, G. et al. LLaVA-CoT: Let Vision Language Models Reason Step-by-Step. Preprint at https:\/\/doi.org\/10.48550\/arXiv.2411.10440 (2025).","DOI":"10.48550\/arXiv.2411.10440"},{"key":"1649_CR71","doi-asserted-by":"publisher","unstructured":"Sun, G. et al. video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model. Preprint at https:\/\/doi.org\/10.48550\/arXiv.2502.11775 (2025).","DOI":"10.48550\/arXiv.2502.11775"},{"key":"1649_CR72","unstructured":"Bai, F., Du, Y., Huang, T., Meng, M. Q.-H. & Zhao, B. M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models. Preprint at http:\/\/arxiv.org\/abs\/2404.00578 (2024)."},{"key":"1649_CR73","doi-asserted-by":"publisher","first-page":"960","DOI":"10.1016\/j.gie.2020.07.060","volume":"93","author":"M Misawa","year":"2021","unstructured":"Misawa, M. et al. Development of a computer-aided detection system for colonoscopy and a publicly accessible large colonoscopy video database (with video). Gastrointest. Endosc. 93, 960\u2013967.e3 (2021).","journal-title":"Gastrointest. Endosc."},{"key":"1649_CR74","unstructured":"Wang, Y. et al. InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation. In The Twelfth International Conference on Learning Representations (2024)."},{"key":"1649_CR75","doi-asserted-by":"publisher","unstructured":"Sun, W. et al. Bora: Biomedical Generalist Video Generation Model. Preprint at https:\/\/doi.org\/10.48550\/arXiv.2407.08944 (2024).","DOI":"10.48550\/arXiv.2407.08944"},{"key":"1649_CR76","doi-asserted-by":"publisher","first-page":"3129","DOI":"10.1038\/s41591-024-03185-2","volume":"30","author":"K Zhang","year":"2024","unstructured":"Zhang, K. et al. A generalist vision\u2013language foundation model for diverse biomedical tasks. Nat. Med. 30, 3129\u20133141 (2024).","journal-title":"Nat. Med."},{"key":"1649_CR77","doi-asserted-by":"publisher","first-page":"172","DOI":"10.1038\/s41586-023-06291-2","volume":"620","author":"K Singhal","year":"2023","unstructured":"Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172\u2013180 (2023).","journal-title":"Nature"},{"key":"1649_CR78","doi-asserted-by":"crossref","unstructured":"Li, C. et al. LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day. In Thirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2023).","DOI":"10.32388\/VLXB6M"},{"key":"1649_CR79","unstructured":"Zhou, H.-Y., Adithan, S., Acosta, J. N., Topol, E. J. & Rajpurkar, P. A Generalist Learner for Multifaceted Medical Image Interpretation."},{"key":"1649_CR80","doi-asserted-by":"publisher","unstructured":"Johnson, A. et al. MIMIC-IV. PhysioNet https:\/\/doi.org\/10.13026\/KPB9-MT58.","DOI":"10.13026\/KPB9-MT58"}],"container-title":["npj Digital Medicine"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-01649-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-01649-4","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-01649-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,5,13]],"date-time":"2025-05-13T17:21:21Z","timestamp":1747156881000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-01649-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,13]]},"references-count":80,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["1649"],"URL":"https:\/\/doi.org\/10.1038\/s41746-025-01649-4","relation":{},"ISSN":["2398-6352"],"issn-type":[{"value":"2398-6352","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,5,13]]},"assertion":[{"value":"1 November 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"18 April 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 May 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"T.M.B. is a consultant for Boston Scientific, Medtronic, and Magentiq Eye. All the other authors declare no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"273"}}