{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,2]],"date-time":"2026-03-02T19:53:49Z","timestamp":1772481229537,"version":"3.50.1"},"reference-count":79,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,5,28]],"date-time":"2025-05-28T00:00:00Z","timestamp":1748390400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,5,28]],"date-time":"2025-05-28T00:00:00Z","timestamp":1748390400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Vis. Intell."],"published-print":{"date-parts":[[2025,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Existing X-ray image based pre-trained vision models are typically trained on a relatively small-scale dataset (less than 500,000 samples) with limited resolution (e.g., <jats:inline-formula>\n              <jats:alternatives>\n                <jats:tex-math>$224 \\times 224$<\/jats:tex-math>\n                <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mn>224<\/mml:mn>\n                  <mml:mo>\u00d7<\/mml:mo>\n                  <mml:mn>224<\/mml:mn>\n                <\/mml:math>\n              <\/jats:alternatives>\n            <\/jats:inline-formula>). However, the key to the success of self-supervised pre-training of large models lies in massive training data, and the maintenance of high-resolution X-ray images contributes to effective solutions for some challenging diseases. In this paper, we proposed a high-resolution (<jats:inline-formula>\n              <jats:alternatives>\n                <jats:tex-math>$1280 \\times 1280$<\/jats:tex-math>\n                <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mn>1280<\/mml:mn>\n                  <mml:mo>\u00d7<\/mml:mo>\n                  <mml:mn>1280<\/mml:mn>\n                <\/mml:math>\n              <\/jats:alternatives>\n            <\/jats:inline-formula>) X-ray image based pre-trained baseline model on our newly collected large-scale dataset containing more than 1 million X-ray images. Our model employs the masked auto-encoder framework, wherein the tokens that have been processed with a high rate are used as input, and the masked image patches are reconstructed by means of the Transformer encoder-decoder network. More importantly, a novel context-aware masking strategy has been introduced. This strategy utilizes the breast contour as a boundary for adaptive masking operations. We validate the effectiveness of our model through its application in two downstream tasks, namely X-ray report generation and disease detection. Extensive experiments demonstrate that our pre-trained medical baseline model can achieve comparable to, or even exceed, those of current state-of-the-art models on downstream benchmark datasets.<\/jats:p>","DOI":"10.1007\/s44267-025-00080-3","type":"journal-article","created":{"date-parts":[[2025,5,28]],"date-time":"2025-05-28T07:18:22Z","timestamp":1748416702000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":9,"title":["Pre-training on high-resolution X-ray images: an experimental study"],"prefix":"10.1007","volume":"3","author":[{"given":"Xiao","family":"Wang","sequence":"first","affiliation":[]},{"given":"Yuehang","family":"Li","sequence":"additional","affiliation":[]},{"given":"Wentao","family":"Wu","sequence":"additional","affiliation":[]},{"given":"Jiandong","family":"Jin","sequence":"additional","affiliation":[]},{"given":"Yao","family":"Rong","sequence":"additional","affiliation":[]},{"given":"Bo","family":"Jiang","sequence":"additional","affiliation":[]},{"given":"Chuanfu","family":"Li","sequence":"additional","affiliation":[]},{"given":"Jin","family":"Tang","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,5,28]]},"reference":[{"key":"80_CR1","first-page":"234","volume-title":"Proceedings of the 18th international conference on medical image computing and computer-assisted intervention","author":"O. Ronneberger","year":"2015","unstructured":"Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: convolutional networks for biomedical image segmentation. In N. Navab, J. Hornegger, W. Wells, & A. Frangi (Eds.), Proceedings of the 18th international conference on medical image computing and computer-assisted intervention (pp. 234\u2013241). Cham: Springer."},{"key":"80_CR2","unstructured":"Yan, K., Wang, X., Lu, L., & Summers, R. M. (2017). DeepLesion: automated deep mining, categorization and detection of significant radiology image findings using large-scale clinical lesion annotations. arXiv preprint. arXiv:1710.01766."},{"key":"80_CR3","unstructured":"Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., Ding, D., Bagul, A., Langlotz, C., Shpanskaya, K., et\u00a0al. (2017). CheXNet: radiologist-level pneumonia detection on chest X-rays with deep learning. arXiv preprint. arXiv:1711.05225."},{"key":"80_CR4","first-page":"2048","volume-title":"Proceedings of the 32nd international conference on machine learning","author":"K. Xu","year":"2015","unstructured":"Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C., Salakhutdinov, R., Zemel, R. S., & Bengio, Y. (2015). Show, attend and tell: neural image caption generation with visual attention. In F. R. Bach & D. M. Blei (Eds.), Proceedings of the 32nd international conference on machine learning (pp. 2048\u20132057). Stroudsburg: International Machine Learning Society."},{"key":"80_CR5","first-page":"248","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition","author":"J. Deng","year":"2009","unstructured":"Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Li, F.-F. (2009). ImageNet: a large-scale hierarchical image database. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 248\u2013255). Piscataway: IEEE."},{"key":"80_CR6","volume-title":"Proceedings of the 3rd international conference on learning representations","author":"K. Simonyan","year":"2015","unstructured":"Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In Y. Bengio & Y. LeCun (Eds.), Proceedings of the 3rd international conference on learning representations. Retrieved April 9, 2025, from https:\/\/openreview.net\/forum?id=Rvk1qgjk4Ee."},{"key":"80_CR7","first-page":"770","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition","author":"K. He","year":"2015","unstructured":"He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770\u2013778). Piscataway: IEEE."},{"key":"80_CR8","first-page":"5998","volume-title":"Proceedings of the 31st international conference on neural information processing systems","author":"A. Vaswani","year":"2017","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Proceedings of the 31st international conference on neural information processing systems (pp. 5998\u20136008). Red Hook: Curran Associates."},{"key":"80_CR9","doi-asserted-by":"publisher","DOI":"10.1007\/s44267-024-00038-x","volume":"2","author":"L. Tang","year":"2024","unstructured":"Tang, L., Yin, Z., Su, H., Lyu, W., & Luo, B. (2024). WFSS: weighted fusion of spectral transformer and spatial self-attention for robust hyperspectral image classification against adversarial attacks. Visual Intelligence, 2, 5.","journal-title":"Visual Intelligence"},{"key":"80_CR10","doi-asserted-by":"publisher","DOI":"10.1007\/s44267-023-00025-8","volume":"1","author":"P. Yan","year":"2023","unstructured":"Yan, P., Liu, X., Zhang, P., & Lu, H. (2023). Learning convolutional multi-level transformers for image-based person re-identification. Visual Intelligence, 1, 24.","journal-title":"Visual Intelligence"},{"key":"80_CR11","first-page":"1597","volume-title":"Proceedings of the 37th international conference on machine learning","author":"T. Chen","year":"2020","unstructured":"Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. E. (2020). A simple framework for contrastive learning of visual representations. In D. H. III & A. Singh (Eds.), Proceedings of the 37th international conference on machine learning (pp. 1597\u20131607). Stroudsburg: International Machine Learning Society."},{"key":"80_CR12","volume-title":"Proceedings of the 9th international conference on pattern recognition","author":"A. Dosovitskiy","year":"2021","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.. (2021). An image is worth 16 x 16 words: transformers for image recognition at scale. In Proceedings of the 9th international conference on pattern recognition. Retrieved May 7, 2021, from https:\/\/openreview.net\/forum?id=YicbFdNTTy."},{"key":"80_CR13","first-page":"9992","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"Z. Liu","year":"2021","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 9992\u201310002). Piscataway: IEEE."},{"key":"80_CR14","first-page":"15979","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"K. He","year":"2022","unstructured":"He, K., Chen, X., Xie, S., Li, Y., Doll\u00e1r, P., & Girshick, R. B. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 15979\u201315988). Piscataway: IEEE."},{"key":"80_CR15","first-page":"8748","volume-title":"Proceedings of the 38th international conference on machine learning","author":"A. Radford","year":"2021","unstructured":"Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In M. Meila & T. Zhang (Eds.), Proceedings of the 38th international conference on machine learning (pp. 8748\u20138763). Stroudsburg: International Machine Learning Society."},{"key":"80_CR16","first-page":"21315","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"C. Wu","year":"2023","unstructured":"Wu, C., Zhang, X., Zhang, Y., Wang, Y., & Xie, W. (2023). MedKLIP: medical knowledge enhanced language-image pre-training for X-ray diagnosis. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 21315\u201321326). Piscataway: IEEE."},{"issue":"1","key":"80_CR17","doi-asserted-by":"publisher","DOI":"10.1038\/s41597-019-0322-0","volume":"6","author":"A. E. Johnson","year":"2019","unstructured":"Johnson, A. E., Pollard, T. J., Berkowitz, S. J., Greenbaum, N. R., Lungren, M. P., Deng, C. Y., Mark, R. G., & Horng, S. (2019). MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data, 6(1), 317.","journal-title":"Scientific Data"},{"key":"80_CR18","first-page":"23346","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"Z. Chen","year":"2023","unstructured":"Chen, Z., Diao, S., Wang, B., Li, G., & Wan, X. (2023). Towards unifying medical vision-and-language pre-training via soft prompts. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 23346\u201323356). Piscataway: IEEE."},{"key":"80_CR19","first-page":"679","volume-title":"Proceedings of the 25th international conference on medical image computing and computer-assisted intervention","author":"Z. Chen","year":"2022","unstructured":"Chen, Z., Du, Y., Hu, J., Liu, Y., Li, G., Wan, X., & Chang, T. (2022). Multi-modal masked autoencoders for medical vision-and-language pre-training. In L. Wang, Q. Dou, P. T. Fletcher, S. Speidel, & S. Li (Eds.), Proceedings of the 25th international conference on medical image computing and computer-assisted intervention (pp. 679\u2013689). Cham: Springer."},{"key":"80_CR20","first-page":"3577","volume-title":"Proceedings of IEEE\/CVF winter conference on applications of computer vision","author":"J. Xiao","year":"2023","unstructured":"Xiao, J., Bai, Y., Yuille, A. L., & Zhou, Z. (2023). Delving into masked autoencoders for multi-label thorax disease classification. In Proceedings of IEEE\/CVF winter conference on applications of computer vision (pp. 3577\u20133589). Piscataway: IEEE."},{"issue":"2","key":"80_CR21","doi-asserted-by":"publisher","first-page":"304","DOI":"10.1093\/jamia\/ocv080","volume":"23","author":"D. Demner-Fushman","year":"2016","unstructured":"Demner-Fushman, D., Kohli, M. D., Rosenman, M. B., Shooshan, S. E., Rodriguez, L., Antani, S., Thoma, G. R., & McDonald, C. J. (2016). Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association, 23(2), 304\u2013310.","journal-title":"Journal of the American Medical Informatics Association"},{"key":"80_CR22","doi-asserted-by":"publisher","first-page":"447","DOI":"10.1007\/s11633-022-1410-8","volume":"20","author":"X. Wang","year":"2023","unstructured":"Wang, X., Chen, G., Qian, G., Gao, P., Wei, X., Wang, Y., Tian, Y., & Gao, W. (2023). Large-scale multi-modal pre-trained models: a comprehensive survey. Machine Intelligence Research, 20, 447\u2013482.","journal-title":"Machine Intelligence Research"},{"key":"80_CR23","unstructured":"Shrestha, P., Amgain, S., Khanal, B., Linte, C. A., & Bhattarai, B. (2023). Medical vision language pretraining: a survey. arXiv preprint. arXiv:2312.06224."},{"key":"80_CR24","unstructured":"Azad, B., Azad, R., Eskandari, S., Bozorgpour, A., Kazerouni, A., Rekik, I., & Merhof, D. (2023). Foundational models in medical imaging: a comprehensive survey and future vision. arXiv preprint. arXiv:2310.18689."},{"key":"80_CR25","unstructured":"Zhao, Z., Liu, Y., Wu, H., Li, Y., Wang, S., Teng, L., Liu, D., Li, X., Cui, Z., Wang, Q., et\u00a0al. (2023). CLIP in medical imaging: a comprehensive survey. arXiv preprint. arXiv:2312.07353."},{"issue":"8","key":"80_CR26","doi-asserted-by":"publisher","first-page":"1930","DOI":"10.1038\/s41591-023-02448-8","volume":"29","author":"A. J. Thirunavukarasu","year":"2023","unstructured":"Thirunavukarasu, A. J., Ting, D. S. J., Elangovan, K., Gutierrez, L., Tan, T. F., & Ting, D. S. W. (2023). Large language models in medicine. Nature Medicine, 29(8), 1930\u20131940.","journal-title":"Nature Medicine"},{"key":"80_CR27","doi-asserted-by":"publisher","DOI":"10.1007\/s44267-023-00005-y","volume":"1","author":"Y. Yang","year":"2023","unstructured":"Yang, Y., Cui, Z., Xu, J., Zhong, C., Zheng, W. S., & Wang, R. (2023). Continual learning with Bayesian model based on a fixed pre-trained feature extractor. Visual Intelligence, 1, 5.","journal-title":"Visual Intelligence"},{"key":"80_CR28","first-page":"3942","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"S. C. Huang","year":"2021","unstructured":"Huang, S. C., Shen, L., Lungren, M. P., & Yeung, S. (2021). Gloria: a multimodal global-local representation learning framework for label-efficient medical image recognition. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 3942\u20133951). Piscataway: IEEE."},{"key":"80_CR29","first-page":"14751","volume-title":"Proceedings of the 38th international conference on neural information processing systems","author":"C. Liu","year":"2023","unstructured":"Liu, C., Ouyang, C., Cheng, S., Shah, A., Bai, W., & Arcucci, R. (2023). G2D: from global to dense radiography representation learning via vision-language pre-training. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, & C. Zhang (Eds.), Proceedings of the 38th international conference on neural information processing systems (pp.\u00a014751\u201314773). Red Hook: Curran Associates."},{"key":"80_CR30","first-page":"9736","volume-title":"IEEE Transactions on Multimedia, 26","author":"C. Zhan","year":"2024","unstructured":"Zhan, C., Zhang, Y., Lin, Y., Wang, G., & Wang, H. (2024). UniDCP: unifying multiple medical vision-language tasks via dynamic cross-modal learnable prompts. IEEE Transactions on Multimedia, 26, 9736\u20139748."},{"key":"80_CR31","doi-asserted-by":"publisher","first-page":"15949","DOI":"10.18653\/v1\/2023.emnlp-main.989","volume-title":"Proceedings of the 2023 conference on empirical methods in natural language processing","author":"S. Wang","year":"2023","unstructured":"Wang, S., Peng, B., Liu, Y., & Peng, Q. (2023). Fine-grained medical vision-language representation learning for radiology report generation. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 conference on empirical methods in natural language processing (pp. 15949\u201315956). Stroudsburg: ACL."},{"key":"80_CR32","first-page":"4706","volume-title":"IEEE Transactions on Multimedia, 26","author":"K. Zhang","year":"2024","unstructured":"Zhang, K., Yang, Y., Yu, J., Jiang, H., Fan, J., Huang, Q., & Han, W. (2024). Multi-task paired masking with alignment modeling for medical vision-language pre-training. IEEE Transactions on Multimedia, 26, 4706\u20134721."},{"key":"80_CR33","volume-title":"Proceedings of the 11th international conference on learning representations","author":"H. Y. Zhou","year":"2023","unstructured":"Zhou, H. Y., Lian, C., Wang, L., & Yu, Y. (2023). Advancing radiograph representation learning with masked record modeling. In Proceedings of the 11th international conference on learning representations. Retrieved October 14, 2024, from https:\/\/openreview.net\/forum?id=w-x7U26GM7j."},{"key":"80_CR34","doi-asserted-by":"publisher","first-page":"5152","DOI":"10.1145\/3503161.3547948","volume-title":"Proceedings of the 30th ACM international conference on multimedia","author":"Z. Chen","year":"2022","unstructured":"Chen, Z., Li, G., & Wan, X. (2022). Align, reason and learn: enhancing medical vision-and-language pre-training with knowledge. In Proceedings of the 30th ACM international conference on multimedia (pp. 5152\u20135161). Cham: Springer."},{"key":"80_CR35","unstructured":"Wang, R., Yao, Q., Lai, H., He, Z., Tao, X., Jiang, Z., & Zhou, S. K. (2023). ECAMP: entity-centered context-aware medical vision language pre-training. arXiv preprint. arXiv:2312.13316."},{"key":"80_CR36","unstructured":"Liu, C., Ouyang, C., Chen, Y., Quilodr\u00e1n-Casas, C. C., Ma, L., Fu, J., Guo, Y., Shah, A., Bai, W., & Arcucci, R. (2023). T3D: towards 3D medical image understanding through vision-language pre-training. arXiv preprint. arXiv:2312.01529."},{"key":"80_CR37","doi-asserted-by":"crossref","unstructured":"Huang, W., Zhou, H., Li, C., Yang, H., Liu, J., & Wang, S. (2023). Enhancing representation in radiography-reports foundation model: a granular alignment algorithm using masked contrastive learning. arXiv preprint. arXiv:2309.05904.","DOI":"10.1038\/s41467-024-51749-0"},{"key":"80_CR38","first-page":"101","volume-title":"Proceedings of the 26th international conference on medical image computing and computer-assisted intervention","author":"K. You","year":"2023","unstructured":"You, K., Gu, J., Ham, J., Park, B., Kim, J., Hong, E. K., Baek, W., & Roh, B. (2023). CXR-CLIP: toward large scale chest X-ray language-image pre-training. In H. Greenspan, A. Madabhushi, P. Mousavi, S. Salcudean, J. Duncan, T. Syeda-Mahmood, & R. Taylor (Eds.), Proceedings of the 26th international conference on medical image computing and computer-assisted intervention (pp. 101\u2013111). Cham: Springer."},{"key":"80_CR39","first-page":"23403","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"Z. Chen","year":"2023","unstructured":"Chen, Z., Diao, S., Wang, B., Li, G., & Wan, X. (2023). Towards unifying medical vision-and-language pre-training via soft prompts. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 23403\u201323413). Piscataway: IEEE."},{"key":"80_CR40","unstructured":"Liu, C., Cheng, S., Shi, M., Shah, A., Bai, W., & Arcucci, R. (2023). IMITATE: clinical prior guided hierarchical vision-language pre-training. arXiv preprint. arXiv:2310.07355."},{"key":"80_CR41","unstructured":"Fan, W., Suvon, M. N. I., Zhou, S., Liu, X., Alabed, S., Osmani, V., Swift, A., Chen, C., & Lu, H. (2024). MeDSLIP: medical dual-stream language-image pre-training for fine-grained alignment. arXiv preprint. arXiv:2403.10635."},{"key":"80_CR42","first-page":"80","volume-title":"Proceedings of the 27th international conference on medical image computing and computer-assisted intervention","author":"Q. Li","year":"2024","unstructured":"Li, Q., Yan, X., Xu, J., Yuan, R., Zhang, Y., Feng, R., Shen, Q., Zhang, X., & Wang, S. (2024). Anatomical structure-guided medical vision-language pre-training. In M. G. Linguraru, Q. Dou, A. Feragen, S. Giannarou, B. Glocker, K. Lekadir, & J. A. Schnabel (Eds.), Proceedings of the 27th international conference on medical image computing and computer-assisted intervention (pp. 80\u201390). Cham: Springer."},{"key":"80_CR43","first-page":"1","volume-title":"Proceedings of the 2024 IEEE international symposium on biomedical imaging","author":"J. Liu","year":"2024","unstructured":"Liu, J., Zhou, H. Y., Li, C., Huang, W., Yang, H., Liang, Y., Shi, G., Zheng, H., & Wang, S. (2024). MLIP: medical language-image pre-training with masked local representation learning. In Proceedings of the 2024 IEEE international symposium on biomedical imaging (pp.\u00a01\u20135). Piscataway: IEEE."},{"key":"80_CR44","unstructured":"Hyland, S. L., Bannur, S., Bouzid, K., Castro, D. C., Ranjit, M., Schwaighofer, A., P\u00e9rez-Garc\u00eda, F., Salvatelli, V., Srivastav, S., Thieme, A., et\u00a0al. (2023). MAIRA-1: a specialised large multimodal model for radiology report generation. arXiv preprint. arXiv:2311.13668."},{"key":"80_CR45","unstructured":"Chiang, W. L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., et\u00a0al. Vicuna: an open-source chatbot impressing GPT-4 with 90%* chatgpt quality. Retrieved January 27, 2024, from https:\/\/lmsys.org\/blog\/2023-03-30-vicuna\/."},{"key":"80_CR46","first-page":"6666","volume-title":"Proceedings of the 33rd AAAI conference on artificial intelligence","author":"C. Y. Li","year":"2019","unstructured":"Li, C. Y., Liang, X., Hu, Z., & Xing, E. P. (2019). Knowledge-driven encode, retrieve, paraphrase for medical image report generation. In Proceedings of the 33rd AAAI conference on artificial intelligence (pp. 6666\u20136673). Palo Alto: AAAI Press."},{"key":"80_CR47","first-page":"8108","volume-title":"Proceedings of the 61st annual meeting of the Association for Computational Linguistics","author":"W. Hou","year":"2023","unstructured":"Hou, W., Xu, K., Cheng, Y., Li, W., & Liu, J. (2023). ORGAN: observation-guided radiology report generation via tree reasoning. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st annual meeting of the Association for Computational Linguistics (pp.\u00a08108\u20138122). Stroudsburg: ACL."},{"key":"80_CR48","first-page":"16266","volume-title":"Proceedings of the 35th international conference on neural information processing systems","author":"F. Liu","year":"2021","unstructured":"Liu, F., You, C., Wu, X., Ge, S., Wang, S., & Sun, X. (2021). Auto-encoding knowledge graph for unsupervised medical report generation. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, & J. W. Vaughan (Eds.), Proceedings of the 35th international conference on neural information processing systems (pp.\u00a016266\u201316279). Red Hook: Curran Associates."},{"key":"80_CR49","first-page":"563","volume-title":"Proceedings of the 17th European conference on computer vision","author":"J. Wang","year":"2022","unstructured":"Wang, J., Bhalerao, A., & He, Y. (2022). Cross-modal prototype driven network for radiology report generation. In S. Avidan, G. J. Brostow, M. Ciss\u00e9, G. M. Farinella, & T. Hassner (Eds.), Proceedings of the 17th European conference on computer vision (pp. 563\u2013579). Cham: Springer."},{"key":"80_CR50","first-page":"904","volume-title":"IEEE Transactions on Multimedia, 26","author":"K. Zhang","year":"2024","unstructured":"Zhang, K., Jiang, H., Zhang, J., Huang, Q., Fan, J., Yu, J., & Han, W. (2024). Semi-supervised medical report generation via graph-guided hybrid feature consistency. IEEE Transactions on Multimedia, 26, 904\u2013915."},{"key":"80_CR51","first-page":"2607","volume-title":"Proceedings of the AAAI conference on artificial intelligence","author":"H. Jin","year":"2024","unstructured":"Jin, H., Che, H., Lin, Y., & Chen, H. (2024). PromptMRG: diagnosis-driven prompts for medical report generation. In Proceedings of the AAAI conference on artificial intelligence (pp. 2607\u20132615). Palo Alto: AAAI Press."},{"key":"80_CR52","first-page":"2863","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"Y. Li","year":"2023","unstructured":"Li, Y., Yang, B., Cheng, X., Zhu, Z., Li, H., & Zou, Y. (2023). Unify, align and refine: multi-level semantic alignment for radiology report generation. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 2863\u20132874). Piscataway: IEEE."},{"key":"80_CR53","first-page":"21361","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"P. Cheng","year":"2023","unstructured":"Cheng, P., Lin, L., Lyu, J., Huang, Y., Luo, W., & Tang, X. (2023). Prior: prototype representation joint learning from medical images and reports. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 21361\u201321371). Piscataway: IEEE."},{"key":"80_CR54","first-page":"21284","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"Y. Chen","year":"2023","unstructured":"Chen, Y., Liu, F., Wang, H., Wang, C., Liu, Y., Tian, Y., & Carneiro, G. (2023). BoMD: bag of multi-label descriptors for noisy chest X-ray classification. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 21284\u201321295). Piscataway: IEEE."},{"key":"80_CR55","first-page":"7433","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"T. Tanida","year":"2023","unstructured":"Tanida, T., M\u00fcller, P., Kaissis, G., & Rueckert, D. (2023). Interactive and explainable region-guided radiology report generation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 7433\u20137442). Piscataway: IEEE."},{"key":"80_CR56","first-page":"19809","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Z. Huang","year":"2023","unstructured":"Huang, Z., Zhang, X., & Zhang, S. (2023). KiUT: knowledge-injected u-transformer for radiology report generation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 19809\u201319818). Piscataway: IEEE."},{"key":"80_CR57","first-page":"11558","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Z. Wang","year":"2023","unstructured":"Wang, Z., Liu, L., Wang, L., & Zhou, L. (2023). METransformer: radiology report generation by transformer with multiple learnable expert tokens. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 11558\u201311567). Piscataway: IEEE."},{"key":"80_CR58","first-page":"1439","volume-title":"Proceedings of the 2020 conference on empirical methods in natural language processing","author":"Z. Chen","year":"2020","unstructured":"Chen, Z., Song, Y., Chang, T., & Wan, X. (2020). Generating radiology reports via memory-driven transformer. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 conference on empirical methods in natural language processing (pp. 1439\u20131449). Stroudsburg: ACL."},{"key":"80_CR59","volume-title":"IEEE Transactions on Circuits and Systems for Video Technology","author":"X. Cheng","year":"2022","unstructured":"Cheng, X., Jia, M., Wang, Q., & Zhang, J. (2022). A simple visual-textual baseline for pedestrian attribute recognition. IEEE Transactions on Circuits and Systems for Video Technology, 32(10), 6994-7004."},{"issue":"1","key":"80_CR60","volume":"1","author":"G. Shih","year":"2019","unstructured":"Shih, G., Wu, C. C., Halabi, S. S., Kohli, M. D., Prevedello, L. M., Cook, T. S., Sharma, A., Amorosa, J. K., Arteaga, V., Galperin-Aizenberg, M., et al. (2019). Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia. Radiology: Artificial Intelligence, 1(1), e180041.","journal-title":"Radiology: Artificial Intelligence"},{"key":"80_CR61","first-page":"3334","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"M. Li","year":"2023","unstructured":"Li, M., Lin, B., Chen, Z., Lin, H., Liang, X., & Chang, X. (2023). Dynamic graph enhanced contrastive learning for chest X-ray report generation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 3334\u20133343). Piscataway: IEEE."},{"key":"80_CR62","first-page":"4566","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition","author":"R. Vedantam","year":"2015","unstructured":"Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). CIDEr: consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4566\u20134575). Piscataway: IEEE."},{"key":"80_CR63","first-page":"311","volume-title":"Proceedings of the 40th annual meeting of the Association for Computational Linguistics","author":"K. Papineni","year":"2002","unstructured":"Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311\u2013318). Stroudsburg: ACL."},{"key":"80_CR64","first-page":"74","volume-title":"Text summarization branches out","author":"C. Y. Lin","year":"2004","unstructured":"Lin, C. Y. (2004). ROUGE: a package for automatic evaluation of summaries. In Text summarization branches out (pp. 74\u201381). Barcelona: Association for Computational Linguistics."},{"key":"80_CR65","first-page":"228","volume-title":"Proceedings of the second workshop on statistical machine translation","author":"S. Banerjee","year":"2005","unstructured":"Banerjee, S., & Lavie, A. (2005). METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In J. Goldstein, A. Lavie, C.-Y. Lin, & C. Voss (Eds.), Proceedings of the second workshop on statistical machine translation (pp. 228\u2013231). Stroudsburg: ACL."},{"key":"80_CR66","volume-title":"Proceedings of the 7th international conference on learning representations","author":"I. Loshchilov","year":"2019","unstructured":"Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. In Proceedings of the 7th international conference on learning representations. Retrieved May 19, 2024, from https:\/\/openreview.net\/forum?id=Bkg6RiCqY7."},{"key":"80_CR67","unstructured":"Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint. arXiv:1907.11692."},{"key":"80_CR68","doi-asserted-by":"publisher","DOI":"10.1016\/j.media.2022.102510","volume":"80","author":"S. Yang","year":"2022","unstructured":"Yang, S., Wu, X., Ge, S., Zhou, S. K., & Xiao, L. (2022). Knowledge matters: chest radiology report generation with general and specific knowledge. Medical Image Analysis, 80, 102510.","journal-title":"Medical Image Analysis"},{"key":"80_CR69","first-page":"1537","volume-title":"Proceedings of the 32nd international conference on neural information processing systems","author":"Y. Li","year":"2018","unstructured":"Li, Y., Liang, X., Hu, Z., & Xing, E. P. (2018). Hybrid retrieval-generation reinforced agent for medical image report generation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Proceedings of the 32nd international conference on neural information processing systems (pp.\u00a01537\u20131547). Red Hook: Curran Associates."},{"key":"80_CR70","first-page":"12910","volume-title":"Proceedings of the 34th AAAI conference on artificial intelligence","author":"Y. Zhang","year":"2020","unstructured":"Zhang, Y., Wang, X., Xu, Z., Yu, Q., Yuille, A. L., & Xu, D. (2020). When radiology report generation meets knowledge graph. In Proceedings of the 34th AAAI conference on artificial intelligence (pp. 12910\u201312917). Palo Alto: AAAI Press."},{"key":"80_CR71","first-page":"13753","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"F. Liu","year":"2021","unstructured":"Liu, F., Wu, X., Ge, S., Fan, W., & Zou, Y. (2021). Exploring and distilling posterior and prior knowledge for radiology report generation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 13753\u201313762). Piscataway: IEEE."},{"key":"80_CR72","first-page":"269","volume-title":"Proceedings of the 59th annual meeting of the Association for Computational Linguistics and the 11th international joint conference on natural language processing","author":"F. Liu","year":"2021","unstructured":"Liu, F., Yin, C., Wu, X., Ge, S., Zhang, P., & Sun, X. (2021). Contrastive attention for automatic chest X-ray report generation. In C. Zong, F. Xia, W. Li, & R. Navigli (Eds.), Proceedings of the 59th annual meeting of the Association for Computational Linguistics and the 11th international joint conference on natural language processing (pp. 269\u2013280). Stroudsburg: ACL."},{"key":"80_CR73","first-page":"3001","volume-title":"Proceedings of the 59th annual meeting of the Association for Computational Linguistics and the 11th international joint conference on natural language processing","author":"F. Liu","year":"2021","unstructured":"Liu, F., Ge, S., & Wu, X. (2021). Competence-based multimodal curriculum learning for medical report generation. In C. Zong, F. Xia, W. Li, & R. Navigli (Eds.), Proceedings of the 59th annual meeting of the Association for Computational Linguistics and the 11th international joint conference on natural language processing (pp. 3001\u20133012). Stroudsburg: ACL."},{"key":"80_CR74","doi-asserted-by":"publisher","DOI":"10.1038\/s41467-023-40260-7","volume":"14","author":"X. Zhang","year":"2023","unstructured":"Zhang, X., Wu, C., Zhang, Y., Xie, W., & Wang, Y. (2023). Knowledge-enhanced visual-language pre-training on chest radiology images. Nature Communications, 14, 4542.","journal-title":"Nature Communications"},{"key":"80_CR75","doi-asserted-by":"publisher","first-page":"32","DOI":"10.1038\/s42256-021-00425-9","volume":"4","author":"H. Y. Zhou","year":"2022","unstructured":"Zhou, H. Y., Chen, X., Zhang, Y., Luo, R., Wang, L., & Yu, Y. (2022). Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports. Nature Machine Intelligence, 4, 32\u201340.","journal-title":"Nature Machine Intelligence"},{"key":"80_CR76","first-page":"33536","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"F. Wang","year":"2022","unstructured":"Wang, F., Zhou, Y., Wang, S., Vardhanabhuti, V., & Yu, L. (2022). Multi-granularity cross-modal alignment for generalized medical visual representation learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Proceedings of the 36th international conference on neural information processing systems (pp. 33536\u201333549). Red Hook: Curran Associates."},{"key":"80_CR77","first-page":"103031","volume-title":"Proceedings of the 37th international conference on neural information processing systems","author":"Y. Liu","year":"2024","unstructured":"Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., & Liu, Y. (2024). VMamba: visual state space model. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp.\u00a0103031\u2013103063). Red Hook: Curran Associates."},{"key":"80_CR78","first-page":"62429","volume-title":"Proceedings of the international conference on machine learning","author":"L. Zhu","year":"2024","unstructured":"Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., & Wang, X. (2024). Vision Mamba: efficient visual representation learning with bidirectional state space model. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, & J. Scarlett (Eds.), Proceedings of the international conference on machine learning (pp.\u00a062429\u201362442). Retrieved November 14, 2024, from https:\/\/proceedings.mlr.press\/v235\/zhu24f.html."},{"key":"80_CR79","unstructured":"Wang, X., Wang, S., Ding, Y., Li, Y., Wu, W., Rong, Y., Kong, W., Huang, J., Li, S., Yang, H., et\u00a0al. (2024). State space model for new-generation network alternative to transformers: a survey. arXiv preprint. arXiv:2404.09516."}],"container-title":["Visual Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44267-025-00080-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s44267-025-00080-3\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44267-025-00080-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,5,28]],"date-time":"2025-05-28T08:03:40Z","timestamp":1748419420000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s44267-025-00080-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,28]]},"references-count":79,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,12]]}},"alternative-id":["80"],"URL":"https:\/\/doi.org\/10.1007\/s44267-025-00080-3","relation":{},"ISSN":["2097-3330","2731-9008"],"issn-type":[{"value":"2097-3330","type":"print"},{"value":"2731-9008","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,5,28]]},"assertion":[{"value":"21 June 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 April 2025","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 April 2025","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 May 2025","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"8"}}