{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,10]],"date-time":"2025-12-10T09:11:05Z","timestamp":1765357865536,"version":"3.41.0"},"reference-count":81,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,7,2]],"date-time":"2025-07-02T00:00:00Z","timestamp":1751414400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,7,2]],"date-time":"2025-07-02T00:00:00Z","timestamp":1751414400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62476143"],"award-info":[{"award-number":["62476143"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Vis. Intell."],"published-print":{"date-parts":[[2025,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Recent breakthroughs in understanding the human brain have revealed its impressive ability to efficiently process and interpret human thoughts, opening up the possibility of intervening in brain signals. In this paper, we aim to develop a straightforward framework that uses other modalities, such as natural language, to translate the original \u201cdreamland\u201d. We present DreamConnect, employing a dual-stream diffusion framework to manipulate visually stimulated brain signals. By integrating an asynchronous diffusion strategy, our framework establishes an effective interface with human \u201cdreams\u201d, and progressively refines their final image synthesis. Through extensive experiments, we demonstrate the efficacy of our method to accurately direct human brain signals in desired directions, ultimately enabling concept manipulation through direct manipulation of the functional magnetic resonance imaging (fMRI) signals. We hope that this work will motivate the use of brain signals in human-computer interaction applications.<\/jats:p>","DOI":"10.1007\/s44267-025-00081-2","type":"journal-article","created":{"date-parts":[[2025,7,2]],"date-time":"2025-07-02T02:45:47Z","timestamp":1751424347000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Connecting dreams with visual brainstorming instruction"],"prefix":"10.1007","volume":"3","author":[{"given":"Yasheng","family":"Sun","sequence":"first","affiliation":[]},{"given":"Bohan","family":"Li","sequence":"additional","affiliation":[]},{"given":"Mingchen","family":"Zhuge","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5245-7518","authenticated-orcid":false,"given":"Deng-Ping","family":"Fan","sequence":"additional","affiliation":[]},{"given":"Salman","family":"Khan","sequence":"additional","affiliation":[]},{"given":"Fahad Shahbaz","family":"Khan","sequence":"additional","affiliation":[]},{"given":"Hideki","family":"Koike","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,7,2]]},"reference":[{"key":"81_CR1","first-page":"10684","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"R. Rombach","year":"2022","unstructured":"Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 10684\u201310695). Piscataway: IEEE."},{"issue":"1","key":"81_CR2","doi-asserted-by":"publisher","first-page":"68","DOI":"10.1002\/mrm.1910140108","volume":"14","author":"S. Ogawa","year":"1990","unstructured":"Ogawa, S., Lee, T.-M., Nayak, A. S., & Glynn, P. (1990). Oxygenation-sensitive contrast in magnetic resonance image of rodent brain at high magnetic fields. Magnetic Resonance in Medicine, 14(1), 68\u201378.","journal-title":"Magnetic Resonance in Medicine"},{"issue":"10","key":"81_CR3","doi-asserted-by":"publisher","first-page":"1097","DOI":"10.1038\/s42256-023-00714-5","volume":"5","author":"A. D\u00e9fossez","year":"2023","unstructured":"D\u00e9fossez, A., Caucheteux, C., Rapin, J., Kabeli, O., & King, J.-R. (2023). Decoding speech perception from non-invasive brain recordings. Nature Machine Intelligence, 5(10), 1097\u20131107.","journal-title":"Nature Machine Intelligence"},{"key":"81_CR4","first-page":"3822","volume-title":"Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: human language technologies","author":"X. Zhao","year":"2024","unstructured":"Zhao, X., Sun, J., Wang, S., Ye, J., Zhang, X., & Zong, C. (2024). MapGuide: A simple yet effective method to reconstruct continuous language from brain activities. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: human language technologies (pp. 3822\u20133832). Stroudsburg: ACL."},{"key":"81_CR5","first-page":"740","volume-title":"Proceedings of the 13th European conference on computer vision","author":"T.-Y. Lin","year":"2014","unstructured":"Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., & Zitnick, C. L. (2014). Microsoft COCO: common objects in context. In D. J. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Proceedings of the 13th European conference on computer vision (pp. 740\u2013755). Cham: Springer."},{"key":"81_CR6","first-page":"3836","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"L. Zhang","year":"2023","unstructured":"Zhang, L., Rao, A., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 3836\u20133847). Piscataway: IEEE."},{"issue":"6","key":"81_CR7","doi-asserted-by":"publisher","DOI":"10.1145\/3618342","volume":"42","author":"Y. Zhang","year":"2023","unstructured":"Zhang, Y., Dong, W., Tang, F., Huang, N., Huang, H., Ma, C., Lee, T.-Y., Deussen, O., & Xu, C. (2023). ProSpect: prompt spectrum for attribute-aware personalization of diffusion models. ACM Transactions on Graphics, 42(6), 244.","journal-title":"ACM Transactions on Graphics"},{"key":"81_CR8","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Huang, N., Tang, F., Huang, H., Ma, C., Dong, W., & Xu, C. Inversion-based style transfer with diffusion models. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 10146\u201310156). Piscataway: IEEE.","DOI":"10.1109\/CVPR52729.2023.00978"},{"key":"81_CR9","first-page":"1","volume":"8","author":"J. Wei","year":"2022","unstructured":"Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al. (2022). Emergent abilities of large language models. Transactions on Machine Learning Research, 8, 1\u201330.","journal-title":"Transactions on Machine Learning Research"},{"issue":"1","key":"81_CR10","doi-asserted-by":"publisher","DOI":"10.1038\/ncomms15037","volume":"8","author":"T. Horikawa","year":"2017","unstructured":"Horikawa, T., & Kamitani, Y. (2017). Generic decoding of seen and imagined objects using hierarchical visual features. Nature Communications, 8(1), 15037.","journal-title":"Nature Communications"},{"key":"81_CR11","first-page":"6840","volume-title":"Proceedings of the 34th international conference on neural information processing systems","author":"J. Ho","year":"2020","unstructured":"Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds.), Proceedings of the 34th international conference on neural information processing systems (pp. 6840\u20136851). Red Hook: Curran Associates."},{"key":"81_CR12","first-page":"14453","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Y. Takagi","year":"2023","unstructured":"Takagi, Y., & Nishimoto, S. (2023). High-resolution image reconstruction with latent diffusion models from human brain activity. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 14453\u201314463). Piscataway: IEEE."},{"key":"81_CR13","first-page":"22710","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Z. Chen","year":"2023","unstructured":"Chen, Z., Qing, J., Xiang, T., Yue, W. L., & Zhou, J. H. (2023). Seeing beyond the brain: conditional diffusion model with sparse masked modeling for vision decoding. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 22710\u201322720). Piscataway: IEEE."},{"key":"81_CR14","first-page":"24705","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"P. Scotti","year":"2023","unstructured":"Scotti, P., Banerjee, A., Goode, J., Shabalin, S., Nguyen, A., Cohen, E., Dempster, A., Verlinde, N., Yundler, E., Weisberg, D., et al. (2023). Reconstructing the mind\u2019s eye: fMRI-to-image with contrastive learning and diffusion priors. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 36th international conference on neural information processing systems (pp. 24705\u201324728). Red Hook: Curran Associates."},{"key":"81_CR15","doi-asserted-by":"publisher","first-page":"5899","DOI":"10.1145\/3581783.3613832","volume-title":"Proceedings of the 31st ACM international conference on multimedia","author":"Y. Lu","year":"2023","unstructured":"Lu, Y., Du, C., Zhou, Q., Wang, D., & He, H. (2023). Minddiffuser: controlled image reconstruction from human brain activity with semantic and structural diffusion. In Proceedings of the 31st ACM international conference on multimedia (pp. 5899\u20135908). New York: ACM."},{"key":"81_CR16","first-page":"12332","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"J. Sun","year":"2023","unstructured":"Sun, J., Li, M., Chen, Z., Zhang, Y., Wang, S., & Moens, M.-F. (2023). Contrast, attend and diffuse to decode high-resolution images from brain activities. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 36th international conference on neural information processing systems (pp. 12332\u201312348). Red Hook: Curran Associates."},{"key":"81_CR17","first-page":"8226","volume-title":"Proceedings of the IEEE\/CVF winter conference on applications of computer vision","author":"W. Xia","year":"2024","unstructured":"Xia, W., de Charette, R., Oztireli, C., & Xue, J.-H. (2024). Dream: visual decoding from reversing human visual system. In Proceedings of the IEEE\/CVF winter conference on applications of computer vision (pp. 8226\u20138235). Piscataway: IEEE."},{"key":"81_CR18","unstructured":"Bai, Y., Wang, X., Cao, Y., Ge, Y., Yuan, C., & Shan, Y. (2023). Dreamdiffusion: generating high-quality images from brain eeg signals. arXiv preprint. arXiv:2306.16934."},{"key":"81_CR19","first-page":"6935","volume-title":"Proceedings of the AAAI conference on artificial intelligence","author":"B. Zeng","year":"2024","unstructured":"Zeng, B., Li, S., Liu, X., Gao, S., Jiang, X., Tang, X., Hu, Y., Liu, J., & Zhang, B. (2024). Controllable mind visual diffusion model. In Proceedings of the AAAI conference on artificial intelligence (pp. 6935\u20136943). Palo Alto: AAAI Press."},{"issue":"1","key":"81_CR20","doi-asserted-by":"publisher","DOI":"10.1038\/s42003-019-0438-y","volume":"2","author":"R. VanRullen","year":"2019","unstructured":"VanRullen, R., & Reddy, L. (2019). Reconstructing faces from fMRI patterns using deep generative neural networks. Communications Biology, 2(1), 193.","journal-title":"Communications Biology"},{"issue":"1","key":"81_CR21","doi-asserted-by":"publisher","DOI":"10.1038\/s41598-021-03938-w","volume":"12","author":"T. Dado","year":"2022","unstructured":"Dado, T., G\u00fc\u00e7l\u00fct\u00fcrk, Y., Ambrogioni, L., Ras, G., Bosch, S., van Gerven, M., & G\u00fc\u00e7l\u00fc, U. (2022). Hyperrealistic neural decoding for reconstructing faces from fMRI activations via the GAN latent space. Scientific Reports, 12(1), 141.","journal-title":"Scientific Reports"},{"issue":"1","key":"81_CR22","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pcbi.1006633","volume":"15","author":"G. Shen","year":"2019","unstructured":"Shen, G., Horikawa, T., Majima, K., & Kamitani, Y. (2019). Deep image reconstruction from human brain activity. PLoS Computational Biology, 15(1), e1006633.","journal-title":"PLoS Computational Biology"},{"key":"81_CR23","doi-asserted-by":"publisher","first-page":"775","DOI":"10.1016\/j.neuroimage.2018.07.043","volume":"181","author":"K. Seeliger","year":"2018","unstructured":"Seeliger, K., G\u00fc\u00e7l\u00fc, U., Ambrogioni, L., G\u00fc\u00e7l\u00fct\u00fcrk, Y., & van Gerven, M.A.J. (2018). Generative adversarial networks for reconstructing natural images from brain activity. NeuroImage, 181, 775\u2013785.","journal-title":"NeuroImage"},{"key":"81_CR24","first-page":"29624","volume-title":"Proceedings of the 35th international conference on neural information processing systems","author":"S. Lin","year":"2022","unstructured":"Lin, S., Sprague, T., & Singh, A. K. (2022). Mind reader: reconstructing complex images from brain activities. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Proceedings of the 35th international conference on neural information processing systems (pp. 29624\u201329636). Red Hook: Curran Associates."},{"key":"81_CR25","first-page":"107","volume-title":"Medical imaging with deep learning","author":"Z. Gu","year":"2024","unstructured":"Gu, Z., Jamison, K., Kuceyeski, A., & Sabuncu, M. R. (2024). Decoding natural image stimuli from fMRI data with a surface-based convolutional network. In I. Oguz, J. Noble, X. Li, M. Styner, C. Baumgartner, M. Rusu, T. Heinmann, D. Kontos, B. Landman & B. Dawant (Eds.), Medical imaging with deep learning (pp. 107\u2013118). Retrieved April 5, 2025, from https:\/\/proceedings.mlr.press\/v227\/gu24a.html."},{"issue":"S1","key":"81_CR26","doi-asserted-by":"publisher","first-page":"953","DOI":"10.1007\/s11760-024-03207-z","volume":"18","author":"Q. Liu","year":"2024","unstructured":"Liu, Q., Zhu, H., Chen, N., Huang, B., Lu, W., & Wang, Y. (2024). Mind-bridge: Reconstructing visual images based on diffusion model from human brain activity. Signal, Image and Video Processing, 18(S1), 953\u2013963.","journal-title":"Signal, Image and Video Processing"},{"issue":"11","key":"81_CR27","doi-asserted-by":"publisher","first-page":"139","DOI":"10.1145\/3422622","volume":"63","author":"I. Goodfellow","year":"2020","unstructured":"Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139\u2013144.","journal-title":"Communications of the ACM"},{"key":"81_CR28","unstructured":"Li, D., Wei, C., Li, S., Zou, J., Qin, H., & Liu, Q. (2024). Visual decoding and reconstruction via eeg embeddings with guided diffusion. arXiv preprint. arXiv:2403.07721."},{"key":"81_CR29","first-page":"16000","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"K. He","year":"2022","unstructured":"He, K., Chen, X., Xie, S., Li, Y., Doll\u00e1r, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 16000\u201316009). Piscataway: IEEE."},{"key":"81_CR30","first-page":"2258","volume-title":"Proceedings of the 26th European conference on artificial intelligence","author":"J. Sun","year":"2023","unstructured":"Sun, J., Zhang, X., & Moens, M.-F. (2023). Tuning in to neural encoding: linking human brain and artificial supervised representations of language. In Proceedings of the 26th European conference on artificial intelligence (pp. 2258\u20132265). Amsterdam: IOS Press."},{"key":"81_CR31","first-page":"11302","volume-title":"Proceedings of the AAAI conference on artificial intelligence","author":"J. Chen","year":"2024","unstructured":"Chen, J., Qi, Y., Wang, Y., & Pan, G. (2024). Bridging the semantic latent space between brain and machine: similarity is all you need. In Proceedings of the AAAI conference on artificial intelligence (pp. 11302\u201311310). Palo Alto: AAAI Press."},{"key":"81_CR32","first-page":"24841","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"Z. Chen","year":"2023","unstructured":"Chen, Z., Qing, J., & Zhou, J. H. (2023). Cinematic mindscapes: high-quality video reconstruction from brain activity. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 36th international conference on neural information processing systems (pp. 24841\u201324858). Red Hook: Curran Associates."},{"key":"81_CR33","unstructured":"Mai, W., & Zhang, Z. (2023). Unibrain: unify image reconstruction and captioning all in one diffusion model from human brain activity. arXiv preprint. arXiv:2308.07428."},{"key":"81_CR34","doi-asserted-by":"publisher","DOI":"10.1016\/j.neuroimage.2022.119754","volume":"264","author":"A. T. Gifford","year":"2022","unstructured":"Gifford, A. T., Dwivedi, K., Roig, G., & Cichy, R. M. (2022). A large and rich EEG dataset for modeling human visual object recognition. NeuroImage, 264, 119754.","journal-title":"NeuroImage"},{"issue":"14","key":"81_CR35","doi-asserted-by":"publisher","first-page":"3754","DOI":"10.1364\/AO.58.003754","volume":"58","author":"T. He","year":"2019","unstructured":"He, T., Sun, Y., Qi, J., Hu, J., & Huang, H. (2019). Image deconvolution for confocal laser scanning microscopy using constrained total variation with a gradient field. Applied Optics, 58(14), 3754\u20133766.","journal-title":"Applied Optics"},{"issue":"4","key":"81_CR36","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3592450","volume":"42","author":"O. Avrahami","year":"2023","unstructured":"Avrahami, O., Fried, O., & Lischinski, D. (2023). Blended latent diffusion. ACM Transactions on Graphics, 42(4), 1\u201311.","journal-title":"ACM Transactions on Graphics"},{"key":"81_CR37","first-page":"16784","volume-title":"Proceedings of the international conference on machine learning","author":"A.Q. Nichol","year":"2022","unstructured":"Nichol, A.Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mcgrew, B., Sutskever, I., & Chen, M. (2022). Glide: towards photorealistic image generation and editing with text-guided diffusion models. In Proceedings of the international conference on machine learning (pp. 16784\u201316804). Retrieved April 5, 2025, from https:\/\/proceedings.mlr.press\/v162\/nichol22a.html."},{"key":"81_CR38","first-page":"1","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"Y. Sun","year":"2023","unstructured":"Sun, Y., Yang, Y., Peng, H., Shen, Y., Yang, Y., Hu, H., Qiu, L., & Koike, H. (2023). Imagebrush: learning visual in-context instructions for exemplar-based image manipulation. In Proceedings of the 36th international conference on neural information processing systems (pp. 1\u201321). Red Hook: Curran Associates."},{"key":"81_CR39","doi-asserted-by":"publisher","first-page":"57288","DOI":"10.1109\/ACCESS.2024.3390182","volume":"12","author":"Y. Sun","year":"2024","unstructured":"Sun, Y., Chu, W., Zhou, H., Wang, K., & Koike, H. (2024). Avi-talking: learning audio-visual instructions for expressive 3D talking face generation. IEEE Access, 12, 57288\u201357301.","journal-title":"IEEE Access"},{"key":"81_CR40","volume-title":"Proceedings of the 11th international conference on learning representations","author":"A. Hertz","year":"2022","unstructured":"Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., & Cohen-or, D. (2022). Prompt-to-prompt image editing with cross-attention control. In Proceedings of the 11th international conference on learning representations. Retrieved April 5, 2025, from https:\/\/openreview.net\/pdf?id=_CDixzkzeyb."},{"key":"81_CR41","unstructured":"Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., Catanzaro, B., et\u00a0al. (2022). Ediffi: text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint. arXiv:2211.01324."},{"issue":"3","key":"81_CR42","first-page":"668","volume":"39","author":"Y. Sun","year":"2019","unstructured":"Sun, Y., Jiang, Q., Hu, J., Qi, J., & Peng, Y. (2019). Attention mechanism based pedestrian trajectory prediction generation model. Journal of Computer Applications, 39(3), 668.","journal-title":"Journal of Computer Applications"},{"key":"81_CR43","first-page":"1921","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"N. Tumanyan","year":"2023","unstructured":"Tumanyan, N., Geyer, M., Bagon, S., & Dekel, T. (2023). Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 1921\u20131930). Piscataway: IEEE."},{"key":"81_CR44","first-page":"18392","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"T. Brooks","year":"2023","unstructured":"Brooks, T., Holynski, A., & Efros, A. A. (2023). Instructpix2pix: learning to follow image editing instructions. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 18392\u201318402). Piscataway: IEEE."},{"key":"81_CR45","first-page":"12709","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"Z. Geng","year":"2024","unstructured":"Geng, Z., Yang, B., Hang, T., Li, C., Gu, S., Zhang, T., Bao, J., Zhang, Z., Li, H., Hu, H., et al. (2024). Instructdiffusion: a generalist modeling interface for vision tasks. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 12709\u201312720). Piscataway: IEEE."},{"key":"81_CR46","first-page":"24824","volume-title":"Proceedings of the 35th international conference on neural information processing systems","author":"J. Wei","year":"2022","unstructured":"Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 35th international conference on neural information processing systems (pp. 24824\u201324837). Red Hook: Curran Associates."},{"key":"81_CR47","first-page":"1049","volume-title":"Proceedings of the 61st annual meeting of the Association for Computational Linguistics","author":"J. Huang","year":"2023","unstructured":"Huang, J., & Chang, K. C.-C. (2023). Towards reasoning in large language models: a survey. In Proceedings of the 61st annual meeting of the Association for Computational Linguistics (pp. 1049\u20131065). Stroudsburg: ACL."},{"key":"81_CR48","first-page":"1877","volume-title":"Proceedings of the 33rd international conference on neural information processing systems","author":"T. Brown","year":"2020","unstructured":"Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. In Proceedings of the 33rd international conference on neural information processing systems (pp. 1877\u20131901). Red Hook: Curran Associates."},{"key":"81_CR49","first-page":"1","volume-title":"Proceedings of the 41st international conference on machine learning","author":"M. Zhuge","year":"2024","unstructured":"Zhuge, M., Wang, W., Kirsch, L., Faccio, F. Khizbullin, D. & Schmidhuber, J. (2024). GPTSwarm: language agents as optimizable graphs. In Proceedings of the 41st international conference on machine learning (pp. 1\u201325). Retrieved April 5, 2025, from https:\/\/openreview.net\/forum?id=uTC9AFXIhg."},{"key":"81_CR50","first-page":"23716","volume-title":"Proceedings of the 35th international conference on neural information processing systems","author":"J.-B. Alayrac","year":"2022","unstructured":"Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. (2022). Flamingo: a visual language model for few-shot learning. In Proceedings of the 35th international conference on neural information processing systems (pp. 23716\u201323736). Red Hook: Curran Associates."},{"key":"81_CR51","doi-asserted-by":"publisher","first-page":"7370","DOI":"10.18653\/v1\/2023.findings-acl.465","volume-title":"Findings of the Association for Computational Linguistics: ACL 2023","author":"T. Chakrabarty","year":"2023","unstructured":"Chakrabarty, T., Saakyan, A., Winn, O., Panagopoulou, A., Yang, Y., Apidianaki, M., & Muresan, S. (2023). I spy a metaphor: large language models and diffusion models co-create visual metaphors. In A. Rogers, J. L. Boyd-Graber, & N. Okazaki (Eds.), Findings of the Association for Computational Linguistics: ACL 2023 (pp. 7370\u20137388). Stroudsburg: ACL."},{"key":"81_CR52","first-page":"19730","volume-title":"Proceedings of the international conference on machine learning","author":"J. Li","year":"2023","unstructured":"Li, J., Li, D., Savarese, S., & Hoi, S. (2023). Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the international conference on machine learning (pp. 19730\u201319742). Retrieved April 5, 2025, from https:\/\/proceedings.mlr.press\/v202\/li23q.html."},{"key":"81_CR53","first-page":"1","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"H. Liu","year":"2023","unstructured":"Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual instruction tuning. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 36th international conference on neural information processing systems (pp. 1\u201325). Red Hook: Curran Associates."},{"key":"81_CR54","first-page":"1","volume-title":"Proceedings of the 12th international conference on learning representations","author":"D. Zhu","year":"2023","unstructured":"Zhu, D., Chen, J., Shen, X., Li, X., & Elhoseiny, M. (2023). MiniGPT-4: enhancing vision-language understanding with advanced large language models. In Proceedings of the 12th international conference on learning representations (pp. 1\u201317). Retrieved April 5, 2025, from https:\/\/openreview.net\/forum?id=1tZbq88f27."},{"key":"81_CR55","first-page":"1","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"J. Y. Koh","year":"2023","unstructured":"Koh, J. Y., Fried, D., & Salakhutdinov, R. R. (2023). Generating images with multimodal language models. In T. Neumann, A. Oh, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 36th international conference on neural information processing systems (pp. 1\u201325). Red Hook: Curran Associates."},{"key":"81_CR56","first-page":"1","volume-title":"Proceedings of the 12th international conference on learning representations","author":"Q. Sun","year":"2024","unstructured":"Sun, Q., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, Y., Gao, H., Liu, J., Huang, T., & Wang, X. (2024). Emu: generative pretraining in multimodality. In Proceedings of the 12th international conference on learning representations (pp. 1\u201329). Retrieved April 5, 2025, from https:\/\/openreview.net\/forum?id=mL8Q9OOamV."},{"key":"81_CR57","first-page":"18225","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"W. Feng","year":"2023","unstructured":"Feng, W., Zhu, W., Fu, T.-J., Jampani, V., Akula, A., He, X., Basu, S., Wang, X. E., & Wang, W. Y. (2023). Layoutgpt: compositional visual planning and generation with large language models. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Proceedings of the 36th international conference on neural information processing systems (pp. 18225\u201318250). Red Hook: Curran Associates."},{"key":"81_CR58","first-page":"1","volume-title":"Proceedings of the 12th international conference on learning representations","author":"T.-J. Fu","year":"2023","unstructured":"Fu, T.-J., Hu, W., Du, X., Wang, W.Y., Yang, Y., & Gan, Z. (2023). Guiding instruction-based image editing via multimodal large language models. In Proceedings of the 12th international conference on learning representations (pp. 1\u201314). Retrieved April 5, 2025, from https:\/\/openreview.net\/forum?id=S1RKWSyZ2Y."},{"key":"81_CR59","first-page":"7754","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"X. Xu","year":"2023","unstructured":"Xu, X., Wang, Z., Zhang, G., Wang, K., & Shi, H. (2023). Versatile diffusion: text, images and variations all in one diffusion model. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 7754\u20137765). Piscataway: IEEE."},{"issue":"2","key":"81_CR60","doi-asserted-by":"publisher","first-page":"257","DOI":"10.1109\/5.18626","volume":"77","author":"L.R. Rabiner","year":"1989","unstructured":"Rabiner, L.R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257\u2013286.","journal-title":"Proceedings of the IEEE"},{"key":"81_CR61","first-page":"12873","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"P. Esser","year":"2021","unstructured":"Esser, P., Rombach, R., & Ommer, B. (2021). Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 12873\u201312883). Piscataway: IEEE."},{"issue":"1","key":"81_CR62","doi-asserted-by":"publisher","DOI":"10.1038\/s41598-023-42891-8","volume":"13","author":"F. Ozcelik","year":"2023","unstructured":"Ozcelik, F., & VanRullen, R. (2023). Natural scene reconstruction from fMRI signals using generative latent diffusion. Scientific Reports, 13(1), 15666.","journal-title":"Scientific Reports"},{"key":"81_CR63","first-page":"1","volume":"1","author":"M. Oquab","year":"2024","unstructured":"Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al. (2024). Dinov2: learning robust visual features without supervision. Transactions on Machine Learning Research Journal, 1, 1\u201332.","journal-title":"Transactions on Machine Learning Research Journal"},{"issue":"1","key":"81_CR64","doi-asserted-by":"publisher","first-page":"116","DOI":"10.1038\/s41593-021-00962-x","volume":"25","author":"E. J. Allen","year":"2022","unstructured":"Allen, E. J., St-Yves, G., Wu, Y., Breedlove, J. L., Prince, J. S., Dowdle, L. T., Nau, M., Caron, B., Pestilli, F., Charest, I., et al. (2022). A massive 7t fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nature Neuroscience, 25(1), 116\u2013126.","journal-title":"Nature Neuroscience"},{"key":"81_CR65","unstructured":"OpenAI (2023). GPT-4v(ision) system card. Retrieved January 20, 2025, from https:\/\/openai.com\/index\/gpt-4v-system-card\/."},{"key":"81_CR66","unstructured":"OpenAI (2023). Dall-e3 system card. Retrieved January 20, 2025, from https:\/\/openai.com\/index\/dall-e-3-system-card\/."},{"key":"81_CR67","first-page":"8024","volume-title":"Proceedings of the 33rd international conference on neural information processing systems","author":"A. Paszke","year":"2019","unstructured":"Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). PyTorch: an imperative style, high-performance deep learning library. In Proceedings of the 33rd international conference on neural information processing systems (pp. 8024\u20138035). Red Hook: Curran Associates."},{"key":"81_CR68","first-page":"1","volume-title":"International conference on learning representations","author":"C. Meng","year":"2022","unstructured":"Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., & Ermon, S. (2022). SDEdit: guided image synthesis and editing with stochastic differential equations. In International conference on learning representations (pp. 1\u201333). Retrieved April 5, 2025, from https:\/\/openreview.net\/pdf?id=aBsCjcPu_tE."},{"key":"81_CR69","first-page":"1","volume-title":"Proceedings of the 36th international conference on neural information processing systems","author":"K. Zhang","year":"2024","unstructured":"Zhang, K., Mo, L., Chen, W., Sun, H., & Su, Y. (2024). Magicbrush: a manually annotated dataset for instruction-guided image editing. In Proceedings of the 36th international conference on neural information processing systems (pp. 1\u201322). Red Hook: Curran Associates."},{"issue":"4","key":"81_CR70","doi-asserted-by":"publisher","first-page":"600","DOI":"10.1109\/TIP.2003.819861","volume":"13","author":"Z. Wang","year":"2004","unstructured":"Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600\u2013612.","journal-title":"IEEE Transactions on Image Processing"},{"key":"81_CR71","first-page":"1106","volume-title":"Proceedings of the 26th International Conference on Neural Information Processing Systems","author":"A. Krizhevsky","year":"2012","unstructured":"Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Proceedings of the 26th International Conference on Neural Information Processing Systems (pp. 1106\u20131114). Red Hook: Curran Associates."},{"key":"81_CR72","first-page":"1","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition","author":"C. Szegedy","year":"2015","unstructured":"Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1\u20139). Piscataway: IEEE."},{"key":"81_CR73","first-page":"8748","volume-title":"Proceedings of the 38th international conference on machine learning","author":"A. Radford","year":"2021","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In Proceedings of the 38th international conference on machine learning (pp. 8748\u20138763). Retrieved April 5, 2025, from http:\/\/proceedings.mlr.press\/v139\/radford21a.html."},{"key":"81_CR74","first-page":"6105","volume-title":"Proceedings of the 8th international conference on machine learning","author":"M. Tan","year":"2019","unstructured":"Tan, M., & Le, Q. (2019). EfficientNet: rethinking model scaling for convolutional neural networks. In Proceedings of the 8th international conference on machine learning (pp. 6105\u20136114). Retrieved April 5, 2025, from http:\/\/proceedings.mlr.press\/v97\/tan19a.html."},{"key":"81_CR75","first-page":"9912","volume-title":"Proceedings of the 34th international conference on neural information processing systems","author":"M. Caron","year":"2020","unstructured":"Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. In Proceedings of the 34th international conference on neural information processing systems (pp. 9912\u20139924). Red Hook: Curran Associates."},{"issue":"4","key":"81_CR76","doi-asserted-by":"publisher","first-page":"14","DOI":"10.1145\/3528223.3530164","volume":"41","author":"R. Gal","year":"2022","unstructured":"Gal, R., Patashnik, O., Maron, H., Bermano, A. H., Chechik, G., & Cohen-Or, D. (2022). StyleGAN-NADA: clip-guided domain adaptation of image generators. ACM Transactions on Graphics, 41(4), 14.","journal-title":"ACM Transactions on Graphics"},{"key":"81_CR77","first-page":"22500","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"N. Ruiz","year":"2023","unstructured":"Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., & Aberman, K. (2023). Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 22500\u201322510). Piscataway: IEEE."},{"key":"81_CR78","first-page":"6038","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"R. Mokady","year":"2023","unstructured":"Mokady, R., Hertz, A., Aberman, K., Pritch, Y., & Cohen-Or, D. (2023). Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 6038\u20136047). Piscataway: IEEE."},{"key":"81_CR79","first-page":"2085","volume-title":"Proceedings of the IEEE\/CVF international conference on computer vision","author":"O. Patashnik","year":"2021","unstructured":"Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., & Lischinski, D. (2021). StyleCLIP: text-driven manipulation of stylegan imagery. In Proceedings of the IEEE\/CVF international conference on computer vision (pp. 2085\u20132094). Piscataway: IEEE."},{"key":"81_CR80","first-page":"8153","volume-title":"Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition","author":"L. Hu","year":"2024","unstructured":"Hu, L. (2024). Animate anyone: consistent and controllable image-to-video synthesis for character animation. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 8153\u20138163). Piscataway: IEEE."},{"key":"81_CR81","first-page":"770","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition","author":"K. He","year":"2016","unstructured":"He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770\u2013778). Piscataway: IEEE."}],"container-title":["Visual Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44267-025-00081-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s44267-025-00081-2\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44267-025-00081-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,2]],"date-time":"2025-07-02T06:18:52Z","timestamp":1751437132000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s44267-025-00081-2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,2]]},"references-count":81,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,12]]}},"alternative-id":["81"],"URL":"https:\/\/doi.org\/10.1007\/s44267-025-00081-2","relation":{},"ISSN":["2097-3330","2731-9008"],"issn-type":[{"type":"print","value":"2097-3330"},{"type":"electronic","value":"2731-9008"}],"subject":[],"published":{"date-parts":[[2025,7,2]]},"assertion":[{"value":"15 August 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"27 April 2025","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 April 2025","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 July 2025","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Deng-Ping Fan is an Associate Editor at Visual Intelligence and was not involved in the editorial review of this article or the decision to publish it. The authors declare that they have no other competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"12"}}