{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,10]],"date-time":"2026-03-10T06:28:20Z","timestamp":1773124100544,"version":"3.50.1"},"reference-count":100,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2026,3,9]],"date-time":"2026-03-09T00:00:00Z","timestamp":1773014400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,3,9]],"date-time":"2026-03-09T00:00:00Z","timestamp":1773014400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"IITP","award":["RS-2019-II191082"],"award-info":[{"award-number":["RS-2019-II191082"]}]},{"name":"IITP","award":["2022-II220156"],"award-info":[{"award-number":["2022-II220156"]}]},{"name":"IITP","award":["RS-2021-II211343"],"award-info":[{"award-number":["RS-2021-II211343"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2026,4]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Spatial audio is essential for enhancing the immersiveness of audio-visual experiences, yet its production typically demands complex recording systems and specialized expertise. In this work, we address a novel problem of generating first-order ambisonics, a widely used spatial audio format, directly from silent videos. To support this task, we develop comprehensive evaluation metrics that capture both standard video-to-audio generation quality and spatial coherence among multiple channels. We introduce YT-Ambigen, a dataset comprising 102K YouTube video clips paired with first-order ambisonics tailored for audio generation, and its expanded version YT-Ambigen+ containing 3x more clips with a rigorously validated high-quality test subset of 19.3K clips. Furthermore, we present Video-to-Spatial Audio Generation (ViSAGe), an end-to-end framework that generates first-order ambisonics from silent videos by leveraging CLIP features, patchwise energy maps, and neural audio codecs with rotation augmentation. To address efficiency challenges, we propose a variant coined ViSAGe-SC (Single Codebook), which replaces complex residual codebooks with an optimized single codebook approach, achieving 4x faster training and 5x faster inference while maintaining superior performance. ViSAGe-SC incorporates heterogeneous codec chaining for postprocessing and candidate reranking for inference-time refinement. Experimental results demonstrate that our approach outperforms several V2A models across spatial metrics and displays competitive performance in semantic quality, generating high-quality spatial audio from video input.<\/jats:p>","DOI":"10.1007\/s11263-025-02610-4","type":"journal-article","created":{"date-parts":[[2026,3,9]],"date-time":"2026-03-09T17:32:37Z","timestamp":1773077557000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Towards Scene-Aware Video-to-Spatial Audio Generation"],"prefix":"10.1007","volume":"134","author":[{"given":"Jaeyeon","family":"Kim","sequence":"first","affiliation":[]},{"given":"Heeseung","family":"Yun","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9543-7453","authenticated-orcid":false,"given":"Gunhee","family":"Kim","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2026,3,9]]},"reference":[{"key":"2610_CR1","unstructured":"Agostinelli, A., Denk, T. I., Borsos, Z., et al. (2023). Musiclm: Generating music from text. arXiv:2301.11325."},{"key":"2610_CR2","doi-asserted-by":"publisher","DOI":"10.4324\/9780203766880","volume-title":"The Foley grail: The art of performing sound for film, games, and animation","author":"VT Ament","year":"2014","unstructured":"Ament, V. T. (2014). The Foley grail: The art of performing sound for film, games, and animation. Routledge."},{"key":"2610_CR3","first-page":"2523","volume":"31","author":"Z Borsos","year":"2023","unstructured":"Borsos, Z., Marinier, R., Vincent, D., et al. (2023). Audiolm: A language modeling approach to audio generation. IEEE\/ACM TASLP, 31, 2523\u20132533.","journal-title":"IEEE\/ACM TASLP"},{"key":"2610_CR4","doi-asserted-by":"crossref","unstructured":"Brandfonbrener, D., Zhang, H., Kirsch, A., et al. (2024). Color-filter: Conditional loss reduction filtering for targeted language model pre-training., In: NeurIPS, pp 97618\u201397649.","DOI":"10.52202\/079017-3097"},{"key":"2610_CR5","unstructured":"Brooks, T., Peebles, B., Holmes, C., et al. (2024). Video generation models as world simulators. https:\/\/openai.com\/research\/video-generation-models-as-world-simulators."},{"issue":"3","key":"2610_CR6","doi-asserted-by":"publisher","first-page":"740","DOI":"10.1109\/TPAMI.2018.2815601","volume":"41","author":"Z Bylinskii","year":"2018","unstructured":"Bylinskii, Z., Judd, T., Oliva, A., et al. (2018). What do different evaluation metrics tell us about saliency models? IEEE TPAMI, 41(3), 740\u2013757.","journal-title":"IEEE TPAMI"},{"key":"2610_CR7","doi-asserted-by":"crossref","unstructured":"Chang, H., Zhang, H., Jiang, L., et al. (2022). Maskgit: Masked generative image transformer., In: CVPR, , pp 11315\u201311325.","DOI":"10.1109\/CVPR52688.2022.01103"},{"key":"2610_CR8","doi-asserted-by":"crossref","unstructured":"Chen, H., Xie, W., Vedaldi, A., et al. (2020). Vggsound: A large-scale audio-visual dataset., In: ICASSP, pp 721\u2013725.","DOI":"10.1109\/ICASSP40776.2020.9053174"},{"key":"2610_CR9","first-page":"8292","volume":"29","author":"P Chen","year":"2020","unstructured":"Chen, P., Zhang, Y., Tan, M., et al. (2020). Generating visually aligned sound from videos. IEEE TIP, 29, 8292\u20138302.","journal-title":"IEEE TIP"},{"key":"2610_CR10","doi-asserted-by":"crossref","unstructured":"Chen, Z., Seetharaman, P., Russell, B., et al. (2025). Video-guided foley sound generation with multimodal controls., In: CVPR, pp 18770\u201318781.","DOI":"10.1109\/CVPR52734.2025.01749"},{"key":"2610_CR11","doi-asserted-by":"crossref","unstructured":"Cheng, H. K., Ishii, M., Hayakawa, A., et al. (2025). Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis., In: CVPR, , pp 28901\u201328911.","DOI":"10.1109\/CVPR52734.2025.02691"},{"key":"2610_CR12","doi-asserted-by":"crossref","unstructured":"Cheng, H. T., Chao, C. H., Dong, J. D., et al. (2018). Cube padding for weakly-supervised saliency prediction in 360 videos. CVPR, pp 1420\u20131429.","DOI":"10.1109\/CVPR.2018.00154"},{"key":"2610_CR13","unstructured":"Cheng, X., Zhang, Z., Wang, Z., et al. (2024). Avset-10m: An open large-scale audio-visual dataset with high correspondence."},{"key":"2610_CR14","doi-asserted-by":"crossref","unstructured":"Chung, Y., Lee, J., & Nam, J. (2024). T-foley: A controllable waveform-domain diffusion model for temporal-event-guided foley sound synthesis. ICASSP, pp. 6820\u20136824.","DOI":"10.1109\/ICASSP48485.2024.10447380"},{"key":"2610_CR15","doi-asserted-by":"crossref","unstructured":"Comunit\u00e0, M., Gramaccioni, R. F., Postolache, E., et al. (2024). Syncfusion: Multimodal onset-synchronized video-to-audio foley synthesis. ICASSP, pp. 936\u2013940.","DOI":"10.1109\/ICASSP48485.2024.10447063"},{"key":"2610_CR16","doi-asserted-by":"crossref","unstructured":"Copet, J., Kreuk, F., Gat, I., et al. (2023). Simple and controllable music generation. NeurIPS, pp 47704\u201347720.","DOI":"10.52202\/075280-2066"},{"key":"2610_CR17","unstructured":"Courville, D., & Studio, A. (1994). Proc\u00e9d\u00e9s et syst\u00e8mes d\u2019enregistrement et de reproduction sonores en trois dimensions. Universit\u00e9 du Qu\u00e9bec \u00e0 Montr\u00e9al."},{"key":"2610_CR18","doi-asserted-by":"crossref","unstructured":"Dai, D., Deng, C., Zhao, C., et al. (2024). Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. ACL, pp 1280\u20131297.","DOI":"10.18653\/v1\/2024.acl-long.70"},{"key":"2610_CR19","unstructured":"Dao, T. (2024). Flashattention-2: Faster attention with better parallelism and work partitioning. In: ICLR."},{"key":"2610_CR20","unstructured":"D\u00e9fossez, A., Copet, J., Synnaeve, G., et al. (2023). High fidelity neural audio compression. TMLR."},{"key":"2610_CR21","doi-asserted-by":"crossref","unstructured":"Deshmukh, S., Alharthi, D., Elizalde, B., et al. (2024). Pam: Prompting audio-language models for audio quality assessment. Interspeech, pp 3320\u20133324.","DOI":"10.21437\/Interspeech.2024-325"},{"key":"2610_CR22","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR."},{"key":"2610_CR23","doi-asserted-by":"crossref","unstructured":"Du, Y., Chen, Z., Salamon, J., et al. (2023). Conditional generation of audio from video via foley analogies. CVPR, pp 2426\u20132436.","DOI":"10.1109\/CVPR52729.2023.00240"},{"issue":"8","key":"2610_CR24","first-page":"698","volume":"11","author":"S Dubnov","year":"2004","unstructured":"Dubnov, S. (2004). Generalization of spectral flatness measure for non-gaussian linear processes. IEEE SPL, 11(8), 698\u2013701.","journal-title":"IEEE SPL"},{"key":"2610_CR25","doi-asserted-by":"crossref","unstructured":"Esser, P., Rombach, R., & Ommer, B. (2021). Taming transformers for high-resolution image synthesis. CVPR, pp 12873\u201312883.","DOI":"10.1109\/CVPR46437.2021.01268"},{"key":"2610_CR26","doi-asserted-by":"crossref","unstructured":"Gao, R., & Grauman, K. (2019). 2.5d visual sound. CVPR, pp. 324\u2013333.","DOI":"10.1109\/CVPR.2019.00041"},{"issue":"10","key":"2610_CR27","doi-asserted-by":"publisher","first-page":"2723","DOI":"10.1007\/s11263-023-01816-8","volume":"131","author":"R Garg","year":"2023","unstructured":"Garg, R., Gao, R., & Grauman, K. (2023). Visually-guided audio spatialization in video with geometry-aware multi-task learning. IJCV, 131(10), 2723\u20132737.","journal-title":"IJCV"},{"key":"2610_CR28","doi-asserted-by":"crossref","unstructured":"Gemmeke, J. F., Ellis, D. P., Freedman, D., et al. (2017). Audio set: An ontology and human-labeled dataset for audio events. ICASSP, pp. 776\u2013780.","DOI":"10.1109\/ICASSP.2017.7952261"},{"key":"2610_CR29","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., et al. (2016). Deep residual learning for image recognition. CVPR, pp 770\u2013778.","DOI":"10.1109\/CVPR.2016.90"},{"key":"2610_CR30","doi-asserted-by":"crossref","unstructured":"Hershey, S., Chaudhuri, S., Ellis, D. P. W., et al. (2017). Cnn architectures for large-scale audio classification. ICASSP, pp 131\u2013135.","DOI":"10.1109\/ICASSP.2017.7952132"},{"key":"2610_CR31","doi-asserted-by":"crossref","unstructured":"Heydari, M., Souden, M., Conejo, B., et al. (2025). Immersediffusion: A generative spatial audio latent diffusion model. ICASSP, pp 1\u20135.","DOI":"10.1109\/ICASSP49660.2025.10889311"},{"key":"2610_CR32","doi-asserted-by":"crossref","unstructured":"Hirway, A., Qiao, Y., & Murray, N. (2022). Spatial audio in $$360^{\\circ }$$ videos: does it influence visual attention? ACM MMSys, pp 39\u201351.","DOI":"10.1145\/3524273.3528179"},{"key":"2610_CR33","doi-asserted-by":"crossref","unstructured":"Hirway, A., Qiao, Y., & Murray, N. (2024). Evaluating visual attention and qoe for $$360^{\\circ }$$ videos with non-spatial and spatial audio. MMSys, pp 532\u2013535.","DOI":"10.1145\/3625468.3652916"},{"key":"2610_CR34","unstructured":"Ho, J., & Salimans, T. (2022). Classifier-free diffusion guidance., arXiv:2207.12598."},{"key":"2610_CR35","doi-asserted-by":"crossref","unstructured":"Ho, J., Chan, W., Saharia, C., et al. (2022). Imagen video: High definition video generation with diffusion models., arXiv:2210.02303.","DOI":"10.52202\/068431-0628"},{"key":"2610_CR36","doi-asserted-by":"crossref","unstructured":"Ho, J., Salimans, T., Gritsenko, A., et al. (2022). Video diffusion models. In: NeurIPS, pp 8633\u20138646.","DOI":"10.52202\/068431-0628"},{"key":"2610_CR37","doi-asserted-by":"crossref","unstructured":"Holm, J., V\u00e4\u00e4n\u00e4nen, K., & Battah, A. (2020). User experience of stereo and spatial audio in $$360^{\\circ }$$ live music videos. AcademicMindtrek, pp 134\u2013141.","DOI":"10.1145\/3377290.3377291"},{"key":"2610_CR38","unstructured":"Huang, J., Ren, Y., Huang, R., et al. (2023). Make-an-audio 2: Temporal-enhanced text-to-audio generation., arXiv:2305.18474."},{"key":"2610_CR39","doi-asserted-by":"crossref","unstructured":"Iashin, V., & Rahtu, E. (2021). Taming visually guided sound generation. In: BMVC.","DOI":"10.5244\/C.35.336"},{"key":"2610_CR40","doi-asserted-by":"publisher","DOI":"10.5281\/zenodo.5143773","author":"G Ilharco","year":"2021","unstructured":"Ilharco, G., Wortsman, M., Wightman, R., et al. (2021). Openclip. https:\/\/doi.org\/10.5281\/zenodo.5143773","journal-title":"Openclip."},{"key":"2610_CR41","doi-asserted-by":"crossref","unstructured":"Jang, W., Lim, D., Yoon, J., et al. (2021). Univnet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation. Interspeech, pp. 2207\u20132211.","DOI":"10.21437\/Interspeech.2021-1016"},{"key":"2610_CR42","doi-asserted-by":"crossref","unstructured":"Jiang, Y., Chen, Q., Ji, S., et al. (2025). Unicodec: Unified audio codec with single domain-adaptive codebook. ACL, pp 19112\u201319124.","DOI":"10.18653\/v1\/2025.acl-long.937"},{"key":"2610_CR43","doi-asserted-by":"crossref","unstructured":"Karaev, N., Rocco, I., Graham, B., et al. (2024). Cotracker: It is better to track together. ECCV, pp. 18\u201335.","DOI":"10.1007\/978-3-031-73033-7_2"},{"key":"2610_CR44","doi-asserted-by":"crossref","unstructured":"Kim, C. D., Moon, J., Moon, S., et al. (2025). Respec: Relevance and specificity grounded online filtering for learning on video-text data streams. CVPR, pp 29040\u201329049.","DOI":"10.1109\/CVPR52734.2025.02704"},{"key":"2610_CR45","unstructured":"Kim, J., Yun, H., & Kim, G. (2025). Visage: Video-to-spatial audio generation. In: ICLR."},{"key":"2610_CR46","doi-asserted-by":"crossref","unstructured":"Koizumi, Y., Zen, H., Karita, S., et al. (2023). Libritts-r: A restored multi-speaker text-to-speech corpus. Interspeech, pp. 5496\u20135500.","DOI":"10.21437\/Interspeech.2023-1584"},{"key":"2610_CR47","unstructured":"Kong, J., Kim, J., & Bae, J. (2020). Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. NeurIPS, pp 17022\u201317033."},{"key":"2610_CR48","doi-asserted-by":"crossref","unstructured":"Koutini, K., Schl\u00fcter, J., Eghbal-zadeh, H., et al. (2022). Efficient training of audio transformers with patchout. Interspeech, pp 2753\u20132757.","DOI":"10.21437\/Interspeech.2022-227"},{"key":"2610_CR49","unstructured":"Kreuk, F., Synnaeve, G., Polyak, A., et al. (2023). Audiogen: Textually guided audio generation. In: ICLR."},{"key":"2610_CR50","doi-asserted-by":"crossref","unstructured":"Kumar, R., Seetharaman, P., Luebs, A., et al. (2024). High-fidelity audio compression with improved rvqgan. NeurIPS, pp 27980\u201327993.","DOI":"10.52202\/075280-1214"},{"key":"2610_CR51","doi-asserted-by":"crossref","unstructured":"Kushwaha, S. S., Ma, J., Thomas, M. R., et al. (2025). Diff-sage: End-to-end spatial audio generation using diffusion models. ICASSP, pp 1\u20135.","DOI":"10.1109\/ICASSP49660.2025.10888882"},{"key":"2610_CR52","unstructured":"Sg, Lee, Ping, W., Ginsburg, B., et al. (2023). Bigvgan: A universal neural vocoder with large-scale training. In: ICLR."},{"key":"2610_CR53","doi-asserted-by":"crossref","unstructured":"Lee, Y., Yeon, I., Nam, J., et al. (2024). Voiceldm: Text-to-speech with environmental context. In: ICASSP, pp. 12566\u201312571.","DOI":"10.1109\/ICASSP48485.2024.10448268"},{"key":"2610_CR54","doi-asserted-by":"crossref","unstructured":"Li, X., Zhuo, F., Luo, D., et al. (2024). Generating stereophonic music with single-stage language models. ICASSP, pp. 1471\u20131475.","DOI":"10.1109\/ICASSP48485.2024.10446643"},{"key":"2610_CR55","doi-asserted-by":"crossref","unstructured":"Li, Z., Zhao, B., & Yuan, Y. (2024). Cyclic learning for binaural audio generation and localization. CVPR, pp 26669\u201326678.","DOI":"10.1109\/CVPR52733.2024.02518"},{"key":"2610_CR56","doi-asserted-by":"crossref","unstructured":"Lim, W., & Nam, J. (2024). Enhancing spatial audio generation with source separation and channel panning loss. ICASSP, pp 8321\u20138325.","DOI":"10.1109\/ICASSP48485.2024.10447970"},{"key":"2610_CR57","doi-asserted-by":"crossref","unstructured":"Lin, T. Y., Maire, M., Belongie, S., et al. (2014). Microsoft coco: Common objects in context. ECCV, pp 740\u2013755.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"2610_CR58","doi-asserted-by":"crossref","unstructured":"Lin, T. Y., Doll\u00e1r, P., Girshick, R., et al. (2017). Feature pyramid networks for object detection. CVPR, pp 2117\u20132125.","DOI":"10.1109\/CVPR.2017.106"},{"key":"2610_CR59","unstructured":"Liu, H., Chen, Z., Yuan, Y., et al. (2023). AudioLDM: Text-to-audio generation with latent diffusion models. ICML, pp 21450\u201321474."},{"key":"2610_CR60","doi-asserted-by":"crossref","unstructured":"Liu, H., Chen, K., Tian, Q., et al. (2024). Audiosr: Versatile audio super-resolution at scale. ICASSP, pp 1076\u20131080.","DOI":"10.1109\/ICASSP48485.2024.10447246"},{"key":"2610_CR61","doi-asserted-by":"crossref","unstructured":"Liu, M., Wang, J., Qian, X., et al. (2024). Visually guided binaural audio generation with cross-modal consistency. ICASSP, pp. 7980\u20137984.","DOI":"10.1109\/ICASSP48485.2024.10446399"},{"key":"2610_CR62","unstructured":"Liu, Y., Ott, M., Goyal, N., et al. (2019). Roberta: A robustly optimized bert pretraining approach., arXiv:1907.11692."},{"key":"2610_CR63","doi-asserted-by":"crossref","unstructured":"Luo, S., Yan, C., Hu, C., et al. (2023). Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. NeurIPS, pp 48855\u201348876.","DOI":"10.52202\/075280-2121"},{"key":"2610_CR64","doi-asserted-by":"crossref","unstructured":"Mei, X., Nagaraja, V., Lan, G. L., et al. (2023). Foleygen: Visually-guided audio generation., arXiv:2309.10537.","DOI":"10.1109\/MLSP58920.2024.10734721"},{"key":"2610_CR65","unstructured":"Morgado, P., Nvasconcelos, N., Langlois, T., et al. (2018). Self-supervised generation of spatial audio for 360 video., In: Bengio S, Wallach H, Larochelle H, et\u00a0al (eds) NeurIPS, pp 360\u2013370."},{"key":"2610_CR66","unstructured":"Morgado, P., Li, Y., & Nvasconcelos, N. (2020). Learning representations from audio-visual spatial alignment. NeurIPS, pp 4733\u20134744."},{"key":"2610_CR67","doi-asserted-by":"crossref","unstructured":"Nguyen, H., & Willson, M. (2023). Spatial audio in youtube vr videos and its impacts on audience engagement., In: I3DA, pp. 1\u20135.","DOI":"10.1109\/I3DA57090.2023.10289616"},{"key":"2610_CR68","doi-asserted-by":"crossref","unstructured":"Owens, A., Isola, P., McDermott, J., et al. (2016). Visually indicated sounds. In: CVPR, pp 2405\u20132413.","DOI":"10.1109\/CVPR.2016.264"},{"key":"2610_CR69","doi-asserted-by":"crossref","unstructured":"Pascual, S., Yeh, C., Tsiamas, I., et al. (2024). Masked generative video-to-audio transformers with enhanced synchronicity. ECCV, pp 247\u2013264.","DOI":"10.1007\/978-3-031-73021-4_15"},{"key":"2610_CR70","doi-asserted-by":"crossref","unstructured":"Poeschl, S., Wall, K., & Doering, N. (2013). Integration of spatial sound in immersive virtual environments an experimental study on effects of spatial sound on presence. IEEE VR, pp 129\u2013130.","DOI":"10.1109\/VR.2013.6549396"},{"key":"2610_CR71","unstructured":"Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. ICML, pp 8748\u20138763."},{"key":"2610_CR72","doi-asserted-by":"crossref","unstructured":"Rana, A., Ozcinar, C., & Smolic, A. (2019). Towards generating ambisonics using audio-visual cue for virtual reality. ICASSP, pp 2012\u20132016.","DOI":"10.1109\/ICASSP.2019.8683318"},{"key":"2610_CR73","unstructured":"Razavi, A., van den Oord, A., & Vinyals, O. (2019). Generating diverse high-fidelity images with vq-vae-2. NeurIPS, pp 14866\u201314876."},{"key":"2610_CR74","doi-asserted-by":"crossref","unstructured":"Ren, Y., Li, C., Xu, M., et al. (2025). Sta-v2a: Video-to-audio generation with semantic and temporal alignment. ICASSP, pp 1\u20135.","DOI":"10.1109\/ICASSP49660.2025.10890132"},{"key":"2610_CR75","doi-asserted-by":"crossref","unstructured":"Roblek, D., Kilgour, K., Sharifi, M., et al. (2019). Fr$$\\backslash $$\u2019echet audio distance: A reference-free metric for evaluating music enhancement algorithms. Interspeech, pp 2350\u20132354.","DOI":"10.21437\/Interspeech.2019-2219"},{"key":"2610_CR76","doi-asserted-by":"crossref","unstructured":"Sheffer, R., & Adi, Y. (2023). I hear your true colors: Image guided audio generation. ICASSP, pp 1\u20135.","DOI":"10.1109\/ICASSP49357.2023.10096023"},{"key":"2610_CR77","doi-asserted-by":"crossref","unstructured":"Shimada, K., Politis, A., Sudarsanam, P., et al. (2023). Starss23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. NeurIPS, pp 72931\u201372957.","DOI":"10.52202\/075280-3189"},{"key":"2610_CR78","unstructured":"Singer, U., Polyak, A., Hayes, T., et al. (2023). Make-a-video: Text-to-video generation without text-video data. ICLR."},{"key":"2610_CR79","doi-asserted-by":"crossref","unstructured":"Singh, M., Gustafson, L., Adcock, A., et al. (2022). Revisiting weakly supervised pre-training of visual perception models. CVPR, pp 804\u2013814.","DOI":"10.1109\/CVPR52688.2022.00088"},{"key":"2610_CR80","unstructured":"Sun, P., Cheng, S., Li, X., et al. (2025). Both ears wide open: Towards language-driven spatial audio generation. ICLR."},{"key":"2610_CR81","doi-asserted-by":"crossref","unstructured":"Vasudevan, A. B., Dai, D., & Van Gool, L. (2020). Semantic object prediction and spatial sound super-resolution with binaural sounds. ECCV, pp 638\u2013655.","DOI":"10.1007\/978-3-030-58548-8_37"},{"key":"2610_CR82","unstructured":"Wang, C., Chen, S., Wu, Y., et al. (2023). Neural codec language models are zero-shot text to speech synthesizers., arXiv:2301.02111."},{"key":"2610_CR83","doi-asserted-by":"crossref","unstructured":"Wang, H., Ma, J., Pascual, S., et al. (2024). V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models. AAAI, pp 48855\u201348876.","DOI":"10.1609\/aaai.v38i14.29475"},{"key":"2610_CR84","doi-asserted-by":"crossref","unstructured":"Wang, Y., Chen, Y., Yan, W., et al. (2024). Cliploss and norm-based data selection methods for multimodal contrastive learning. NeurIPS, pp 15028\u201315069.","DOI":"10.52202\/079017-0480"},{"key":"2610_CR85","doi-asserted-by":"crossref","unstructured":"Wu, H., Chung, H. L., Lin, Y. C., et al. (2024). Codec-superb: An in-depth analysis of sound codec models., arXiv:2402.13071.","DOI":"10.18653\/v1\/2024.findings-acl.616"},{"key":"2610_CR86","doi-asserted-by":"crossref","unstructured":"Wu, Y., Chen, K., Zhang, T., et al. (2023). Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. ICASSP, pp 1\u20135.","DOI":"10.1109\/ICASSP49357.2023.10095969"},{"key":"2610_CR87","doi-asserted-by":"crossref","unstructured":"Xu, H., Xie, S., Huang, P. Y., et al. (2023). Cit: Curation in training for effective vision-language data. ICCV, pp 15180\u201315189.","DOI":"10.1109\/ICCV51070.2023.01393"},{"key":"2610_CR88","doi-asserted-by":"crossref","unstructured":"Xu, X., Zhou, H., Liu, Z., et al. (2021). Visually informed binaural audio generation without binaural audios. CVPR, pp 15485\u201315494.","DOI":"10.1109\/CVPR46437.2021.01523"},{"key":"2610_CR89","doi-asserted-by":"crossref","unstructured":"Yang, J., Lee, J., Choi, H. S., et al. (2024). Dualspeech: Enhancing speaker-fidelity and text-intelligibility through dual classifier-free guidance. Interspeech, pp 4423\u20134427.","DOI":"10.21437\/Interspeech.2024-2005"},{"key":"2610_CR90","doi-asserted-by":"crossref","unstructured":"You, Y., Wu, X., & Qu, T. (2025). Ta-v2a: Textually assisted video-to-audio generation. ICASSP, pp 1\u20135.","DOI":"10.1109\/ICASSP49660.2025.10887573"},{"key":"2610_CR91","doi-asserted-by":"crossref","unstructured":"Yun, H., Yu, Y., Yang, W., et al. (2021). Pano-avqa: Grounded audio-visual question answering on 360deg videos. ICCV, pp 2031\u20132041.","DOI":"10.1109\/ICCV48922.2021.00204"},{"key":"2610_CR92","doi-asserted-by":"crossref","unstructured":"Yun, H., Lee, S., & Kim, G. (2022). Panoramic vision transformer for saliency detection in 360$$\\circ $$ videos. ECCV, pp 422\u2013439.","DOI":"10.1007\/978-3-031-19833-5_25"},{"key":"2610_CR93","first-page":"495","volume":"30","author":"N Zeghidour","year":"2021","unstructured":"Zeghidour, N., Luebs, A., Omran, A., et al. (2021). Soundstream: An end-to-end neural audio codec. IEEE TASLP, 30, 495\u2013507.","journal-title":"IEEE TASLP"},{"key":"2610_CR94","doi-asserted-by":"crossref","unstructured":"Zen, H., Dang, V., Clark, R., et al. (2019). Libritts: A corpus derived from librispeech for text-to-speech. Interspeech, pp 1526\u20131530.","DOI":"10.21437\/Interspeech.2019-2441"},{"key":"2610_CR95","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Xu, X., & Wu, M. (2025). Smooth-foley: Creating continuous sound for video-to-audio generation under semantic guidance. ICASSP, pp 1\u20135.","DOI":"10.1109\/ICASSP49660.2025.10890403"},{"key":"2610_CR96","doi-asserted-by":"crossref","unstructured":"Zhou, H., Xu, X., Lin, D., et al. (2020). Sep-stereo: Visually guided stereophonic audio generation by associating source separation. ECCV, pp 52\u201369.","DOI":"10.1007\/978-3-030-58610-2_4"},{"key":"2610_CR97","doi-asserted-by":"crossref","unstructured":"Zhou, Y., Wang, Z., Fang, C., et al. (2018). Visual to sound: Generating natural sound for videos in the wild. CVPR, pp 3550\u20133558.","DOI":"10.1109\/CVPR.2018.00374"},{"key":"2610_CR98","unstructured":"Zhu, B., Lin, B., Ning, M., et al. (2024). Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. ICLR."},{"key":"2610_CR99","unstructured":"Ziv, A., Gat, I., Le Lan, G., et al. (2024). Masked audio generation using a single non-autoregressive transformer. ICLR."},{"key":"2610_CR100","doi-asserted-by":"crossref","unstructured":"Zotter, F., & Frank, M. (2019). Ambisonics: A practical 3D audio theory for recording, studio production, sound reinforcement, and virtual reality. Springer Nature.","DOI":"10.1007\/978-3-030-17207-7"}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-025-02610-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-025-02610-4","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-025-02610-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,9]],"date-time":"2026-03-09T17:32:59Z","timestamp":1773077579000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-025-02610-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,9]]},"references-count":100,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2026,4]]}},"alternative-id":["2610"],"URL":"https:\/\/doi.org\/10.1007\/s11263-025-02610-4","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,3,9]]},"assertion":[{"value":"15 April 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 November 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 March 2026","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"All resources are available Data in\n                      \n                      .","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Data, materials, and code availability"}},{"value":"Not applicable","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"All authors agreed with the content and gave explicit consent to submit this article.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors have no competing interests to declare that are relevant to this article.","order":5,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}],"article-number":"184"}}