{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,10]],"date-time":"2026-06-10T10:02:08Z","timestamp":1781085728962,"version":"3.54.1"},"reference-count":132,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2026,1,28]],"date-time":"2026-01-28T00:00:00Z","timestamp":1769558400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,1,28]],"date-time":"2026-01-28T00:00:00Z","timestamp":1769558400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100007863","name":"Rice University","doi-asserted-by":"publisher","award":["0"],"award-info":[{"award-number":["0"]}],"id":[{"id":"10.13039\/100007863","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000145","name":"Division of Information and Intelligent Systems","doi-asserted-by":"publisher","award":["2201710"],"award-info":[{"award-number":["2201710"]}],"id":[{"id":"10.13039\/100000145","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2026,3]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    The scalability of ambient sound generators is hindered by data scarcity, insufficient caption quality, and limited scalability in model architecture. This work addresses these challenges by advancing both data and model scaling. First, we propose an efficient and scalable dataset collection pipeline tailored for\n                    <jats:italic>ambient<\/jats:italic>\n                    audio generation, resulting in AutoReCap-XL, the largest ambient audio-text dataset with over\n                    <jats:italic>47 million<\/jats:italic>\n                    clips. To provide high-quality textual annotations, we propose AutoCap, a\n                    <jats:italic>high-quality<\/jats:italic>\n                    automatic audio captioning model. By adopting a Q-Former module and leveraging audio metadata, AutoCap \u00a0substantially enhances caption quality, reaching a CIDEr score of 83.2, a\n                    <jats:inline-formula>\n                      <jats:alternatives>\n                        <jats:tex-math>$$3.2\\%$$<\/jats:tex-math>\n                        <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                          <mml:mrow>\n                            <mml:mn>3.2<\/mml:mn>\n                            <mml:mo>%<\/mml:mo>\n                          <\/mml:mrow>\n                        <\/mml:math>\n                      <\/jats:alternatives>\n                    <\/jats:inline-formula>\n                    improvement over previous captioning models. Finally, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters. We demonstrate its benefits from data scaling with synthetic captions as well as model size scaling. When compared to baseline audio generators\n                    <jats:italic>trained at similar size and data scale<\/jats:italic>\n                    , GenAu \u00a0obtains significant improvements of\n                    <jats:inline-formula>\n                      <jats:alternatives>\n                        <jats:tex-math>$$4.7\\%$$<\/jats:tex-math>\n                        <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                          <mml:mrow>\n                            <mml:mn>4.7<\/mml:mn>\n                            <mml:mo>%<\/mml:mo>\n                          <\/mml:mrow>\n                        <\/mml:math>\n                      <\/jats:alternatives>\n                    <\/jats:inline-formula>\n                    in FAD score,\n                    <jats:inline-formula>\n                      <jats:alternatives>\n                        <jats:tex-math>$$22.65\\%$$<\/jats:tex-math>\n                        <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                          <mml:mrow>\n                            <mml:mn>22.65<\/mml:mn>\n                            <mml:mo>%<\/mml:mo>\n                          <\/mml:mrow>\n                        <\/mml:math>\n                      <\/jats:alternatives>\n                    <\/jats:inline-formula>\n                    in IS, and\n                    <jats:inline-formula>\n                      <jats:alternatives>\n                        <jats:tex-math>$$13.5\\%$$<\/jats:tex-math>\n                        <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                          <mml:mrow>\n                            <mml:mn>13.5<\/mml:mn>\n                            <mml:mo>%<\/mml:mo>\n                          <\/mml:mrow>\n                        <\/mml:math>\n                      <\/jats:alternatives>\n                    <\/jats:inline-formula>\n                    in CLAP score. Our code, model checkpoints, and dataset are\n                    <jats:italic>publicly available<\/jats:italic>\n                    .\n                  <\/jats:p>","DOI":"10.1007\/s11263-025-02632-y","type":"journal-article","created":{"date-parts":[[2026,1,28]],"date-time":"2026-01-28T13:39:00Z","timestamp":1769607540000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Taming Data and Transformers for Audio Generation"],"prefix":"10.1007","volume":"134","author":[{"given":"Moayed","family":"Haji-Ali","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Willi","family":"Menapace","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Aliaksandr","family":"Siarohin","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Guha","family":"Balakrishnan","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-0279-5275","authenticated-orcid":false,"given":"Vicente","family":"Ordonez","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2026,1,28]]},"reference":[{"key":"2632_CR1","unstructured":"Achiam, J., Adler, S., & Agarwal, S. (2023). arXiv:2303.08774 GPT-4 Technical Report."},{"key":"2632_CR2","unstructured":"BBC Sound Effects (2024) Bbc sound effects archive. https:\/\/sound-effects.bbcrewind.co.uk\/, accessed: 2024-10-01"},{"key":"2632_CR3","unstructured":"Chen, G., Wang, G., & Huang, X., et\u00a0al, (2024a). Semantically consistent video-to-audio generation using multimodal language large model. arXiv:2404.16305"},{"key":"2632_CR4","doi-asserted-by":"crossref","unstructured":"Chen, H., Xie, W., & Vedaldi, A. (2020). Vggsound: A large-scale audio-visual dataset. IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP.","DOI":"10.1109\/ICASSP40776.2020.9053174"},{"key":"2632_CR5","doi-asserted-by":"crossref","unstructured":"Chen, K., Du, X., & Zhu, B. (2022). Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection. IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP.","DOI":"10.31219\/osf.io\/d264y"},{"key":"2632_CR6","unstructured":"Chen, T., & Li, L. (2023). Fit: Far-reaching interleaved transformers arXiv:2305.12689."},{"key":"2632_CR7","doi-asserted-by":"crossref","unstructured":"Chen, T.S., Siarohin, A., & Menapace, W., et\u00a0al, (2024b). Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In: International Conference on Computer Vision and Pattern Recognition (CVPR).","DOI":"10.1109\/CVPR52733.2024.01265"},{"key":"2632_CR8","doi-asserted-by":"crossref","unstructured":"Chen, W., Ma, Z., Li, X., et\u00a0al. (2025). Slam-aac: Enhancing audio captioning with paraphrasing augmentation and clap-refine through llms. ICASSP 2025\u20132025 IEEE International Conference on Acoustics (pp. 1\u20135). Speech and Signal Processing (ICASSP): IEEE.","DOI":"10.1109\/ICASSP49660.2025.10889071"},{"key":"2632_CR9","doi-asserted-by":"crossref","unstructured":"Cheng, Y. C., Lee, H. Y., & Tulyakov, S. (2023). SDFusion: Multimodal 3d shape completion, reconstruction, and generation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 4456\u20134465)","DOI":"10.1109\/CVPR52729.2023.00433"},{"key":"2632_CR10","unstructured":"Cheng, Z., Leng, S., & Zhang, H. (2024). Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms arXiv:2406.07476."},{"issue":"70","key":"2632_CR11","first-page":"1","volume":"25","author":"HW Chung","year":"2024","unstructured":"Chung, H. W., Hou, L., Longpre, S., et al. (2024). Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70), 1\u201353.","journal-title":"Journal of Machine Learning Research"},{"key":"2632_CR12","unstructured":"Cousin, M., Labb, E., & Pellegrini, T. (2023). Multilingual audio captioning using machine translated data. https:\/\/hal.science\/hal-04220315, hAL Id: hal-04220315"},{"key":"2632_CR13","doi-asserted-by":"crossref","unstructured":"Deshmukh, S., Elizalde, B., & Singh, R., et\u00a0al, (2023a). Pengi: An audio language model for audio tasks. In: Thirty-seventh Conference on Neural Information Processing Systems, https:\/\/openreview.net\/forum?id=gJLAfO4KUq","DOI":"10.52202\/075280-0795"},{"key":"2632_CR14","doi-asserted-by":"publisher","unstructured":"Deshmukh, S., Elizalde, B.,& Wang, H. (2023b). Audio retrieval with wavtext5k and clap training. In: Interspeech 2023, pp 2948\u20132952, https:\/\/doi.org\/10.21437\/Interspeech.2023-1136","DOI":"10.21437\/Interspeech.2023-1136"},{"key":"2632_CR15","doi-asserted-by":"crossref","unstructured":"Deshmukh, S., Elizalde, B.,& Emmanouilidou, D., et\u00a0al. (2024). Training audio captioning models without audio. ICASSP 2024\u20132024 IEEE International Conference on Acoustics (pp. 371\u2013375). Speech and Signal Processing (ICASSP): IEEE.","DOI":"10.1109\/ICASSP48485.2024.10448115"},{"key":"2632_CR16","doi-asserted-by":"publisher","unstructured":"Deshmukh, S., Singh, R., & Raj, B. (2024b). Domain adaptation for contrastive audio-language models. In: Interspeech 2024, pp 1680\u20131684, https:\/\/doi.org\/10.21437\/Interspeech.2024-41","DOI":"10.21437\/Interspeech.2024-41"},{"key":"2632_CR17","doi-asserted-by":"crossref","unstructured":"Drossos, K., Lipping, S., & Virtanen, T. (2020). Clotho: an audio captioning dataset. IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP.","DOI":"10.1109\/ICASSP40776.2020.9052990"},{"key":"2632_CR18","doi-asserted-by":"crossref","unstructured":"Elizalde, B., Deshmukh, S., Ismail, A., & M. (2023). Clap learning audio concepts from natural language supervision. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.","DOI":"10.1109\/ICASSP49357.2023.10095889"},{"key":"2632_CR19","doi-asserted-by":"crossref","unstructured":"Elizalde, B., Deshmukh, S., & Wang, H. (2024). Natural language supervision for general-purpose audio representations. ICASSP 2024\u20132024 IEEE International Conference on Acoustics (pp. 336\u2013340). Speech and Signal Processing (ICASSP): IEEE.","DOI":"10.1109\/ICASSP48485.2024.10448504"},{"key":"2632_CR20","doi-asserted-by":"crossref","unstructured":"Esser, P., Rombach, R., & Ommer, B. (2021). Taming transformers for high-resolution image synthesis. Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (pp. 12873\u201312883)","DOI":"10.1109\/CVPR46437.2021.01268"},{"key":"2632_CR21","unstructured":"Esser, P., Kulal, S., & Blattmann, A. (2024). Scaling rectified flow transformers for high-resolution image synthesis. Forty-first international conference on machine learning"},{"key":"2632_CR22","unstructured":"Evans, Z., Carr, C., & Taylor, J., et\u00a0al, (2024a). Fast timing-conditioned latent audio diffusion. In: International Conference on Machine Learning (ICML)."},{"key":"2632_CR23","unstructured":"Evans, Z., Parker, J.D.,& Carr, C., et\u00a0al, (2024b). Stable audio open. arXiv:2407.14358"},{"key":"2632_CR24","doi-asserted-by":"publisher","unstructured":"Font, F., Roma, G.,& Serra, X. (2013). Freesound technical demo. In: Proceedings of the 21st ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, MM \u201913, p 411-412, https:\/\/doi.org\/10.1145\/2502081.2502245,","DOI":"10.1145\/2502081.2502245"},{"key":"2632_CR25","doi-asserted-by":"crossref","unstructured":"Gemmeke, J. F., Ellis, D. P. W., & Freedman, D. (2017). Audio set: An ontology and human-labeled dataset for audio events. IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP.","DOI":"10.1109\/ICASSP.2017.7952261"},{"key":"2632_CR26","doi-asserted-by":"crossref","unstructured":"Ghosal, D., Majumder, N., & Mehrish, A. (2023). Text-to-audio generation using instruction guided latent diffusion model. Proceedings of the 31st ACM International Conference on Multimedia (pp. 3590\u20133598)","DOI":"10.1145\/3581783.3612348"},{"key":"2632_CR27","doi-asserted-by":"crossref","unstructured":"Ghosh, S., Kumar, S., Evuru, C. K. R., et\u00a0al. (2024). Recap: Retrieval-augmented audio captioning. ICASSP 2024\u20132024 IEEE International Conference on Acoustics (pp. 1161\u20131165). Speech and Signal Processing (ICASSP): IEEE.","DOI":"10.1109\/ICASSP48485.2024.10448030"},{"key":"2632_CR28","doi-asserted-by":"crossref","unstructured":"Ghosh, S., Kumar, S., & Seth, A., et\u00a0al, (2024b). GAMA: A large audio-language model with advanced audio understanding and complex reasoning abilities. In: Al-Onaizan Y, Bansal M, Chen YN (eds) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Miami, Florida, USA, pp 6288\u20136313, https:\/\/aclanthology.org\/2024.emnlp-main.361","DOI":"10.18653\/v1\/2024.emnlp-main.361"},{"key":"2632_CR29","unstructured":"Gong, Y., Luo, H., & Liu, A.H., et\u00a0al, (2024). Listen, think, and understand. In: The Twelfth International Conference on Learning Representations, https:\/\/openreview.net\/forum?id=nBZBPXdJlC"},{"key":"2632_CR30","unstructured":"Gontier, F., Serizel, R.,& Cerisara, C. (2021). Automated audio captioning by fine-tuning bart with audioset tags. In: DCASE 2021 - 6th Workshop on Detection and Classification of Acoustic Scenes and Events."},{"key":"2632_CR31","doi-asserted-by":"publisher","unstructured":"Guan, W., Wang, K., Zhou, W., et\u00a0al. (2024). Lafma: A latent flow matching model for text-to-audio generation. Interspeech,2024, 4813\u20134817. https:\/\/doi.org\/10.21437\/Interspeech.2024-1848","DOI":"10.21437\/Interspeech.2024-1848"},{"key":"2632_CR32","doi-asserted-by":"crossref","unstructured":"Guo, Z., Mao, J., & Tao, R. (2024). Audio generation with multiple conditional diffusion model. Proceedings of the AAAI Conference on Artificial Intelligence (pp. 18153\u201318161)","DOI":"10.1609\/aaai.v38i16.29773"},{"key":"2632_CR33","doi-asserted-by":"crossref","unstructured":"Gupta, A., Yu, L., & Sohn, K. (2024). Photorealistic video generation with diffusion models. European Conference on Computer Vision (pp. 393\u2013411). Springer.","DOI":"10.1007\/978-3-031-72986-7_23"},{"key":"2632_CR34","doi-asserted-by":"crossref","unstructured":"Hai, J., Xu, Y., & Zhang, H. (2024). Ezaudio: Enhancing text-to-audio generation with efficient diffusion transformer arXiv:2409.10819.","DOI":"10.21437\/Interspeech.2025-1137"},{"key":"2632_CR35","doi-asserted-by":"crossref","unstructured":"Haji-Ali, M., Balakrishnan, G., & Ordonez, V. (2024a). Elasticdiffusion: Training-free arbitrary size image generation through global-local content separation. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 6603\u20136612.","DOI":"10.1109\/CVPR52733.2024.00631"},{"key":"2632_CR36","doi-asserted-by":"crossref","unstructured":"Haji-Ali, M., Menapace, W.,& Siarohin, A., et\u00a0al, (2024b). Av-link: Temporally-aligned diffusion features for cross-modal audio-video generation. arXiv:2412.15191","DOI":"10.1109\/ICCV51701.2025.01801"},{"key":"2632_CR37","unstructured":"Hayakawa, A., Ishii, M., & Shibuya, T., et\u00a0al, (2025). MMDisco: Multi-modal discriminator-guided cooperative diffusion for joint audio and video generation. In: The Thirteenth International Conference on Learning Representations, https:\/\/openreview.net\/forum?id=agbiPPuSeQ"},{"key":"2632_CR38","doi-asserted-by":"crossref","unstructured":"Hershey, S., Chaudhuri, S., & Ellis, D. P. W. (2017). Cnn architectures for large-scale audio classification. IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP.","DOI":"10.1109\/ICASSP.2017.7952132"},{"key":"2632_CR39","first-page":"6840","volume":"33","author":"J Ho","year":"2020","unstructured":"Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840\u20136851.","journal-title":"Advances in neural information processing systems"},{"key":"2632_CR40","doi-asserted-by":"crossref","unstructured":"Ho, J., Chan, W., & Saharia, C. (2022). Imagen video: High definition video generation with diffusion models arXiv:2210.02303.","DOI":"10.52202\/068431-0628"},{"key":"2632_CR41","unstructured":"Huang, J., Ren, Y., & Huang, R., et\u00a0al, (2023a). Make-an-audio 2: Temporal-enhanced text-to-audio generation. arXiv:2305.18474"},{"key":"2632_CR42","unstructured":"Huang, R., Huang, J., & Yang, D., et\u00a0al, (2023b). Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. Proceedings of the 40th International Conference on Machine Learning (ICML)."},{"key":"2632_CR43","unstructured":"Kadl\u010d\u00edk, M., H\u00e1jek, A., & Kieslich, J. (2023). A whisper transformer for audio captioning trained with synthetic captions and transfer learning arXiv:2305.09690."},{"key":"2632_CR44","doi-asserted-by":"publisher","unstructured":"Kilgour, K., Zuluaga, M., Roblek, D., et\u00a0al. (2019). Fr chet audio distance: A reference-free metric for evaluating music enhancement algorithms. Interspeech,2019, 2350\u20132354. https:\/\/doi.org\/10.21437\/Interspeech.2019-2219","DOI":"10.21437\/Interspeech.2019-2219"},{"key":"2632_CR45","unstructured":"Kim, C.D., Kim, B., & Lee, H., et\u00a0al, (2019). Audiocaps: Generating captions for audios in the wild. In: NAACL-HLT."},{"key":"2632_CR46","unstructured":"Kim, E., Kim, J., & Oh, Y., et\u00a0al, (2022). Exploring train and test-time augmentations for audio-language learning. arXiv:2210.17143"},{"key":"2632_CR47","unstructured":"Kim, G., Martinez, A., & Su, Y.C., et\u00a0al, (2024a). A versatile diffusion transformer with mixture of noise levels for audiovisual generation. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems, https:\/\/openreview.net\/forum?id=cs1HISJkLU"},{"key":"2632_CR48","unstructured":"Kim, J., Jeon, M., & Jung, J., et\u00a0al, (2024b). Enclap++: Analyzing the enclap framework for optimizing automated audio captioning performance. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024), Tokyo, Japan, pp 61\u201365."},{"key":"2632_CR49","doi-asserted-by":"crossref","unstructured":"Kim, J., Jung, J., Lee, J., et\u00a0al, (2024c). Enclap: Combining neural audio codec and audio-text joint embedding for automated audio captioning. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).","DOI":"10.1109\/ICASSP48485.2024.10446672"},{"key":"2632_CR50","doi-asserted-by":"crossref","unstructured":"Kong, Q., Cao, Y., & Iqbal, T., et\u00a0al, (2019). Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE\/ACM Transactions on Audio, Speech, and Language Processing.","DOI":"10.1109\/TASLP.2020.3030497"},{"key":"2632_CR51","unstructured":"Kong, Z., Goel, A., & Badlani, R. (2024). Audio flamingo: a novel audio language model with few-shot learning and dialogue abilities. Proceedings of the 41st International Conference on Machine Learning. JMLR.org, ICML\u201924"},{"key":"2632_CR52","unstructured":"Kreuk, F., Synnaeve, G., & Polyak, A. (2023). Audiogen: Textually guided audio generation. The Eleventh International Conference on Learning Representations"},{"key":"2632_CR53","doi-asserted-by":"crossref","unstructured":"Labb, E., Pellegrini, T., Pinquier, J., et\u00a0al, (2024). Conette: An efficient audio captioning system leveraging multiple datasets with task embedding. IEEE\/ACM Transactions on Audio, Speech, and Language Processing.","DOI":"10.1109\/TASLP.2024.3430813"},{"key":"2632_CR54","unstructured":"Labb\u00e9, E., Pellegrini, T., & Pinquier, J. (2023). Killing two birds with one stone: Can an audio captioning system also be used for audio-text retrieval? arXiv:2308.15090."},{"key":"2632_CR55","doi-asserted-by":"crossref","unstructured":"Lavie, A., & Agarwal, A. (2007). Meteor: an automatic metric for mt evaluation with high levels of correlation with human judgments. Proceedings of the Second Workshop on Statistical Machine Translation","DOI":"10.3115\/1626355.1626389"},{"key":"2632_CR56","doi-asserted-by":"crossref","unstructured":"Lee, S., Chung, J., Yu., & Y. (2021). Acav100m: Automatic curation of large-scale datasets for audio-visual video representation learning. Proceedings of the IEEE\/CVF International Conference on Computer Vision (pp. 10274\u201310284)","DOI":"10.1109\/ICCV48922.2021.01011"},{"key":"2632_CR57","unstructured":"gil Lee, S., Ping, W., & Ginsburg, B., et\u00a0al, (2023). BigVGAN: A universal neural vocoder with large-scale training. In: The Eleventh International Conference on Learning Representations, https:\/\/openreview.net\/forum?id=iTtGCMDEzS_"},{"key":"2632_CR58","doi-asserted-by":"crossref","unstructured":"Lewis, M., Liu, Y., & Goyal, N. (2020). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","DOI":"10.18653\/v1\/2020.acl-main.703"},{"key":"2632_CR59","unstructured":"Li, J., Li, D., & Savarese, S., et\u00a0al, (2023). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning (ICML)."},{"key":"2632_CR60","doi-asserted-by":"crossref","unstructured":"Li, Y., Wang, X., & Liu, H. (2024). Audio-free prompt tuning for language-audio models. ICASSP 2024\u20132024 IEEE International Conference on Acoustics (pp. 491\u2013495). Speech and Signal Processing (ICASSP): IEEE.","DOI":"10.1109\/ICASSP48485.2024.10446472"},{"key":"2632_CR61","unstructured":"Liang, J., Zhan,g .H,& Liu, H., et\u00a0al, (2024). Wavcraft: Audio editing and generation with large language models. arXiv:2403.09527"},{"key":"2632_CR62","unstructured":"Lin, C. Y. (2004). Rouge: A package for automatic evaluation of summaries. Text summarization branches out (pp. 74\u201381)"},{"key":"2632_CR63","unstructured":"Liu, H., Chen, Z., & Yuan, Y., et\u00a0al, (2023a). Audioldm: Text-to-audio generation with latent diffusion models. Proceedings of the 40th International Conference on Machine Learning (ICML)."},{"key":"2632_CR64","doi-asserted-by":"crossref","unstructured":"Liu, H., Chen, K.,& Tian, Q., et\u00a0al. (2024). Audiosr: Versatile audio super-resolution at scale. ICASSP 2024\u20132024 IEEE International Conference on Acoustics (pp. 1076\u20131080). Speech and Signal Processing (ICASSP): IEEE.","DOI":"10.1109\/ICASSP48485.2024.10447246"},{"key":"2632_CR65","doi-asserted-by":"publisher","unstructured":"Liu, H., Huang, R.,& Liu, Y., et\u00a0al, (2024b). Audiolcm: Efficient and high-quality text-to-audio generation with minimal inference steps. In: Proceedings of the 32nd ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, MM \u201924, p 7008 7017, https:\/\/doi.org\/10.1145\/3664647.3681072,","DOI":"10.1145\/3664647.3681072"},{"issue":"2871","key":"2632_CR66","doi-asserted-by":"publisher","first-page":"2883","DOI":"10.1109\/TASLP.2024.3399607","volume":"32","author":"H Liu","year":"2024","unstructured":"Liu, H., Yuan, Y., Liu, X., et al. (2024). Audioldm 2: Learning holistic audio generation with self-supervised pretraining. IEEE\/ACM Trans Audio, Speech and Lang Proc, 32(2871), 2883. https:\/\/doi.org\/10.1109\/TASLP.2024.3399607","journal-title":"IEEE\/ACM Trans Audio, Speech and Lang Proc"},{"key":"2632_CR67","doi-asserted-by":"crossref","unstructured":"Liu, S., Zhu, Z., Y., & N. (2017). Improved image captioning via policy gradient optimization of spider. IEEE International Conference on Computer Vision (ICCV)","DOI":"10.1109\/ICCV.2017.100"},{"key":"2632_CR68","doi-asserted-by":"crossref","unstructured":"Liu, X., Huang, Q.,& Mei, X., et\u00a0al, (2023b). Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention. In: Proc. INTERSPEECH 2023","DOI":"10.21437\/Interspeech.2023-914"},{"key":"2632_CR69","doi-asserted-by":"crossref","unstructured":"Liu, X., Kong, Q., & Zhao, Y., et\u00a0al, (2024d). Separate anything you describe. IEEE\/ACM Transactions on Audio, Speech, and Language Processing.","DOI":"10.1109\/TASLP.2024.3520017"},{"key":"2632_CR70","unstructured":"Liu, Y., Ott, M., & Goyal, N. (2020). Ro berta: A robustly optimized bert pretraining approach. International Conference on Learning Representations (ICLR"},{"key":"2632_CR71","doi-asserted-by":"crossref","unstructured":"Mahfuz, R., & Guo, Y., & Visser, E. (2023). Improving audio captioning using semantic similarity metrics. ICASSP 2023\u20132023 IEEE International Conference on Acoustics (pp. 1\u20135). Speech and Signal Processing (ICASSP): IEEE.","DOI":"10.1109\/ICASSP49357.2023.10096522"},{"key":"2632_CR72","doi-asserted-by":"crossref","unstructured":"Majumder, N., Hung, C. Y., & Ghosal, D. (2024). Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization. Proceedings of the 32nd ACM International Conference on Multimedia (pp. 564\u2013572)","DOI":"10.1145\/3664647.3681688"},{"key":"2632_CR73","doi-asserted-by":"crossref","unstructured":"Mao, Y., Shen, X., & Zhang, J. (2024). Tavgbench: Benchmarking text to audible-video generation. Proceedings of the 32nd ACM International Conference on Multimedia (pp. 6607\u20136616)","DOI":"10.1145\/3664647.3680612"},{"key":"2632_CR74","doi-asserted-by":"crossref","unstructured":"Mei, X., Meng, C., & Liu, H., et\u00a0al, (2024a). Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. IEEE\/ACM Transactions on Audio, Speech, and Language Processing.","DOI":"10.1109\/TASLP.2024.3419446"},{"key":"2632_CR75","doi-asserted-by":"crossref","unstructured":"Mei, X., Nagaraja, V., & Le\u00a0Lan, G., et\u00a0al, (2024b). Foleygen: Visually-guided audio generation. In: 2024 IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP), IEEE, pp 1\u20136.","DOI":"10.1109\/MLSP58920.2024.10734721"},{"key":"2632_CR76","doi-asserted-by":"crossref","unstructured":"Melechovsky, J., Guo, Z., & Ghosal, D., et\u00a0al, (2024). Mustango: Toward controllable text-to-music generation. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp 8286\u20138309.","DOI":"10.18653\/v1\/2024.naacl-long.459"},{"key":"2632_CR77","doi-asserted-by":"crossref","unstructured":"Menapace, W., Siarohin, A., & Skorokhodov, I. (2024). Snap video: Scaled spatiotemporal transformers for text-to-video synthesis. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 7038\u20137048)","DOI":"10.1109\/CVPR52733.2024.00672"},{"key":"2632_CR78","doi-asserted-by":"crossref","unstructured":"Miech, A., Zhukov, D., A., & J. B. (2019). HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. Proceedings of the IEEE International Conference on Computer Vision (ICCV)","DOI":"10.1109\/ICCV.2019.00272"},{"key":"2632_CR79","doi-asserted-by":"crossref","unstructured":"Nagrani, A., Seo, P. H., & Seybold, B. (2022). Learning audio-video modalities from image captions. European Conference on Computer Vision (pp. 407\u2013426). Springer.","DOI":"10.1007\/978-3-031-19781-9_24"},{"key":"2632_CR80","doi-asserted-by":"crossref","unstructured":"Niu, X., Zhang, J.,& Walder, C., et\u00a0al. (2024). Soundlocd: An efficient conditional discrete contrastive latent diffusion model for text-to-sound generation. ICASSP 2024\u20132024 IEEE International Conference on Acoustics (pp. 261\u2013265). Speech and Signal Processing (ICASSP): IEEE.","DOI":"10.1109\/ICASSP48485.2024.10446349"},{"key":"2632_CR81","doi-asserted-by":"publisher","unstructured":"Paissan, F., Della Libera, L., Wang, Z., et\u00a0al. (2024). Audio editing with non-rigid text prompts. Interspeech,2024, 3290\u20133294. https:\/\/doi.org\/10.21437\/Interspeech.2024-636","DOI":"10.21437\/Interspeech.2024-636"},{"key":"2632_CR82","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., & Ward, T. (2002). Bleu: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics","DOI":"10.3115\/1073083.1073135"},{"key":"2632_CR83","doi-asserted-by":"crossref","unstructured":"Peebles, W., & Xie, S. (2023). Scalable diffusion models with transformers. Proceedings of the IEEE\/CVF international conference on computer vision (pp. 4195\u20134205)","DOI":"10.1109\/ICCV51070.2023.00387"},{"key":"2632_CR84","unstructured":"Podell, D., English, Z., & Lacey, K., et\u00a0al, (2024). SDXL: Improving latent diffusion models for high-resolution image synthesis. In: The Twelfth International Conference on Learning Representations, https:\/\/openreview.net\/forum?id=di52zR8xgf"},{"key":"2632_CR85","unstructured":"Qiu, H., Xia, M.,& Zhang, Y., et\u00a0al, (2024). Freenoise: Tuning-free longer video diffusion via noise rescheduling. In: The Twelfth International Conference on Learning Representations, https:\/\/openreview.net\/forum?id=ijoqFqSC7p"},{"key":"2632_CR86","unstructured":"Raffel, C., Shazeer, N., & Roberts, A. (2022). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, JMLR."},{"key":"2632_CR87","unstructured":"Ramesh, A., Dhariwal, P., & Nichol, A., et\u00a0al, (2022). Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125 1(2):3"},{"key":"2632_CR88","doi-asserted-by":"publisher","unstructured":"Rho, K., Lee, H.,& Iverson, V., et\u00a0al, (2025). Lavcap: Llm-based audio-visual captioning using optimal transport. In: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 1\u20135, https:\/\/doi.org\/10.1109\/ICASSP49660.2025.10888241","DOI":"10.1109\/ICASSP49660.2025.10888241"},{"key":"2632_CR89","doi-asserted-by":"crossref","unstructured":"Rombach, R., Blattmann, A., & Lorenz, D. (2022). High-resolution image synthesis with latent diffusion models. IEEE\/CVF conference on computer vision and pattern recognition (CVPR)","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"2632_CR90","doi-asserted-by":"crossref","unstructured":"Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention, MICCAI.","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"2632_CR91","unstructured":"Saito, K., Kim, D., & Shibuya, T., et\u00a0al, (2024). SoundCTM: Uniting score-based and consistency models for text-to-sound generation. In: Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation, https:\/\/openreview.net\/forum?id=MZT5hVsMOH"},{"key":"2632_CR92","unstructured":"Shi, Y., Lan, G. L., & Nagaraja, V. (2023). Enhance audio generation controllability through representation similarity regularization arXiv:2309.08773."},{"key":"2632_CR93","doi-asserted-by":"publisher","unstructured":"Shi, Z., Zhou, X., & Qiu, X., et\u00a0al, (2020). Improving image captioning with better use of caption. In: Jurafsky D, Chai J, Schluter N, et\u00a0al (eds) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, pp 7454\u20137464, https:\/\/doi.org\/10.18653\/v1\/2020.acl-main.664, https:\/\/aclanthology.org\/2020.acl-main.664\/","DOI":"10.18653\/v1\/2020.acl-main.664"},{"key":"2632_CR94","unstructured":"Shu, F., Zhang, L., & Jiang, H. (2023). Audio-visual llm for video understanding arXiv:2312.06720."},{"key":"2632_CR95","unstructured":"Song, J., Meng, C., & Ermon, S. (2021). Denoising diffusion implicit models. International Conference on Learning Representations (ICLR"},{"key":"2632_CR96","unstructured":"SoundBible .(2024). Free sound effects. https:\/\/soundbible.com\/, accessed: 2024-10-01"},{"key":"2632_CR97","doi-asserted-by":"crossref","unstructured":"Sridhar, A. K., Guo, Y.,& Visser, E., et\u00a0al. (2024). Parameter efficient audio captioning with faithful guidance using audio-text shared latent representation. ICASSP 2024\u20132024 IEEE International Conference on Acoustics (pp. 1181\u20131185). Speech and Signal Processing (ICASSP): IEEE.","DOI":"10.1109\/ICASSP48485.2024.10448154"},{"key":"2632_CR98","doi-asserted-by":"crossref","unstructured":"Sun, L., Xu, X., & Wu, M. (2024). Auto-acd: A large-scale dataset for audio-language representation learning. Proceedings of the 32nd ACM International Conference on Multimedia (pp. 5025\u20135034)","DOI":"10.1145\/3664647.3681472"},{"key":"2632_CR99","unstructured":"Tang, C., Yu, W., & Sun, G., et\u00a0al, (2024a). SALMONN: Towards generic hearing abilities for large language models. In: The Twelfth International Conference on Learning Representations, https:\/\/openreview.net\/forum?id=14rn7HpKVk"},{"key":"2632_CR100","first-page":"16083","volume":"36","author":"Z Tang","year":"2023","unstructured":"Tang, Z., Yang, Z., Zhu, C., et al. (2023). Any-to-any generation via composable diffusion. Advances in Neural Information Processing Systems, 36, 16083\u201316099.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2632_CR101","doi-asserted-by":"crossref","unstructured":"Tang, Z., Yang, Z.,& Khademi, M., et\u00a0al, (2024b). Codi-2: In-context interleaved and interactive any-to-any generation. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 27425\u201327434.","DOI":"10.1109\/CVPR52733.2024.02589"},{"key":"2632_CR102","doi-asserted-by":"crossref","unstructured":"Tian, Z., Liu, Z., & Yuan, R. (2025). Vidmuse: A simple video-to-music generation framework with long-short-term modeling. Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 18782\u201318793)","DOI":"10.1109\/CVPR52734.2025.01750"},{"key":"2632_CR103","doi-asserted-by":"crossref","unstructured":"Vahdati, D. S., Nguyen, T. D., & Azizpour, A. (2024). Beyond deepfake images: Detecting ai-generated videos. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 4397\u20134408)","DOI":"10.1109\/CVPRW63382.2024.00443"},{"key":"2632_CR104","unstructured":"Vandchali, M. A., & Kyrillidis, A. (2025). One rank at a time: Cascading error dynamics in sequential learning arXiv:2505.22602."},{"key":"2632_CR105","unstructured":"Vaswani, A., Shazeer, N., & Parmar, N. (2017). Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS)"},{"key":"2632_CR106","doi-asserted-by":"crossref","unstructured":"Vedantam, R., Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"2632_CR107","unstructured":"Villegas, R., Babaeizadeh, M., & Kindermans, P. J. (2022). Phenaki: Variable length video generation from open domain textual descriptions. International Conference on Learning Representations"},{"key":"2632_CR108","unstructured":"Vyas, A., Shi, B., L., & M. (2023). Audiobox: Unified audio generation with natural language prompts arXiv:2312.15821 arXiv preprint."},{"key":"2632_CR109","doi-asserted-by":"crossref","unstructured":"Wang, H., Ma, J.,& Pascual, S., et\u00a0al, (2024a). V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 15492\u201315501.","DOI":"10.1609\/aaai.v38i14.29475"},{"key":"2632_CR110","unstructured":"Wang, K., Deng, S.,& Shi, J., et\u00a0al, (2024b). AV-dit: Efficient audio-visual diffusion transformer for joint audio and video generation. In: Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation, https:\/\/openreview.net\/forum?id=FE6zflN5G5"},{"key":"2632_CR111","doi-asserted-by":"crossref","unstructured":"Wang, W., Lv, Q., & Yu, W., et\u00a0al, (2024c). Cogvlm: Visual expert for pretrained language models. In: Globerson A, Mackey L, Belgrave D, et\u00a0al (eds) Advances in Neural Information Processing Systems, vol\u00a037. Curran Associates, Inc., pp 121475\u2013121499, https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2024\/file\/dc06d4d2792265fb5454a6092bfd5c6a-Paper-Conference.pdf","DOI":"10.52202\/079017-3860"},{"key":"2632_CR112","unstructured":"Wang, Y., Chen, X., & Ma, X., et\u00a0al, (2024d). Lavie: High-quality video generation with cascaded latent diffusion models. International Journal of Computer Vision pp 1\u201320."},{"key":"2632_CR113","unstructured":"Wang, Y., Guo, W., &Huang, R., et\u00a0al .(2024e). Frieren: Efficient video-to-audio generation network with rectified flow matching. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems, https:\/\/openreview.net\/forum?id=prXfM5X2Db"},{"key":"2632_CR114","doi-asserted-by":"crossref","unstructured":"Wu, S. L., Chang, X., Wichern, G., et\u00a0al. (2024). Improving audio captioning models with fine-grained audio features, text embedding supervision, and llm mix-up augmentation. ICASSP 2024\u20132024 IEEE International Conference on Acoustics (pp. 316\u2013320). Speech and Signal Processing (ICASSP): IEEE.","DOI":"10.1109\/ICASSP48485.2024.10447215"},{"key":"2632_CR115","doi-asserted-by":"crossref","unstructured":"Wu, Y., Chen, K., & Zhang, T., et\u00a0al, (2023a). Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).","DOI":"10.1109\/ICASSP49357.2023.10095969"},{"key":"2632_CR116","doi-asserted-by":"crossref","unstructured":"Wu, Y., Chen, K.,& Zhang, T., et\u00a0al, (2023b). Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).","DOI":"10.1109\/ICASSP49357.2023.10095969"},{"key":"2632_CR117","doi-asserted-by":"crossref","unstructured":"Xing, Y., He, Y., & Tian, Z. (2024). Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 7151\u20137161)","DOI":"10.1109\/CVPR52733.2024.00683"},{"key":"2632_CR118","unstructured":"Xu, M., Li, C., & Zhang, D., et\u00a0al, (2024a). Prompt-guided precise audio editing with diffusion models. In: Proceedings of the 41st International Conference on Machine Learning. JMLR.org, ICML\u201924."},{"key":"2632_CR119","doi-asserted-by":"crossref","unstructured":"Xu, Y., Chen, H.,& Yu, J., et\u00a0al, (2024b). Secap: Speech emotion captioning with large language model. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 19323\u201319331.","DOI":"10.1609\/aaai.v38i17.29902"},{"key":"2632_CR120","doi-asserted-by":"crossref","unstructured":"Xue, H., Hang, T., & Zeng, Y. (2022). Advancing high-resolution video-language representation with large-scale video transcriptions. International Conference on Computer Vision and Pattern Recognition (CVPR)","DOI":"10.1109\/CVPR52688.2022.00498"},{"key":"2632_CR121","doi-asserted-by":"crossref","unstructured":"Xue, J., Deng, Y., & Gao, Y., et\u00a0al, (2024). Auffusion: Leveraging the power of diffusion and large language models for text-to-audio generation. IEEE\/ACM Transactions on Audio, Speech, and Language Processing.","DOI":"10.1109\/TASLP.2024.3485485"},{"key":"2632_CR122","unstructured":"Yang, D., Tian, J., & Tan, X., et\u00a0al, (2023a). Uniaudio: An audio foundation model toward universal audio generation. arXiv:2310.00704"},{"key":"2632_CR123","doi-asserted-by":"crossref","unstructured":"Yang, D., Yu, J., & Wang, H., et\u00a0al, (2023b). Diffsound: Discrete diffusion model for text-to-sound generation. IEEE\/ACM Transactions on Audio, Speech, and Language Processing.","DOI":"10.1109\/TASLP.2023.3268730"},{"key":"2632_CR124","doi-asserted-by":"crossref","unstructured":"Ye, Z., Wang, Y., & Wang, H., et\u00a0al, (2022). Featurecut: An adaptive data augmentation for automated audio captioning. In: 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), IEEE, pp 313\u2013318","DOI":"10.23919\/APSIPAASC55919.2022.9980325"},{"key":"2632_CR125","unstructured":"You, Y., Li, J., & Reddi, S., et\u00a0al, (2020). Large batch optimization for deep learning: Training bert in 76 minutes. In: International Conference on Learning Representations, https:\/\/openreview.net\/forum?id=Syx4wnEtvH"},{"key":"2632_CR126","doi-asserted-by":"crossref","unstructured":"Yuan, Y., Jia, D.,& Zhuang, X., et\u00a0al. (2025). Sound-vecaps: Improving audio generation with visually enhanced captions. ICASSP 2025\u20132025 IEEE International Conference on Acoustics (pp. 1\u20135). Speech and Signal Processing (ICASSP): IEEE.","DOI":"10.1109\/ICASSP49660.2025.10889473"},{"key":"2632_CR127","doi-asserted-by":"crossref","unstructured":"Zellers, R., Lu, J., & Lu, X., et\u00a0al, (2022). Merlot reserve: Neural script knowledge through vision and language and sound. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 16375\u201316387.","DOI":"10.1109\/CVPR52688.2022.01589"},{"key":"2632_CR128","doi-asserted-by":"publisher","unstructured":"Zhang, H., Li, X., & Bing, L. (2023a). Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. In: Feng Y, Lefever E (eds) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Singapore, pp 543\u2013553, https:\/\/doi.org\/10.18653\/v1\/2023.emnlp-demo.49, https:\/\/aclanthology.org\/2023.emnlp-demo.49\/","DOI":"10.18653\/v1\/2023.emnlp-demo.49"},{"key":"2632_CR129","unstructured":"Zhang, Y., Maezawa, A.,& Xia, G., et\u00a0al, (2023b). Loop copilot: Conducting ai ensembles for music generation and iterative editing. arXiv:2310.12404"},{"key":"2632_CR130","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Xu, X., & Du, R. (2024). Zero-shot audio captioning using soft and hard prompts arXiv:2406.06295 arXiv preprint.","DOI":"10.1109\/TASLPRO.2025.3567770"},{"key":"2632_CR131","doi-asserted-by":"publisher","first-page":"2045","DOI":"10.1109\/TASLPRO.2025.3567770","volume":"33","author":"Y Zhang","year":"2025","unstructured":"Zhang, Y., Xu, X., Du, R., et al. (2025). Zero-shot audio captioning using soft and hard prompts. IEEE Transactions on Audio, Speech and Language Processing, 33, 2045\u20132058. https:\/\/doi.org\/10.1109\/TASLPRO.2025.3567770","journal-title":"IEEE Transactions on Audio, Speech and Language Processing"},{"key":"2632_CR132","doi-asserted-by":"crossref","unstructured":"Zhu, G., Darefsky, J., & Duan, Z. (2024). Cacophony: An improved contrastive audio-text model. IEEE\/ACM Transactions on Audio, Speech, and Language Processing.","DOI":"10.1109\/TASLP.2024.3485170"}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-025-02632-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-025-02632-y","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-025-02632-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,6,10]],"date-time":"2026-06-10T09:33:46Z","timestamp":1781084026000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-025-02632-y"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,28]]},"references-count":132,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2026,3]]}},"alternative-id":["2632"],"URL":"https:\/\/doi.org\/10.1007\/s11263-025-02632-y","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,28]]},"assertion":[{"value":"15 April 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 October 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 January 2026","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 June 2026","order":5,"name":"change_date","label":"Change Date","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"Update","order":6,"name":"change_type","label":"Change Type","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The original version of the article is revised due to retrospective open access order.","order":7,"name":"change_details","label":"Change Details","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Authors Moayed Haji-Ali, Guha Balakrishnan, and Vicente Ordonez are affiliated with Rice University, and authors Willi Menapace and Aliaksandr Siarohin are affiliated with Snap Inc., where they are supervised by Sergey Tulyakov. The authors declare they have no financial interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}},{"value":"All participants in the user studies provided gave consent and received financial compensation for their participation.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"All authors consent to the publication of this work.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"All code used in this study will be publicly available upon publication.","order":5,"name":"Ethics","group":{"name":"EthicsHeading","label":"Code availability"}}],"article-number":"87"}}