{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T19:46:12Z","timestamp":1765309572099,"version":"3.46.0"},"publisher-location":"New York, NY, USA","reference-count":40,"publisher":"ACM","funder":[{"DOI":"10.13039\/501100001691","name":"Japan Society for the Promotion of Science","doi-asserted-by":"publisher","award":["JP24K02942,JP23K21676, JP23K11211, JP23KJ0046"],"award-info":[{"award-number":["JP24K02942,JP23K21676, JP23K11211, JP23KJ0046"]}],"id":[{"id":"10.13039\/501100001691","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,10,27]]},"DOI":"10.1145\/3746027.3755794","type":"proceedings-article","created":{"date-parts":[[2025,10,25]],"date-time":"2025-10-25T06:54:17Z","timestamp":1761375257000},"page":"2235-2243","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Context-aware Image-to-Music Generation via Bridging Modalities through Musical Captions"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-6624-407X","authenticated-orcid":false,"given":"Shilin","family":"Liu","sequence":"first","affiliation":[{"name":"Hokkaido University, Sapporo, Hokkaido, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7054-7920","authenticated-orcid":false,"given":"Kyohei","family":"Kamikawa","sequence":"additional","affiliation":[{"name":"Hokkaido University, Sapporo, Hokkaido, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8039-3462","authenticated-orcid":false,"given":"Keisuke","family":"Maeda","sequence":"additional","affiliation":[{"name":"Hokkaido University, Sapporo, Hokkaido, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5332-8112","authenticated-orcid":false,"given":"Takahiro","family":"Ogawa","sequence":"additional","affiliation":[{"name":"Hokkaido University, Sapporo, Hokkaido, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1496-1761","authenticated-orcid":false,"given":"Miki","family":"Haseyama","sequence":"additional","affiliation":[{"name":"Hokkaido University, Sapporo, Hokkaido, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,10,27]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Gunjan Aggarwal and Devi Parikh. 2021. Dance2Music: Automatic Dance-driven Music Generation. arXiv:2107.06252 [cs.SD]"},{"key":"e_1_3_2_1_2_1","volume-title":"Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325","author":"Agostinelli Andrea","year":"2023","unstructured":"Andrea Agostinelli, Timo I Denk, Zal\u00e1n Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. 2023. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325 (2023)."},{"key":"e_1_3_2_1_3_1","first-page":"1877","article-title":"Language Models are Few-Shot Learners","volume":"33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Proceedings of Advances in Neural Information Processing Systems, Vol. 33. 1877-1901.","journal-title":"Proceedings of Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1186\/s13636-025-00397-3"},{"key":"e_1_3_2_1_5_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 24185-24198","author":"Chen Zhe","year":"2024","unstructured":"Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. 2024. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 24185-24198."},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02533"},{"key":"e_1_3_2_1_7_1","first-page":"47704","article-title":"Simple and Controllable Music Generation","volume":"36","author":"Copet Jade","year":"2023","unstructured":"Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Defossez. 2023. Simple and Controllable Music Generation. In in Proceedings of Advances in Neural Information Processing Systems, Vol. 36. 47704-47720.","journal-title":"in Proceedings of Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3672554"},{"key":"e_1_3_2_1_9_1","volume-title":"FMA: A Dataset For Music Analysis. In arXiv:1612.01840.","author":"Defferrard Micha\u00ebl","year":"2017","unstructured":"Micha\u00ebl Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. 2017. FMA: A Dataset For Music Analysis. In arXiv:1612.01840."},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475195"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00291"},{"key":"e_1_3_2_1_12_1","first-page":"35959","volume-title":"Oh (Eds.)","volume":"35","author":"Gao Yuting","year":"2022","unstructured":"Yuting Gao, Jinfeng Liu, Zihan Xu, Jun Zhang, Ke Li, Rongrong Ji, and Chunhua Shen. 2022. PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. 35959-35970."},{"key":"e_1_3_2_1_13_1","article-title":"Deep Learning Approaches on Image Captioning","volume":"56","author":"Ghandi Taraneh","year":"2023","unstructured":"Taraneh Ghandi, Hamidreza Pourreza, and Hamidreza Mahyar. 2023. Deep Learning Approaches on Image Captioning: A Review. Proceedings of ACM Computing Surveys 56, 3, 39 pages.","journal-title":"A Review. Proceedings of ACM Computing Surveys"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0283103"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.681"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00277"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"crossref","unstructured":"Kevin Kilgour Mauricio Zuluaga Dominik Roblek and Matthew Sharifi. 2019. Fr\u00e9chet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms. In arXiv:1812.08466.","DOI":"10.21437\/Interspeech.2019-2219"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2024.3360695"},{"key":"e_1_3_2_1_20_1","first-page":"34892","article-title":"Visual Instruction Tuning","volume":"36","author":"Liu Haotian","year":"2023","unstructured":"Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruction Tuning. In in Proceedings of Advances in Neural Information Processing Systems, Vol. 36. 34892-34916.","journal-title":"in Proceedings of Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_21_1","volume-title":"Semantic-Conditional Diffusion Networks for Image Captioning. In in Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 23359-23368","author":"Luo Jianjie","year":"2023","unstructured":"Jianjie Luo, Yehao Li, Yingwei Pan, Ting Yao, Jianlin Feng, Hongyang Chao, and Tao Mei. 2023. Semantic-Conditional Diffusion Networks for Image Captioning. In in Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 23359-23368."},{"key":"e_1_3_2_1_22_1","volume-title":"Mustango: Toward Controllable Text-to-Music Generation. In in Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 8293-8316","author":"Melechovsky Jan","year":"2024","unstructured":"Jan Melechovsky, Zixun Guo, Deepanway Ghosal, Navonil Majumder, Dorien Herremans, and Soujanya Poria. 2024. Mustango: Toward Controllable Text-to-Music Generation. In in Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 8293-8316."},{"key":"e_1_3_2_1_23_1","unstructured":"Jan Melechovsky Abhinaba Roy and Dorien Herremans. 2024. MidiCaps: A Large-scale MIDI Dataset with Text Captions. In arXiv:2406.02255."},{"key":"e_1_3_2_1_24_1","volume-title":"Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever.","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In arXiv:2103.00020."},{"key":"e_1_3_2_1_25_1","volume-title":"Proceedings of International Conference on Machine Learning (Proceedings of Machine Learning Research","volume":"4373","author":"Roberts Adam","year":"2018","unstructured":"Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. 2018. A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music. In Proceedings of International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80). 4364-4373."},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.3390\/app142311470"},{"key":"e_1_3_2_1_27_1","volume-title":"Proceedings of Advances in Neural Information Processing Systems","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of Advances in Neural Information Processing Systems, Vol. 30."},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.23919\/DAFx51585.2021.9768298"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3690640"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2023.3338089"},{"key":"e_1_3_2_1_31_1","volume-title":"MeloTrans: A Text to Symbolic Music Generation Model Following Human Composition Habit. arXiv preprint arXiv:2410.13419","author":"Wang Yutian","year":"2024","unstructured":"Yutian Wang, Wanyin Yang, Zhenrong Dai, Yilong Zhang, Kun Zhao, and Hui Wang. 2024. MeloTrans: A Text to Symbolic Music Generation Model Following Human Composition Habit. arXiv preprint arXiv:2410.13419 (2024)."},{"key":"e_1_3_2_1_32_1","volume-title":"Victor Shea-Jay Huang, and Yue Liao","author":"Wang Zhaokai","year":"2025","unstructured":"Zhaokai Wang, Chenxi Bao, Le Zhuo, Jingrui Han, Yang Yue, Yihong Tang, Victor Shea-Jay Huang, and Yue Liao. 2025. Vision-to-Music Generation: A Survey. arXiv:2503.21254 [cs.CV]"},{"key":"e_1_3_2_1_33_1","volume-title":"Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus.","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022. Emergent Abilities of Large Language Models. In Transactions on Machine Learning Research."},{"key":"e_1_3_2_1_34_1","unstructured":"Yusong Wu Ke Chen Tianyu Zhang Yuchen Hui Taylor Berg-Kirkpatrick and Shlomo Dubnov. 2023. Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation. In in Proceedings of IEEE International Conference on Acoustics Speech and Signal Processing."},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISM55400.2022.00051"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01337"},{"key":"e_1_3_2_1_37_1","volume-title":"Proceedings of European Conference on Computer Vision. 310-325","author":"Zhang Beichen","year":"2024","unstructured":"Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. 2024. Long-CLIP: Unlocking thenbsp;Long-Text Capability ofnbsp;CLIP. In Proceedings of European Conference on Computer Vision. 310-325."},{"key":"e_1_3_2_1_38_1","volume-title":"Zhifeng Li, Wei Liu, and Li Yuan.","author":"Zhu Bin","year":"2023","unstructured":"Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, Wang HongFa, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Cai Wan Zhang, Zhifeng Li, Wei Liu, and Li Yuan. 2023. Language Bind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. arXiv:2310.01852 [cs.CV]"},{"key":"e_1_3_2_1_39_1","unstructured":"Jinlong Zhu Keigo Sakurai Ren Togo Takahiro Ogawa and Miki Haseyama. 2024. MMT-BERT: Chord-aware Symbolic Music Generation Based on Multitrack Music Transformer and MusicBERT. arXiv:2409.00919 [cs.SD]"},{"key":"e_1_3_2_1_40_1","volume-title":"Proceedings of The Annual Conference on Neural Information Processing Systems.","author":"Zhuang Chenyi","year":"2024","unstructured":"Chenyi Zhuang, Ying Hu, and Pan Gao. 2024. Magnet: We Never Know How Text-to-Image Diffusion Models Work, Until We Learn How Vision-Language Models Function. In Proceedings of The Annual Conference on Neural Information Processing Systems."}],"event":{"name":"MM '25: The 33rd ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Dublin Ireland","acronym":"MM '25"},"container-title":["Proceedings of the 33rd ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3746027.3755794","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T19:42:32Z","timestamp":1765309352000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3746027.3755794"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,27]]},"references-count":40,"alternative-id":["10.1145\/3746027.3755794","10.1145\/3746027"],"URL":"https:\/\/doi.org\/10.1145\/3746027.3755794","relation":{},"subject":[],"published":{"date-parts":[[2025,10,27]]},"assertion":[{"value":"2025-10-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}