{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T19:16:27Z","timestamp":1765307787264,"version":"3.46.0"},"publisher-location":"New York, NY, USA","reference-count":45,"publisher":"ACM","funder":[{"name":"the National Key Research and Development Program of China","award":["2023YFF0904900"],"award-info":[{"award-number":["2023YFF0904900"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,10,27]]},"DOI":"10.1145\/3746027.3755523","type":"proceedings-article","created":{"date-parts":[[2025,10,25]],"date-time":"2025-10-25T05:44:48Z","timestamp":1761371088000},"page":"10278-10286","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Spatial-Temporal Decomposition and Alignment in Controllable Video-to-Music Generation"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9625-5547","authenticated-orcid":false,"given":"Weitao","family":"You","sequence":"first","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-7999-2317","authenticated-orcid":false,"given":"Heda","family":"Zuo","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-8447-2121","authenticated-orcid":false,"given":"Junxian","family":"Wu","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6307-7692","authenticated-orcid":false,"given":"Dengming","family":"Zhang","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9545-3763","authenticated-orcid":false,"given":"Zhibin","family":"Zhou","sequence":"additional","affiliation":[{"name":"Hong Kong Polytechnic University, Hong Kong, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5561-0493","authenticated-orcid":false,"given":"Lingyun","family":"Sun","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]}],"member":"320","published-online":{"date-parts":[[2025,10,27]]},"reference":[{"volume-title":"Visual Communications and Image Processing'92","author":"Akutsu Akihito","key":"e_1_3_2_1_1_1","unstructured":"Akihito Akutsu, Yoshinobu Tonomura, Hideo Hashimoto, and Yuji Ohba. 1992. Video indexing using motion vectors. In Visual Communications and Image Processing'92, Vol. 1818. SPIE, 1522-1530."},{"key":"e_1_3_2_1_2_1","unstructured":"Youssef Bendraou. 2017. Video shot boundary detection and key-frame extraction using mathematical models. Ph.D. Dissertation. Universit\u00e9 du Littoral C\u00f4te d'Opale; Universit\u00e9 Mohammed V (Rabat). Facult\u00e9 \u2026."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1007\/s13735-022-00251-8"},{"key":"e_1_3_2_1_4_1","volume-title":"Advances in Neural Information Processing Systems","volume":"36","author":"Copet Jade","year":"2024","unstructured":"Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre D\u00e9fossez. 2024. Simple and controllable music generation. Advances in Neural Information Processing Systems, Vol. 36 (2024)."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475195"},{"key":"e_1_3_2_1_6_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly Jakob Uszkoreit and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929 [cs.CV]"},{"key":"e_1_3_2_1_7_1","volume-title":"High Fidelity Neural Audio Compression. arXiv preprint arXiv:2210.13438","author":"D\u00e9fossez Alexandre","year":"2022","unstructured":"Alexandre D\u00e9fossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. 2022. High Fidelity Neural Audio Compression. arXiv preprint arXiv:2210.13438 (2022)."},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP49357.2023.10095889"},{"key":"e_1_3_2_1_9_1","volume-title":"Video-Music Retrieval with Fine-Grained Cross-Modal Alignment. In 2023 IEEE International Conference on Image Processing (ICIP). IEEE","author":"Era Yuki","year":"2023","unstructured":"Yuki Era, Ren Togo, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. 2023. Video-Music Retrieval with Fine-Grained Cross-Modal Alignment. In 2023 IEEE International Conference on Image Processing (ICIP). IEEE, 2005-2009."},{"key":"e_1_3_2_1_10_1","volume-title":"Armand Joulin, and Ishan Misra.","author":"Girdhar Rohit","year":"2023","unstructured":"Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. ImageBind: One Embedding Space To Bind Them All. In CVPR."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-03844-5_19"},{"key":"e_1_3_2_1_12_1","volume-title":"Multi-modal Music Understanding and Generation with the Power of Large Language Models. arXiv preprint arXiv:2311.11255","author":"Hussain Atin Sakkeer","year":"2023","unstructured":"Atin Sakkeer Hussain, Shansong Liu, Chenshuo Sun, and Ying Shan. 2023. M^2UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models. arXiv preprint arXiv:2311.11255 (2023)."},{"key":"e_1_3_2_1_13_1","volume-title":"A Comprehensive Survey on Generative AI for Video-to-Music Generation. arXiv preprint arXiv:2502.12489","author":"Ji Shu","year":"2025","unstructured":"Shu lei Ji, Songruoyao Wu, Zihao Wang, Shuyu Li, and Kejun Zhang. 2025. A Comprehensive Survey on Generative AI for Video-to-Music Generation. arXiv preprint arXiv:2502.12489 (2025)."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSP.2011.941851"},{"key":"e_1_3_2_1_15_1","volume-title":"Video2Music: Suitable music generation from videos using an Affective Multimodal Transformer model. Expert Systems with Applications","author":"Kang Jaeyong","year":"2024","unstructured":"Jaeyong Kang, Soujanya Poria, and Dorien Herremans. 2024. Video2Music: Suitable music generation from videos using an Affective Multimodal Transformer model. Expert Systems with Applications (2024), 123640."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10462-024-10742-1"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"crossref","unstructured":"Kevin Kilgour Mauricio Zuluaga Dominik Roblek and Matthew Sharifi. 2019. Fr\u00e9chet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms. arXiv:1812.08466 [eess.AS]","DOI":"10.21437\/Interspeech.2019-2219"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00371"},{"key":"e_1_3_2_1_19_1","volume-title":"MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization. arXiv preprint arXiv:2410.12957","author":"Li Ruiqi","year":"2024","unstructured":"Ruiqi Li, Siqi Zheng, Xize Cheng, Ziang Zhang, Shengpeng Ji, and Zhou Zhao. 2024c. MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization. arXiv preprint arXiv:2410.12957 (2024)."},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02582"},{"key":"e_1_3_2_1_21_1","volume-title":"VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features. arXiv preprint arXiv:2412.06296","author":"Li Sifei","year":"2024","unstructured":"Sifei Li, Binxin Yang, Chunji Yin, Chong Sun, Yuxin Zhang, Weiming Dong, and Chen Li. 2024b. VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features. arXiv preprint arXiv:2412.06296 (2024)."},{"key":"e_1_3_2_1_22_1","volume-title":"VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos. arXiv preprint arXiv:2409.07450","author":"Lin Yan-Bo","year":"2024","unstructured":"Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, and Heng Wang. 2024. VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos. arXiv preprint arXiv:2409.07450 (2024)."},{"key":"e_1_3_2_1_23_1","volume-title":"Heli Ben-Hamu, Maximilian Nickel, and Matt Le.","author":"Lipman Yaron","year":"2022","unstructured":"Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. 2022. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)."},{"key":"e_1_3_2_1_24_1","volume-title":"Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503","author":"Liu Haohe","year":"2023","unstructured":"Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. 2023. Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503 (2023)."},{"key":"e_1_3_2_1_25_1","volume-title":"Rectified flow: A marginal preserving approach to optimal transport. arXiv preprint arXiv:2209.14577","author":"Liu Qiang","year":"2022","unstructured":"Qiang Liu. 2022. Rectified flow: A marginal preserving approach to optimal transport. arXiv preprint arXiv:2209.14577 (2022)."},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICME.2010.5583863"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3059968"},{"key":"e_1_3_2_1_28_1","volume-title":"Youngjung Uh, Yunjey Choi, and Jaejun Yoo.","author":"Naeem Muhammad Ferjad","year":"2020","unstructured":"Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. 2020. Reliable Fidelity and Diversity Metrics for Generative Models. (2020)."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3152598"},{"key":"e_1_3_2_1_30_1","unstructured":"Nikhila Ravi Valentin Gabeur Yuan-Ting Hu Ronghang Hu Chaitanya Ryali Tengyu Ma Haitham Khedr Roman R\u00e4dle Chloe Rolland Laura Gustafson et al. 2024. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)."},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3664647.3685517"},{"key":"e_1_3_2_1_32_1","volume-title":"Transnet: A deep network for fast detection of common shot transitions. arXiv preprint arXiv:1906.03363","author":"Sou\u010dek Tom\u00e1\u0161","year":"2019","unstructured":"Tom\u00e1\u0161 Sou\u010dek, Jaroslav Moravec, and Jakub Loko\u010d. 2019. Transnet: A deep network for fast detection of common shot transitions. arXiv preprint arXiv:1906.03363 (2019)."},{"key":"e_1_3_2_1_33_1","volume-title":"Qingqing Huang, Dima Kuzmin, Joonseok Lee, Chris Donahue, Fei Sha, Aren Jansen, Yu Wang, Mauro Verzetti, et al.","author":"Su Kun","year":"2023","unstructured":"Kun Su, Judith Yue Li, Qingqing Huang, Dima Kuzmin, Joonseok Lee, Chris Donahue, Fei Sha, Aren Jansen, Yu Wang, Mauro Verzetti, et al., 2023. V2Meow: Meowing to the Visual Beat via Music Generation. arXiv preprint arXiv:2305.06594 (2023)."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01031"},{"key":"e_1_3_2_1_35_1","volume-title":"Vidmuse: A simple video-to-music generation framework with long-short-term modeling. arXiv preprint arXiv:2406.04321","author":"Tian Zeyue","year":"2024","unstructured":"Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Qifeng Liu, Xu Tan, Qifeng Chen, Wei Xue, and Yike Guo. 2024. Vidmuse: A simple video-to-music generation framework with long-short-term modeling. arXiv preprint arXiv:2406.04321 (2024)."},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1037\/0096-3445.123.4.394"},{"key":"e_1_3_2_1_37_1","volume-title":"Attention is all you need. Advances in neural information processing systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017)."},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV45572.2020.9093345"},{"key":"e_1_3_2_1_39_1","volume-title":"Seggpt: Segmenting everything in context. arXiv preprint arXiv:2304.03284","author":"Wang Xinlong","year":"2023","unstructured":"Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, and Tiejun Huang. 2023. Seggpt: Segmenting everything in context. arXiv preprint arXiv:2304.03284 (2023)."},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/1076034.1076097"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.3390\/electronics12051199"},{"key":"e_1_3_2_1_42_1","volume-title":"HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization. arXiv preprint arXiv:2503.01725","author":"Zhou Zitang","year":"2025","unstructured":"Zitang Zhou, Ke Mei, Yu Lu, Tianyi Wang, and Fengyun Rao. 2025. HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization. arXiv preprint arXiv:2503.01725 (2025)."},{"key":"e_1_3_2_1_43_1","first-page":"49859","article-title":"Darksam: Fooling segment anything model to segment nothing","volume":"37","author":"Zhou Ziqi","year":"2024","unstructured":"Ziqi Zhou, Yufei Song, Minghui Li, Shengshan Hu, Xianlong Wang, Leo Yu Zhang, Dezhong Yao, and Hai Jin. 2024. Darksam: Fooling segment anything model to segment nothing. Advances in Neural Information Processing Systems, Vol. 37 (2024), 49859-49880.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01433"},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v39i21.34474"}],"event":{"name":"MM '25: The 33rd ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Dublin Ireland","acronym":"MM '25"},"container-title":["Proceedings of the 33rd ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3746027.3755523","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T19:14:44Z","timestamp":1765307684000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3746027.3755523"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,27]]},"references-count":45,"alternative-id":["10.1145\/3746027.3755523","10.1145\/3746027"],"URL":"https:\/\/doi.org\/10.1145\/3746027.3755523","relation":{},"subject":[],"published":{"date-parts":[[2025,10,27]]},"assertion":[{"value":"2025-10-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}