{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T20:12:39Z","timestamp":1774642359005,"version":"3.50.1"},"reference-count":52,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2022,2,16]],"date-time":"2022-02-16T00:00:00Z","timestamp":1644969600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2022,5,31]]},"abstract":"<jats:p>Synthesize human motions from music (i.e., music to dance) is appealing and has attracted lots of research interests in recent years. It is challenging because of the requirement for realistic and complex human motions for dance, but more importantly, the synthesized motions should be consistent with the style, rhythm, and melody of the music. In this article, we propose a novel autoregressive generative model, DanceNet, to take the style, rhythm, and melody of music as the control signals to generate 3D dance motions with high realism and diversity. Due to the high long-term spatio-temporal complexity of dance, we propose the dilated convolution to improve the receptive field, and adopt the gated activation unit as well as separable convolution to enhance the fusion of motion features and control signals. To boost the performance of our proposed model, we capture several synchronized music-dance pairs by professional dancers and build a high-quality music-dance pair dataset. Experiments have demonstrated that the proposed method can achieve state-of-the-art results.<\/jats:p>","DOI":"10.1145\/3485664","type":"journal-article","created":{"date-parts":[[2022,2,16]],"date-time":"2022-02-16T17:56:32Z","timestamp":1645034192000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":94,"title":["Music2Dance: DanceNet for Music-Driven Dance Generation"],"prefix":"10.1145","volume":"18","author":[{"given":"Wenlin","family":"Zhuang","sequence":"first","affiliation":[{"name":"Southeast University and the Key Laboratory of Measurement and Controlof Complex Systems of Engineering, Ministry of Education, Nanjing, China"}]},{"given":"Congyi","family":"Wang","sequence":"additional","affiliation":[{"name":"Xmov, Shanghai, China"}]},{"given":"Jinxiang","family":"Chai","sequence":"additional","affiliation":[{"name":"Xmov, Texas A&amp;M University, Shanghai, China"}]},{"given":"Yangang","family":"Wang","sequence":"additional","affiliation":[{"name":"Southeast University, Nanjing, China"}]},{"given":"Ming","family":"Shao","sequence":"additional","affiliation":[{"name":"University of Massachusetts Dartmouth, Dartmouth, MA, USA"}]},{"given":"Siyu","family":"Xia","sequence":"additional","affiliation":[{"name":"Southeast University, Nanjing, China"}]}],"member":"320","published-online":{"date-parts":[[2022,2,16]]},"reference":[{"key":"e_1_3_1_2_2","article-title":"Adobe Mixamo Dataset","year":"2017","unstructured":"Adobe. 2017. Adobe Mixamo Dataset. Retrieved November 11, 2021 from https:\/\/www.mixamo.com.","journal-title":"https:\/\/www.mixamo.com."},{"key":"e_1_3_1_3_2","volume-title":"Proceedings of the 15th International Conference on Digital Audio Effects (DAFx\u201912).","author":"B\u00f6ck Sebastian","year":"2012","unstructured":"Sebastian B\u00f6ck, Andreas Arzt, Florian Krebs, and Markus Schedl. 2012. Online real-time onset detection with recurrent neural networks. In Proceedings of the 15th International Conference on Digital Audio Effects (DAFx\u201912)."},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2973795"},{"key":"e_1_3_1_5_2","first-page":"255","volume-title":"Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR\u201916)","author":"B\u00f6ck Sebastian","year":"2016","unstructured":"Sebastian B\u00f6ck, Florian Krebs, and Gerhard Widmer. 2016. Joint beat and downbeat tracking with recurrent neural networks. In Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR\u201916). 255\u2013261."},{"key":"e_1_3_1_6_2","volume-title":"Proceedings of the IEEE Workshop on Human Modeling, Analysis, and Synthesis","volume":"2000","author":"Bowden Richard","year":"2000","unstructured":"Richard Bowden. 2000. Learning statistical models of human motion. In Proceedings of the IEEE Workshop on Human Modeling, Analysis, and Synthesis, Vol. 2000."},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/344779.344865"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.5555\/787261.787780"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/1186822.1073248"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/1276377.1276387"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2017.7952585"},{"key":"e_1_3_1_12_2","article-title":"Carnegie-Mellon Motion Capture Database","year":"2010","unstructured":"CMU. 2010. Carnegie-Mellon Motion Capture Database. Retrieved November 11, 2021 from http:\/\/mocap.cs.cmu.edu.","journal-title":"http:\/\/mocap.cs.cmu.edu."},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00655"},{"key":"e_1_3_1_14_2","first-page":"589","volume-title":"Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR\u201910)","author":"Eyben Florian","year":"2010","unstructured":"Florian Eyben, Sebastian B\u00f6ck, Bj\u00f6rn Schuller, and Alex Graves. 2010. Universal onset detection with bidirectional long-short term memory neural networks. In Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR\u201910). 589\u2013594."},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.5555\/2919332.2919834"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00361"},{"key":"e_1_3_1_17_2","first-page":"249","volume-title":"Proceedings of the 13th International Conference on Artificial Intelligence and Statistics","author":"Glorot Xavier","year":"2010","unstructured":"Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. 249\u2013256."},{"key":"e_1_3_1_18_2","unstructured":"Emilia G\u00f3mez. 2006. Tonal description of music audio signals. Ph.D. Dissertation. Universitat Pompeu Fabra Barcelona Spain."},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01239"},{"key":"e_1_3_1_20_2","article-title":"Onsets and frames: Dual-objective piano transcription","author":"Hawthorne Curtis","year":"2017","unstructured":"Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore, and Douglas Eck. 2017. Onsets and frames: Dual-objective piano transcription. arXiv preprint arXiv:1710.11153 (2017).","journal-title":"arXiv preprint arXiv:1710.11153"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295408"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/2897824.2925975"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.243"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.573"},{"key":"e_1_3_1_26_2","article-title":"Feature learning for chord recognition: The deep chroma extractor","author":"Korzeniowski Filip","year":"2016","unstructured":"Filip Korzeniowski and Gerhard Widmer. 2016. Feature learning for chord recognition: The deep chroma extractor. arXiv preprint arXiv:1612.05065 (2016).","journal-title":"arXiv preprint arXiv:1612.05065"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/1401132.1401202"},{"key":"e_1_3_1_28_2","first-page":"129","volume-title":"Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR\u201916)","author":"Krebs Florian","year":"2016","unstructured":"Florian Krebs, Sebastian B\u00f6ck, Matthias Dorfer, and Gerhard Widmer. 2016. Downbeat tracking using beat synchronous features with recurrent neural networks. In Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR\u201916). 129\u2013135."},{"key":"e_1_3_1_29_2","first-page":"227","volume-title":"Proceedings of the Annual Conference of the International Society for Music Information Retrieval (ISMIR\u201913)","author":"Krebs Florian","year":"2013","unstructured":"Florian Krebs, Sebastian B\u00f6ck, and Gerhard Widmer. 2013. Rhythmic pattern modeling for beat and downbeat tracking in musical audio. In Proceedings of the Annual Conference of the International Society for Music Information Retrieval (ISMIR\u201913). 227\u2013232."},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.5555\/2999134.2999257"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/1661412.1618517"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1038\/nature14539"},{"key":"e_1_3_1_33_2","first-page":"3581","volume-title":"Advances in Neural Information Processing Systems","author":"Lee Hsin-Ying","year":"2019","unstructured":"Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun Wang, Yu-Ding Lu, Ming-Hsuan Yang, and Jan Kautz. 2019. Dancing to music. In Advances in Neural Information Processing Systems. 3581\u20133591."},{"key":"e_1_3_1_34_2","article-title":"Listen to dance: Music-driven choreography generation using autoregressive encoder-decoder network","author":"Lee Juheon","year":"2018","unstructured":"Juheon Lee, Seohyun Kim, and Kyogu Lee. 2018. Listen to dance: Music-driven choreography generation using autoregressive encoder-decoder network. arXiv preprint arXiv:1811.00818 (2018).","journal-title":"arXiv preprint arXiv:1811.00818"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/3272127.3275071"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00548"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/566654.566604"},{"key":"e_1_3_1_38_2","article-title":"Auto-conditioned LSTM network for extended complex human motion synthesis","volume":"3","author":"Li Zimo","year":"2017","unstructured":"Zimo Li, Yi Zhou, Shuangjiu Xiao, Chong He, and Hao Li. 2017. Auto-conditioned LSTM network for extended complex human motion synthesis. arXiv preprint arXiv:1707.05363 3 (2017).","journal-title":"arXiv preprint arXiv:1707.05363"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.497"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/2366145.2366172"},{"key":"e_1_3_1_41_2","volume-title":"Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR\u201911)","author":"M\u00fcller Meinard","year":"2011","unstructured":"Meinard M\u00fcller and Sebastian Ewert. 2011. Chroma toolbox: MATLAB implementations for extracting variants of chroma-based audio features. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR\u201911)."},{"key":"e_1_3_1_42_2","article-title":"WaveNet: A generative model for raw audio","author":"Oord Aaron van den","year":"2016","unstructured":"Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).","journal-title":"arXiv preprint arXiv:1609.03499"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00794"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.5555\/3008751.3008888"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/1276377.1276510"},{"key":"e_1_3_1_46_2","article-title":"SFU Motion Capture Database","year":"2017","unstructured":"SFU. 2017. SFU Motion Capture Database. Retrieved November 11, 2021 from http:\/\/mocap.cs.sfu.ca.","journal-title":"http:\/\/mocap.cs.sfu.ca."},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240526"},{"issue":"2","key":"e_1_3_1_48_2","first-page":"26","article-title":"Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude","volume":"4","author":"Tieleman Tijmen","year":"2012","unstructured":"Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4, 2 (2012), 26\u201331.","journal-title":"COURSERA: Neural Networks for Machine Learning"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.5555\/3157382.3157633"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1145\/2766999"},{"key":"e_1_3_1_51_2","article-title":"Weakly supervised deep recurrent neural networks for basic dance step generation","author":"Yalta Nelson","year":"2018","unstructured":"Nelson Yalta, Shinji Watanabe, Kazuhiro Nakadai, and Tetsuya Ogata. 2018. Weakly supervised deep recurrent neural networks for basic dance step generation. arXiv preprint arXiv:1807.01126 (2018).","journal-title":"arXiv preprint arXiv:1807.01126"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00449"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2012.6239234"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3485664","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3485664","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:18:40Z","timestamp":1750191520000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3485664"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,2,16]]},"references-count":52,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2022,5,31]]}},"alternative-id":["10.1145\/3485664"],"URL":"https:\/\/doi.org\/10.1145\/3485664","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,2,16]]},"assertion":[{"value":"2021-01-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-09-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-02-16","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}