{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T19:30:26Z","timestamp":1775590226092,"version":"3.50.1"},"reference-count":54,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2024,7,19]],"date-time":"2024-07-19T00:00:00Z","timestamp":1721347200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key Research and Development Program of China","doi-asserted-by":"publisher","award":["2022YFF0902204"],"award-info":[{"award-number":["2022YFF0902204"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"NSFC","doi-asserted-by":"publisher","award":["62171255"],"award-info":[{"award-number":["62171255"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Graph."],"published-print":{"date-parts":[[2024,7,19]]},"abstract":"<jats:p>In this paper, we touch on the problem of markerless multi-modal human motion capture especially for string performance capture which involves inherently subtle hand-string contacts and intricate movements. To fulfill this goal, we first collect a dataset, named String Performance Dataset (SPD), featuring cello and violin performances. The dataset includes videos captured from up to 23 different views, audio signals, and detailed 3D motion annotations of the body, hands, instrument, and bow. Moreover, to acquire the detailed motion annotations, we propose an audio-guided multi-modal motion capture framework that explicitly incorporates hand-string contacts detected from the audio signals for solving detailed hand poses. This framework serves as a baseline for string performance capture in a completely markerless manner without imposing any external devices on performers, eliminating the potential of introducing distortion in such delicate movements. We argue that the movements of performers, particularly the sound-producing gestures, contain subtle information often elusive to visual methods but can be inferred and retrieved from audio cues. Consequently, we refine the vision-based motion capture results through our innovative audio-guided approach, simultaneously clarifying the contact relationship between the performer and the instrument, as deduced from the audio. We validate the proposed framework and conduct ablation studies to demonstrate its efficacy. Our results outperform current state-of-the-art vision-based algorithms, underscoring the feasibility of augmenting visual motion capture with audio modality. To the best of our knowledge, SPD is the first dataset for musical instrument performance, covering fine-grained hand motion details in a multi-modal, large-scale collection. It holds significant implications and guidance for string instrument pedagogy, animation, and virtual concerts, as well as for both musical performance analysis and generation. Our code and SPD dataset are available at https:\/\/github.com\/Yitongishere\/string_performance.<\/jats:p>","DOI":"10.1145\/3658235","type":"journal-article","created":{"date-parts":[[2024,7,19]],"date-time":"2024-07-19T14:47:57Z","timestamp":1721400477000},"page":"1-10","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":10,"title":["Audio Matters Too! Enhancing Markerless Motion Capture with Audio Signals for String Performance Capture"],"prefix":"10.1145","volume":"43","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-3979-5878","authenticated-orcid":false,"given":"Yitong","family":"Jin","sequence":"first","affiliation":[{"name":"Central Conservatory of Music, Beijing, China"},{"name":"Tsinghua University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-8663-1955","authenticated-orcid":false,"given":"Zhiping","family":"Qiu","sequence":"additional","affiliation":[{"name":"Central Conservatory of Music, Beijing, China"},{"name":"Tsinghua University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7500-192X","authenticated-orcid":false,"given":"Yi","family":"Shi","sequence":"additional","affiliation":[{"name":"Central Conservatory of Music, Beijing, China"},{"name":"Tsinghua University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-7988-7328","authenticated-orcid":false,"given":"Shuangpeng","family":"Sun","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-1384-941X","authenticated-orcid":false,"given":"Chongwu","family":"Wang","sequence":"additional","affiliation":[{"name":"Central Conservatory of Music, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-3863-286X","authenticated-orcid":false,"given":"Donghao","family":"Pan","sequence":"additional","affiliation":[{"name":"Central Conservatory of Music, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2864-9229","authenticated-orcid":false,"given":"Jiachen","family":"Zhao","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-7467-4469","authenticated-orcid":false,"given":"Zhenghao","family":"Liang","sequence":"additional","affiliation":[{"name":"Weilan Tech, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-8604-6243","authenticated-orcid":false,"given":"Yuan","family":"Wang","sequence":"additional","affiliation":[{"name":"Central Conservatory of Music, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0113-824X","authenticated-orcid":false,"given":"Xiaobing","family":"Li","sequence":"additional","affiliation":[{"name":"Central Conservatory of Music, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-0607-7315","authenticated-orcid":false,"given":"Feng","family":"Yu","sequence":"additional","affiliation":[{"name":"Central Conservatory of Music, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3818-5069","authenticated-orcid":false,"given":"Tao","family":"Yu","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7043-3061","authenticated-orcid":false,"given":"Qionghai","family":"Dai","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,7,19]]},"reference":[{"key":"e_1_2_2_1_1","volume-title":"Cynthia CS Liem, and Alan Hanjalic.","author":"Bazzica Alessio","year":"2017","unstructured":"Alessio Bazzica, JC Van Gemert, Cynthia CS Liem, and Alan Hanjalic. 2017. Vision-based detection of acoustic timed events: a case study on clarinet note onsets. arXiv preprint arXiv:1706.09556 (2017)."},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.143"},{"key":"e_1_2_2_3_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3450626.3459792","article-title":"Capturing detailed deformations of moving human bodies","volume":"40","author":"Chen He","year":"2021","unstructured":"He Chen, Hyojoon Park, Kutay Macit, and Ladislav Kavan. 2021. Capturing detailed deformations of moving human bodies. ACM Transactions on Graphics (TOG) 40, 4 (2021), 1--18.","journal-title":"ACM Transactions on Graphics (TOG)"},{"key":"e_1_2_2_4_1","volume-title":"TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement. ICCV","author":"Doersch Carl","year":"2023","unstructured":"Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. 2023. TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement. ICCV (2023)."},{"key":"e_1_2_2_5_1","volume-title":"Proceedings of the 2003 ACM SIGGRAPH\/Eurographics symposium on Computer animation. 110--119","author":"ElKoura George","year":"2003","unstructured":"George ElKoura and Karan Singh. 2003. Handrix: animating the human hand. In Proceedings of the 2003 ACM SIGGRAPH\/Eurographics symposium on Computer animation. 110--119."},{"key":"e_1_2_2_6_1","first-page":"37055","article-title":"DART: Articulated hand model with diverse accessories and rich textures","volume":"35","author":"Gao Daiheng","year":"2022","unstructured":"Daiheng Gao, Yuliang Xiu, Kailin Li, Lixin Yang, Feng Wang, Peng Zhang, Bang Zhang, Cewu Lu, and Ping Tan. 2022. DART: Articulated hand model with diverse accessories and rich textures. Advances in Neural Information Processing Systems 35 (2022), 37055--37067.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_2_7_1","volume-title":"International Society for Music Information Retrieval Conference (ISMIR).","author":"Gillet Olivier","year":"2006","unstructured":"Olivier Gillet and Ga\u00ebl Richard. 2006. Enst-drums: an extensive audio-visual database for drum signals processing. In International Society for Music Information Retrieval Conference (ISMIR)."},{"key":"e_1_2_2_8_1","volume-title":"Johannes Lunde Hatfield, and Rolf Inge God\u00f8y","author":"Gonzalez-Sanchez Victor","year":"2019","unstructured":"Victor Gonzalez-Sanchez, Sofia Dahl, Johannes Lunde Hatfield, and Rolf Inge God\u00f8y. 2019. Characterizing movement fluency in musical performance: Toward a generic measure for technology enhanced learning. Frontiers in psychology 10 (2019), 84."},{"key":"e_1_2_2_9_1","volume-title":"Conference on New Interfaces for Musical Expression. 106--110","author":"Hadjakos Aristotelis","year":"2013","unstructured":"Aristotelis Hadjakos, Tobias Gro\u00dfhauser, and Werner Goebl. 2013. Motion analysis of music ensembles with the Kinect. In Conference on New Interfaces for Musical Expression. 106--110."},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3550469.3555378"},{"key":"e_1_2_2_11_1","volume-title":"Multi-layer adaptation of group coordination in musical ensembles. Scientific reports 9, 1","author":"Hilt Pauline M","year":"2019","unstructured":"Pauline M Hilt, Leonardo Badino, Alessandro D'Ausilio, Gualtiero Volpe, Ser\u00e2 Tokay, Luciano Fadiga, and Antonio Camurri. 2019. Multi-layer adaptation of group coordination in musical ensembles. Scientific reports 9, 1 (2019), 5854."},{"key":"e_1_2_2_12_1","volume-title":"Audio-Driven Violin Performance Animation with Clear Fingering and Bowing. In ACM SIGGRAPH 2022 Posters. 1--2.","author":"Hirata Asuka","year":"2022","unstructured":"Asuka Hirata, Keitaro Tanaka, Masatoshi Hamanaka, and Shigeo Morishima. 2022. Audio-Driven Violin Performance Animation with Clear Fingering and Bowing. In ACM SIGGRAPH 2022 Posters. 1--2."},{"key":"e_1_2_2_13_1","volume-title":"Bowing-Net: Motion Generation for String Instruments Based on Bowing Information. In ACM SIGGRAPH 2021 Posters. 1--2.","author":"Hirata Asuka","year":"2021","unstructured":"Asuka Hirata, Keitaro Tanaka, Ryo Shimamura, and Shigeo Morishima. 2021. Bowing-Net: Motion Generation for String Instruments Based on Bowing Information. In ACM SIGGRAPH 2021 Posters. 1--2."},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.3389\/fdigh.2017.00009"},{"key":"e_1_2_2_15_1","unstructured":"Glenn Jocher Ayush Chaurasia and Jing Qiu. 2023. YOLO by Ultralytics. https:\/\/github.com\/ultralytics\/ultralytics"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413848"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGI.2000.852318"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8461329"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/1599301.1599304"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2018.2856090"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2019.04.026"},{"key":"e_1_2_2_22_1","volume-title":"Juhyun Lee, et al.","author":"Lugaresi Camillo","year":"2019","unstructured":"Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. 2019. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019)."},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/MMUL.2017.3"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1080\/09298215.2014.922999"},{"key":"e_1_2_2_25_1","volume-title":"Proceedings, Part XX 16","author":"Moon Gyeongsik","year":"2020","unstructured":"Gyeongsik Moon, Shoou-I Yu, He Wen, Takaaki Shiratori, and Kyoung Mu Lee. 2020. Interhand2. 6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XX 16. Springer, 548--564."},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00061"},{"key":"e_1_2_2_27_1","volume-title":"VOCAL: Vowel and Consonant Layering for Expressive Animator-Centric Singing Animation. In SIGGRAPH Asia 2022 Conference Papers. 1--9.","author":"Pan Yifang","year":"2022","unstructured":"Yifang Pan, Chris Landreth, Eugene Fiume, and Karan Singh. 2022. VOCAL: Vowel and Consonant Layering for Expressive Animator-Centric Singing Animation. In SIGGRAPH Asia 2022 Conference Papers. 1--9."},{"key":"e_1_2_2_28_1","unstructured":"Panagiotis Papiotis et al. 2016. A computational approach to studying interdependence in string quartet performance. Ph. D. Dissertation. Universitat Pompeu Fabra."},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/HAVE.2014.6954339"},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.3389\/fcomp.2019.00008"},{"key":"e_1_2_2_31_1","volume-title":"Music, Mind, and Embodiment: 11th International Symposium, CMMR","author":"Perez-Carrillo Alfonso","year":"2015","unstructured":"Alfonso Perez-Carrillo, Josep-Lluis Arcos, and Marcelo Wanderley. 2016. Estimation of guitar fingering and plucking controls based on multimodal analysis of motion, audio and musical score. In Music, Mind, and Embodiment: 11th International Symposium, CMMR 2015, Plymouth, UK, June 16--19, 2015, Revised Selected Papers 11. Springer, 71--87."},{"key":"e_1_2_2_32_1","volume-title":"Embodied hands: Modeling and capturing hands and bodies together. arXiv preprint arXiv:2201.02610","author":"Romero Javier","year":"2022","unstructured":"Javier Romero, Dimitrios Tzionas, and Michael J Black. 2022. Embodied hands: Modeling and capturing hands and bodies together. arXiv preprint arXiv:2201.02610 (2022)."},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2018.2856178"},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1121\/1.3227640"},{"key":"e_1_2_2_35_1","volume-title":"Adam Zukerman, Chethan M Parameshwara, and Yiannis Aloimonos.","author":"Shrestha Snehesh","year":"2022","unstructured":"Snehesh Shrestha, Cornelia Ferm\u00fcller, Tianyu Huang, Pyone Thant Win, Adam Zukerman, Chethan M Parameshwara, and Yiannis Aloimonos. 2022. AIMusicGuru: Music Assisted Human Pose Correction. arXiv preprint arXiv:2203.12829 (2022)."},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.494"},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1080\/09298215.2014.925939"},{"key":"e_1_2_2_38_1","volume-title":"Interpersonal Coordination in Dyadic Performance","author":"Thompson Marc R","unstructured":"Marc R Thompson, Georgios Diapoulis, Tommi Himberg, and Petri Toiviainen. 2017. Interpersonal Coordination in Dyadic Performance. In The Routledge Companion to Embodied Music Interaction. Routledge, 186--194."},{"key":"e_1_2_2_39_1","unstructured":"Micka\u00ebl Tits Jo\u00eblle Tilmanne Nicolas D'alessandro and Marcelo M Wanderley. 2015. Feature extraction and expertise analysis of pianists' Motion-Captured Finger Gestures. In ICMC."},{"key":"e_1_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/1073368.1073414"},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3125571.3125588"},{"key":"e_1_2_2_42_1","volume-title":"The Oxford Handbook of Music Performance","author":"Wanderley Marcelo M","unstructured":"Marcelo M Wanderley. 2022. The Oxford Handbook of Music Performance. Vol. 2. Oxford University Press. 465--494 pages."},{"key":"e_1_2_2_43_1","volume-title":"Computer Graphics Forum","author":"Wheatland Nkenge","unstructured":"Nkenge Wheatland, Yingying Wang, Huaguang Song, Michael Neff, Victor Zordan, and Sophie J\u00f6rg. 2015. State of the art in hand and finger modeling and animation. In Computer Graphics Forum, Vol. 34. Wiley Online Library, 735--760."},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW60793.2023.00455"},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/1279740.1279818"},{"key":"e_1_2_2_46_1","volume-title":"Mediapipe hands: On-device real-time hand tracking. arXiv preprint arXiv:2006.10214","author":"Zhang Fan","year":"2020","unstructured":"Fan Zhang, Valentin Bazarevsky, Andrey Vakunov, Andrei Tkachenka, George Sung, Chuo-Ling Chang, and Matthias Grundmann. 2020. Mediapipe hands: On-device real-time hand tracking. arXiv preprint arXiv:2006.10214 (2020)."},{"key":"e_1_2_2_47_1","volume-title":"Hand Pose Estimation with Mems-Ultrasonic Sensors. In SIGGRAPH Asia 2023 Conference Papers. 1--11","author":"Zhang Qiang","year":"2023","unstructured":"Qiang Zhang, Yuanqiao Lin, Yubin Lin, and Szymon Rusinkiewicz. 2023a. Hand Pose Estimation with Mems-Ultrasonic Sensors. In SIGGRAPH Asia 2023 Conference Papers. 1--11."},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01387"},{"key":"e_1_2_2_49_1","volume-title":"CCOM-HuQin: an Annotated Multimodal Chinese Fiddle Performance Dataset. arXiv preprint arXiv:2209.06496","author":"Zhang Yu","year":"2022","unstructured":"Yu Zhang, Ziya Zhou, Xiaobing Li, Feng Yu, and Maosong Sun. 2022. CCOM-HuQin: an Annotated Multimodal Chinese Fiddle Performance Dataset. arXiv preprint arXiv:2209.06496 (2022)."},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3603618"},{"key":"e_1_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00589"},{"key":"e_1_2_2_52_1","volume-title":"Algorithm 778: LBFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on mathematical software (TOMS) 23, 4","author":"Zhu Ciyou","year":"1997","unstructured":"Ciyou Zhu, Richard H Byrd, Peihuang Lu, and Jorge Nocedal. 1997. Algorithm 778: LBFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on mathematical software (TOMS) 23, 4 (1997), 550--560."},{"key":"e_1_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1002\/cav.1477"},{"key":"e_1_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-92659-5_16"}],"container-title":["ACM Transactions on Graphics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3658235","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3658235","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3658235","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:04:16Z","timestamp":1750291456000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3658235"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,19]]},"references-count":54,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,7,19]]}},"alternative-id":["10.1145\/3658235"],"URL":"https:\/\/doi.org\/10.1145\/3658235","relation":{},"ISSN":["0730-0301","1557-7368"],"issn-type":[{"value":"0730-0301","type":"print"},{"value":"1557-7368","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,7,19]]},"assertion":[{"value":"2024-07-19","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}