{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T23:35:16Z","timestamp":1761176116897,"version":"build-2065373602"},"reference-count":0,"publisher":"IOS Press","isbn-type":[{"value":"9781643686318","type":"electronic"}],"license":[{"start":{"date-parts":[[2025,10,21]],"date-time":"2025-10-21T00:00:00Z","timestamp":1761004800000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,10,21]]},"abstract":"<jats:p>Co-speech video generation focuses on improving the authenticity of virtual characters by aligning their gestures and facial expressions with spoken audio. Despite recent advancements, existing methods often struggle with speech-gesture misalignment and unnatural hand motions. To address the issues, we propose a novel audio-driven gesture generation framework. This framework integrates a hierarchical diffusion model with multimodal feature disentanglement and dynamic fusion strategies. The core of our approach is the Cross-modal Spatial-Temporal Attention mechanism (CSTA), which ensures high-fidelity synchronization between audio and human motion while capturing fine-grained dynamics of hand and facial. By effectively disentangling different motion modalities, CSTA enhances the alignment between body part movements and the audio signal, leading to more natural and coherent video synthesis. Furthermore, to improve the physical plausibility and diversity of generated gestures, we introduce a Hand Memory Module (HMM). This module leverages a Vector Quantization-Variational Autoencoder (VQ-VAE) to learn a discrete gesture prior. By embedding these learned priors during the generation process, our method not only enhances temporal consistency but also preserves intricate details, mitigating common issues in prior work such as motion blur and detail loss. Experiments on the PATS and BEAT2 datasets demonstrate that CSTA surpasses existing methods in generating co-speech videos with more synchronized and natural hand motions, achieving state-of-the-art performance in both qualitative and quantitative evaluations. Project Page: CSTA-HMM<\/jats:p>","DOI":"10.3233\/faia250794","type":"book-chapter","created":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T09:42:27Z","timestamp":1761126147000},"source":"Crossref","is-referenced-by-count":0,"title":["Toward Realistic Co-Speech Motion via Cross-Modal Spatial-Temporal Attention and Hand Memory Module"],"prefix":"10.3233","author":[{"given":"Jiye","family":"Zhang","sequence":"first","affiliation":[{"name":"School of Information and Communication Engineering, Communication University of China, Beijing, China"},{"name":"Pengcheng Laboratory, Shenzhen, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Dingwei","family":"Liu","sequence":"additional","affiliation":[{"name":"Pengcheng Laboratory, Shenzhen, China"},{"name":"School of Future Technology, South China University of Technology, Guangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Guibiao","family":"Liao","sequence":"additional","affiliation":[{"name":"School of Electronic and Computer Engineering, Peking University, Shenzhen, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiuhua","family":"Jiang","sequence":"additional","affiliation":[{"name":"School of Information and Communication Engineering, Communication University of China, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jiangbo","family":"Xu","sequence":"additional","affiliation":[{"name":"School of Information and Communication Engineering, Communication University of China, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"7437","container-title":["Frontiers in Artificial Intelligence and Applications","ECAI 2025"],"original-title":[],"link":[{"URL":"https:\/\/ebooks.iospress.nl\/pdf\/doi\/10.3233\/FAIA250794","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T09:42:27Z","timestamp":1761126147000},"score":1,"resource":{"primary":{"URL":"https:\/\/ebooks.iospress.nl\/doi\/10.3233\/FAIA250794"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,21]]},"ISBN":["9781643686318"],"references-count":0,"URL":"https:\/\/doi.org\/10.3233\/faia250794","relation":{},"ISSN":["0922-6389","1879-8314"],"issn-type":[{"value":"0922-6389","type":"print"},{"value":"1879-8314","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,21]]}}}