Research article

Audio2DiffuGesture: Generating a diverse co-speech gesture based on a diffusion model

  • Received: 23 June 2024 Revised: 11 September 2024 Accepted: 14 September 2024 Published: 26 September 2024
  • People use a combination of language and gestures to convey intentions, making the generation of natural co-speech gestures a challenging task. In audio-driven gesture generation, relying solely on features extracted from raw audio waveforms limits the model's ability to fully learn the joint distribution between audio and gestures. To address this limitation, we integrated key features from both raw audio waveforms and Mel-spectrograms. Specifically, we employed cascaded 1D convolutions to extract features from the audio waveform and a two-stage attention mechanism to capture features from the Mel-spectrogram. The fused features were then input into a Transformer with cross-dimension attention for sequence modeling, which mitigated accumulated non-autoregressive errors and reduced redundant information. We developed a diffusion model-based Audio to Diffusion Gesture (A2DG) generation pipeline capable of producing high-quality and diverse gestures. Our method demonstrated superior performance in extensive experiments compared to established baselines. Regarding the TED Gesture and TED Expressive datasets, the Fréchet Gesture Distance (FGD) performance improved by 16.8 and 56%, respectively. Additionally, a user study validated that the co-speech gestures generated by our method are more vivid and realistic.

    Citation: Hongze Yao, Yingting Xu, Weitao WU, Huabin He, Wen Ren, Zhiming Cai. Audio2DiffuGesture: Generating a diverse co-speech gesture based on a diffusion model[J]. Electronic Research Archive, 2024, 32(9): 5392-5408. doi: 10.3934/era.2024250

    Related Papers:

  • People use a combination of language and gestures to convey intentions, making the generation of natural co-speech gestures a challenging task. In audio-driven gesture generation, relying solely on features extracted from raw audio waveforms limits the model's ability to fully learn the joint distribution between audio and gestures. To address this limitation, we integrated key features from both raw audio waveforms and Mel-spectrograms. Specifically, we employed cascaded 1D convolutions to extract features from the audio waveform and a two-stage attention mechanism to capture features from the Mel-spectrogram. The fused features were then input into a Transformer with cross-dimension attention for sequence modeling, which mitigated accumulated non-autoregressive errors and reduced redundant information. We developed a diffusion model-based Audio to Diffusion Gesture (A2DG) generation pipeline capable of producing high-quality and diverse gestures. Our method demonstrated superior performance in extensive experiments compared to established baselines. Regarding the TED Gesture and TED Expressive datasets, the Fréchet Gesture Distance (FGD) performance improved by 16.8 and 56%, respectively. Additionally, a user study validated that the co-speech gestures generated by our method are more vivid and realistic.



    加载中


    [1] S. Van Mulken, E. André, J. Müller, The Persona Effect: How Substantial Is It?, in People and Computers XⅢ : Proceedings of HCI'98, Springer London, (1998), 53–66. https://doi.org/10.1007/978-1-4471-3605-7_4
    [2] J. Cassell, D. McNeill, K. E. McCullough, Speech-gesture mismatches: Evidence for one underlying representation of linguistic and nonlinguistic information, Pragmatics Cognit., 7 (1999), 1–34. https://doi.org/10.1075/pc.7.1.03cas doi: 10.1075/pc.7.1.03cas
    [3] T. Kucherenko, P. Jonell, Y. Yoon, P. Wolfert, A large, crowdsourced evaluation of gesture generation systems on common data: The GENEA challenge 2020, in 26th International Conference on Intelligent User Interfaces (IUI), (2021), 11–21. https://doi.org/10.1145/3397481.3450692
    [4] C. M. Huang, B. Mutlu, Robot behavior toolkit: generating effective social behaviors for robots, in Proceedings of the Seventh Annual ACM/IEEE International Conference on Human-Robot Interaction (HRI), (2012), 25–32. https://doi.org/10.1145/2157689.2157694
    [5] M. Salem, S. Kopp, I. Wachsmuth, K. Rohlfing, F. Joublin, Generation and evaluation of communicative robot gesture, Int. J. Social Rob., 4 (2012), 201–217. https://doi.org/10.1007/s12369-011-0124-9 doi: 10.1007/s12369-011-0124-9
    [6] A. Kranstedt, S. Kopp, I. Wachsmuth, Murml: A multimodal utterance representation markup language for conversational agents, in AAMAS'02 Workshop Embodied Conversational Agents-Let's Specify and Evaluate Them!, 2002.
    [7] J. Cassell, C. Pelachaud, N. Badler, M. Steedman, B. Achorn, T. Becket, et al., Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents, in Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), (1994), 413–420. https://doi.org/10.1145/192161.192272
    [8] Y. Yoon, B. Cha, J. H. Lee, M. Jang, J. Lee, J. Kim, et al., Speech gesture generation from the trimodal context of text, audio, and speaker identity, ACM Trans. Graphics, 39 (2020), 1–16. https://doi.org/10.1145/3414685.3417838 doi: 10.1145/3414685.3417838
    [9] U. Bhattacharya, E. Childs, N. Rewkowski, D. Manocha, Speech2affectivegestures: Synthesizing co-speech gestures with generative adversarial affective expression learning, in Proceedings of the 29th ACM International Conference on Multimedia (MM), (2021), 2027–2036. https://doi.org/10.1145/3474085.3475223
    [10] X. Liu, Q. Wu, H. Zhou, Y. Xu, R. Qian, X. Lin, et al., Learning hierarchical cross-modal association for co-speech gesture generation, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2022), 10452–10462. https://doi.org/10.1109/CVPR52688.2022.01021
    [11] S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, J. Malik, Learning individual styles of conversational gesture, in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019), 3492–3501. https://doi.org/10.1109/CVPR.2019.00361
    [12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, et al., Generative adversarial nets, Commun. ACM, 63 (2020), 139–144. https://doi.org/10.1145/3422622 doi: 10.1145/3422622
    [13] J. Ho, A. Jain, P. Abbeel, Denoising diffusion probabilistic models, in Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS), (2020), 6840–6851.
    [14] S. Alexanderson, G. E. Henter, T. Kucherenko, J. Beskow, Style‐controllable speech‐driven gesture synthesis using normalising flows, Comput. Graphics Forum, 39 (2020), 487–496. https://doi.org/10.1111/cgf.13946 doi: 10.1111/cgf.13946
    [15] T. Kucherenko, P. Jonell, S. Van Waveren, G. E. Henter, S. Alexandersson, I. Leite, et al. Gesticulator: A framework for semantically-aware speech-driven gesture generation, in Proceedings of the 2020 International Conference on Multimodal Interaction (ICMI), (2020), 242–250. https://doi.org/10.1145/3382507.3418815
    [16] S. Qian, Z. Tu, Y. Zhi, W. Liu, S. Gao, Speech drives templates: Co-speech gesture synthesis with learned templates, in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (2021), 11057–11066. https://doi.org/10.1109/ICCV48922.2021.01089
    [17] Y. Yoon, W. R. Ko, M. Jang, J. Lee, J. Kim, G. Lee, Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots, in 2019 International Conference on Robotics and Automation (ICRA), (2019), 4303–4309. https://doi.org/10.1109/ICRA.2019.8793720
    [18] C. Ahuja, L. P. Morency, Language2pose: Natural language grounded pose forecasting, in 2019 International Conference on 3D Vision (3DV), (2019), 719–728. https://doi.org/10.1109/3DV.2019.00084
    [19] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., Attention is all you need, in Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), (2017), 6000–6010. https://doi.org/10.48550/arXiv.1706.03762
    [20] S. Nyatsanga, T. Kucherenko, C. Ahuja, G. E. Henter, M. Neff, A comprehensive review of data‐driven co‐speech gesture generation, Comput. Graphics Forum, 42 (2023), 569–596. https://doi.org/10.1111/cgf.14776 doi: 10.1111/cgf.14776
    [21] D. Hasegawa, N. Kaneko, S. Shirakawa, H. Sakuta, K. Sumi, Evaluation of speech-to-gesture generation using Bi-directional LSTM network, in Proceedings of the 18th International Conference on Intelligent Virtual Agents (IVA), (2018), 79–86. https://doi.org/10.1145/3267851.3267878
    [22] T. Kucherenko, D. Hasegawa, G. E. Henter, N. Kaneko, H. Kjellströ m, Analyzing input and output representations for speech-driven gesture generation, in Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents (IVA), (2019), 97–104. https://doi.org/10.1145/3308532.3329472
    [23] T. Ao, Q. Gao, Y. Lou, B. Chen, L. Liu, Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings, ACM Trans. Graphics, 41 (2022), 1–19. https://doi.org/10.1145/3550454.3555435. doi: 10.1145/3550454.3555435
    [24] S. Ye, Y. H. Wen, Y. Sun, Y. He, Z. Zhang, Y. Wang, et al., Audio-driven stylized gesture generation with flow-based model, in European Conference on Computer Vision, 13665 (2022), 712–728. https://doi.org/10.1007/978-3-031-20065-6_41
    [25] H. Liu, Z. Zhu, N. Iwamoto, Y. Peng, Z. Li, Y. Zhou, et al., BEAT: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis, in European Conference on Computer Vision, 13667 (2022), 612–630. https://doi.org/10.1007/978-3-031-20071-7_36
    [26] H. Yi, H. Liang, Y. Liu, Q. Cao, Y. Wen, T. Bolkart, et al., Generating holistic 3D human motion from speech, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2023), 469–480.
    [27] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, High-resolution image synthesis with latent diffusion models, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2022), 10684–10695. https://doi.org/10.48550/arXiv.2112.10752
    [28] P. Dhariwal, A. Nichol, Diffusion models beat GANs on image synthesis, in Proceedings of the 35th International Conference on Neural Information Processing Systems (NIPS), (2021), 8780–8794. https://doi.org/10.48550/arXiv.2105.05233
    [29] M. Zhang, Z. Cai, L. Pan, F. Hong, X, Guo, L, Yang, et al., MotionDiffuse: Text-driven human motion generation with diffusion model, IEEE Trans. Pattern Anal. Mach. Intell., 46 (2024), 4115–4128. https://10.1109/TPAMI.2024.3355414 doi: 10.1109/TPAMI.2024.3355414
    [30] X. Chen, B. Jiang, W. Liu, Z. Huang, B. Fu, T. Chen, et al., Executing your commands via motion diffusion in latent space, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2023), 18000–18010. https://10.1109/CVPR52729.2023.01726
    [31] S. Alexanderson, R. Nagy, J. Beskow, G. E. Henter, listen, denoise, action! audio-driven motion synthesis with diffusion models, ACM Trans. Graphics, 42 (2023). https://doi.org/10.1145/3592458 doi: 10.1145/3592458
    [32] L. Zhu, X. Liu, X. Liu, R. Qian, Z. Liu, L. Yu, Taming diffusion models for audio-driven co-speech gesture generation, in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2023), 10544–10553. https://doi.org/10.1109/CVPR52729.2023.01016
    [33] S. Yang, Z. Wu, M. Li, Z. Zhang, L. Hao, W. Bao, et al., DiffuseStyleGesture: stylized audio-driven co-speech gesture generation with diffusion models, in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI), 650 (2023), 5860–5868. https://doi.org/10.24963/ijcai.2023/650
    [34] Y. Yuan, J. Song, U. Iqbal, A. Vahdat, J. Kautz, PhysDiff: Physics-guided human motion diffusion model, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), (2023), 16010–16021. https://doi.org/10.48550/arXiv.2212.02500
    [35] T. Ao, Z. Zhang, L. Liu, GestureDiffuCLIP: Gesture diffusion model with CLIP latents, ACM Trans. Graphics, 42 (2023), 1–18. https://doi.org/10.1145/3550454.3555435 doi: 10.1145/3550454.3555435
    [36] Z. Cao, T. Simon, S. Wei, Y. Sheikh, Realtime multi-person 2D pose estimation using part affinity fields, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017), 1302–1310. https://doi.org/10.1109/CVPR.2017.143
    [37] V. Choutas, G. Pavlakos, T. Bolkart, D. Tzionas, M. J. Black, Monocular expressive body regression through body-driven attention, in 16th European Conference Computer Vision (ECCV), 12355 (2020), 20–40. https://doi.org/10.1007/978-3-030-58607-2_2
    [38] Y. Chen, Y. Kalantidis, J. Li, S. Yan, J. Feng, A.2-Nets: Double attention networks, in Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS), (2018), 350–359. https://doi.org/10.48550/arXiv.1810.11579
    [39] T. Y. Lin, A. RoyChowdhury, S. Maji, Bilinear CNN models for fine-grained visual recognition, in Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), (2015), 1449–1457. https://doi.org/10.1109/ICCV.2015.170
    [40] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, et al., An image is worth 16×16 words: Transformers for image recognition at scale, preprint, arXiv: 2010.11929v2. https://doi.org/10.48550/arXiv.2010.11929
    [41] D. Misra, T. Nalamada, A. U. Arasanipalai, Q. Hou, Rotate to attend: Convolutional triplet attention module, in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), (2021), 3139–3148. https://doi.org/10.1109/WACV48630.2021.00318
    [42] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, GANs trained by a two time-scale update rule converge to a local nash equilibrium, in Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), (2017), 6629–6640. https://doi.org/10.48550/arXiv.1706.08500.
    [43] C. Ionescu, D. Papava, V. Olaru, C. Sminchisescu, Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments, IEEE Trans. Pattern Anal. Mach. Intell., 36 (2013), 1325–1339. https://doi.org/10.1109/TPAMI.2013.248 doi: 10.1109/TPAMI.2013.248
    [44] R. Li, S. Yang, D. A. Ross, A. Kanazawa, AI choreographer: Music conditioned 3D dance generation with AIST++, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), (2021), 13401–13412. https://doi.org/10.1109/ICCV48922.2021.01315
    [45] H. Y. Lee, X. Yang, M. Y. Liu, T. C. Wang, Y. D. Lu, M. H. Yang, et al., Dancing to music, in Proceedings of the 33rd International Conference on Neural Information Processing Systems, (2019), 3586–3596. https://doi.org/10.48550/arXiv.1911.02001
    [46] T. Kucherenko, P. Wolfert, Y. Yoon, C. Viegas, T. Nikolov, M. Tsakov, et al., Evaluating gesture generation in a large-scale open challenge: The GENEA Challenge 2022, ACM Trans. Graphics, 43 (2024). https://doi.org/10.1145/3656374
  • Reader Comments
  • © 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(62) PDF downloads(4) Cited by(0)

Article outline

Figures and Tables

Figures(6)  /  Tables(3)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog