Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation
Huan Yang*1, Jiahui Chen*1, 2, Chaofan Ding1, Runhua Shi1, Siyu Xiong1, Qingqi Hong2, Xiaoqi Mo1, Xinhan Di1
1 Giant Network AI Lab, 2 Xiamen University
*Denotes Equal Contribution
in ECCV 2024 workshop paper
Gestures are pivotal in enhancing co-speech communication, while recent works have mostly focused on pointlevel motion transformation or fully supervised motion representations through data-driven approaches, we explore the representation of gestures in co-speech, with a focus on self-supervised representation and pixel-level motion deviation, utilizing aa diffusion model which incorporates latent motion features. Our approach leverages self-supervised deviation in latent representation to facilitate hand gestures generation, which are crucial for generating realistic gesture videos. Results of our first experiment demonstrate that our method enhances the quality of generated videos, with an improvement from 2.7 to 4.5% for FGD, DIV and FVD, and 8.1% for PSNR, 2.5% for SSIM over the current stateof-the-art methods.
Co-speech gesture video generation pipeline of our proposed method consists of three main components: 1) the latent deviation extractor (orange) 2) the latent deviation decoder (blue) 3) the latent motion diffusion (green).
We propose a novel method for generating co-speech gesture videos, utilizing a self-supervised full scene deviation, produces co-speech gesture video
We structured the training process into two stages. In the first stage, a driving image
chemistry_1.mp4
chemistry_2.mp4
oliver_1.mp4
oliver_2.mp4
seth_1.mp4
seth_2.mp4
Train code:
Coming soon...
conda create -n cospeech python=3.8
conda activate cospeech
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia
pip install -r requirements.txt
python test.py