Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation

Huan Yang*¹, Jiahui Chen*^{1, 2}, Chaofan Ding¹, Runhua Shi¹, Siyu Xiong¹, Qingqi Hong², Xiaoqi Mo¹, Xinhan Di¹

1 Giant Network AI Lab, 2 Xiamen University

*Denotes Equal Contribution

in ECCV 2024 workshop paper

Abstract

Gestures are pivotal in enhancing co-speech communication, while recent works have mostly focused on pointlevel motion transformation or fully supervised motion representations through data-driven approaches, we explore the representation of gestures in co-speech, with a focus on self-supervised representation and pixel-level motion deviation, utilizing aa diffusion model which incorporates latent motion features. Our approach leverages self-supervised deviation in latent representation to facilitate hand gestures generation, which are crucial for generating realistic gesture videos. Results of our first experiment demonstrate that our method enhances the quality of generated videos, with an improvement from 2.7 to 4.5% for FGD, DIV and FVD, and 8.1% for PSNR, 2.5% for SSIM over the current stateof-the-art methods.

Pipeline

Co-speech gesture video generation pipeline of our proposed method consists of three main components: 1) the latent deviation extractor (orange) 2) the latent deviation decoder (blue) 3) the latent motion diffusion (green).

We propose a novel method for generating co-speech gesture videos, utilizing a self-supervised full scene deviation, produces co-speech gesture video $V$ (i.e., image sequence) that exhibit natural poses and synchronized movements. The generation process takes as input the speaker’s speech audio a and a source image $I_S$.

We structured the training process into two stages. In the first stage, a driving image $I_D$ and a source image $I_S$ are used to train the base model. In one aspect, the proposed latent deviation module consisting of latent deviation extractor, warping calculator and latent deviation decoder is trained under self-supervision. In another aspect, other modules in the base model is trained under full supervision. In the second stage, the self-supervised motion features, consisting of $MF_i$, $\hat{MF}_{[i − 4, i − 1]}$, and the noiseadded motion feature sequence $[MF_j]$, are used to train the latent motion diffusion model.

Generated video

chemistry_1.mp4

chemistry_2.mp4

oliver_1.mp4

oliver_2.mp4

seth_1.mp4

seth_2.mp4

Visual comparison

Generated video comparison

Code

Train code:

Coming soon...

Checkpoints

ckpts

Test data

test data

conda create -n cospeech python=3.8
conda activate cospeech
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia
pip install -r requirements.txt

python test.py

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
models		models
utils		utils
README.md		README.md
model.py		model.py
requirements.txt		requirements.txt
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation

Abstract

Pipeline

Generated video

Visual comparison

Generated video comparison

Code

Checkpoints

Test data

About

Releases

Packages

Contributors 2

Languages

Giant-AILab/SSL

Folders and files

Latest commit

History

Repository files navigation

Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation

Abstract

Pipeline

Generated video

Visual comparison

Generated video comparison

Code

Checkpoints

Test data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages