Skip to content

Giant-AILab/SSL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation

Huan Yang*1, Jiahui Chen*1, 2, Chaofan Ding1, Runhua Shi1, Siyu Xiong1, Qingqi Hong2, Xiaoqi Mo1, Xinhan Di1

1 Giant Network AI Lab, 2 Xiamen University

*Denotes Equal Contribution

in ECCV 2024 workshop paper

image

Abstract

Gestures are pivotal in enhancing co-speech communication, while recent works have mostly focused on pointlevel motion transformation or fully supervised motion representations through data-driven approaches, we explore the representation of gestures in co-speech, with a focus on self-supervised representation and pixel-level motion deviation, utilizing aa diffusion model which incorporates latent motion features. Our approach leverages self-supervised deviation in latent representation to facilitate hand gestures generation, which are crucial for generating realistic gesture videos. Results of our first experiment demonstrate that our method enhances the quality of generated videos, with an improvement from 2.7 to 4.5% for FGD, DIV and FVD, and 8.1% for PSNR, 2.5% for SSIM over the current stateof-the-art methods.

Pipeline

Co-speech gesture video generation pipeline of our proposed method consists of three main components: 1) the latent deviation extractor (orange) 2) the latent deviation decoder (blue) 3) the latent motion diffusion (green).

We propose a novel method for generating co-speech gesture videos, utilizing a self-supervised full scene deviation, produces co-speech gesture video $V$ (i.e., image sequence) that exhibit natural poses and synchronized movements. The generation process takes as input the speaker’s speech audio a and a source image $I_S$.

We structured the training process into two stages. In the first stage, a driving image $I_D$ and a source image $I_S$ are used to train the base model. In one aspect, the proposed latent deviation module consisting of latent deviation extractor, warping calculator and latent deviation decoder is trained under self-supervision. In another aspect, other modules in the base model is trained under full supervision. In the second stage, the self-supervised motion features, consisting of $MF_i$, $\hat{MF}_{[i − 4, i − 1]}$, and the noiseadded motion feature sequence $[MF_j]$, are used to train the latent motion diffusion model.

image

Generated video

chemistry_1.mp4
chemistry_2.mp4
oliver_1.mp4
oliver_2.mp4
seth_1.mp4
seth_2.mp4

Visual comparison

image

image

Generated video comparison

Code

Train code:

Coming soon...

Checkpoints

ckpts

Test data

test data

conda create -n cospeech python=3.8
conda activate cospeech
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia
pip install -r requirements.txt
python test.py

About

ECCV 2024 workshop

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages