This is the unofficial train code of MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance(
An overview of the framework of MimicMotion.
- In the experiments, the posenet is so hard to control, So I do a lot of experiments for it. I think the posenet is not good for control the pose, But I train the posenet with unet2d, the results shows that posenet can control the pose for sd-2.1, You can follow my other project Pose2Image.(
- The diffusers is unstabitily, I do it with different versions, the result is different
- It is need clear data and so many datasets, This is a data hungry task
- It is bad for train many epochs, mybe my dataset is so poor
- Maybe you should train the posenet on image and finetune the unet and posenet for SVD. (
Recommend python 3+ with torch 2.x are validated with an Nvidia A800 GPU. Follow the command below to install all the dependencies of python:
conda env create -f environment.yaml
conda activate mimicmotion
If you experience connection issues with Hugging Face, you can utilize the mirror endpoint by setting the environment variable: export HF_ENDPOINT=
Please download weights manually as follows:
cd MimicMotions/
mkdir models
- Download DWPose pretrained model: dwpose
mkdir -p models/DWPose wget -O models/DWPose/yolox_l.onnx wget -O models/DWPose/dw-ll_ucoco_384.onnx
- Download the pre-trained checkpoint of MimicMotion from Huggingface
wget -P models/
- The SVD model stabilityai/stable-video-diffusion-img2vid-xt-1-1 will be automatically downloaded.
Finally, all the weights should be organized in models as follows
├── DWPose
│ ├── dw-ll_ucoco_384.onnx
│ └── yolox_l.onnx
|-- videos
|-- pose_score
|-- ref
|-- dwpose
You can run the script to get the pose, pose_score, reference face pic
python ubc_data/videos dwpose
python ubc_data/videos pose_score
python ubc_data/videos ref
accelerate launch --num_processes 1 --mixed_precision "fp16" \
--video_folder='ubc_data' \
--pretrained_model_name_or_path="stabilityai/stable-video-diffusion-img2vid-xt-1-1" \
--per_gpu_batch_size=1 \
--max_train_steps=50000 \
--width=576 \
--height=768 \
--checkpointing_steps=200 \
--learning_rate=1e-05 \
--lr_warmup_steps=0 \
--seed=123 \
A sample configuration for testing is provided as test.yaml
. You can also easily modify the various configurations according to your needs.
python --inference_config configs/test.yaml
Tips: if your GPU memory is limited, try set env PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:256
For the 35s demo video, the 72-frame model requires 16GB VRAM (4060ti) and finishes in 20 minutes on a 4090 GPU.
The minimum VRAM requirement for the 16-frame U-Net model is 8GB; however, the VAE decoder demands 16GB. You have the option to run the VAE decoder on CPU.
title={MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance},
author={Yuang Zhang and Jiaxi Gu and Li-Wen Wang and Han Wang and Junqi Cheng and Yuefeng Zhu and Fangyuan Zou},
journal={arXiv preprint arXiv:2406.19680},