The source code for our paper "Deep Image Spatial Transformation for Person Image Generation" (CVPR2020)
We propose a Global-Flow Local-Attention Model for deep image spatial transformation. Our model can be flexibly applied to tasks such as:
- Pose-Guided Person Image Generation:
Left: generated results of our model; Right: Input source images.
- Pose-Guided Person Image Animation
Left most: Skeleton Squences. The others: Animation Results.
- Face Image Animation
Left: Input image; Right: Output results.
- View Synthesis
Form Left to Right: Input image, Results of Appearance Flow, Results of Ours, Ground-truth images.
-
2020.4.30 Several demos are provided for quick exploration.
-
2020.4.29 Code for Pose-Guided Person Image Animation is avaliable now!
-
2020.3.15 We upload the code and trained models of the Face Animation and View Synthesis!
-
2020.3.3 Project Website and Paper are avaliable!
-
2020.2.29 Code for PyTorch is available now!
For a quick exploration of our model, find the online colab demo.
Requirements
- Python 3
- pytorch (1.0.0)
- CUDA
- visdom
Conda installation
# 1. Create a conda virtual environment.
conda create -n gfla python=3.6 -y
source activate gfla
# 2. Install dependency
pip install -r requirement.txt
# 3. Build pytorch Custom CUDA Extensions
./setup.sh
Note: The current code is tested with Tesla V100. If you use a different GPU, you may need to select correct nvcc_args
for your GPU when you buil Custom CUDA Extensions. Comment or Uncomment --gencode
in block_extractor/setup.py, local_attn_reshape/setup.py, and resample2d_package/setup.py. Please check here for details.
We provide the pre-trained weights of our model. The resources are listed as following:
-
Pose-Guided Person Image Generation
-
Pose Guided Person Image Animation
Google Drive: FashionVideo | iPER
OneDrive: FashionVideo | iPER
-
Face Image Animation
Google Drive: Face Animation
OneDrive: Face_Animation
-
Novel View Synthesis
Google Drive: ShapeNet Car | ShapeNet Chair
OneDrive: ShapeNet_Car | ShapeNet_Chair
Download the Per-Trained Models and the Demo Images by running the following code:
./download.sh
The Pose-Guided Person Image Generation task is to transfer a source person image to a target pose.
Run the demo of this task:
python demo.py \
--name=pose_fashion_checkpoints \
--model=pose \
--attn_layer=2,3 \
--kernel_size=2=5,3=3 \
--gpu_id=0 \
--dataset_mode=fashion \
--dataroot=./dataset/fashion \
--results_dir=./demo_results/fashion
For more training and testing details, please find the PERSON_IMAGE_GENERATION.md
The Pose-Guided Person Image Animation task generates a video clip from a still source image according to a driving target sequence. We further model the temporal consistency for this task.
Run the the demo of this task:
python demo.py \
--name=dance_fashion_checkpoints \
--model=dance \
--attn_layer=2,3 \
--kernel_size=2=5,3=3 \
--gpu_id=0 \
--dataset_mode=dance \
--sub_dataset=fashion \
--dataroot=./dataset/danceFashion \
--results_dir=./demo_results/dance_fashion \
--test_list=val_list.csv
For more training and testing details, please find the PERSON_IMAGE_ANIMATION.md.
Given an input source image and a guidance video sequence depicting the structure movements, our model generating a video containing the specific movements.
Run the the demo of this task:
python demo.py \
--name=face_checkpoints \
--model=face \
--attn_layer=2,3 \
--kernel_size=2=5,3=3 \
--gpu_id=0 \
--dataset_mode=face \
--dataroot=./dataset/FaceForensics \
--results_dir=./demo_results/face
We use the real video of the FaceForensics dataset. See FACE_IMAGE_ANIMATION.md for more details.
View synthesis requires generating novel views of objects or scenes based on arbitrary input views.
In this task, we use the car and chair categories of the ShapeNet dataset. See VIEW_SYNTHESIS.md for more details.
@article{ren2020deep,
title={Deep Image Spatial Transformation for Person Image Generation},
author={Ren, Yurui and Yu, Xiaoming and Chen, Junming and Li, Thomas H and Li, Ge},
journal={arXiv preprint arXiv:2003.00696},
year={2020}
}
We build our project base on Vid2Vid. Some dataset preprocessing methods are derived from Pose-Transfer.