InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions
InteractiveVideo is a user-centric framework for interactive video generation. It highlights the contributions of comprehensive editing by users' intuitive manipulation, and it performs high-quality regional content control and precise motion control. We would like to introduce features as follows:
"Purple Flowers." | "Purple Flowers, bee" | "the purple flowers are shaking, a bee is flying" |
"1 Cat." | "1 Cat, butterfly" | "the small yellow butterfly is flying to the cat's face" |
"flowers." | "flowers." | "windy, the flowers are shaking in the wind" |
"1 Man." | "1 Man, rose." | "1 Man, smiling." |
InteractiveVideo can perform precise motion control.
"1 man, dark light " | "the man is turning his body" | "the man is turning his body" |
"1 beautiful girl with long black hair, and a flower on her head, clouds" | " the girl is turning gradually" | " the girl is turning gradually" |
InteractiveVideo can smoothly cooperate with LoRAs and DreamBooth, thus, there are many potential functions of this framework that are still under-explored.
"Yae Miko" (Genshin Impact) | "Dressing Up " | "Dressing Up" |
# create a conda environment
conda create -n ivideo python=3.10
conda activate ivideo
# install requirements
pip install -r requirements.txt
You can simply use the following script to download checkpoints
python scripts/download_models.py
This will take a long time, you can also selectively download checkpoints by modifying "scripts/download_models.py" and "scripts/*.json". Please make sure that there is at least one checkpoint left for each JSON file. Moreover, all checkpoints are listed as follows
- Checkpoints for enjoying image-to-image generation
Models | Types | Version | Checkpoints |
---|---|---|---|
StableDiffusion | - | v1.5 | Huggingface |
StableDiffusion | - | turbo | Huggingface |
KoHaKu | Animation | v2.1 | Huggingface |
LCM-LoRA-StableDiffusion | - | v1.5 | Huggingface |
LCM-LoRA-StableDiffusion | - | xl | Huggingface |
- Checkpoints for enjoying image-to-video generation
Models | Types | Version | Checkpoints |
---|---|---|---|
StableDiffusion | - | v1.5 | Huggingface |
PIA (UNet) | - | - | Huggingface |
Dreambooth | MagicMixRealistic | v5 | Civitai |
Dreambooth | RCNZCartoon3d | v10 | Civitai |
Dreambooth | RealisticVision | - | Huggingface |
- Checkpoints for enjoying dragging images.
Models | Types | Resolution | Checkpoints |
---|---|---|---|
StyleGAN-2 | Lions | 512 x 512 | Google Storage |
StyleGAN-2 | Dogs | 1024 x 1024 | Google Storage |
StyleGAN-2 | Horses | 256 x 256 | Google Storage |
StyleGAN-2 | Elephants | 512 x 512 | Google Storage |
StyleGAN-2 | Face (FFHQ) | 512 x 512 | NGC |
StyleGAN-2 | Cat Face (AFHQ) | 512 x 512 | NGC |
StyleGAN-2 | Car | 512 x 512 | CloudFront |
StyleGAN-2 | Cat | 512 x 512 | CloudFront |
StyleGAN-2 | Landmark (LHQ) | 256 x 256 | Google Drive |
Also, you can train and try your customized models. You should put your model into the "checkpoints" folder, which is organized as follows
InteractiveVideo # project
|----checkpoints
|----|----drag # Drag
|----|----|----stylegan2_elephants_512_pytorch.pkl
|----|----i2i # Image-2-Image
|----|----|----lora
|----|----|----|----lcm-lora-sdv1-5.safetensors
|----|----i2v # Image-to-Video
|----|----|----unet
|----|----|----|----pia.ckpt
|----|----|----dreambooth
|----|----|----|----realisticVisionV51_v51VAE.safetensors
|----|----diffusion_body
|----|----|----stable-diffusion-v1-5
|----|----|----kohahu-v2-1
|----|----|----sd-turbo
To run a local demo, use the following command (recommended)
python demo/main.py
You can also run our web demo locally with
python demo/main_gradio.py
In the following, we provide some instructions for a quick start.
Input image-to-image text prompts, and click the "Confirm Text" button. The generation is real-time.
Input image-to-video text prompts, and click the "Confirm Text" button. Then click the "Generate Video" button and wait for seconds.
The generated video might not be satisfactory, but you can properly customize the video with multi-modal instructions. For example, draw butterflies to help the model know the location of them.
You can also drag images. First, you should choose a proper checkpoint in the "Drag Image" tab and click the "Drag Mode On" button. It will take a few minutes to prepare. Then you can draw masks, add points, and click the "start" button. Once the result is satisfactory, click the "stop" button.
If the code and paper help your research, please kindly cite:
@article{zhang2024interactivevideo,
title={InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions},
author={Zhang, Yiyuan and Kang, Yuhao and Zhang, Zhixin and Ding, Xiaohan and Zhao, Sanyuan and Yue, Xiangyu},
year={2024},
eprint={2402.03040},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Our codebase builds on Stable Diffusion, StreamDiffusion, DragGAN, PTI, and PIA. Thanks the authors for sharing their awesome codebases!
We develop this repository for RESEARCH purposes, so it can only be used for personal/research/non-commercial purposes.