Skip to content

Latest commit

 

History

History
191 lines (137 loc) · 8.99 KB

README.md

File metadata and controls

191 lines (137 loc) · 8.99 KB

Open-Sora: Towards Open Reproduction of Sora

Open-Sora is an open-source initiative dedicated to efficiently reproducing OpenAI's Sora. Our project aims to cover the full pipeline, including video data preprocessing, training with acceleration, efficient inference and more. Operating on a limited budget, we prioritize the vibrant open-source community, providing access to text-to-image, image captioning, and language models. We hope to make a contribution to the community and make the project more accessible to everyone.

📰 News

  • [2024.03.18] 🔥 We release Open-Sora 1.0, an open-source project to reproduce OpenAI Sora. Open-Sora 1.0 supports a full pipeline of video data preprocessing, training with acceleration, inference, and more. Our provided checkpoint can produce 2s 512x512 videos.

🎥 Latest Demo

2s 512x512 2s 512x512 2s 512x512

Click for the original video.

🔆 New Features/Updates

  • 📍 Open-Sora-v1 is trained on xxx. We train the model in three stages. Model weights are available here. Training details can be found here. [WIP]
  • ✅ Support training acceleration including flash-attention, accelerated T5, mixed precision, gradient checkpointing, splitted VAE, sequence parallelism, etc. XXX times. Details locates at acceleration.md. [WIP]
  • ✅ We provide video cutting and captioning tools for data preprocessing. Instructions can be found here and our data collection plan can be found at datasets.md.
  • ✅ We find VQ-VAE from VideoGPT has a low quality and thus adopt a better VAE from Stability-AI. We also find patching in the time dimension deteriorates the quality. See our report for more discussions.
  • ✅ We investigate different architectures including DiT, Latte, and our proposed STDiT. Our STDiT achieves a better trade-off between quality and speed. See our report for more discussions.
  • ✅ Support clip and T5 text conditioning.
  • ✅ By viewing images as one-frame videos, our project supports training DiT on both images and videos (e.g., ImageNet & UCF101). See command.md for more instructions.
  • ✅ Support inference with official weights from DiT, Latte, and PixArt.
View more
  • ✅ Refactor the codebase. See structure.md to learn the project structure and how to use the config files.

TODO list sorted by priority

  • Complete the data processing pipeline (including dense optical flow, aesthetics scores, text-image similarity, deduplication, etc.). See datasets.md for more information. [WIP]
  • Training Video-VAE. [WIP]
View more
  • Support image and video conditioning.
  • Evaluation pipeline.
  • Incoporate a better scheduler, e.g., rectified flow in SD3.
  • Support variable aspect ratios, resolutions, durations.
  • Support SD3 when released.

Contents

Installation

# create a virtual env
conda create -n opensora python=3.10

# install torch
# the command below is for CUDA 12.1, choose install commands from 
# https://pytorch.org/get-started/locally/ based on your own CUDA version
pip3 install torch torchvision

# install flash attention (optional)
pip install packaging ninja
pip install flash-attn --no-build-isolation

# install apex (optional)
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git

# install xformers
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121

# install this project
git clone https://github.com/hpcaitech/Open-Sora
cd Open-Sora
pip install -v -e .

After installation, we suggest reading structure.md to learn the project structure and how to use the config files.

Model Weights

Model #Params url
16x256x256

Inference

To run inference with our provided weights, first prepare the pretrained weights including XXX. [WIP]

Then run the following commands to generate samples. See here to customize the configuration.

# Sample 16x256x256 (~2s)
python scripts/inference.py configs/opensora/inference/16x256x256.py --ckpt-path ./path/to/your/ckpt.pth
# Sample 16x512x512 (~2s)
python scripts/inference.py configs/opensora/inference/16x512x512.py
# Sample 64x512x512 (~5s)
python scripts/inference.py configs/opensora/inference/64x512x512.py

For inference with other models, see here for more instructions.

Data Processing

Split video into clips

We provide code to split a long video into separate clips efficiently using multiprocessing. See tools/data/scene_detect.py.

Generate video caption

Training

Acknowledgement

  • DiT: Scalable Diffusion Models with Transformers.
  • OpenDiT: An acceleration for DiT training. OpenDiT's team provides valuable suggestions on acceleration of our training process.
  • PixArt: An open-source DiT-based text-to-image model.
  • Latte: An attempt to efficiently train DiT for video.
  • StabilityAI VAE: A powerful image VAE model.
  • CLIP: A powerful text-image embedding model.
  • T5: A powerful text encoder.
  • LLaVA: A powerful image captioning model based on Yi-34B.

We are grateful for their exceptional work and generous contribution to open source.

Citation

@software{opensora,
  author = {Zangwei Zheng and Xiangyu Peng and Shenggui Li and Yang You},
  title = {Open-Sora: Towards Open Reproduction of Sora},
  month = {March},
  year = {2024},
  url = {https://github.com/hpcaitech/Open-Sora}
}

Zangwei Zheng and Xiangyu Peng equally contributed to this work during their internship at HPC-AI Tech.

Star History

Star History Chart

TODO

Modules for releasing:

  • configs
  • opensora
  • assets
  • scripts
  • tools

packages for data processing

put all outputs under ./checkpoints/, including pretrained_models, checkpoints, samples