Video Captioning

Human labeling of videos is expensive and time-consuming. We adopt powerful image captioning models to generate captions for videos. Although GPT-4V achieves a better performance, its 20s/sample speed is too slow for us. LLaVA is the second best open-source model in MMMU and accepts any resolution. We find the quality of 34B model is comparable.

LLaVA Captioning

We extract three frames from the video for captioning. With batch inference, we can achieve 10 times speedup. With approximately 720p resolution and 1 frames, the speed is 2~3 videos/s on 8 GPUs. If we resize the smaller side to 336, the speed can be 8 videos/s. In Open-Sora v1.1, to lower the cost, we use the 7B model.

Requirement

# create conda env
conda create -n llava python=3.10 -y
conda activate llava

# install torch
pip install torch torchvision

# clone llava
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
# CAUTION: This line is to remove torch dependency in pyproject.toml, which is:
# "torch==2.1.2", "torchvision==0.16.2",
# It is better manually remove it in your local pyproject.toml
sed -i '16d' pyproject.toml

# install llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

# install flash attention
pip install flash-attn --no-build-isolation
# install colossalai and decord
pip install colossalai decord

Usage

Prepare a csv file for processing. The csv file can be generated by convert_dataset.py according to its documentation. Then, run the following command to generate captions for videos/images with Llava:

# caption with mistral-7B
torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava DATA.csv --dp-size 8 --tp-size 1 --model-path liuhaotian/llava-v1.6-mistral-7b --prompt video

# caption with llava-34B
# NOTE: remember to enable flash attention for this model
torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava DATA.csv --dp-size 4 --tp-size 2 --model-path liuhaotian/llava-v1.6-34b --prompt image-3ex --flash-attention

# we run this on 8xH800 GPUs
torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava DATA.csv --tp-size 2 --dp-size 4 --bs 16

# at least two 80G GPUs are required
torchrun --nproc_per_node 2 --standalone -m tools.caption.caption_llava DATA.csv --tp-size 2 --dp-size 1 --bs 16

# can also caption images
torchrun --nproc_per_node 2 --standalone -m tools.caption.caption_llava DATA.csv --tp-size 2 --dp-size 1 --bs 16 --prompt image-3ex

Please note that you should add the --flash-attention flag when running with Llama-based Llava models as it provides speedup but do turn it off for mistral-based ones. Reasons can be found in this issue.

After running the script, with dp-size=N, you will get N parts of csv files. Run the following command to merge them:

python -m tools.datasets.datautil DATA_caption_part*.csv --output DATA_caption.csv

Resume

Sometimes the process may be interrupted. We can resume the process by running the following command:

# merge generated results
python -m tools.datasets.datautil DATA_caption_part*.csv --output DATA_caption.csv

# get the remaining videos
python -m tools.datasets.datautil DATA.csv --difference DATA_caption.csv --output DATA_remaining.csv

Then use the output csv file to resume the process.

PLLaVA Captioning

Download the PLLaVA repo

First, make sure you are under the directory of tools/caption/pllava_dir. Then,

git clone https://github.com/magic-research/PLLaVA.git

cd PLLaVA

Environment

conda create -n pllava python=3.10

conda activate pllava

pip install -r requirements.txt # change to your own torch version if neccessary; torch==2.2.2, torchaudio==2.2.2, torchvision==0.17.2 worked for H100 for Tom.

Download weights

python python_scripts/hf.py # download the weights

Usage

Since PLLaVA is not fashioned as a package, we will use PYTHONPATH to use it as a package.

cd .. # step back to pllava_dir

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
PYTHONPATH='$PYTHONPATH:OPEN_SORA_HOME/tools/caption/pllava_dir/PLLaVA' \
nohup /mnt/nfs-207/envs/pllava/bin/python caption_pllava.py \
  --pretrained_model_name_or_path PLLaVA/MODELS/pllava-13b \
  --use_lora \
  --lora_alpha 4 \
  --num_frames 4 \
  --weight_dir PLLaVA/MODELS/pllava-13b \
  --csv_path meta.csv \
  --pooling_shape 4-12-12 \
  > pllava_caption.out 2>&1 &

GPT-4V Captioning

Run the following command to generate captions for videos with GPT-4V:

# output: DATA_caption.csv
python -m tools.caption.caption_gpt4 DATA.csv --key $OPENAI_API_KEY

The cost is approximately $0.01 per video (3 frames per video).

Camera Motion Detection

Install additional required packages: tools/caption/camera_motion/requirements.txt.

Run the following command to classify camera motion:

# output: meta_cmotion.csv
python -m tools.caption.camera_motion.detect tools/caption/camera_motion/meta.csv

You may additionally specify threshold to indicate how "sensitive" the detection should be as below. For example threshold = 0.2 means that the video is only counted as tilt_up when the pixels moved down by >20% of video height between the starting and ending frames.

# output: meta_cmotion.csv
python -m tools.caption.camera_motion.detect tools/caption/camera_motion/meta.csv --threshold 0.2

Each video is classified according to 8 categories: pan_right, pan_left, tilt_up, tilt_down, zoom_in, zoom_out, static, unclassified. Categories of tilt, pan and zoom can overlap with each other.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Video Captioning

LLaVA Captioning

Requirement

Usage

Resume

PLLaVA Captioning

Download the PLLaVA repo

Environment

Download weights

Usage

GPT-4V Captioning

Camera Motion Detection

Files

README.md

Latest commit

History

README.md

File metadata and controls

Video Captioning

LLaVA Captioning

Requirement

Usage

Resume

PLLaVA Captioning

Download the PLLaVA repo

Environment

Download weights

Usage

GPT-4V Captioning

Camera Motion Detection