V-AURA: Temporally Aligned Audio for Video with Autoregression

The official implementation of V-AURA.

[Project Page] [Google Colab Demo] [Arxiv]

Environment

This code has been tested on Ubuntu 20.04 with Python 3.8.18 and PyTorch 2.2.1 using CUDA 12.1. To install the required packages, run the following command:

conda env create -f conda_env_cuda12.1.yaml
conda activate vaura_cu12.1

Demo

We provide a local demo notebook (demo.ipynb) and a Google Colab notebook to generate samples with the pre-trained model. In the local version, simply activate the environment and execute the cells in the notebook to generate samples. On Google Colab just execute the cells.

Data

VGGSound

Download the full VGGSound dataset. After downloading, you need to reencode the data. You can use the ./scripts/reencode_videos.py to do so. Just substitute the path in the script with the path to your data.

VAS

You can run inference on VAS (training is not tested). Download the VAS dataset. VAS is encoded on-the-fly during inference, so no need to reencode. Just substitute the VAS base path to ./data/vas/test/data.jsonl file entries.

VisualSound

To use the novel VisualSound dataset, you need to download the VGGSound dataset and reencode the data. You can use the ./scripts/reencode_videos.py for reencoding. The entries included in VisualSound are defined in ./data/meta/visualsound/visualsound.csv and per split in ./data/splits/visualsound. For training, testing, and inference, you can provide a path to full VGGSound dataset (config.dataloader.data_dir) and a path to a VisualSound splits (config.dataloader.split_dir) in V-AURA configuration. Only files defined in the split-files are used.

Visual Feature Extractor

We employ Segment AVCLIP by Iashin et al. for visual feature extraction. You can download the model pre-trained on VGGSound from here. After downloading, you need to specify the path to the model in the configuration file (./configs/modules/feature_extractors/avclip_vggsound.yaml) or in command line (see training).

Models

We provide a checkpoint to the model used to calculate the results in the paper. Download the checkpoint here.

Extract the TAR-file to your log directory (e.g. ./logs) and provide the path to the experiment directory in the command line (see generation).

Training

To simplify things we use PyTorch Lightning. See PyTorch Lightning Trainer arguments from ./scripts/train.py to customize the training process. To run the training with the default settings, run the following command:

python main.py
    config=./configs/experiments/vggsound/avclip/9cb-viscond-avclip-channel_concat-llama.yaml
    dataloader.data_dir=/path/to/reencoded_data
    trainer.num_nodes=1  # number of nodes
    trainer.devices=[0]  # GPUs, e.g. 4 or [0, 2]
    model.feature_extractor_config.params.ckpt_path=/path/to/epoch_best.pt

If you want to resume training, just provide the path to the last checkpoint in the experiment directory (e.g. YY-MM-DDTHH-MM-SS, which contains checkpoint and logging dirs) and the training will continue from there. Also, logging will be appended to the existing TensorBoard logs:

python main.py
    config=/path/to/experiment/config.yaml
    dataloader.data_dir=/path/to/reencoded_data
    trainer.num_nodes=1  # number of nodes
    trainer.devices=[0]  # GPUs, e.g. 4 or [0, 2]
    model.feature_extractor_config.params.ckpt_path=/path/to/epoch_best.pt
    trainer.ckpt_path=/path/to/experiment/last.ckpt

For distributed training, use torchrun or SLURM and set the number of nodes and devices according to your setup. Note that the number of GPUs and nodes configured in the Trainer (model config.yaml) should match the actual number of GPUs and nodes defined in torchrun/SLURM config.

By default, TensorBoard is used to log the progress. Start TensorBoard with the following command:

tensorboard --logdir=./logs

Generation (inference)

To generate VGGSound/VGGSound-Sparse/VisualSound/VAS samples with the model, run the following command:

python main.py
    config=configs/generate_[vgg, vgg_sparse, vas, visualsound].yaml
    experiment_path=/path/to/experiment  # experiment directory generated during training
    overridden_hparams.feature_extractor_config.params.ckpt_path=/path/to/epoch_best.pt  # path to the visual feature extractor ckpt (if different from training)
    duration=2.56  # duration of the generated audio in seconds (n * 0.64)
    dataloader.data_dir=/path/to/reencoded_data

Evaluation

For evaluation use my evaluation framework.

Acknowledgements

We would like to thank following open-source repositories for their code and documentation:

Synchformer
LlamaGen
AudioCraft
SpecVQGAN
PyTorch
PyTorchLightning
NumPy, SciPy, and other Python libraries

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
configs		configs
data		data
models		models
scripts		scripts
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conda_env_cuda12.1.yaml		conda_env_cuda12.1.yaml
demo.ipynb		demo.ipynb
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

V-AURA: Temporally Aligned Audio for Video with Autoregression

Environment

Demo

Data

VGGSound

VAS

VisualSound

Visual Feature Extractor

Models

Training

Generation (inference)

Evaluation

Acknowledgements

About

Releases

Packages

Languages

License

ilpoviertola/V-AURA

Folders and files

Latest commit

History

Repository files navigation

V-AURA: Temporally Aligned Audio for Video with Autoregression

Environment

Demo

Data

VGGSound

VAS

VisualSound

Visual Feature Extractor

Models

Training

Generation (inference)

Evaluation

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages