Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, Andrew Owens
University of Michigan, Yale University, Adobe Research
This is the official PyTorch implementation of "Conditional Generation of Audio from Video via Foley Analogies". [Project Page] [Arxiv] [Video]
To setup the environment, please run
conda env create -f conda_env.yml
conda activate condfoley
To setup SparseSync re-ranking environment, please run
cd SparseSync
conda env create -f conda_env.yml
conda activate sparse_sync
A quick demonstrate to generate 6-sec audio with our model is to simply run
mkdir logs
python audio_generation.py --gh_demo --model_name 2022-05-03T11-33-05_greatesthit_transformer_with_vNet_randshift_2s_GH_vqgan_no_earlystop --target_log_dir demo_output --W_scale 3
The generated video will located at logs/demo_output/2sec_full_generated_video_0
.
You may check the audio_generation.py
to change the input videos and play with different videos of your own!
We use the Greatestest Hits dataset to train and evaluate our model both qualitatively and quantitatively. Data can be downloaded from here.
We use the Countix-AV dataset to demonstrate our method on a more realistic scenario. Data can be downloaded following the configs from RepetitionCounting repo.
As described in the paper, we resampled the videos into 15FPS and resampled the audio into 22050Hz. The video is also resized to (640, 360)
for faster loading. The audio is denoised with noisereduce package.
FOr training preprocess, please use feature_extraction\video_preprocess.py
, which will build correct training data structure. See the file for more detail. We have also updated the script so that you can use --greatesthit
flag to process data for the Greatest Hits dataset and ignore this flag for the CountixAV dataset.
For evaluation & demonstration purpose, please use video_preprocess.py
.
The Greatest Hits dataset should be placed under the data/
folder following such structure:
path/to/CondFoleyGen/
data/
greatesthit/
greatesthit-process-resized/
{video_idx}/
audio/
{videoIdx}_denoised.wav
{videoIdx}_denoised_resampled.wav
frames/
frame000001.jpg
frame000002.jpg
...
hit_record.json
meta.json
...
The meta.json
and hit_record.json
files can be found at data/greatest_hit_meta_info.tar.gz
, which contains all the necessary information in the correct structure. In fact, you may only use them when you train the GreatestHit model with spectrograms, which is deprecated. The current training scheme only uses the .wav
audio file.
Similarly, the Countix-AV dataset should be placed under the data/
folder following such structure:
path/to/CondFoleyGen/
data/
ImpactSet/
impactset-proccess-resize/
{video_idx}/
audio/
{videoIdx}.wav
{videoIdx}_resampled.wav
{videoIdx}_resampled_denoised.wav
frames/
frame000001.jpg
frame000002.jpg
...
...
We split each dataset on video level randomly. The split file is under the data/
folder, named as data/greatesthit_[train/val/test].json
and data/countixAV_[train/val/test].json
To conduct a fair evaluation on the Greatest Hit dataset, we build a fixed test set composed of 2-sec. conditional and target video pairs cropped from previous test split following the description in the paper. Please check data/AMT_test_set.json
for the detailed information. We also provide the corresponding action information in the data/AMT_test_set_type_dict.json
and whether if the action in two videos are match or not in data/AMT_test_set_match_dict.json
The path of the target and conditional video is at data/AMT_test_set_path.json
. The data should be placed under the logs/
folder following such structure
path/to/CondFoleyGen/
logs/
AMT_test_target_video_trimmed_15fps/
<video_1>.mp4
<video_2>.mp4
...
AMT_test_cond_video_trimmed_15fps/
<video_1>.mp4
<video_2>.mp4
...
We also provide the pre-processed videos for downloading at google drive, you may download it and extract it to the logs/
dir directly.
Coming soon...
The training of our model with default configs requires 1 NVIDIA A40 40G GPU for the first stage, and 4 NVIDIA A40 40G GPUs for the second stage. You may change the --gpus
argument to use different number of GPUS. You may also update the configurations under config/
folder to adjust the batch size.
The first step of the training process is to train the VQ-GAN codebook model.
- To train the model on the Greatest Hit dataset, run
python train.py --base configs/greatesthit_codebook.yaml -t True --gpus 0,
- To train the model on the Countix-AV dataset, run
python train.py --base configs/countixAV_codebook_denoise.yaml -t True --gpus 0,
The second step of the training process is to train the conditional transformer model.
- To train the model on the Greatest Hit dataset, please first fill the relative path of previous trained codebook checkpoint path to the config file at
configs/greatesthit_transformer_with_vNet_randshift_2s_GH_vqgan_no_earlystop.yaml
. The path should be put atmodel.params.first_stage_config.params.ckpt_path
After that, you may train the transformer model by running
python train.py --base configs/greatesthit_transformer_with_vNet_randshift_2s_GH_vqgan_no_earlystop.yaml -t True --gpus 0,1,2,3,
- To train the model on the Countix AV dataset, please first fill the relative path of previous trained codebook checkpoint path to the config file at
configs/countixAV_transformer_denoise.yaml
, then run
python train.py --base configs/countixAV_transformer_denoise.yaml -t True --gpus 0,1,2,3,
We provide a sample script to generate audio with pre-trained model and a pair of sample video at audio_generation.py
.
- To generate audio with transformer model trained on the Greatest Hit dataset
python audio_generation.py --gh_gen --model_name <pre_trained_model_folder_name> --target_log_dir <target_output_dir_name>
you may change the orig_videos
and cond_videos
in the script to generate audio for different videos
- To generate audio with transformer model trained on the Countix-AV dataset
python audio_generation.py --countix_av_gen --model_name <pre_trained_model_folder_name> --target_log_dir <target_output_dir_name>
- To generate audio for the Greatest Hit test set, run
python audio_generation.py --gh_testset --model_name <pre_trained_model_folder_name> --target_log_dir <target_output_dir_name>
The Greatest Hit test data should be placed following the instruction in the previous section
- To generate multiple audio for re-ranking, please use the
--multiple
argument. The output will be atlogs/{target_log_dir}/{gen_cnt}_times_split_{split}_wav_dict.pt
. You may then generate the re-ranking output by running
cd SparseSync
conda activate sparse_sync
python predict_best_sync.py -d 0 --dest_dir <path_to_generated_file> --tolerance 0.2 --split <split> --cnt <gen_cnt>
The output will be at the SparseSync/logs/<path_to_generated_file>
folder, under the same folder of previous generated output.
If you find this work useful, please consider citing:
@inproceedings{
du2023conditional,
title={Conditional Generation of Audio from Video via Foley Analogies},
author={Du, Yuexi and Chen, Ziyang and Salamon, Justin and Russell, Bryan and Owens, Andrew},
booktitle={Conference on Computer Vision and Pattern Recognition 2023},
year={2023},
}
We thank Jon Gillick, Daniel Geng, and Chao Feng for the helpful discussions. Our code base is developed upon two amazing projects proposed by Vladimir Iashin, check out those projects here (SpecVQGAN, SparseSync). This work was funded in part by DARPA Semafor and Cisco Systems, and by a gift from Adobe. The views, opinions and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.