forked from Text-to-Audio/AudioLCM
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 4dfb84c
Showing
139 changed files
with
32,083 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
*.pyc | ||
*__pycache__ | ||
useful_ckpts/bigvgan | ||
useful_ckpts/*.ckpt | ||
useful_ckpts/CLAP/*.ckpt | ||
evaluation | ||
.idea/ | ||
logs | ||
audiocaps_gen | ||
audioldm_eval | ||
src | ||
processed | ||
run.sh | ||
infer.sh | ||
*.DS_Store | ||
data_melnone16000nfft1024 | ||
data | ||
audiocaps_mels |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,127 @@ | ||
# AudioLCM: Text-to-Audio Generation with Latent Consistency Models | ||
|
||
#### Huadai Liu, Rongjie Huang, Yang Liu, Hengyuan Cao, Jialei Wang, Xize Cheng, Siqi Zheng, Zhou Zhao | ||
|
||
PyTorch Implementation of [AudioLCM]: a efficient and high-quality text-to-audio generation with latent consistency model. | ||
|
||
<!-- [](https://arxiv.org/abs/2301.12661) | ||
[](https://huggingface.co/spaces/AIGC-Audio/Make_An_Audio) | ||
[](https://github.com/Text-to-Audio/Make-An-Audio) --> | ||
|
||
We provide our implementation and pretrained models as open source in this repository. | ||
|
||
Visit our [demo page](https://audiolcm.github.io/) for audio samples. | ||
|
||
<!-- [Text-to-Audio HuggingFace Space](https://huggingface.co/spaces/AIGC-Audio/Make_An_Audio) | [Audio Inpainting HuggingFace Space](https://huggingface.co/spaces/AIGC-Audio/Make_An_Audio_inpaint) --> | ||
|
||
## News | ||
<!-- - Jan, 2023: **[Make-An-Audio](https://arxiv.org/abs/2207.06389)** submitted to arxiv. --> | ||
- June, 2024: **[AudioLCM]** released in Github. | ||
|
||
## Quick Started | ||
We provide an example of how you can generate high-fidelity samples quickly using AudioLCM. | ||
|
||
To try on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below instructions. | ||
|
||
|
||
### Support Datasets and Pretrained Models | ||
|
||
Simply run following command to download the weights from [Google drive](https://drive.google.com/drive/folders/1zZTI3-nHrUIywKFqwxlFO6PjB66JA8jI?usp=drive_link). | ||
Download bert-base-uncased weights from [Hugging Face](https://huggingface.co/google-bert/bert-base-uncased). Down load t5-v1_1-large weights from [Hugging Face](https://huggingface.co/google/t5-v1_1-large). Download CLAP weights from [Hugging Face](https://huggingface.co/microsoft/msclap/blob/main/CLAP_weights_2022.pth). | ||
|
||
``` | ||
Download: | ||
audiolcm.ckpt and put it into ./ckpts | ||
BigVGAN vocoder and put it into ./vocoder/logs/bigvnat16k93.5w | ||
t5-v1_1-large and put it into ./ldm/modules/encoders/CLAP | ||
bert-base-uncased and put it into ./ldm/modules/encoders/CLAP | ||
CLAP_weights_2022.pth and put it into ./wav_evaluation/useful_ckpts/CLAP | ||
``` | ||
<!-- The directory structure should be: | ||
``` | ||
useful_ckpts/ | ||
├── bigvgan | ||
│ ├── args.yml | ||
│ └── best_netG.pt | ||
├── CLAP | ||
│ ├── config.yml | ||
│ └── CLAP_weights_2022.pth | ||
└── maa1_full.ckpt | ||
``` --> | ||
|
||
|
||
### Dependencies | ||
See requirements in `requirement.txt`: | ||
|
||
## Inference with pretrained model | ||
```bash | ||
python scripts/txt2audio_for_lcm.py --ddim_steps 2 -b configs/audiolcm.yaml --sample_rate 16000 --vocoder-ckpt vocoder/logs/bigvnat16k93.5w --outdir results --test-dataset audiocaps -r ckpt/audiolcm.ckpt | ||
``` | ||
# Train | ||
## dataset preparation | ||
We can't provide the dataset download link for copyright issues. We provide the process code to generate melspec. | ||
Before training, we need to construct the dataset information into a tsv file, which includes name (id for each audio), dataset (which dataset the audio belongs to), audio_path (the path of .wav file),caption (the caption of the audio) ,mel_path (the processed melspec file path of each audio). We provide a tsv file of audiocaps test set: ./audiocaps_test_16000_struct.tsv as a sample. | ||
### generate the melspec file of audio | ||
Assume you have already got a tsv file to link each caption to its audio_path, which mean the tsv_file have "name","audio_path","dataset" and "caption" columns in it. | ||
To get the melspec of audio, run the following command, which will save mels in ./processed | ||
```bash | ||
python ldm/data/preprocess/mel_spec.py --tsv_path tmp.tsv | ||
``` | ||
Add the duration into the tsv file | ||
```bash | ||
python ldm/data/preprocess/add_duration.py | ||
``` | ||
## Train variational autoencoder | ||
Assume we have processed several datasets, and save the .tsv files in data/*.tsv . Replace **data.params.spec_dir_path** with the **data**(the directory that contain tsvs) in the config file. Then we can train VAE with the following command. If you don't have 8 gpus in your machine, you can replace --gpus 0,1,...,gpu_nums | ||
```bash | ||
python main.py --base configs/train/vae.yaml -t --gpus 0,1,2,3,4,5,6,7 | ||
``` | ||
The training result will be save in ./logs/ | ||
## train latent diffsuion | ||
After Trainning VAE, replace model.params.first_stage_config.params.ckpt_path with your trained VAE checkpoint path in the config file. | ||
Run the following command to train Diffusion model | ||
```bash | ||
python main.py --base configs/autoencoder1d.yaml -t --gpus 0,1,2,3,4,5,6,7 | ||
``` | ||
The training result will be save in ./logs/ | ||
# Evaluation | ||
## generate audiocaps samples | ||
```bash | ||
python scripts/txt2audio_for_lcm.py --ddim_steps 2 -b configs/audiolcm.yaml --sample_rate 16000 --vocoder-ckpt vocoder/logs/bigvnat16k93.5w --outdir results --test-dataset audiocaps -r ckpt/audiolcm.ckpt | ||
``` | ||
|
||
## calculate FD,FAD,IS,KL | ||
install [audioldm_eval](https://github.com/haoheliu/audioldm_eval) by | ||
```bash | ||
git clone [email protected]:haoheliu/audioldm_eval.git | ||
``` | ||
Then test with: | ||
```bash | ||
python scripts/test.py --pred_wavsdir {the directory that saves the audios you generated} --gt_wavsdir {the directory that saves audiocaps test set waves} | ||
``` | ||
## calculate Clap_score | ||
```bash | ||
python wav_evaluation/cal_clap_score.py --tsv_path {the directory that saves the audios you generated}/result.tsv | ||
``` | ||
|
||
|
||
## Acknowledgements | ||
This implementation uses parts of the code from the following Github repos: | ||
[Make-An-Audio](https://github.com/Text-to-Audio/Make-An-Audio) | ||
[CLAP](https://github.com/LAION-AI/CLAP), | ||
[Stable Diffusion](https://github.com/CompVis/stable-diffusion), | ||
as described in our code. | ||
|
||
<!-- ## Citations ## | ||
If you find this code useful in your research, please consider citing: | ||
```bibtex | ||
@article{huang2023make, | ||
title={Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models}, | ||
author={Huang, Rongjie and Huang, Jiawei and Yang, Dongchao and Ren, Yi and Liu, Luping and Li, Mingze and Ye, Zhenhui and Liu, Jinglin and Yin, Xiang and Zhao, Zhou}, | ||
journal={arXiv preprint arXiv:2301.12661}, | ||
year={2023} | ||
} | ||
``` --> | ||
|
||
# Disclaimer ## | ||
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws. |
Oops, something went wrong.