Skip to content

Commit

Permalink
Update (230629)
Browse files Browse the repository at this point in the history
  • Loading branch information
gmltmd789 committed Jun 29, 2023
1 parent 115d465 commit a3a5d80
Show file tree
Hide file tree
Showing 12 changed files with 509 additions and 96 deletions.
13 changes: 10 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,12 @@
### [Paper](https://arxiv.org/abs/2306.16083)
### [Audio demo](https://unitspeech.github.io/)

## Updates (Updated components compared to the version of INTERSPEECH.)
## Updates
### 2023.06.29 : We update our code and checkpoints for better pronunciation!
- **Extract reference speaker embeddings using the [WavLM](https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification#pre-trained-models)-based speaker encoder.**
- **Modeling normalized mel-spectrogram (-1 ~ 1)**

### 2023.06.28 : Updated components compared to the version of INTERSPEECH.
- **Change in vocoder (from HiFi-GAN to BigVGAN).**
- **Support for speaker classifier-free guidance (advantageous for adapting to more unique voices.)**
- **Change "training-free text classifier-free guidance" to "text classifier-free guidance" (learning text unconditional embedding).**
Expand All @@ -13,6 +18,7 @@
- **To improve TTS (Text-to-Speech) pronunciation, an IPA-based phonemizer is used.**
- **To improve VC (Voice Conversion) pronunciation, a contentvec encoder is introduced.**


# Warning: Ethical & Legal Considerations
1. **UnitSpeech was created with the primary objective of facilitating research endeavors.**
2. **When utilizing samples generated using this model, it is crucial to clearly disclose that the samples were generated using AI technology. Additionally, it is necessary to provide the sources of the audio used in the generation process.**
Expand All @@ -38,6 +44,7 @@ conda activate unitspeech
git clone https://github.com/gmltmd789/UnitSpeech.git
cd UnitSpeech
pip install -e .
pip install --no-deps s3prl==0.4.10
```

## Pretrained Models
Expand All @@ -49,7 +56,7 @@ pip install -e .
|text_encoder.pt|Used for adaptive text-to-speech tasks.|
|duration_predictor.pt|Used for adaptive text-to-speech tasks.|
|pretrained_decoder.pt|Used for all adaptive speech synthesis tasks.|
|speaker_encoder.pt|Used for extracting speaker embeddings.|
|speaker_encoder.pt|Used for extracting [speaker embeddings](https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification#pre-trained-models).|
|bigvgan.pt|[Vocoder](https://github.com/NVIDIA/BigVGAN) checkpoint.|
|bigvgan-config.json|Configuration for the vocoder.|

Expand Down Expand Up @@ -131,7 +138,7 @@ The code and model weights of UnitSpeech are released under the CC BY-NC-SA 4.0
* [VITS](https://github.com/jaywalnut310/vits) (for text & IPA phoneme sequence processing)
* [Grad-TTS](https://github.com/huawei-noah/Speech-Backbones/tree/main/Grad-TTS) (for overall architecture and code)
* [denoising-diffusion-pytorch](https://github.com/rosinality/denoising-diffusion-pytorch) (for diffusion-based sampler)
* [Pytorch_Speaker_Verification](https://github.com/HarryVolek/PyTorch_Speaker_Verification) (for speaker embedding extraction)
* [WavLM](https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification) (for speaker embedding extraction)

## Citation
```
Expand Down
32 changes: 19 additions & 13 deletions notebooks/playground.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"text": [
"/home/gmltmd789/anaconda3/envs/unitspeech/lib/python3.8/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n",
"2023-06-27 17:47:39 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX\n"
"2023-06-29 22:43:06 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX\n"
]
}
],
Expand All @@ -30,8 +30,6 @@
" path = \"../\"\n",
" os.chdir(path)\n",
" \n",
"os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\" \n",
"os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"2\"\n",
"import phonemizer\n",
"import random\n",
"from scipy.io.wavfile import write\n",
Expand All @@ -43,8 +41,7 @@
"from unitspeech.unitspeech import UnitSpeech\n",
"from unitspeech.duration_predictor import DurationPredictor\n",
"from unitspeech.encoder import Encoder\n",
"from unitspeech.speaker_encoder.speaker_encoder import SpeechEmbedder\n",
"from unitspeech.speaker_encoder.util import extract_speaker_embedding\n",
"from unitspeech.speaker_encoder.ecapa_tdnn import ECAPA_TDNN_SMALL\n",
"from unitspeech.text import cleaned_text_to_sequence, phonemize, symbols\n",
"from unitspeech.textlesslib.textless.data.speech_encoder import SpeechEncoder\n",
"from unitspeech.util import HParams, fix_len_compatibility, intersperse, process_unit, generate_path, sequence_mask\n",
Expand Down Expand Up @@ -136,9 +133,9 @@
"\n",
"# Speaker Encoder for extracting speaker embedding\n",
"print('Initializing Speaker Encoder...')\n",
"spk_embedder = SpeechEmbedder()\n",
"spk_embedder = ECAPA_TDNN_SMALL(feat_dim=1024, feat_type=\"wavlm_large\", config_path=None)\n",
"state_dict = torch.load(speaker_encoder_path, map_location=lambda storage, loc: storage)\n",
"spk_embedder.load_state_dict(state_dict['embedder_net'], strict=False)\n",
"spk_embedder.load_state_dict(state_dict['model'], strict=False)\n",
"_ = spk_embedder.cuda().eval()\n",
"\n",
"# Unit Extractor for extraction unit and duration, which are used for finetuning\n",
Expand All @@ -163,20 +160,25 @@
"metadata": {},
"outputs": [],
"source": [
"# Load the normalization parameters for mel-spectrogram normalization.\n",
"mel_min = torch.load(\"unitspeech/parameters/mel_min.pt\").unsqueeze(0).unsqueeze(-1)\n",
"mel_max = torch.load(\"unitspeech/parameters/mel_max.pt\").unsqueeze(0).unsqueeze(-1)\n",
"\n",
"# Preprocess the reference audio in a format suitable for fine-tuning\n",
"wav, sr = librosa.load(reference_path)\n",
"wav = torch.FloatTensor(wav)\n",
"spk_emb = extract_speaker_embedding(wav, spk_embedder, **hps_finetune.data).unsqueeze(0)\n",
"wav = wav.unsqueeze(0)\n",
"wav = torch.FloatTensor(wav).unsqueeze(0)\n",
"mel = mel_spectrogram(wav, hps_finetune.data.n_fft, hps_finetune.data.n_feats, hps_finetune.data.sampling_rate, hps_finetune.data.hop_length,\n",
" hps_finetune.data.win_length, hps_finetune.data.mel_fmin, hps_finetune.data.mel_fmax, center=False)\n",
"\n",
"with torch.no_grad():\n",
" reference_audio = vocoder.forward(mel.cuda()).cpu().squeeze().clamp(-1, 1).numpy()\n",
"\n",
" \n",
"mel = (mel - mel_min) / (mel_max - mel_min) * 2 - 1\n",
"mel = mel.cuda()\n",
"resample_fn = torchaudio.transforms.Resample(sr, 16000).cuda()\n",
"wav = resample_fn(wav.cuda())\n",
"spk_emb = spk_embedder(wav)\n",
"spk_emb = spk_emb / spk_emb.norm()\n",
"\n",
"# Extract the units and unit durations to be used for fine-tuning.\n",
"encoded = unit_extractor(wav.to(\"cuda\"))\n",
Expand Down Expand Up @@ -544,6 +546,8 @@
" phoneme, phoneme_lengths, spk_emb, len(hps_tts.decoder.dim_mults) - 1,\n",
" diffusion_step, text_gradient_scale, spk_gradient_scale, length_scale\n",
" )\n",
" mel_generated = ((mel_generated + 1) / 2 * (mel_max.to(mel_generated.device) - mel_min.to(mel_generated.device))\n",
" + mel_min.to(mel_generated.device))\n",
" synthesized_audio = vocoder.forward(mel_generated).cpu().squeeze().clamp(-1, 1).numpy()\n",
"\n",
"print('Reference voice to adapt to')\n",
Expand Down Expand Up @@ -783,8 +787,10 @@
" contentvec_encoder, unitspeech,\n",
" contentvec, contentvec_length, mel_length, spk_emb, len(hps_vc.decoder.dim_mults) - 1,\n",
" diffusion_step, text_gradient_scale, spk_gradient_scale\n",
" )\n",
"\n",
" ) \n",
" mel_generated = ((mel_generated + 1) / 2 * (mel_max.to(mel_generated.device) - mel_min.to(mel_generated.device))\n",
" + mel_min.to(mel_generated.device))\n",
" \n",
" synthesized_audio = vocoder.forward(mel_generated).cpu().squeeze().clamp(-1, 1).numpy()\n",
"\n",
"print('Reference voice to adapt to')\n",
Expand Down
18 changes: 11 additions & 7 deletions scripts/finetune.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,7 @@

from unitspeech.unitspeech import UnitSpeech
from unitspeech.encoder import Encoder
from unitspeech.speaker_encoder.speaker_encoder import SpeechEmbedder
from unitspeech.speaker_encoder.util import extract_speaker_embedding
from unitspeech.speaker_encoder.ecapa_tdnn import ECAPA_TDNN_SMALL
from unitspeech.textlesslib.textless.data.speech_encoder import SpeechEncoder
from unitspeech.util import HParams, fix_len_compatibility, process_unit, generate_path, sequence_mask
from unitspeech.vocoder.env import AttrDict
Expand Down Expand Up @@ -77,9 +76,9 @@ def main(args, hps):
vocoder.remove_weight_norm()

print('Initializing Speaker Encoder...')
spk_embedder = SpeechEmbedder()
spk_embedder = ECAPA_TDNN_SMALL(feat_dim=1024, feat_type="wavlm_large", config_path=None)
state_dict = torch.load(args.speaker_encoder_path, map_location=lambda storage, loc: storage)
spk_embedder.load_state_dict(state_dict['embedder_net'], strict=False)
spk_embedder.load_state_dict(state_dict['model'], strict=False)
_ = spk_embedder.cuda().eval()

print('Initializing Unit Extracter...')
Expand All @@ -95,16 +94,21 @@ def main(args, hps):
)
_ = unit_extractor.cuda().eval()

# Load the normalization parameters for mel-spectrogram normalization.
mel_min = torch.load("unitspeech/parameters/mel_min.pt").unsqueeze(0).unsqueeze(-1)
mel_max = torch.load("unitspeech/parameters/mel_max.pt").unsqueeze(0).unsqueeze(-1)

# Load the reference audio and extract mel-spectrogram and speaker embeddings.
wav, sr = librosa.load(args.reference_path)
wav = torch.FloatTensor(wav)
spk_emb = extract_speaker_embedding(wav, spk_embedder, **hps.data).unsqueeze(0)
wav = wav.unsqueeze(0)
wav = torch.FloatTensor(wav).unsqueeze(0)
mel = mel_spectrogram(wav, hps.data.n_fft, hps.data.n_feats, hps.data.sampling_rate, hps.data.hop_length,
hps.data.win_length, hps.data.mel_fmin, hps.data.mel_fmax, center=False)
mel = (mel - mel_min) / (mel_max - mel_min) * 2 - 1
mel = mel.cuda()
resample_fn = torchaudio.transforms.Resample(sr, 16000).cuda()
wav = resample_fn(wav.cuda())
spk_emb = spk_embedder(wav)
spk_emb = spk_emb / spk_emb.norm()

# Extract the units and unit durations to be used for fine-tuning.
encoded = unit_extractor(wav.to("cuda"))
Expand Down
6 changes: 6 additions & 0 deletions scripts/text_to_speech.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,10 @@ def main(args, hps):
language='en-us', preserve_punctuation=True, with_stress=True, language_switch="remove-flags"
)

# Load the normalization parameters for mel-spectrogram normalization.
mel_min = torch.load("unitspeech/parameters/mel_min.pt").unsqueeze(0).unsqueeze(-1)
mel_max = torch.load("unitspeech/parameters/mel_max.pt").unsqueeze(0).unsqueeze(-1)

# Initialize & load model
text_encoder = Encoder(
n_vocab=len(symbols) + 1,
Expand Down Expand Up @@ -103,6 +107,8 @@ def main(args, hps):
phoneme, phoneme_lengths, spk_emb, len(hps.decoder.dim_mults) - 1
)

mel_generated = ((mel_generated + 1) / 2 * (mel_max.to(mel_generated.device) - mel_min.to(mel_generated.device))
+ mel_min.to(mel_generated.device))
audio_generated = vocoder.forward(mel_generated).cpu().squeeze().clamp(-1, 1).numpy()

if "/" in args.generated_sample_path:
Expand Down
7 changes: 7 additions & 0 deletions scripts/voice_conversion.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,10 @@ def main(args, hps):
contentvec_extractor = HubertModelWithFinalProj.from_pretrained("lengyue233/content-vec-best")
_ = contentvec_extractor.cuda().eval()

# Load the normalization parameters for mel-spectrogram normalization.
mel_min = torch.load("unitspeech/parameters/mel_min.pt").unsqueeze(0).unsqueeze(-1)
mel_max = torch.load("unitspeech/parameters/mel_max.pt").unsqueeze(0).unsqueeze(-1)

wav, sr = librosa.load(args.source_path)
wav = torch.FloatTensor(wav).unsqueeze(0)
resample_fn = torchaudio.transforms.Resample(sr, 16000).to("cuda")
Expand Down Expand Up @@ -107,6 +111,9 @@ def main(args, hps):
contentvec, contentvec_length, mel_length, spk_emb, len(hps.decoder.dim_mults) - 1
)

mel_generated = ((mel_generated + 1) / 2 * (mel_max.to(mel_generated.device) - mel_min.to(mel_generated.device))
+ mel_min.to(mel_generated.device))

audio_generated = vocoder.forward(mel_generated).cpu().squeeze().clamp(-1, 1).numpy()

if "/" in args.generated_sample_path:
Expand Down
Binary file added unitspeech/parameters/mel_max.pt
Binary file not shown.
Binary file added unitspeech/parameters/mel_min.pt
Binary file not shown.
Loading

0 comments on commit a3a5d80

Please sign in to comment.