Automatically create datasets in LJ Speech format to use for training TTS ( text-to-speech) models. LJ Speech is a common standard and is used in TTS frameworks such as Tortoise or Piper.
Segments detected in VAD step in WhisperX are used to create short samples in
.wav
format and WhisperX ASR is used to
create the corresponding transcriptions.
pip install -r requirements.txt
- Put your (possibly long) audio files containing spoken audio into
input_audio
- Run
python create_dataset.py --model base --gpu 0 --input input_audio --output output
if you have a GPU available, or on CPU:python create_dataset.py --model tiny --cpu --input input_audio --output output
- Output:
- Processed audio samples are saved as
.wav
files in theoutput/audio
directory - A
metadata.csv
file is generated, containing entries in the format000001_000001|Transcribed text of first audiosample. 000001_000002|Transcribed text of the second audiosample. ...
- Processed audio samples are saved as
- Build Dockerimage:
docker build -t whisperx4ljspeech .
- Run
docker run --gpus '"device=0"' -v $(pwd)/input_audio:/app/input_audio -v $(pwd)/output:/app/output whisperx4ljspeech --input input_audio/ --output output --gpu 0 --model large-v3 --language de