forked from X-LANCE/SLAM-LLM
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request X-LANCE#52 from ddlBoJack/dev-mzy
update example/asr_librispeech
- Loading branch information
Showing
8 changed files
with
201 additions
and
35 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# ASR_Librispeech | ||
|
||
## Performance and checkpoints | ||
We only train the linear projector in this recipe. | ||
Encoder | Projector | LLM | test-clean | test-other | ||
|---|---|---|---|--- | ||
[WavLM-large](https://drive.google.com/file/d/12-cB34qCTvByWT-QtOcZaqwwO21FLSqU/view) | [Linear]()(~18.88M) | [vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5) | 2.28 | 4.78 | ||
|
||
|
||
## Data preparation | ||
You need to prepare the data jsonl in this format. | ||
``` | ||
{"key": "1001-134707-0000_ASR", "source": "/data/open_data/librispeech_audio/audio/librispeech_1001-134707-0000.wav", "target": "1 little recks the laborer. How near his work is holding him to God, The loving laborer through space and time, after all, not to create, only or found only."} | ||
... | ||
{"key": "1001-134707-0000_ASR", "source": "/data/open_data/librispeech_audio/audio/librispeech_1001-134707-0000.wav", "target": "1 little recks the laborer. How near his work is holding him to God, The loving laborer through space and time, after all, not to create, only or found only."} | ||
``` | ||
|
||
## Decode with checkpoints | ||
``` | ||
bash decode_wavlm_large_linear_vicuna_7b.sh | ||
``` | ||
Modify the path including `speech_encoder_path`, `llm_path`, `output_dir`, `ckpt_path`, `val_data_path` and `decode_log` in the script when you run the shell script. | ||
|
||
## Train a new model | ||
|
||
### Use whisper as the encoder | ||
``` | ||
bash finetune_whisper_large_linear_vicuna_7b.sh | ||
``` | ||
Whisper takes mel as input. Pay attention to the key `dataset_config.mel_size` for different version of the whisper model family. | ||
|
||
### Use self-supervised model(such as WavLM) as the encoder | ||
``` | ||
bash finetune_wavlm_large_linear_vicuna_7b.sh | ||
``` | ||
WavLM takes raw wavform as input. Pay attention to the key `dataset_config.normalize` and `model_config.normalize` for different version of the SSL models for different SSL models are different in these keys. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
dataset_config: | ||
# we put prompt here, because the hydra override in shell script only support a small subset of chars | ||
prompt: "Transcribe speech to text. Output the transcription directly without redundant content. Ensure that the output is not duplicated. " | ||
# prompt: "Transcribe speech to text. Output the transcription directly without redundant content. Ensure that the output is not duplicated. " | ||
prompt: "Transcribe speech to text. " |
58 changes: 58 additions & 0 deletions
58
examples/asr_librispeech/scripts/decode_wavlm_large_linear_vicuna_7b.sh
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
#!/bin/bash | ||
#export PYTHONPATH=/root/whisper:$PYTHONPATH | ||
export CUDA_VISIBLE_DEVICES=0 | ||
export TOKENIZERS_PARALLELISM=false | ||
# export CUDA_LAUNCH_BLOCKING=1 | ||
|
||
run_dir=/root/SLAM-LLM | ||
cd $run_dir | ||
code_dir=examples/asr_librispeech | ||
|
||
speech_encoder_path=/nfs/maziyang.mzy/models/wavlm/WavLM-Large.pt | ||
llm_path=/nfs/maziyang.mzy/models/vicuna-7b-v1.5 | ||
|
||
output_dir=/root/tmp/vicuna-7b-v1.5-librispeech-linear-steplrwarmupkeep1e-4-wavlm-large-20240426 | ||
ckpt_path=$output_dir/asr_epoch_1_step_1000 | ||
split=librispeech_test_clean | ||
val_data_path=/nfs/maziyang.mzy/data/librispeech/${split}.jsonl | ||
decode_log=$ckpt_path/decode_${split}_beam4 | ||
|
||
# -m debugpy --listen 5678 --wait-for-client | ||
python $code_dir/inference_asr_batch.py \ | ||
--config-path "conf" \ | ||
--config-name "prompt.yaml" \ | ||
hydra.run.dir=$ckpt_path \ | ||
++model_config.llm_name="vicuna-7b-v1.5" \ | ||
++model_config.llm_path=$llm_path \ | ||
++model_config.llm_dim=4096 \ | ||
++model_config.encoder_name=wavlm \ | ||
++model_config.normalize=true \ | ||
++dataset_config.normalize=true \ | ||
++model_config.encoder_projector_ds_rate=5 \ | ||
++model_config.encoder_path=$speech_encoder_path \ | ||
++model_config.encoder_dim=1024 \ | ||
++model_config.encoder_projector=linear \ | ||
++dataset_config.dataset=speech_dataset \ | ||
++dataset_config.val_data_path=$val_data_path \ | ||
++dataset_config.input_type=raw \ | ||
++dataset_config.inference_mode=true \ | ||
++train_config.model_name=asr \ | ||
++train_config.freeze_encoder=true \ | ||
++train_config.freeze_llm=true \ | ||
++train_config.batching_strategy=custom \ | ||
++train_config.num_epochs=1 \ | ||
++train_config.val_batch_size=4 \ | ||
++train_config.num_workers_dataloader=2 \ | ||
++train_config.output_dir=$output_dir \ | ||
++decode_log=$decode_log \ | ||
++ckpt_path=$ckpt_path/model.pt \ | ||
# ++peft_ckpt=$ckpt_path \ | ||
# ++train_config.use_peft=true \ | ||
# ++train_config.peft_config.r=32 \ | ||
# ++dataset_config.normalize=true \ | ||
# ++model_config.encoder_projector=q-former \ | ||
# ++dataset_config.fix_length_audio=64 \ | ||
|
||
python src/slam_llm/utils/whisper_tn.py ${decode_log}_gt ${decode_log}_gt.proc | ||
python src/slam_llm/utils/whisper_tn.py ${decode_log}_pred ${decode_log}_pred.proc | ||
python src/slam_llm/utils/compute_wer.py ${decode_log}_gt.proc ${decode_log}_pred.proc ${decode_log}.proc.wer |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
75 changes: 75 additions & 0 deletions
75
examples/asr_librispeech/scripts/finetune_wavlm_large_linear_vicuna_7b.sh
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
#!/bin/bash | ||
# export PYTHONPATH=/root/whisper:$PYTHONPATH | ||
export PYTHONPATH=/root/fairseq:$PYTHONPATH | ||
export CUDA_VISIBLE_DEVICES=0,1 | ||
export TOKENIZERS_PARALLELISM=false | ||
# export CUDA_LAUNCH_BLOCKING=1 | ||
export OMP_NUM_THREADS=1 | ||
|
||
# debug setting for multiple gpus | ||
# export NCCL_DEBUG=INFO | ||
# export NCCL_DEBUG_SUBSYS=ALL | ||
# export TORCH_DISTRIBUTED_DEBUG=INFO | ||
|
||
run_dir=/root/SLAM-LLM | ||
cd $run_dir | ||
code_dir=examples/asr_librispeech | ||
|
||
speech_encoder_path=/nfs/maziyang.mzy/models/wavlm/WavLM-Large.pt | ||
llm_path=/nfs/maziyang.mzy/models/vicuna-7b-v1.5 | ||
train_data_path=/nfs/maziyang.mzy/data/librispeech/librispeech_train_960h.jsonl | ||
val_data_path=/nfs/maziyang.mzy/data/librispeech/librispeech_dev_other.jsonl | ||
|
||
output_dir=/root/tmp/vicuna-7b-v1.5-librispeech-linear-steplrwarmupkeep1e-4-wavlm-large-$(date +"%Y%m%d") | ||
|
||
hydra_args=" | ||
hydra.run.dir=$output_dir \ | ||
++model_config.llm_name=vicuna-7b-v1.5 \ | ||
++model_config.llm_path=$llm_path \ | ||
++model_config.llm_dim=4096 \ | ||
++model_config.encoder_name=wavlm \ | ||
++model_config.normalize=true \ | ||
++dataset_config.normalize=true \ | ||
++model_config.encoder_projector_ds_rate=5 \ | ||
++model_config.encoder_path=$speech_encoder_path \ | ||
++model_config.encoder_dim=1024 \ | ||
++model_config.encoder_projector=linear \ | ||
++dataset_config.dataset=speech_dataset \ | ||
++dataset_config.train_data_path=$train_data_path \ | ||
++dataset_config.val_data_path=$val_data_path \ | ||
++dataset_config.input_type=raw \ | ||
++train_config.model_name=asr \ | ||
++train_config.num_epochs=3 \ | ||
++train_config.freeze_encoder=true \ | ||
++train_config.freeze_llm=true \ | ||
++train_config.batching_strategy=custom \ | ||
++train_config.warmup_steps=1000 \ | ||
++train_config.total_steps=100000 \ | ||
++train_config.lr=1e-4 \ | ||
++train_config.validation_interval=1000 \ | ||
++train_config.batch_size_training=4 \ | ||
++train_config.val_batch_size=4 \ | ||
++train_config.num_workers_dataloader=2 \ | ||
++train_config.output_dir=$output_dir \ | ||
++metric=acc \ | ||
" | ||
|
||
# -m debugpy --listen 5678 --wait-for-client | ||
if [[ $CUDA_VISIBLE_DEVICES != *","* ]]; then | ||
python -m debugpy --listen 5678 --wait-for-client $code_dir/finetune_asr.py \ | ||
--config-path "conf" \ | ||
--config-name "prompt.yaml" \ | ||
$hydra_args | ||
else | ||
torchrun \ | ||
--nnodes 1 \ | ||
--nproc_per_node 2 \ | ||
--master_port=29503 \ | ||
$code_dir/finetune_asr.py \ | ||
--config-path "conf" \ | ||
--config-name "prompt.yaml" \ | ||
++train_config.enable_fsdp=false \ | ||
++train_config.enable_ddp=true \ | ||
++train_config.use_fp16=true \ | ||
$hydra_args | ||
fi |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters