This repository contains the PyTorch implementation of the following paper:
Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language
Jeong Hun Yeo, Chae Won Kim, Hyunjun Kim, Hyeongseop Rha, Seunghee Han, Wen-Huang Cheng, Yong Man Ro
[Paper]
We propose a novel speaker-adaptive lip reading method that adapts a pre-trained lip reading model to target speakers at both vision and language levels. Specifically, we integrate prompt tuning and the LoRA approach, applying them to a pre-trained lip reading model to effectively adapt the model to target speakers. In addition, to validate its effectiveness in real-world scenarios, we introduce a new dataset, VoxLRS-SA, derived from VoxCeleb2 and LRS3.
conda create -n Personalized-Lip-Reading python=3.9 -y
conda activate Personalized-Lip-Reading
git clone https://github.com/JeongHun0716/Personalized-Lip-Reading
cd Personalized-Lip-Reading
# PyTorch and related packages
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118
(If your pip version > 24.1, please run "python3 -m pip install --upgrade pip==24.0")
pip install numpy==1.23.5 scipy opencv-python
pip install editdistance python_speech_features einops soundfile sentencepiece tqdm tensorboard unidecode librosa
pip install omegaconf==2.0.6 hydra-core==1.0.7
pip install transformers peft bitsandbytes
cd fairseq
pip install --editable ./
To validate speaker adaptive lip reading methods in real-world scenarios, we proposed a new dataset named VoxLRS-SA. To download the VoxLRS-SA dataset, please refer to this repository.
- Baseline lip-reading model
To train the baseline lip-reading model, please download the checkpoints from the links below and move them to the target directory.
Conformer Encoder Model | Training Datasets | Target Directory |
---|---|---|
vsr_trlrs3_base.pth | LRS3 | src/pretrained_models/conformer_encoder/pretrained_lrs3 |
Large Language Model | Target Directory |
---|---|
Meta-Llama-3-8B | src/pretrained_models/llm |
# Train
bash scripts/train/baseline/train.sh
# Evaluation
bash scripts/eval/baseline/eval.sh
The pre-trained baseline lip-reading model is provided in the below link.
Baseline Model | Training Datasets | WER(%) | Target Directory |
---|---|---|---|
best_ckpt.pt | VoxLRS-SA | 47.3 | src/pretrained_models/conformer_encoder/pretrained_w_llm |
- Vision Level Adaptation to Target speaker
# Train
bash scripts/train/adaptation/vision_level/train.sh
# Evaluation
bash scripts/eval/adaptation/vision_level/eval.sh
The pre-trained vision-adapted model is provided in the below link.
Vision Adapted Model | Training Datasets | WER(%) | Target Directory |
---|---|---|---|
best_ckpts.zip | VoxLRS-SA | 41.5 | src/pretrained_models/adapted_model/vision |
- Vision & Language Levels Adaptation to Target speaker
# Train
bash scripts/train/adaptation/vision_language_level/train.sh
# Evaluation
bash scripts/eval/adaptation/vision_language_level/eval.sh
The pre-trained vision- and language-adapted model is provided in the below link.
Vision & Language Adapted Model | Training Datasets | WER(%) | Target Directory |
---|---|---|---|
best_ckpts.zip | VoxLRS-SA | 40.9 | src/pretrained_models/adapted_model/vision_language |
The adapted pre-trained models should be unzipped in the Target Directory, to evaluate the performance in the VoxLRS-SA dataset.
If you find this work useful in your research, please cite the paper:
@article{yeo2024personalized,
title={Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language},
author={Yeo, Jeong Hun and Kim, Chae Won and Kim, Hyunjun and Rha, Hyeongseop and Han, Seunghee and Cheng, Wen-Huang and Ro, Yong Man},
journal={arXiv preprint arXiv:2409.00986},
year={2024}
}
This project is based on the avhubert, espnet, and fairseq code. We would like to thank the developers of these projects for their contributions and the open-source community for making this work possible.