Skip to content

Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language

Notifications You must be signed in to change notification settings

JeongHun0716/Personalized-Lip-Reading

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language

This repository contains the PyTorch implementation of the following paper:

Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language
Jeong Hun Yeo, Chae Won Kim, Hyunjun Kim, Hyeongseop Rha, Seunghee Han, Wen-Huang Cheng, Yong Man Ro
[Paper]

Introduction

We propose a novel speaker-adaptive lip reading method that adapts a pre-trained lip reading model to target speakers at both vision and language levels. Specifically, we integrate prompt tuning and the LoRA approach, applying them to a pre-trained lip reading model to effectively adapt the model to target speakers. In addition, to validate its effectiveness in real-world scenarios, we introduce a new dataset, VoxLRS-SA, derived from VoxCeleb2 and LRS3.

Environment Setup

conda create -n Personalized-Lip-Reading python=3.9 -y
conda activate Personalized-Lip-Reading
git clone https://github.com/JeongHun0716/Personalized-Lip-Reading
cd Personalized-Lip-Reading
# PyTorch and related packages
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118
(If your pip version > 24.1, please run "python3 -m pip install --upgrade pip==24.0")
pip install numpy==1.23.5 scipy opencv-python
pip install editdistance python_speech_features einops soundfile sentencepiece tqdm tensorboard unidecode librosa
pip install omegaconf==2.0.6 hydra-core==1.0.7
pip install transformers peft bitsandbytes
cd fairseq
pip install --editable ./

Dataset preparation

To validate speaker adaptive lip reading methods in real-world scenarios, we proposed a new dataset named VoxLRS-SA. To download the VoxLRS-SA dataset, please refer to this repository.

Training and Inference

  1. Baseline lip-reading model

To train the baseline lip-reading model, please download the checkpoints from the links below and move them to the target directory.

Conformer Encoder Model Training Datasets Target Directory
vsr_trlrs3_base.pth LRS3 src/pretrained_models/conformer_encoder/pretrained_lrs3
Large Language Model Target Directory
Meta-Llama-3-8B src/pretrained_models/llm
# Train
bash scripts/train/baseline/train.sh
# Evaluation
bash scripts/eval/baseline/eval.sh

The pre-trained baseline lip-reading model is provided in the below link.

Baseline Model Training Datasets WER(%) Target Directory
best_ckpt.pt VoxLRS-SA 47.3 src/pretrained_models/conformer_encoder/pretrained_w_llm
  1. Vision Level Adaptation to Target speaker
# Train
bash scripts/train/adaptation/vision_level/train.sh
# Evaluation
bash scripts/eval/adaptation/vision_level/eval.sh

The pre-trained vision-adapted model is provided in the below link.

Vision Adapted Model Training Datasets WER(%) Target Directory
best_ckpts.zip VoxLRS-SA 41.5 src/pretrained_models/adapted_model/vision
  1. Vision & Language Levels Adaptation to Target speaker
# Train
bash scripts/train/adaptation/vision_language_level/train.sh
# Evaluation
bash scripts/eval/adaptation/vision_language_level/eval.sh

The pre-trained vision- and language-adapted model is provided in the below link.

Vision & Language Adapted Model Training Datasets WER(%) Target Directory
best_ckpts.zip VoxLRS-SA 40.9 src/pretrained_models/adapted_model/vision_language

The adapted pre-trained models should be unzipped in the Target Directory, to evaluate the performance in the VoxLRS-SA dataset.

Citation

If you find this work useful in your research, please cite the paper:

@article{yeo2024personalized,
  title={Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language},
  author={Yeo, Jeong Hun and Kim, Chae Won and Kim, Hyunjun and Rha, Hyeongseop and Han, Seunghee and Cheng, Wen-Huang and Ro, Yong Man},
  journal={arXiv preprint arXiv:2409.00986},
  year={2024}
}

Acknowledgement

This project is based on the avhubert, espnet, and fairseq code. We would like to thank the developers of these projects for their contributions and the open-source community for making this work possible.

About

Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published