Skip to content

Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation (ACM MM 2024)

Notifications You must be signed in to change notification settings

JeongHun0716/e-mvsr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Efficient-Training-for-Multilingual-Visual-Speech-Recognition

This repository contains the PyTorch implementation of the following paper:

Efficient-Training-for-Multilingual-Visual-Speech-Recognition: Pre-training with Discretized Visual Speech Representation
(ACM MM 2024)
*Minsu Kim, *Jeonghun Yeo, Se Jin Park, Hyeongseop Rha, Yong Man Ro
[Paper]

Environment Setup

conda create -n e-mvsr python=3.9 -y
conda activate e-mvsr
git clone https://github.com/JeongHun0716/e-mvsr
cd e-mvsr
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
(If your pip version > 24.1, please run "python3 -m pip install --upgrade pip==24.0")
pip install numpy==1.23.5 editdistance omegaconf==2.0.6 hydra-core==1.0.7
pip install python_speech_features scipy opencv-python einops soundfile
pip install sentencepiece tqdm tensorboard
cd fairseq
pip install --editable ./

Dataset preparation

For inference, Multilingual TEDx(mTEDx), and LRS3 Datasets are needed.

  1. Download the mTEDx dataset from the mTEDx link of the official website.
  2. Download the LRS3 dataset from the LRS3 link of the official website.

When you are interested in training the model, you should prepare additional VoxCeleb2 and AVSpeech datasets.

  1. Download the VoxCeleb2 dataset from the VoxCeleb2 link of the official website.
  2. Download the AVSpeech dataset from the AVSpeech link of the official website.

Preprocessing

After downloading the datasets, you should detect the facial landmarks of all videos and crop the mouth region using these facial landmarks. We recommend you preprocess the videos following Visual Speech Recognition for Multiple Languages.

Inference

Download the checkpoints from the below links and move them to the src/pretrained_models directory. You can evaluate the performance of the finetuned model using the scripts available in the scripts directory.

Pretrained Models

Model Training Datasets Training data (h) Used Languages
mavhubert.pt mTEDx + LRS3 + VoxCeleb2 + AVSpeech + LRS2 5,512 En, Es, It, Fr, Pt, De, Ru, Ar, El
unit_pretrained.pt mTEDx + LRS3 + VoxCeleb2 + AVSpeech 4,545 En, Es, It, Fr, Pt
finetuned.pt mTEDx + LRS3 + VoxCeleb2 + AVSpeech 4,545 En, Es, It, Fr, Pt

Citation

If you find this work useful in your research, please cite the paper:

@inproceedings{kim2024efficient,
  title={Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation},
  author={Kim, Minsu and Yeo, Jeonghun and Park, Se Jin and Rha, Hyeongseop and Ro, Yong Man},
  booktitle={ACM Multimedia 2024}
}

Acknowledgement

This project is based on the avhubert and fairseq code. We would like to thank the developers of these projects for their contributions and the open-source community for making this work possible.

About

Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation (ACM MM 2024)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published