Efficient-Training-for-Multilingual-Visual-Speech-Recognition

This repository contains the PyTorch implementation of the following paper:

Efficient-Training-for-Multilingual-Visual-Speech-Recognition: Pre-training with Discretized Visual Speech Representation
(ACM MM 2024)
*Minsu Kim, *Jeonghun Yeo, Se Jin Park, Hyeongseop Rha, Yong Man Ro
[Paper]

Environment Setup

conda create -n e-mvsr python=3.9 -y
conda activate e-mvsr
git clone https://github.com/JeongHun0716/e-mvsr
cd e-mvsr

pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
(If your pip version > 24.1, please run "python3 -m pip install --upgrade pip==24.0")
pip install numpy==1.23.5 editdistance omegaconf==2.0.6 hydra-core==1.0.7
pip install python_speech_features scipy opencv-python einops soundfile
pip install sentencepiece tqdm tensorboard
cd fairseq
pip install --editable ./

Dataset preparation

For inference, Multilingual TEDx(mTEDx), and LRS3 Datasets are needed.

Download the mTEDx dataset from the mTEDx link of the official website.
Download the LRS3 dataset from the LRS3 link of the official website.

When you are interested in training the model, you should prepare additional VoxCeleb2 and AVSpeech datasets.

Download the VoxCeleb2 dataset from the VoxCeleb2 link of the official website.
Download the AVSpeech dataset from the AVSpeech link of the official website.

Preprocessing

After downloading the datasets, you should detect the facial landmarks of all videos and crop the mouth region using these facial landmarks. We recommend you preprocess the videos following Visual Speech Recognition for Multiple Languages.

Inference

Download the checkpoints from the below links and move them to the src/pretrained_models directory. You can evaluate the performance of the finetuned model using the scripts available in the scripts directory.

Pretrained Models

Model	Training Datasets	Training data (h)	Used Languages
mavhubert.pt	mTEDx + LRS3 + VoxCeleb2 + AVSpeech + LRS2	5,512	En, Es, It, Fr, Pt, De, Ru, Ar, El
unit_pretrained.pt	mTEDx + LRS3 + VoxCeleb2 + AVSpeech	4,545	En, Es, It, Fr, Pt
finetuned.pt	mTEDx + LRS3 + VoxCeleb2 + AVSpeech	4,545	En, Es, It, Fr, Pt

Citation

If you find this work useful in your research, please cite the paper:

@inproceedings{kim2024efficient,
  title={Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation},
  author={Kim, Minsu and Yeo, Jeonghun and Park, Se Jin and Rha, Hyeongseop and Ro, Yong Man},
  booktitle={ACM Multimedia 2024}
}

Acknowledgement

This project is based on the avhubert and fairseq code. We would like to thank the developers of these projects for their contributions and the open-source community for making this work possible.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
avhubert		avhubert
fairseq		fairseq
img		img
labels		labels
scripts		scripts
spm1000		spm1000
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient-Training-for-Multilingual-Visual-Speech-Recognition

Environment Setup

Dataset preparation

Preprocessing

Inference

Pretrained Models

Citation

Acknowledgement

About

Releases

Packages

Languages

JeongHun0716/e-mvsr

Folders and files

Latest commit

History

Repository files navigation

Efficient-Training-for-Multilingual-Visual-Speech-Recognition

Environment Setup

Dataset preparation

Preprocessing

Inference

Pretrained Models

Citation

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages