This repository contains the PyTorch implementation of the following paper:
Efficient-Training-for-Multilingual-Visual-Speech-Recognition: Pre-training with Discretized Visual Speech Representation
(ACM MM 2024)
*Minsu Kim, *Jeonghun Yeo, Se Jin Park, Hyeongseop Rha, Yong Man Ro
[Paper]
conda create -n e-mvsr python=3.9 -y
conda activate e-mvsr
git clone https://github.com/JeongHun0716/e-mvsr
cd e-mvsr
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
(If your pip version > 24.1, please run "python3 -m pip install --upgrade pip==24.0")
pip install numpy==1.23.5 editdistance omegaconf==2.0.6 hydra-core==1.0.7
pip install python_speech_features scipy opencv-python einops soundfile
pip install sentencepiece tqdm tensorboard
cd fairseq
pip install --editable ./
For inference, Multilingual TEDx(mTEDx), and LRS3 Datasets are needed.
- Download the mTEDx dataset from the mTEDx link of the official website.
- Download the LRS3 dataset from the LRS3 link of the official website.
When you are interested in training the model, you should prepare additional VoxCeleb2 and AVSpeech datasets.
- Download the VoxCeleb2 dataset from the VoxCeleb2 link of the official website.
- Download the AVSpeech dataset from the AVSpeech link of the official website.
After downloading the datasets, you should detect the facial landmarks of all videos and crop the mouth region using these facial landmarks. We recommend you preprocess the videos following Visual Speech Recognition for Multiple Languages.
Download the checkpoints from the below links and move them to the src/pretrained_models
directory.
You can evaluate the performance of the finetuned model using the scripts available in the scripts
directory.
Model | Training Datasets | Training data (h) | Used Languages |
---|---|---|---|
mavhubert.pt | mTEDx + LRS3 + VoxCeleb2 + AVSpeech + LRS2 | 5,512 | En, Es, It, Fr, Pt, De, Ru, Ar, El |
unit_pretrained.pt | mTEDx + LRS3 + VoxCeleb2 + AVSpeech | 4,545 | En, Es, It, Fr, Pt |
finetuned.pt | mTEDx + LRS3 + VoxCeleb2 + AVSpeech | 4,545 | En, Es, It, Fr, Pt |
If you find this work useful in your research, please cite the paper:
@inproceedings{kim2024efficient,
title={Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation},
author={Kim, Minsu and Yeo, Jeonghun and Park, Se Jin and Rha, Hyeongseop and Ro, Yong Man},
booktitle={ACM Multimedia 2024}
}
This project is based on the avhubert and fairseq code. We would like to thank the developers of these projects for their contributions and the open-source community for making this work possible.