To validate speaker adaptive lip reading methods in real-world scenarios, we introduce a new dataset named VoxLRS-SA, derived from VoxCeleb2 and LRS3 datasets. This repository contains the speaker ID information of VoxCeleb2 and LRS3 audio-visual datasets, which is the outcome of the following paper:
Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language
Jeong Hun Yeo, Chae Won Kim, Hyunjun Kim, Hyeongseop Rha, Seunghee Han, Wen-Huang Cheng, Yong Man Ro
[Paper]
For Training and Inference the model, VoxCeleb2, and LRS3 Datasets are needed.
- Download the VoxCeleb2 dataset from the VoxCeleb2 link of the official website.
- Download the LRS3 dataset from the LRS3 link of the official website.
After downloading the datasets, you should detect the facial landmarks of all videos and crop the mouth region using these facial landmarks. We recommend you preprocess the videos following Visual Speech Recognition for Multiple Languages.
Path to Datasets(e.g., /home/Dataset/...)
├── lrs3
│ ├── lrs3_video_seg24s # Preprocessed video data
├── vox2 # VoxCeleb2
│ └── en
│ └── video # Preprocessed video data
The Speaker ID information is included in the *.tsv files.
# Example in line 2 of baseline/test.tsv
voxlrs-00001 /Path_to_Datasets/vox2/en/video/dev/mp4/id05998/nfWYhJyGsPU/00370_00.mp4 /Path_to_Datasets/vox2/en/video/dev/mp4/id05998/nfWYhJyGsPU/00370_00.mp4 241 241
(Speaker ID) (Path to Video) (Path to Video) (Number of frames in video) (Number of frames in video)
If you want to design personalized audio-visual speech recognition, you replace the 3rd item (Path to Video) and 5th item (Number of frames in video) with audio information.
All manifest files are provided in this repo. You need to replace the video path in the manifest file with your preprocessed video path using the following command:
python path_modification.py --dataset_pth /path/to/datasets
In order to build unseen-speaker scenario (train and test speakers are not overlapped), 20 speakers are selected for test and validation (adaptation).
For more information, please refer to our paper.
For more accurate speaker information, we welcome your participation in improving label information.
If you find this work useful in your research, please cite the paper:
@article{yeo2024personalized,
title={Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language},
author={Yeo, Jeong Hun and Kim, Chae Won and Kim, Hyunjun and Rha, Hyeongseop and Han, Seunghee and Cheng, Wen-Huang and Ro, Yong Man},
journal={arXiv preprint arXiv:2409.00986},
year={2024}
}