Lipreading using Temporal Convolutional Networks

Authors

Pingchuan Ma, Brais Martinez, Stavros Petridis, Maja Pantic.

Update

2021-06-09: We have released our official training code, see here.

2020-12-08: We have released the audio-only model which achieves the testing accuracy of 98.5% on LRW.

Content

Deep Lipreading

Introduction
Preprocessing
How to install the environment
How to prepare the dataset
How to train
How to test
How to extract embeddings

Model Zoo Changes

Deep Lipreading

Introduction

This is the respository of Towards Practical Lipreading with Distilled and Efficient Models and Lipreading using Temporal Convolutional Networks. In this repository, we provide training code, pre-trained models, network settings for end-to-end visual speech recognition (lipreading). We trained our model on LRW. The network architecture is based on 3D convolution, ResNet-18 plus MS-TCN.

By using this repository, you can achieve a performance of 87.9% on the LRW dataset. This reporsitory also provides a script for feature extraction.

Preprocessing

As described in our paper, each video sequence from the LRW dataset is processed by 1) doing face detection and face alignment, 2) aligning each frame to a reference mean face shape 3) cropping a fixed 96 × 96 pixels wide ROI from the aligned face image so that the mouth region is always roughly centered on the image crop 4) transform the cropped image to gray level.

You can run the pre-processing script provided in the preprocessing folder to extract the mouth ROIs.


0. Original	1. Detection	2. Transformation	3. Mouth ROIs

How to install environment

Clone the repository into a directory. We refer to that directory as TCN_LIPREADING_ROOT.

git clone --recursive https://github.com/mpc001/Lipreading_using_Temporal_Convolutional_Networks.git

Install all required packages.

pip install -r requirements.txt

How to prepare dataset

Collect your dataset (videos) and change your data structure like below.

dataset_folder
    |
    |
    |-> participant 1
                  |
                  |-> videos
    |                  
    |-> participant 2
                  |
                  |->videos
    .
    .
    .

Extract landmarks and save them to npz format by executing ./preprocessing/landmark_extracting.py. Notice that to change the pathes in the script.
Change the dataset structer by classes by executing ./preprocessing/sort_by_classes.py.
Split data to train, test and val using the script ./preprocessing/splitting_data.py.
Create a csv file to save the data pathes using the script ./preprocessing/csv_maker.py.
Pre-process mouth ROIs using the script crop_mouth_from_video.py in the preprocessing folder and save them to $TCN_LIPREADING_ROOT/datasets/visual_data/.
Pre-process audio waveforms using the script extract_audio_from_video.py in the preprocessing folder and save them to $TCN_LIPREADING_ROOT/datasets/audio_data/.
Download a pre-trained model from Model Zoo and put the model into the $TCN_LIPREADING_ROOT/models/ folder.

How to train

Train a visual-only model.

CUDA_VISIBLE_DEVICES=0 python main.py --config-path <MODEL-JSON-PATH> \
                                      --annonation-direc <ANNONATION-DIRECTORY> \
                                      --data-dir <MOUTH-ROIS-DIRECTORY>

Train an audio-only model.

CUDA_VISIBLE_DEVICES=0 python main.py --modality raw_audio \
                                      --config-path <MODEL-JSON-PATH> \
                                      --annonation-direc <ANNONATION-DIRECTORY> \
                                      --data-dir <AUDIO-WAVEFORMS-DIRECTORY>

We call the original LRW directory that includes timestamps (.txt) as <ANNONATION-DIRECTORY>.

Resume from last checkpoint.

You can pass the checkpoint path (.pth.tar) <CHECKPOINT-PATH> to the variable argument --model-path, and specify the --init-epoch to 1 to resume training.

How to test

Evaluate the visual-only performance (lipreading).

CUDA_VISIBLE_DEVICES=0 python main.py --config-path <MODEL-JSON-PATH> \
                                      --model-path <MODEL-PATH> \
                                      --data-dir <MOUTH-ROIS-DIRECTORY> \
                                      --test

Evaluate the audio-only performance.

CUDA_VISIBLE_DEVICES=0 python main.py --modality raw_audio \
                                      --config-path <MODEL-JSON-PATH> \
                                      --model-path <MODEL-PATH> \
                                      --data-dir <AUDIO-WAVEFORMS-DIRECTORY>
                                      --test

How to extract embeddings

We assume you have cropped the mouth patches and put them into <MOUTH-PATCH-PATH>. The mouth embeddings will be saved in the .npz format

To extract 512-D feature embeddings from the top of ResNet-18:

CUDA_VISIBLE_DEVICES=0 python main.py --extract-feats \
                                      --config-path <MODEL-JSON-PATH> \
                                      --model-path <MODEL-PATH> \
                                      --mouth-patch-path <MOUTH-PATCH-PATH> \
                                      --mouth-embedding-out-path <OUTPUT-PATH>

Model Zoo

We plan to include more models in the future. We use a sequence of 29-frames with a size of 88 by 88 pixels to compute the FLOPs.

Architecture	Acc.	FLOPs (G)	url	size (MB)
Audio-only
resnet18_mstcn(adamw)	98.9	3.72	GoogleDrive or BaiduDrive (key: xt66)	111
resnet18_mstcn	98.5	3.72	GoogleDrive or BaiduDrive (key: 3n25)	111
Visual-only
resnet18_mstcn(adamw_s3)	87.9	10.31	GoogleDrive or BaiduDrive (key: j5tw)	139
resnet18_mstcn	85.5	10.31	GoogleDrive or BaiduDrive (key: um1q)	139
snv1x_tcn2x	84.6	1.31	GoogleDrive or BaiduDrive (key: f79d)	35
snv1x_dsmstcn3x	85.3	1.26	GoogleDrive or BaiduDrive (key: 86s4)	36
snv1x_tcn1x	82.7	1.12	GoogleDrive or BaiduDrive (key: 3caa)	15
snv05x_tcn2x	82.5	1.02	GoogleDrive or BaiduDrive (key: ej9e)	32
snv05x_tcn1x	79.9	0.58	GoogleDrive or BaiduDrive (key: devg)	11

Changes

We train this model for our collected dataset.

fixed for mp4 data with torch.
remote annotation files for own use.
Added our new weight for our own dataset.
Added our dataset labels.
Added scripts to produce landmarks.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
configs		configs
data		data
datasets		datasets
doc		doc
labels		labels
lipreading		lipreading
models		models
preprocessing		preprocessing
train_logs		train_logs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
options.py		options.py
plot_results.ipynb		plot_results.ipynb
requirements.txt		requirements.txt
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lipreading using Temporal Convolutional Networks

Authors

Update

Content

Deep Lipreading

Introduction

Preprocessing

How to install environment

How to prepare dataset

How to train

How to test

How to extract embeddings

Model Zoo

Changes

About

Releases

Packages

Languages

License

OmidSa75/Lipreading_using_Temporal_Convolutional_Networks

Folders and files

Latest commit

History

Repository files navigation

Lipreading using Temporal Convolutional Networks

Authors

Update

Content

Deep Lipreading

Introduction

Preprocessing

How to install environment

How to prepare dataset

How to train

How to test

How to extract embeddings

Model Zoo

Changes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages