VSP-LLM (Visual Speech Processing incorporated with LLMs)

This is the PyTorch code for Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing. This code is developed on the code of AV-HuBERT.

add colab demo

Introduction

We propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task. The input video is mapped to the input latent space of a LLM by employing a self-supervised visual speech model. Focused on the fact that there is redundant information in input frames, we propose a novel deduplication method that reduces the embedded visual features by employing visual speech units. Through the proposed deduplication and Low Rank Adaptors (LoRA), VSP-LLM can be trained in a computationally efficient manner.

Model checkpoint

You can find checkpoint of our model in here

Demo

Try our VSP-LLM demo using colab

Preparartion

conda create -n vsp-llm python=3.9 -y
conda activate vsp-llm
git clone https://github.com/Sally-SH/VSP-LLM.git
cd VSP-LLM
pip install -r requirements.txt

Download AV-HuBERT pre-trained model AV-HuBERT Large (LSR3 + VoxCeleb2) from here.
Download LLaMA2-7B from here.

Data preprocessing

Follow Auto-AVSR preperation to preprocess the LRS3 dataset.
Then, follow AV-HuBERT preperation from step 3 to create manifest of LRS3 dataset.

Generate visual speech unit and cluster counts file

Follow the steps in clustering (pre-train only) to create:

{train,valid}.km frame-aligned pseudo label files. The label_rate is the same as the feature frame rate used for clustering, which is 25Hz for AV-HuBERT features by default.

Dataset layout

.
├── lrs3_video_seg24s                     # preprocessed video and audio data
├── lrs3_text_seg24s                      # preprocessed text data
└── lrs3_dataset                          
      ├── train.tsv                       # List of audio and video path for training
      ├── train.wrd                       # List of target label for training
      ├── train.cluster_counts            # List of clusters to deduplicate speech units in training
      ├── test.tsv                        # List of audio and video path for testing
      ├── test.wrd                        # List of target label for testing
      └── test.cluster_counts             # List of clusters to deduplicate speech units in testing

Training

Open the training script (scripts/train.sh) and replace these variables:

# path to downloaded pre-trained avhubert
PRETRAINED_MODEL_PATH=???

# path to train dataset dir
DATA_PATH=???

# path to llama checkpoint
LLM_PATH=???

# path where output trained models will be located
OUT_PATH=???

Run the training script:

$ bash scripts/train.sh

Decoding

Open the decoding script (scripts/decode.sh) and replace these variables:

# language direction (e.g "en" or "en-fr")
LANG=???

# path to the trained model
MODEL_PATH=???

# path to test dataset dir
DATA_PATH=???

# path to llama checkpoint
LLM_PATH=???

# path where decoding results and scores will be located
OUT_PATH=???

Run the decoding script:

$ bash scripts/decode.sh

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
fairseq		fairseq
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VSP-LLM (Visual Speech Processing incorporated with LLMs)

Introduction

Model checkpoint

Demo

Preparartion

Data preprocessing

Generate visual speech unit and cluster counts file

Dataset layout

Training

Decoding

About

Releases

Packages

Languages

License

EdwinKestler/VSP-LLM

Folders and files

Latest commit

History

Repository files navigation

VSP-LLM (Visual Speech Processing incorporated with LLMs)

Introduction

Model checkpoint

Demo

Preparartion

Data preprocessing

Generate visual speech unit and cluster counts file

Dataset layout

Training

Decoding

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages