Skip to content

Commit

Permalink
Edited the README.md file.
Browse files Browse the repository at this point in the history
  • Loading branch information
Ngaima Sandiman committed Oct 15, 2024
1 parent 9311725 commit 47a6aaf
Show file tree
Hide file tree
Showing 6 changed files with 5,721 additions and 1,014 deletions.
190 changes: 104 additions & 86 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,15 @@

# **ImageCraft: Direct Image-to-Speech Synthesis**





## **Overview**

ImageCraft is a deep learning project designed to generate spoken descriptions directly from images. The goal is to create a model that combines vision and text-to-speech capabilities for accessibility tools, multimedia storytelling, and human-computer interaction. It utilizes a vision transformer (SigLIP) for image encoding, Gemma for text decoding, and VoiceCraft for speech synthesis.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bzmNvc-XM9RPbkZEYFdap-nNJkrCvfzu#scrollTo=-SoOHUJHsfTD) [![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/nsandiman/uarizona-msis-capstone-group5-imagecraft)

![alt text](https://github.com/Jerdah/ImageCraft/blob/main/reports/figures/imagecraft-arch.jpeg)
![alt text](https://github.com/Jerdah/ImageCraft/blob/main/reports/figures/imagecraft-arch.jpeg)

## **Table of Contents**

1. [Project Objectives](#project-objectives)
2. [Directory Structure](#directory-structure)
3. [Dataset](#dataset)
Expand All @@ -23,16 +20,19 @@ ImageCraft is a deep learning project designed to generate spoken descriptions d
8. [Deployment](#deployment)
9. [Testing](#testing)
10. [Results and Visualization](#results-and-visualization)
12. [Future Work](#future-work)
15. [References](#references)
11. [Future Work](#future-work)
12. [References](#references)

## **Project Objectives**

The primary objectives of ImageCraft are:

- To create a multimodal pipeline that converts input images into meaningful spoken descriptions.
- To utilize transformer-based models, specifically a vision transformer (SigLIP) as an image encoder and a Gemma decoder.
- To facilitate image-to-speech for accessibility use cases.

## **Directory Structure**

The primary objectives of ImageCraft are:

```css
Expand Down Expand Up @@ -96,57 +96,64 @@ ImageCraft/
|
└── setup.py
```

## **Dataset**
### **Flickr30k and MSCOCO**
The Flickr30k dataset is used for training and evaluation. It contains paired image-caption data, making it suitable for the image-to-speech task.

- **Download and Preparation**: The datasets are downloaded and organized into relevant folders for training (`/training_data/dataset/flickr30k` and `/training_data/dataset/mscoco`). During preparation, images are resized, and captions are tokenized using a custom tokenizer that adds special tokens like `[BOS]` (beginning of sequence) and `[EOS]` (end of sequence).
### **MSCOCO**

The MSCOCO dataset is used for training and evaluation. It contains paired image-caption data, making it suitable for the image-to-speech task.

- **Download and Preparation**: The datasets are downloaded and organized into relevant folders for training (`data/processed/coco/train` and evaluation `data/processed/coco/test`).

**Download dataset for training**:

```bash
python -m src.data.download --dataset "coco" --dataset_size "10%"
```

## **Model Architecture**

ImageCraft consists of three major components:

1. **Vision Transformer (SigLIP)**: Calculates the image embeddings.
2. **Gemma Decoder**: Decodes text from the image features.
3. **VoiceCraft Module**: The speech synthesis model.
3. **VoiceCraft token infilling neural codec language model**: The speech synthesis model.

## **Installation**

To set up the environment and install the necessary dependencies, follow the steps below:

1. **Clone the Repository**:
```bash
git clone https://github.com/Jerdah/ImageCraft.git
cd ImageCraft
```

2.**Install System-Level Dependencies**:

```bash
apt-get install -y espeak-ng espeak espeak-data libespeak1 libespeak-dev festival* build-essential flac libasound2-dev libsndfile1-dev vorbis-tools libxml2-dev libxslt-dev zlib1g-dev
```

3. **Install Python Libraries**:
3. **Install Python libraries**:

```bash
pip install -r requirements.txt
```

4. **Download Dataset from Kaggle**:
kaggle datasets download -d hsankesara/flickr-image-dataset
kaggle datasets download -d mnassrib/ms-coco

### +**Installation Instructions Details**
Ensure you have all dependencies installed with specific versions:
4. **Metrics for evaluating automated image descriptions using tools such as SPICE, PTBTokenizer, METEOR and FENSE**:

- Python >= 3.8
- torch==2.0.1
- transformers==4.27.1
- gradio==3.0

If you encounter installation errors, refer to the `requirements.txt` or contact us for help.
```bash
aac-metrics-download
```

## **Usage**

### **Inference**

You can use the provided Gradio interface or run the inference script to generate speech from an image.

#### **Using Gradio**:
#### **Using Gradio (basic interface)**:

```python

Expand All @@ -158,15 +165,64 @@ os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["USER"] = "imagecraft"

import gradio as gr
from src.model.modules.imagecraft import ImageCraft

from bert_score import score

import evaluate
model = ImageCraft.from_pretrained("nsandiman/imagecraft-ft-co-224")

default_image = "media/images/2.jpeg"
def generate(image_path):
"""Process image inputs and generate audio response."""
transcript, audio_buffer = model.generate(image_path, output_type="buffer")

return audio_buffer, transcript


imagecraft_app = gr.Interface(
fn=generate,
inputs=[
gr.Image(
type="filepath",
label="Upload an image",
sources=["upload"],
value=default_image,
),
],
outputs=[gr.Audio(label="Speech"), gr.Textbox(label="Text")],
title="ImageCraft",
description="Upload an image and get the speech responses.",
allow_flagging="never",
)

imagecraft_app.launch()
```

![alt text](https://github.com/Jerdah/ImageCraft/blob/main/reports/figures/imagecraft-basic-ui.png)

#### **Using Gradio (evaluation interface)**:

```python


import sys
import os

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["USER"] = "imagecraft"

import gradio as gr
from bert_score import score
import evaluate
from src.model.modules.imagecraft import ImageCraft

model = ImageCraft.from_pretrained("nsandiman/imagecraft-ft-co-224")

bertscore_metric = evaluate.load("bertscore")
bleu_metric = evaluate.load("bleu")
rouge_metric = evaluate.load('rouge')


default_image = "media/images/2.jpeg"
def imagecraft_interface(image_path, reference_text):
"""Process image inputs and generate audio response."""
transcript, audio_buffer = model.generate(image_path, output_type="buffer")
Expand Down Expand Up @@ -200,27 +256,33 @@ def calculate_rouge_score(reference, hypothesis):
results = rouge_metric.compute(predictions=[hypothesis], references=[[reference]])
return results["rougeL"]

# Define Gradio interface
gradio_interface = gr.Interface(
imagecraft_app = gr.Interface(
fn=imagecraft_interface,
inputs=[
gr.Image(type="filepath", label="Upload an image"),
gr.Image(
type="filepath",
label="Upload an image",
sources=["upload"],
value=default_image,
),
gr.Textbox(label="Reference Text (for evaluation)")
],
outputs=[
gr.Audio(label="Speech"),
gr.Textbox(label="Transcript"),
gr.Textbox(label="Text"),
gr.Textbox(label="Evaluation Results")
],
title="ImageCraft",
description="Upload an image and get the speech responses.",
allow_flagging="never"
)

# Launch the Gradio app
gradio_interface.launch(debug=False)
imagecraft_app.launch()

```

![alt text](https://github.com/Jerdah/ImageCraft/blob/main/reports/figures/imagecraft-evaluation-ui.png)

#### **Using CLI**:

```bash
Expand All @@ -231,63 +293,19 @@ python -m src.model.inference --image_path "media/images/1.jpeg" --output_type "
## **Training and Evaluation**

### **Training**
The training pipeline uses the following setup:

- **Freezing Strategy**: Initially, only the Gemma decoder is trained while the SigLIP encoder remains frozen. Later epochs unfreeze the ViT for end-to-end fine-tuning.
- **Metrics**: Training loss and test loss are monitored along with perplexity, which measures the quality of text predictions.
Specify training log collector:

To train the model from scratch:
- To use tensorboard, add the argument tensorboard to the command line.
- To use wandb, add the argument wandb to the command line.

```python
#train the model
python -m src.model.train --dataset "flickr" --dataset_size "5%" --batch_size 2 --max_epochs 2
```
Download the dataset (if it doesn't exist) and train the model.

### **Evaluation Metrics**
The following metrics are used to evaluate model performance:

**Training Loss**: Measures the model's performance on the training set.
**Test Loss**: Measures the generalization ability on unseen data.
**Perplexity**: Evaluates how well the model predicts the sequence.

### **TensorBoard**:
Training metrics are logged to TensorBoard for easy visualization:

```bash
tensorboard --logdir runs
```

## **Deployment**

The model can be deployed using the REST API provided by Flask. Additionally, the model can be containerized using Docker for reproducibility and easy deployment on cloud platforms.
```python

### **Run API**:
```bash
python app.py
python -m src.model.train --dataset "coco" --dataset_size "20%" --batch_size 2 --max_epochs 10 --log_every_n_steps 2 --log_to "wandb"
```

Navigate to `http://localhost:5000` to use the web interface.

## **Testing**

There are no specific unit tests implemented in the code for different functions. Implementing unit tests with a framework like `pytest` is recommended for:

- **Data Preprocessing**: Validate transformations and tokenization.
- **Model Forward Passes**: Ensure that both SigLIP and Gemma modules work as expected.
To add unit tests, consider creating a `tests/` directory with the following:

`test_data_preparation.py`
`test_model_forward.py`

## **Results and Visualization**

- **Training Curves**: Loss and perplexity are plotted using matplotlib after each epoch to visualize performance.
- **Generated Samples**: Audio samples from the model are saved and can be played back to evaluate the quality of speech generation.

### **Gradio demo app**

![alt text](https://github.com/Jerdah/ImageCraft/blob/main/reports/figures/gradio_app_demo.png)

## **Future Work**

- **Real-Time Processing**: Optimize the model for real-time inference on edge devices.
Expand Down
Loading

0 comments on commit 47a6aaf

Please sign in to comment.