ImageCraft is a deep learning project designed to generate spoken descriptions directly from images. The goal is to create a model that combines vision and text-to-speech capabilities for accessibility tools, multimedia storytelling, and human-computer interaction. It utilizes a vision transformer (SigLIP) for image encoding, Gemma for text decoding, and VoiceCraft for speech synthesis.
The primary objectives of ImageCraft are:
- To create a multimodal pipeline that converts input images into meaningful spoken descriptions.
- To utilize transformer-based models, specifically a vision transformer (SigLIP) as an image e ncoder and a Gemma decoder.
- To facilitate image-to-speech for accessibility use cases.
The MSCOCO dataset is used for training and evaluation. It contains paired image-caption data, making it suitable for the image-to-speech task.
- Download and Preparation: The datasets are downloaded and organized into relevant folders for training (
data/processed/coco/train
and evaluationdata/processed/coco/test
).
Download dataset for training:
python -m src.data.download --dataset "coco" --dataset_size "10%"
ImageCraft consists of three major components:
- Vision Transformer (SigLIP): Calculates the image embeddings.
- Gemma Decoder: Decodes text from the image features.
- VoiceCraft token infilling neural codec language model: The speech synthesis model.
To set up the environment and install the necessary dependencies, follow the steps below:
- Clone the Repository:
git clone https://github.com/Jerdah/ImageCraft.git
cd ImageCraft
- Install System-Level Dependencies:
apt-get install -y espeak-ng espeak espeak-data libespeak1 libespeak-dev festival* build-essential flac libasound2-dev libsndfile1-dev vorbis-tools libxml2-dev libxslt-dev zlib1g-dev
- Install Python libraries:
pip install -r requirements.txt
- Metrics for evaluating automated image descriptions using tools such as SPICE, PTBTokenizer, METEOR and FENSE:
aac-metrics-download
You can use the provided Gradio interface or run the inference script to generate speech from an image.
import sys
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["USER"] = "imagecraft"
import gradio as gr
from src.model.modules.imagecraft import ImageCraft
model = ImageCraft.from_pretrained("nsandiman/imagecraft-ft-co-224")
default_image = "media/images/2.jpeg"
def generate(image_path):
"""Process image inputs and generate audio response."""
transcript, audio_buffer = model.generate(image_path, output_type="buffer")
return audio_buffer, transcript
imagecraft_app = gr.Interface(
fn=generate,
inputs=[
gr.Image(
type="filepath",
label="Upload an image",
sources=["upload"],
value=default_image,
),
],
outputs=[gr.Audio(label="Speech"), gr.Textbox(label="Text")],
title="ImageCraft",
description="Upload an image and get the speech responses.",
allow_flagging="never",
)
imagecraft_app.launch()
import sys
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["USER"] = "imagecraft"
import gradio as gr
from bert_score import score
import evaluate
from src.model.modules.imagecraft import ImageCraft
bertscore_metric = evaluate.load("bertscore")
bleu_metric = evaluate.load("bleu")
rouge_metric = evaluate.load('rouge')
default_image = "media/images/2.jpeg"
def imagecraft_interface(image_path, reference_text):
"""Process image inputs and generate audio response."""
transcript, audio_buffer = model.generate(image_path, output_type="buffer")
if not reference_text:
evaluation_result = "No reference text provided for evaluation."
else:
reference_text = reference_text.strip().lower().rstrip('.')
transcript = transcript.strip().lower().rstrip('.')
bert_score_result = calculate_bert_score(reference_text, transcript)
bleu_score_result = calculate_bleu_score(reference_text, transcript)
rouge_score_result = calculate_rouge_score(reference_text, transcript)
evaluation_result = f"BERT Score: {bert_score_result:.4f}\nBLEU Score: {bleu_score_result:.4f}\nROUGE Score: {rouge_score_result:.4f}"
return audio_buffer, transcript, evaluation_result
def calculate_bert_score(reference, hypothesis):
scores = bertscore_metric.compute(predictions=[hypothesis], references=[reference], lang="en")
f1 = scores["f1"][0]
return f1
def calculate_bleu_score(reference, hypothesis):
results = bleu_metric.compute(predictions=[hypothesis], references=[[reference]])
bleu = results["bleu"]
return bleu
def calculate_rouge_score(reference, hypothesis):
results = rouge_metric.compute(predictions=[hypothesis], references=[[reference]])
return results["rougeL"]
imagecraft_app = gr.Interface(
fn=imagecraft_interface,
inputs=[
gr.Image(
type="filepath",
label="Upload an image",
sources=["upload"],
value=default_image,
),
gr.Textbox(label="Reference Text (for evaluation)")
],
outputs=[
gr.Audio(label="Speech"),
gr.Textbox(label="Text"),
gr.Textbox(label="Evaluation Results")
],
title="ImageCraft",
description="Upload an image and get the speech responses.",
allow_flagging="never"
)
imagecraft_app.launch()
# run inference and return the audio file path
python -m src.model.inference --image_path "media/images/1.jpeg" --output_type "file"
Specify training log collector:
- To use tensorboard, add the argument tensorboard to the command line.
- To use wandb, add the argument wandb to the command line.
Download the dataset (if it doesn't exist) and train the model.
python -m src.model.train --dataset "coco" --dataset_size "20%" --batch_size 2 --max_epochs 10 --log_every_n_steps 2 --log_to "wandb"
- Real-Time Processing: Optimize the model for real-time inference on edge devices.
- Improvement in Text Generation: Integrate semantic analysis to enhance caption quality.
- VoiceCraft: The VoiceCraft text-to-speech module used in this project is based on the repository provided by Facebook Research. For more details, visit the VoiceCraft GitHub repository.
- Vision Transformer (SigLIP): The Vision Transformer architecture is inspired by "Sigmoid Loss for Language Image Pre-Training" by Zhai et al. (2023). Paper link
This codebase is under CC BY-NC-SA 4.0 (LICENSE-CODE). Note that we use some of the code from other repository that are under different licenses: ./src/model/modules/voicecraft.py is under CC BY-NC-SA 4.0; ./src/model/modules/codebooks_patterns.py is under MIT license; ./src/model/modules/tokenizer.py are under Apache License, Version 2.0; the phonemizer we used is under GNU 3.0 License.
- Thanks to nsandiman, ravinamore-ml, Masengug and Jerdah
- We thank Umar Jamil for his work pytorch-paligemma, from where we took lot of inspiration.