Edited the README.md file.

Jerdah · Oct 15, 2024 · 47a6aaf · 47a6aaf
1 parent 9311725
commit 47a6aaf
Show file tree

Hide file tree

Showing 6 changed files with 5,721 additions and 1,014 deletions.
diff --git a/README.md b/README.md
@@ -1,18 +1,15 @@
-
 # **ImageCraft: Direct Image-to-Speech Synthesis**
 
-
-
-
-
 ## **Overview**
+
 ImageCraft is a deep learning project designed to generate spoken descriptions directly from images. The goal is to create a model that combines vision and text-to-speech capabilities for accessibility tools, multimedia storytelling, and human-computer interaction. It utilizes a vision transformer (SigLIP) for image encoding, Gemma for text decoding, and VoiceCraft for speech synthesis.
 
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bzmNvc-XM9RPbkZEYFdap-nNJkrCvfzu#scrollTo=-SoOHUJHsfTD) [![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/nsandiman/uarizona-msis-capstone-group5-imagecraft)
 
-![alt text](https://github.com/Jerdah/ImageCraft/blob/main/reports/figures/imagecraft-arch.jpeg) 
+![alt text](https://github.com/Jerdah/ImageCraft/blob/main/reports/figures/imagecraft-arch.jpeg)
 
 ## **Table of Contents**
+
 1. [Project Objectives](#project-objectives)
 2. [Directory Structure](#directory-structure)
 3. [Dataset](#dataset)
@@ -23,16 +20,19 @@ ImageCraft is a deep learning project designed to generate spoken descriptions d
 8. [Deployment](#deployment)
 9. [Testing](#testing)
 10. [Results and Visualization](#results-and-visualization)
-12. [Future Work](#future-work)
-15. [References](#references)
+11. [Future Work](#future-work)
+12. [References](#references)
 
 ## **Project Objectives**
+
 The primary objectives of ImageCraft are:
+
 - To create a multimodal pipeline that converts input images into meaningful spoken descriptions.
 - To utilize transformer-based models, specifically a vision transformer (SigLIP) as an image encoder and a Gemma decoder.
 - To facilitate image-to-speech for accessibility use cases.
 
 ## **Directory Structure**
+
 The primary objectives of ImageCraft are:
 
 ```css
@@ -96,57 +96,64 @@ ImageCraft/
 |
 └── setup.py
 ```
+
 ## **Dataset**
-### **Flickr30k and MSCOCO**
-The Flickr30k dataset is used for training and evaluation. It contains paired image-caption data, making it suitable for the image-to-speech task.
 
-- **Download and Preparation**: The datasets are downloaded and organized into relevant folders for training (`/training_data/dataset/flickr30k` and `/training_data/dataset/mscoco`). During preparation, images are resized, and captions are tokenized using a custom tokenizer that adds special tokens like `[BOS]` (beginning of sequence) and `[EOS]` (end of sequence).
+### **MSCOCO**
+
+The MSCOCO dataset is used for training and evaluation. It contains paired image-caption data, making it suitable for the image-to-speech task.
+
+- **Download and Preparation**: The datasets are downloaded and organized into relevant folders for training (`data/processed/coco/train` and evaluation `data/processed/coco/test`).
+
+**Download dataset for training**:
+
+```bash
+python -m src.data.download --dataset "coco" --dataset_size "10%"
+```
 
 ## **Model Architecture**
+
 ImageCraft consists of three major components:
+
 1. **Vision Transformer (SigLIP)**: Calculates the image embeddings.
 2. **Gemma Decoder**: Decodes text from the image features.
-3. **VoiceCraft Module**: The speech synthesis model.
+3. **VoiceCraft token infilling neural codec language model**: The speech synthesis model.
 
 ## **Installation**
+
 To set up the environment and install the necessary dependencies, follow the steps below:
 
 1. **Clone the Repository**:
    ```bash
    git clone https://github.com/Jerdah/ImageCraft.git
    cd ImageCraft
    ```
-   
+
 2.**Install System-Level Dependencies**:
+
 ```bash
 apt-get install -y espeak-ng espeak espeak-data libespeak1 libespeak-dev festival* build-essential flac libasound2-dev libsndfile1-dev vorbis-tools libxml2-dev libxslt-dev zlib1g-dev
 ```
 
-3. **Install Python Libraries**:
+3. **Install Python libraries**:
+
 ```bash
 pip install -r requirements.txt
 ```
 
-4. **Download Dataset from Kaggle**: 
-kaggle datasets download -d hsankesara/flickr-image-dataset
-kaggle datasets download -d mnassrib/ms-coco
-
-### +**Installation Instructions Details**
-Ensure you have all dependencies installed with specific versions:
+4. **Metrics for evaluating automated image descriptions using tools such as SPICE, PTBTokenizer, METEOR and FENSE**:
 
-- Python >= 3.8
-- torch==2.0.1
-- transformers==4.27.1
-- gradio==3.0
-
-If you encounter installation errors, refer to the `requirements.txt` or contact us for help.
+```bash
+aac-metrics-download
+```
 
 ## **Usage**
+
 ### **Inference**
 
 You can use the provided Gradio interface or run the inference script to generate speech from an image.
 
-#### **Using Gradio**:
+#### **Using Gradio (basic interface)**:
 
 ```python
 
@@ -158,15 +165,64 @@ os.environ["CUDA_VISIBLE_DEVICES"] = "0"
 os.environ["USER"] = "imagecraft"
 
 import gradio as gr
+from src.model.modules.imagecraft import ImageCraft
 
-from bert_score import score
 
-import evaluate
+model = ImageCraft.from_pretrained("nsandiman/imagecraft-ft-co-224")
 
+default_image = "media/images/2.jpeg"
+def generate(image_path):
+    """Process image inputs and generate audio response."""
+    transcript, audio_buffer = model.generate(image_path, output_type="buffer")
+
+    return audio_buffer, transcript
+
+
+imagecraft_app = gr.Interface(
+    fn=generate,
+    inputs=[
+        gr.Image(
+            type="filepath",
+            label="Upload an image",
+            sources=["upload"],
+            value=default_image,
+        ),
+    ],
+    outputs=[gr.Audio(label="Speech"), gr.Textbox(label="Text")],
+    title="ImageCraft",
+    description="Upload an image and get the speech responses.",
+    allow_flagging="never",
+)
+
+imagecraft_app.launch()
+```
+
+![alt text](https://github.com/Jerdah/ImageCraft/blob/main/reports/figures/imagecraft-basic-ui.png)
+
+#### **Using Gradio (evaluation interface)**:
+
+```python
+
+
+import sys
+import os
+
+os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
+os.environ["CUDA_VISIBLE_DEVICES"] = "0"
+os.environ["USER"] = "imagecraft"
+
+import gradio as gr
+from bert_score import score
+import evaluate
 from src.model.modules.imagecraft import ImageCraft
 
-model = ImageCraft.from_pretrained("nsandiman/imagecraft-ft-co-224")
 
+bertscore_metric = evaluate.load("bertscore")
+bleu_metric = evaluate.load("bleu")
+rouge_metric = evaluate.load('rouge')
+
+
+default_image = "media/images/2.jpeg"
 def imagecraft_interface(image_path, reference_text):
   """Process image inputs and generate audio response."""
   transcript, audio_buffer = model.generate(image_path, output_type="buffer")
@@ -200,27 +256,33 @@ def calculate_rouge_score(reference, hypothesis):
   results = rouge_metric.compute(predictions=[hypothesis], references=[[reference]])
   return results["rougeL"]
 
-# Define Gradio interface
-gradio_interface = gr.Interface(
+imagecraft_app = gr.Interface(
   fn=imagecraft_interface,
   inputs=[
-    gr.Image(type="filepath", label="Upload an image"),
+    gr.Image(
+            type="filepath",
+            label="Upload an image",
+            sources=["upload"],
+            value=default_image,
+        ),
     gr.Textbox(label="Reference Text (for evaluation)")
   ],
   outputs=[
     gr.Audio(label="Speech"),
-    gr.Textbox(label="Transcript"),
+    gr.Textbox(label="Text"),
     gr.Textbox(label="Evaluation Results")
   ],
   title="ImageCraft",
   description="Upload an image and get the speech responses.",
   allow_flagging="never"
 )
 
-# Launch the Gradio app
-gradio_interface.launch(debug=False)
+imagecraft_app.launch()
+
 ```
 
+![alt text](https://github.com/Jerdah/ImageCraft/blob/main/reports/figures/imagecraft-evaluation-ui.png)
+
 #### **Using CLI**:
 
 ```bash
@@ -231,63 +293,19 @@ python -m src.model.inference --image_path "media/images/1.jpeg" --output_type "
 ## **Training and Evaluation**
 
 ### **Training**
-The training pipeline uses the following setup:
 
-- **Freezing Strategy**: Initially, only the Gemma decoder is trained while the SigLIP encoder remains frozen. Later epochs unfreeze the ViT for end-to-end fine-tuning.
-- **Metrics**: Training loss and test loss are monitored along with perplexity, which measures the quality of text predictions.
+Specify training log collector:
 
-To train the model from scratch:
+- To use tensorboard, add the argument tensorboard to the command line.
+- To use wandb, add the argument wandb to the command line.
 
-```python
-#train the model
-python -m src.model.train --dataset "flickr" --dataset_size "5%" --batch_size 2 --max_epochs 2
-```
+Download the dataset (if it doesn't exist) and train the model.
 
-### **Evaluation Metrics**
-The following metrics are used to evaluate model performance:
-
-**Training Loss**: Measures the model's performance on the training set.
-**Test Loss**: Measures the generalization ability on unseen data.
-**Perplexity**: Evaluates how well the model predicts the sequence.
-
-### **TensorBoard**:
-Training metrics are logged to TensorBoard for easy visualization:
-
-```bash
-tensorboard --logdir runs
-```
-
-## **Deployment**
-
-The model can be deployed using the REST API provided by Flask. Additionally, the model can be containerized using Docker for reproducibility and easy deployment on cloud platforms.
+```python
 
-### **Run API**:
-```bash
-python app.py
+python -m src.model.train --dataset "coco" --dataset_size "20%" --batch_size 2 --max_epochs 10 --log_every_n_steps 2 --log_to "wandb"
 ```
 
-Navigate to `http://localhost:5000` to use the web interface.
-
-## **Testing**
-
-There are no specific unit tests implemented in the code for different functions. Implementing unit tests with a framework like `pytest` is recommended for:
-
-- **Data Preprocessing**: Validate transformations and tokenization.
-- **Model Forward Passes**: Ensure that both SigLIP and Gemma modules work as expected.
-To add unit tests, consider creating a `tests/` directory with the following:
-
-`test_data_preparation.py`
-`test_model_forward.py`
-
-## **Results and Visualization**
-
-- **Training Curves**: Loss and perplexity are plotted using matplotlib after each epoch to visualize performance.
-- **Generated Samples**: Audio samples from the model are saved and can be played back to evaluate the quality of speech generation.
-
-  ### **Gradio demo app**
-
-![alt text](https://github.com/Jerdah/ImageCraft/blob/main/reports/figures/gradio_app_demo.png)
-
 ## **Future Work**
 
 - **Real-Time Processing**: Optimize the model for real-time inference on edge devices.