ImgCap

ImgCap is an image captioning model designed to automatically generate descriptive captions for images. It has two versions CNN + LSTM model and CNN + LSTM + Attention mechanism model.

Usage

Clone the repository:

git clone https://github.com/Sh-31/ImgCap.git

Install the required dependencies:

pip3 install -r requirements.txt
python3 -q -m spacy download en_core_web_sm

Download the model checkpoint (manual step):
- ImgCap (CNN + LSTM): Download checkpoint
- ImgCap (CNN + LSTM + Attention): Download checkpoint
- Place the model checkpoint in the appropriate directory:
  - For CNN + LSTM + Attention: ImgCap/trainning/checkpoints/attention
  - For CNN + LSTM: ImgCap/trainning/checkpoints
Run the main script (Gradio GUI for inference):
```
python3 main.py
```

Alternatively, you can use the model directly on Hugging Face Spaces: ImgCap on Hugging Face

Sample Output

Dataset

Flickr30k:

The Flickr30k dataset consists of 30,000 images, each accompanied by five captions. It provides a wide variety of scenes and objects, making it ideal for diverse image captioning tasks.

To download the dataset, follow these steps:

Enable Kaggle’s public API by following the instructions here: Kaggle API.

Run the following command to download the dataset:

kaggle datasets download -d hsankesara/flickr-image-dataset -p /teamspace/studios/this_studio/data/Flickr30

Additionally, I’ve documented a similar image captioning dataset, which you can review here: Image Caption Documentation.

Model Architecture Comparison and Details

The model architectures compared in this report consist of two versions of the ImgCap model, each with different configurations. The models were trained using Float16 precision and optimized with torch.compile for improved training efficiency on an L4 24GB RAM GPU.

Key Differences

Number of Parameters:
- ImgCap with Attention: This model incorporates an additional attention mechanism that increases the parameter count. Specifically, the attention layer adds about 3.15M parameters, bringing the total to 85.79M. Out of these, 36.72M are trainable, with the rest being frozen in the ResNet50 encoder.
- ImgCap without Attention: The model without the attention mechanism has 52.89M total parameters, with 29.38M being trainable, as it simplifies the decoder by removing the attention layers.
CNN Encoder (ResNet50) Freezing Strategy:
- Both models use ResNet50 as the CNN encoder. The convolutional layers in ResNet50 are frozen to reduce computational overhead and focus training on the LSTM-based decoder. Only the fully connected layers at the end of ResNet50 are trainable in both models.
LSTM Decoder and Embedding:
- Both models use an LSTM-based decoder with trainable embedding layers. The LSTM decoder with attention concatenates the context vectors obtained from the attention mechanism, while the non-attention model directly processes image features via projection layers.
- The embedding dimension, hidden size, and number of layers in the LSTM remain consistent across both models.
Vocabulary Construction:
- Captions Tokenization: Captions are tokenized using spaCy, which splits captions into tokens. These tokens are then used to build the vocabulary (vocab size 4096).
- Vocabulary Content: The vocabulary includes special tokens (<unk>, <pad>, <sos>, <eos>) and tokens derived from the captions. Individual English alphabet characters and spaces are also added to the vocabulary to handle out-of-vocabulary words or character-level tokenization.
- Tokenization: Each caption is tokenized into tokens that are then mapped to indices in the vocabulary.
- Encoding: Tokens are converted to indices, starting with <sos>, followed by the token indices, and ending with <eos>. This encoding helps the LSTM decoder understand the sequence of words.
Teacher Forcing:
- Teacher Forcing Ratio: During training, both models used a teacher forcing ratio of 0.90, meaning that 90% of the time, the ground truth caption tokens were fed into the decoder during sequence generation, while the remaining 10% relied on the model's predictions.
Training Configuration:
- Both models were trained using mixed precision (Float16) to improve memory efficiency and training speed.
- The training was executed on an L4 24GB RAM GPU using torch.compile for improved runtime optimizations, enabling faster convergence and better GPU utilization.

Parameter Comparison

Component	ImgCap with Attention	ImgCap without Attention
Total Parameters	85.79M	52.89M
Trainable Parameters	36.72M	29.38M
Non-trainable Parameters	49.07M	23.51M

ImgCap with Attention

ImgCap without Attention

The model with attention has more trainable parameters and introduces a more complex mechanism for context generation, leading to improved performance in captioning tasks, as seen in the evaluation metrics. However, due to the larger number of parameters, the model may require longer training time to fully converge.

Model Evaluation

Model	Epoch	Beam Width	BLEU-1	BLEU-2	BLEU-3	BLEU-4	CIDEr
ImgCap without Attention	40	5	0.37	0.22	0.14	0.09	0.41
ImgCap with Attention	30	5	0.3959	0.2464	0.1619	0.1077	0.6213

Note: The models are still undertrained, and in theory, their accuracy is expected to improve with further training. Extending the number of epochs could lead to higher BLEU and CIDEr scores, particularly for the attention-based model, which already shows a performance boost.

Future Work

In the next phase, I plan to explore the Vision Transformer (ViT) architecture to develop a new variant of the ImgCap model. This variant will scale more effectively for complex visual understanding tasks. Additionally, I aim to expand the model's capabilities by training it for multilingual captioning in both English and Arabic.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
data_utils		data_utils
docs/examples		docs/examples
eval_utils		eval_utils
trainning		trainning
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
download_kaggle_data.sh		download_kaggle_data.sh
main.py		main.py
prepare_data.ipynb		prepare_data.ipynb
requirements.txt		requirements.txt
vocab.pkl		vocab.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ImgCap

Usage

Sample Output

Dataset

Flickr30k:

Model Architecture Comparison and Details

Key Differences

Parameter Comparison

ImgCap with Attention

ImgCap without Attention

Model Evaluation

Future Work

About

Releases

Packages

Languages

Sh-31/ImgCap

Folders and files

Latest commit

History

Repository files navigation

ImgCap

Usage

Sample Output

Dataset

Flickr30k:

Model Architecture Comparison and Details

Key Differences

Parameter Comparison

ImgCap with Attention

ImgCap without Attention

Model Evaluation

Future Work

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages