Image-Captioning

An image captioning model by combining a pre-trained VGG-16 image encoder with LSTM-based language decoder.

Here I build an image captioning model by combining a pre-trained VGG-16 image encoder from with LSTM-based language decoder from.

This is an implementation of the following paper:

Show and tell: A neural image caption generator, O. Vinyals, A. Toshev, S. Bengio, D. Erhan, CVPR, 2015.

which requires the steps detailed below.

1. Setup Image Encoder

Here I load pre-trained VGG-16 model with the weights trained on ImageNet. I also get rid of softmax, so I will end up with fc2 layer producing 4096 feature encoding for a given image i.

2. Setup Language Decoder

Here I will start with the language decoder model. I need to pass image encoding as the hidden state input into the first LSTM cell (i.e., h0 = xi). However, this would only work if the hidden state is dimension of the 4096, which is way too high dimensional. In order to get a more reasonably dimensional representation I insert a linear layer to project from 4096 to 300.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
image_captioning.ipynb		image_captioning.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image-Captioning

1. Setup Image Encoder

2. Setup Language Decoder

About

Releases

Packages

Languages

baharefatemi/Image-Captioning

Folders and files

Latest commit

History

Repository files navigation

Image-Captioning

1. Setup Image Encoder

2. Setup Language Decoder

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages