Skip to content

An image captioning model by combining a pre-trained VGG-16 image encoder with LSTM-based language decoder.

Notifications You must be signed in to change notification settings

baharefatemi/Image-Captioning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Image-Captioning

An image captioning model by combining a pre-trained VGG-16 image encoder with LSTM-based language decoder.

Here I build an image captioning model by combining a pre-trained VGG-16 image encoder from with LSTM-based language decoder from.

This is an implementation of the following paper:

Show and tell: A neural image caption generator, O. Vinyals, A. Toshev, S. Bengio, D. Erhan, CVPR, 2015.

which requires the steps detailed below.

1. Setup Image Encoder

Here I load pre-trained VGG-16 model with the weights trained on ImageNet. I also get rid of softmax, so I will end up with fc2 layer producing 4096 feature encoding for a given image i.

2. Setup Language Decoder

Here I will start with the language decoder model. I need to pass image encoding as the hidden state input into the first LSTM cell (i.e., h0 = xi). However, this would only work if the hidden state is dimension of the 4096, which is way too high dimensional. In order to get a more reasonably dimensional representation I insert a linear layer to project from 4096 to 300.

About

An image captioning model by combining a pre-trained VGG-16 image encoder with LSTM-based language decoder.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published