An image captioning model by combining a pre-trained VGG-16 image encoder with LSTM-based language decoder.
Here I build an image captioning model by combining a pre-trained VGG-16 image encoder from with LSTM-based language decoder from.
This is an implementation of the following paper:
Show and tell: A neural image caption generator, O. Vinyals, A. Toshev, S. Bengio, D. Erhan, CVPR, 2015.
which requires the steps detailed below.
Here I load pre-trained VGG-16 model with the weights trained on ImageNet. I also get rid of softmax, so I will end up with fc2 layer producing 4096 feature encoding for a given image i.
Here I will start with the language decoder model. I need to pass image encoding as the hidden state input into the first LSTM cell (i.e., h0 = xi). However, this would only work if the hidden state is dimension of the 4096, which is way too high dimensional. In order to get a more reasonably dimensional representation I insert a linear layer to project from 4096 to 300.