This is a reimplementation of the basic image caption structures(CNN-RNN).
CNN-(ResNet18), RNN-(LSTM), dataset(MSCOCO), Toolkit(Pytorch)
Image caption is some techniques that help computers to understand the picture given to them and express the picture by nature languages.
- Extract features from the input images with convolutional neural network (in this work is pretrained
Resnet18
)
- Input: batch of images with the shape
(N, C, H, W)
- Output: batch of features of shape
(N, D)
N:batch size, C:image channel(RGB), H:image height, W:image weight, D:feature dimensions(512)
- Encode the sentence into vectors with a dictionary and put
<start>, <end>, <pad>
into sentences.
- Input: batch of strings with shape
(N, *)
- Output: batch of vectors with shape
(N, L)
N:batch size, *:length of the sentence, L:fixed length of the vector
- Use the long short-term memory(LSTM) model as the RNN to realize the generation part.
- Input: batch of encode captions of shape
(N, L, C)
- Initial hidden layer: extracted features of shape
(N, D)
- Output :
(N, L, C)
C:dictionary size
the Experiment metrics is as follows: