Skip to content

Heronalps/Visual_QA_Attn

Repository files navigation

Visual Question and Answering with Hierarchical co-attention

1. Objective

The objective of our project is to have a deep learning model that answers open-end questions based on the given image.

Alt text

2. Methodology

We have implemented two Models for the task of Visual Question Answering. We refer them as Base Model and Hierarchical Co-attention Model. A typical system of VQA consists of image, question(represented by text) as inputs and answer to the question as output. Systems differ in how the image and questions features are encoded into a common vector space, followed by decoding the vector space to get the answer. Typically, the image features are computed by Convolution Neural Network(CNN) whereas the text features are computed using Recurrent Neural Network(RNN) to preserve the temporal information in the text. Base Model considers the aggregate features from question and image to determine the answer. While the Hierarchical Co-attention model determines the answer by attended image and question features. We used Base model as a baseline for our accuracy and results.

3. Baseline Model

The image and question first needs to be embed(encoded) into a common vector space and a decoder then decodes the vector space to obtain the answer.

Alt text

3.1 Encoder

The encoder part consists of image and question encoding.

3.1.1 Image Encoding

A pre-trained vgg16 CNN model on Imagenet is used as an encoder. The Vgg16 model consists of 5 convolution layers, 2 fully connected layers and 1 softmax layer. Outputs of fully connected layer are considered as image features which is of size 4096.

Alt text

3.1.2 Question Encoding

RNNs are used to encode the question into vector space by preserving temporal information. We have used LSTM as a RNN module to mitigate the problems of vanishing gradient descent. We have a fixed length of LSTM units as we will have a threshold on maximum number of words each question can have. The state of the final LSTM unit is considered as question feature. A LSTM of 512 unit size is considered in each layer. Each LSTM unit gives hidden state of size 512 and cell state of size 512. Both the states are concatenated to get a 1024 vector. Since two LSTM layers are considered we get a 2048 size vector as question feature.

Since the outputs from both image and question encoding are different, we have a fully connected layer at image and question encoding to get them to a size of 1024. Thus the outputs of the encoder are two vectors of size 1024 which represents the image and question features.

3.2 Decoder

The Decoder performs a softmax-classification for the image and question features calculated by Encoder. Decoder predicts the best answer among the top 1000 chosen from dataset. The top 1000 answers accommodates around 85% answers of the dataset. Hence this is mostly a classification task rather than generating task for answers. The steps involved in classification are, First, a pointwise image and question features are multiplied to get a single vector of sizer 1024. This is fed to a fully connected layer of size 1000 and softmax layer. The highest output from the softmax layer is the answer to the give question.

4. Hierarchical Co-Attention Model

Alt text

In the Base Model, we have seen the encoder takes the output of the final fully connected layer of CNN and final LSTM unit state as the outputs. While these features represent the whole image and question,no specific priority is given to certain parts of the question or certain portions of the image. In Hierarchical Co-Attention Model, we consider multiple features w.r.t image and question and give priority to certain features. The priority given to certain features is called attention. In our model, we consider attention to image features based on question features and attention to question features based on image features. This is so called co-attention part in our model which will be discussed further in the next section. Before explaining the model, we would first present you about attention mechanism,features considered in Hierarchy Model and co-attention mechanism.

Alt text

5. Datasets

In the project, we used Microsoft Common Objects in Context (COCO) datasets. The training and testing dataset contains 82,783 and 40,504 images respectively. In the question side, the VQA project provides 443,757 questions each with 10 answers for training images. Averagely, each image has 5.4 corresponding questions. It's worth pointing out that these answer are generated by workers on Amazon Mechanical Turk and have different confidence levels as "yes", "no" and "maybe". Apparently, the answers with "yes" confidence level are chosen in the training and testing.

We can roughly categorize questions in the dataset into five classes. They are fine-grained recognition ("What kind of cheese is on the pizza?"), object detection("How many bikes are there?"), activity recognition("Is this man laughing?"), knowledge base reasoning("Is this a vegetarian pizza?"), and common sense reasoning("Does this person have 20/20 vision?"). The top five question types are "What is"(13.84%), "What color"(8.98%), "What kind"(2.49%), "What are"(2.32%) and "What type"(1.78%). In terms of answer, the top 5 ones are yes(22.82%), no(15.35%), 2(3.22%), 1(1.87%), white(1.68%).

5.1 Image

  • The CNN is the encoder is initialized as a pre-trained VGG16 model.
  • Every image is rescaled to [244 * 244 * 3] dimension before being fed into CNN.

5.2 Questions

  • In order to construct vocabulary, we used nltk library to tokenize all words present in training questions.
  • Thus, only questions whose words are in vocabulary are considered in the validation dataset.
  • To feed them into LSTM, all questions are left padded to a maximum length of 25 in the encoder.

5.3 Answers

  • Based on statistics, top 1000 answers are considered from dataset which accommodates to about 85% of answers.
  • Each answer has different confidence level. Picking the answer with confidence level 'yes'.
  • Indexing the answers from 1 to 1000 for softmax classification.

6. Results

Alt text

For those image and question pairs of right prediction, the image tends to have fewer objects that are separated clearly against each other. The question is also straightforward that any individual would be able to answer intuitively like in the 2nd image in top row asking for which animal is this? Both models predict accurately as Bear is prominent in the image.

Alt text

With respect to the image and question pair that base and co-attention models give diverse answers, we figured out this is either because of the similarity of different objects on the image or the mixture of background and object of interest. In such scenario, the co-attention mechanism is able to focus on the specific area of the picture guided by question. Therefore, the model provides right answer other than base model.

Alt text

For those images having numerous and vague objects, both models are unable to identify the right objects to extract and answer the question upon. Further, the ploychrome nature of these images also play a role in making wrong predictions as seen in Fig \ref{fig:wrong}. In our future work, we expect to improve our encoder with more sophisticated neural network to gain fine-grained features that leverage our entire model accuracy.

References

[1] Xu H., Saenko K. (2016) Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering. In: Leibe B., Matas J., Sebe N., Welling M. (eds) Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science, vol 9911. Springer, Cham.

[2] Lu, J., Yang, J., Batra, D., & Parikh, D. (2016). Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems (pp. 289-297).

[3] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, 2015.

[4] Zhang, P., Goyal, Y., Summers-Stay, D., Batra, D., & Parikh, D. (2016, June). Yin and yang: Balancing and answering binary visual questions. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on (pp. 5014-5022). IEEE

[5] Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2017, July). Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In CVPR (Vol. 1, No. 6, p. 9).

About

Visual QA system with multiple attention-head

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages