Vision and Language tasks (sebastianruder#443)

* Adding different VQA tasks and challenges * Adding vqav2 and gqa * Adding textvqa and vizwiz * Incorporating suggestions * Adding code links
ZihanWangKi · Apr 26, 2020 · 23c5e1d · 23c5e1d
1 parent ff960af
commit 23c5e1d
Showing 1 changed file with 54 additions and 0 deletions.
diff --git a/english/multimodal.md b/english/multimodal.md
@@ -45,4 +45,58 @@ The MOSI dataset ([Zadeh et al., 2016](https://arxiv.org/pdf/1606.06259.pdf)) is
 | bc-LSTM (Poria et al., 2017) | 80.3%  | [Context-Dependent Sentiment Analysis in User-Generated Videos](http://sentic.net/context-dependent-sentiment-analysis-in-user-generated-videos.pdf) |
 | MARN (Zadeh et al., 2018) | 77.1%  | [Multi-attention Recurrent Network for Human Communication Comprehension](https://arxiv.org/pdf/1802.00923.pdf) |
 
+## Visual Question Answering
+
+### VQAv2 
+
+Given an image and a natural language question about the image, the task is to provide an accurate natural language answer
+
+- [Website](https://visualqa.org)
+- [Challenge](https://visualqa.org/challenge.html)
+
+| Model           | Accuracy  |  Paper / Source | Code |
+| ------------- | :-----:| --- | --- |
+| UNITER (Chen et al., 2019) | 73.4 | [UNITER: LEARNING UNIVERSAL IMAGE-TEXT REPRESENTATIONS](https://arxiv.org/pdf/1909.11740.pdf) | [Link](https://github.com/ChenRocks/UNITER) |
+| LXMERT (Tan et al., 2019) | 72.54 | [LXMERT: Learning Cross-Modality Encoder Representations from Transformers](https://arxiv.org/abs/1908.07490) | [Link](https://github.com/airsplay/lxmert) |
+
+### GQA - Visual Reasoning in the Real World 
+
+GQA focuses on real-world compositional reasoning. 
+
+- [Website](https://cs.stanford.edu/people/dorarad/gqa/)
+- [Challenge](https://cs.stanford.edu/people/dorarad/gqa/challenge.html)
+
+| Model           | Accuracy  |  Paper / Source | Code |
+| ------------- | :-----:| --- | --- |
+| KaKao Brain | 73.24 | [GQA Challenge](https://drive.google.com/file/d/1CtFk0ldbN5w2qhwvfKrNzAFEj-I9Tjgy/view) | Unavailable |
+| LXMERT (Tan et al., 2019) | 60.3 | [LXMERT: Learning Cross-Modality Encoder Representations from Transformers](https://arxiv.org/abs/1908.07490) | [Link](https://github.com/airsplay/lxmert) |
+
+### TextVQA
+
+TextVQA requires models to read and reason about text in an image to answer questions based on them.
+
+- [Website](https://textvqa.org/)
+- [Challenge](https://textvqa.org/challenge)
+
+| Model           | Accuracy  |  Paper / Source | Code |
+| ------------- | :-----:| --- | --- |
+| M4C (Hu et al., 2020) | 40.46 | [Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA](https://arxiv.org/pdf/1911.06258.pdf) | [Link](https://github.com/facebookresearch/pythia/tree/project/m4c/projects/M4C_Captioner) |
+
+
+### VizWiz dataset
+
+This task focuses on answering visual questions that originate from a real use case where blind people were submitting images with recorded spoken questions in order to learn about their physical surroundings.
+- [Website](https://vizwiz.org/tasks-and-datasets/vqa/)
+- [Challenge](https://vizwiz.org/tasks-and-datasets/vqa/)
+
+| Model           | Accuracy  |  Paper / Source | Code |
+| ------------- | :-----:| --- | --- |
+| Pythia | 54.22 | [FB's Pythia repository](https://github.com/facebookresearch/pythia/blob/master/docs/source/tutorials/pretrained_models.md) | [Link](https://github.com/facebookresearch/pythia/blob/master/docs/source/tutorials/pretrained_models.md) |
+| BUTD Vizwiz (Gurari et al., 2018) | 46.9 | [VizWiz Grand Challenge: Answering Visual Questions from Blind People](https://arxiv.org/abs/1802.08218) | Unavailable |
+
+## Other multimodal resources
+
+- [awesome-multimodal-ml](https://github.com/pliang279/awesome-multimodal-ml)
+- [awesome-vision-and-language-papers](https://github.com/sangminwoo/awesome-vision-and-language-papers)
+
 [Go back to the README](../README.md)