Skip to content

Commit

Permalink
Vision and Language tasks (sebastianruder#443)
Browse files Browse the repository at this point in the history
* Adding different VQA tasks and challenges

* Adding vqav2 and gqa

* Adding textvqa and vizwiz

* Incorporating suggestions

* Adding code links
  • Loading branch information
shubhamagarwal92 authored Apr 26, 2020
1 parent ff960af commit 23c5e1d
Showing 1 changed file with 54 additions and 0 deletions.
54 changes: 54 additions & 0 deletions english/multimodal.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,4 +45,58 @@ The MOSI dataset ([Zadeh et al., 2016](https://arxiv.org/pdf/1606.06259.pdf)) is
| bc-LSTM (Poria et al., 2017) | 80.3% | [Context-Dependent Sentiment Analysis in User-Generated Videos](http://sentic.net/context-dependent-sentiment-analysis-in-user-generated-videos.pdf) |
| MARN (Zadeh et al., 2018) | 77.1% | [Multi-attention Recurrent Network for Human Communication Comprehension](https://arxiv.org/pdf/1802.00923.pdf) |

## Visual Question Answering

### VQAv2

Given an image and a natural language question about the image, the task is to provide an accurate natural language answer

- [Website](https://visualqa.org)
- [Challenge](https://visualqa.org/challenge.html)

| Model | Accuracy | Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| UNITER (Chen et al., 2019) | 73.4 | [UNITER: LEARNING UNIVERSAL IMAGE-TEXT REPRESENTATIONS](https://arxiv.org/pdf/1909.11740.pdf) | [Link](https://github.com/ChenRocks/UNITER) |
| LXMERT (Tan et al., 2019) | 72.54 | [LXMERT: Learning Cross-Modality Encoder Representations from Transformers](https://arxiv.org/abs/1908.07490) | [Link](https://github.com/airsplay/lxmert) |

### GQA - Visual Reasoning in the Real World

GQA focuses on real-world compositional reasoning.

- [Website](https://cs.stanford.edu/people/dorarad/gqa/)
- [Challenge](https://cs.stanford.edu/people/dorarad/gqa/challenge.html)

| Model | Accuracy | Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| KaKao Brain | 73.24 | [GQA Challenge](https://drive.google.com/file/d/1CtFk0ldbN5w2qhwvfKrNzAFEj-I9Tjgy/view) | Unavailable |
| LXMERT (Tan et al., 2019) | 60.3 | [LXMERT: Learning Cross-Modality Encoder Representations from Transformers](https://arxiv.org/abs/1908.07490) | [Link](https://github.com/airsplay/lxmert) |

### TextVQA

TextVQA requires models to read and reason about text in an image to answer questions based on them.

- [Website](https://textvqa.org/)
- [Challenge](https://textvqa.org/challenge)

| Model | Accuracy | Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| M4C (Hu et al., 2020) | 40.46 | [Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA](https://arxiv.org/pdf/1911.06258.pdf) | [Link](https://github.com/facebookresearch/pythia/tree/project/m4c/projects/M4C_Captioner) |


### VizWiz dataset

This task focuses on answering visual questions that originate from a real use case where blind people were submitting images with recorded spoken questions in order to learn about their physical surroundings.
- [Website](https://vizwiz.org/tasks-and-datasets/vqa/)
- [Challenge](https://vizwiz.org/tasks-and-datasets/vqa/)

| Model | Accuracy | Paper / Source | Code |
| ------------- | :-----:| --- | --- |
| Pythia | 54.22 | [FB's Pythia repository](https://github.com/facebookresearch/pythia/blob/master/docs/source/tutorials/pretrained_models.md) | [Link](https://github.com/facebookresearch/pythia/blob/master/docs/source/tutorials/pretrained_models.md) |
| BUTD Vizwiz (Gurari et al., 2018) | 46.9 | [VizWiz Grand Challenge: Answering Visual Questions from Blind People](https://arxiv.org/abs/1802.08218) | Unavailable |

## Other multimodal resources

- [awesome-multimodal-ml](https://github.com/pliang279/awesome-multimodal-ml)
- [awesome-vision-and-language-papers](https://github.com/sangminwoo/awesome-vision-and-language-papers)

[Go back to the README](../README.md)

0 comments on commit 23c5e1d

Please sign in to comment.