diff --git a/english/multimodal.md b/english/multimodal.md index 7f10513d..fee42041 100644 --- a/english/multimodal.md +++ b/english/multimodal.md @@ -45,4 +45,58 @@ The MOSI dataset ([Zadeh et al., 2016](https://arxiv.org/pdf/1606.06259.pdf)) is | bc-LSTM (Poria et al., 2017) | 80.3% | [Context-Dependent Sentiment Analysis in User-Generated Videos](http://sentic.net/context-dependent-sentiment-analysis-in-user-generated-videos.pdf) | | MARN (Zadeh et al., 2018) | 77.1% | [Multi-attention Recurrent Network for Human Communication Comprehension](https://arxiv.org/pdf/1802.00923.pdf) | +## Visual Question Answering + +### VQAv2 + +Given an image and a natural language question about the image, the task is to provide an accurate natural language answer + +- [Website](https://visualqa.org) +- [Challenge](https://visualqa.org/challenge.html) + +| Model | Accuracy | Paper / Source | Code | +| ------------- | :-----:| --- | --- | +| UNITER (Chen et al., 2019) | 73.4 | [UNITER: LEARNING UNIVERSAL IMAGE-TEXT REPRESENTATIONS](https://arxiv.org/pdf/1909.11740.pdf) | [Link](https://github.com/ChenRocks/UNITER) | +| LXMERT (Tan et al., 2019) | 72.54 | [LXMERT: Learning Cross-Modality Encoder Representations from Transformers](https://arxiv.org/abs/1908.07490) | [Link](https://github.com/airsplay/lxmert) | + +### GQA - Visual Reasoning in the Real World + +GQA focuses on real-world compositional reasoning. + +- [Website](https://cs.stanford.edu/people/dorarad/gqa/) +- [Challenge](https://cs.stanford.edu/people/dorarad/gqa/challenge.html) + +| Model | Accuracy | Paper / Source | Code | +| ------------- | :-----:| --- | --- | +| KaKao Brain | 73.24 | [GQA Challenge](https://drive.google.com/file/d/1CtFk0ldbN5w2qhwvfKrNzAFEj-I9Tjgy/view) | Unavailable | +| LXMERT (Tan et al., 2019) | 60.3 | [LXMERT: Learning Cross-Modality Encoder Representations from Transformers](https://arxiv.org/abs/1908.07490) | [Link](https://github.com/airsplay/lxmert) | + +### TextVQA + +TextVQA requires models to read and reason about text in an image to answer questions based on them. + +- [Website](https://textvqa.org/) +- [Challenge](https://textvqa.org/challenge) + +| Model | Accuracy | Paper / Source | Code | +| ------------- | :-----:| --- | --- | +| M4C (Hu et al., 2020) | 40.46 | [Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA](https://arxiv.org/pdf/1911.06258.pdf) | [Link](https://github.com/facebookresearch/pythia/tree/project/m4c/projects/M4C_Captioner) | + + +### VizWiz dataset + +This task focuses on answering visual questions that originate from a real use case where blind people were submitting images with recorded spoken questions in order to learn about their physical surroundings. +- [Website](https://vizwiz.org/tasks-and-datasets/vqa/) +- [Challenge](https://vizwiz.org/tasks-and-datasets/vqa/) + +| Model | Accuracy | Paper / Source | Code | +| ------------- | :-----:| --- | --- | +| Pythia | 54.22 | [FB's Pythia repository](https://github.com/facebookresearch/pythia/blob/master/docs/source/tutorials/pretrained_models.md) | [Link](https://github.com/facebookresearch/pythia/blob/master/docs/source/tutorials/pretrained_models.md) | +| BUTD Vizwiz (Gurari et al., 2018) | 46.9 | [VizWiz Grand Challenge: Answering Visual Questions from Blind People](https://arxiv.org/abs/1802.08218) | Unavailable | + +## Other multimodal resources + +- [awesome-multimodal-ml](https://github.com/pliang279/awesome-multimodal-ml) +- [awesome-vision-and-language-papers](https://github.com/sangminwoo/awesome-vision-and-language-papers) + [Go back to the README](../README.md)