The IEMOCAP (Busso et al., 2008) contains the acts of 10 speakers in a two-way conversation segmented into utterances. The medium of the conversations in all the videos is English. The database contains the following categorical labels: anger, happiness, sadness, neutral, excitement, frustration, fear, surprise, and other.
Monologue:
Model | Accuracy | Paper / Source |
---|---|---|
CHFusion (Poria et al., 2017) | 76.5% | Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling |
bc-LSTM (Poria et al., 2017) | 74.10% | Context-Dependent Sentiment Analysis in User-Generated Videos |
Conversational: Conversational setting enables the models to capture emotions expressed by the speakers in a conversation. Inter speaker dependencies are considered in this setting.
Model | Weighted Accuracy (WAA) | Paper / Source |
---|---|---|
CMN (Hazarika et al., 2018) | 77.62% | Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos |
Memn2n | 75.08 | Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos |
Mohammad et. al, 2016 created a dataset of verb-noun pairs from WordNet that had multiple senses. They annoted these pairs for metaphoricity (metaphor or not a metaphor). Dataset is in English.
Model | F1 Score | Paper / Source | Code |
---|---|---|---|
5-layer convolutional network (Krizhevsky et al., 2012), Word2Vec | 0.75 | Shutova et. al, 2016 | Unavailable |
Tsvetkov et. al, 2014 created a dataset of adjective-noun pairs that they then annotated for metaphoricity. Dataset is in English.
Model | F1 Score | Paper / Source | Code |
---|---|---|---|
5-layer convolutional network (Krizhevsky et al., 2012), Word2Vec | 0.79 | Shutova et. al, 2016 | Unavailable |
The MOSI dataset (Zadeh et al., 2016) is a dataset rich in sentimental expressions where 93 people review topics in English. The videos are segmented with each segments sentiment label scored between +3 (strong positive) to -3 (strong negative) by 5 annotators.
Model | Accuracy | Paper / Source |
---|---|---|
bc-LSTM (Poria et al., 2017) | 80.3% | Context-Dependent Sentiment Analysis in User-Generated Videos |
MARN (Zadeh et al., 2018) | 77.1% | Multi-attention Recurrent Network for Human Communication Comprehension |
Given an image and a natural language question about the image, the task is to provide an accurate natural language answer
Model | Accuracy | Paper / Source | Code |
---|---|---|---|
UNITER (Chen et al., 2019) | 73.4 | UNITER: LEARNING UNIVERSAL IMAGE-TEXT REPRESENTATIONS | Link |
LXMERT (Tan et al., 2019) | 72.54 | LXMERT: Learning Cross-Modality Encoder Representations from Transformers | Link |
GQA focuses on real-world compositional reasoning.
Model | Accuracy | Paper / Source | Code |
---|---|---|---|
KaKao Brain | 73.24 | GQA Challenge | Unavailable |
LXMERT (Tan et al., 2019) | 60.3 | LXMERT: Learning Cross-Modality Encoder Representations from Transformers | Link |
TextVQA requires models to read and reason about text in an image to answer questions based on them.
Model | Accuracy | Paper / Source | Code |
---|---|---|---|
M4C (Hu et al., 2020) | 40.46 | Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA | Link |
This task focuses on answering visual questions that originate from a real use case where blind people were submitting images with recorded spoken questions in order to learn about their physical surroundings.
Model | Accuracy | Paper / Source | Code |
---|---|---|---|
Pythia | 54.22 | FB's Pythia repository | Link |
BUTD Vizwiz (Gurari et al., 2018) | 46.9 | VizWiz Grand Challenge: Answering Visual Questions from Blind People | Unavailable |