<<<<<<< HEAD datasets-read-notes
- multimodal datasets https://huggingface.co/datasets?sort=likes&search=multimodal
Name | Paper | Link | Notes |
---|---|---|---|
VEGA | VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models | Link | A dataset for enchancing model capabilities in comprehension of interleaved information |
ALLaVA-4V | ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model | Link | Vision and language caption and instruction dataset generated by GPT4V |
IDK | Visually Dehallucinative Instruction Generation: Know What You Don't Know | Link | Dehallucinative visual instruction for "I Know" hallucination |
CAP2QA | Visually Dehallucinative Instruction Generation | Link | Image-aligned visual instruction dataset |
M3DBench | M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts | Link | A large-scale 3D instruction tuning dataset |
ViP-LLaVA-Instruct | Making Large Multimodal Models Understand Arbitrary Visual Prompts | Link | A mixture of LLaVA-1.5 instruction data and the region-level visual prompting data |
LVIS-Instruct4V | To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning | Link | A visual instruction dataset via self-instruction from GPT-4V |
ComVint | What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning | Link | A synthetic instruction dataset for complex visual reasoning |
SparklesDialogue | ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models | Link | A machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions to augment the conversational competence of instruction-following LLMs across multiple images and dialogue turns. |
StableLLaVA | StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data | Link | A cheap and effective approach to collect visual instruction tuning data |
M-HalDetect | Detecting and Preventing Hallucinations in Large Vision Language Models | Coming soon | A dataset used to train and benchmark models for hallucination detection and prevention |
MGVLID | ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning | - | A high-quality instruction-tuning dataset including image-text and region-text pairs |
BuboGPT | BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs | Link | A high-quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data |
SVIT | SVIT: Scaling up Visual Instruction Tuning | Link | A large-scale dataset with 4.2M informative visual instruction tuning data, including conversations, detailed descriptions, complex reasoning and referring QAs |
mPLUG-DocOwl | mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding | Link | An instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding |
PF-1M | Visual Instruction Tuning with Polite Flamingo | Link | A collection of 37 vision-language datasets with responses rewritten by Polite Flamingo. |
ChartLlama | ChartLlama: A Multimodal LLM for Chart Understanding and Generation | Link | A multi-modal instruction-tuning dataset for chart understanding and generation |
LLaVAR | LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding | Link | A visual instruction-tuning dataset for Text-rich Image Understanding |
MotionGPT | MotionGPT: Human Motion as a Foreign Language | Link | A instruction-tuning dataset including multiple human motion-related tasks |
LRV-Instruction | Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | Link | Visual instruction tuning dataset for addressing hallucination issue |
Macaw-LLM | Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration | Link | A large-scale multi-modal instruction dataset in terms of multi-turn dialogue |
LAMM-Dataset | LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | Link | A comprehensive multi-modal instruction tuning dataset |
Video-ChatGPT | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | Link | 100K high-quality video instruction dataset |
MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning | Link | Multimodal in-context instruction tuning |
M3IT | M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning | Link | Large-scale, broad-coverage multimodal instruction tuning dataset |
LLaVA-Med | LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | Coming soon | A large-scale, broad-coverage biomedical instruction-following dataset |
GPT4Tools | GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction | Link | Tool-related instruction datasets |
MULTIS | ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst | Coming soon | Multimodal instruction tuning dataset covering 16 multimodal tasks |
DetGPT | DetGPT: Detect What You Need via Reasoning | Link | Instruction-tuning dataset with 5000 images and around 30000 query-answer pairs |
PMC-VQA | PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering | Coming soon | Large-scale medical visual question-answering dataset |
VideoChat | VideoChat: Chat-Centric Video Understanding | Link | Video-centric multimodal instruction dataset |
X-LLM | X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages | Link | Chinese multimodal instruction dataset |
LMEye | LMEye: An Interactive Perception Network for Large Language Models | Link | A multi-modal instruction-tuning dataset |
cc-sbu-align | MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | Link | Multimodal aligned dataset for improving model's usability and generation's fluency |
LLaVA-Instruct-150K | Visual Instruction Tuning | Link | Multimodal instruction-following data generated by GPT |
MultiInstruct | MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning | Link | The first multimodal instruction tuning benchmark dataset |
Name | Paper | Link | Notes |
---|---|---|---|
MIC | MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning | Link | A manually constructed instruction tuning dataset including interleaved text-image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs. |
MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning | Link | Multimodal in-context instruction dataset |
Name | Paper | Link | Notes |
---|---|---|---|
EMER | Explainable Multimodal Emotion Reasoning | Coming soon | A benchmark dataset for explainable emotion reasoning task |
EgoCOT | EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | Coming soon | Large-scale embodied planning dataset |
VIP | Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction | Coming soon | An inference-time dataset that can be used to evaluate VideoCOT |
ScienceQA | Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | Link | Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains |
Name | Paper | Link | Notes |
---|---|---|---|
VLFeedback | Silkie: Preference Distillation for Large Visual Language Models | Link | A vision-language feedback dataset annotated by AI |
- Sentiment Analysis
Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
---|---|---|---|
EmoDB | A Database of German Emotional Speech | Paper | Dataset |
VAM | The Vera am Mittag German Audio-Visual Emotional Speech Database | Paper | Dataset |
IEMOCAP | IEMOCAP: interactive emotional dyadic motion capture database | Paper | Dataset |
Mimicry | A Multimodal Database for Mimicry Analysis | Paper | Dataset |
YouTube | Towards Multimodal Sentiment Analysis:Harvesting Opinions from the Web | [Paper](https://ict.usc.edu/pubs/Towards Multimodal Sentiment Analysis- Harvesting Opinions from The Web.pdf) | Dataset |
HUMAINE | The HUMAINE database | Paper | Dataset |
Large Movies | Sentiment classification on Large Movie Review | Paper | Dataset |
SEMAINE | The SEMAINE Database: Annotated Multimodal Records of Emotionally Colored Conversations between a Person and a Limited Agent | Paper | Dataset |
AFEW | Collecting Large, Richly Annotated Facial-Expression Databases from Movies | Paper | Dataset |
SST | Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank | Paper | Dataset |
ICT-MMMO | YouTube Movie Reviews: Sentiment Analysis in an AudioVisual Context | Paper | Dataset |
RECOLA | Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions | Paper | Dataset |
MOUD | Utterance-Level Multimodal Sentiment Analysis | Paper | |
CMU-MOSI | MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos | Paper | Dataset |
POM | Multimodal Analysis and Prediction of Persuasiveness in Online Social Multimedia | Paper | Dataset |
MELD | MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations | Paper | Dataset |
CMU-MOSEI | Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph | Paper | Dataset |
AMMER | Towards Multimodal Emotion Recognition in German Speech Events in Cars using Transfer Learning | Paper | On Request |
SEWA | SEWA DB: A Rich Database for Audio-Visual Emotion and Sentiment Research in the Wild | Paper | Dataset |
Fakeddit | r/fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection | Paper | Dataset |
CMU-MOSEAS | CMU-MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French | Paper | Dataset |
MultiOFF | Multimodal meme dataset (MultiOFF) for identifying offensive content in image and text | Paper | Dataset |
MEISD | MEISD: A Multimodal Multi-Label Emotion, Intensity and Sentiment Dialogue Dataset for Emotion Recognition and Sentiment Analysis in Conversations | Paper | Dataset |
TASS | Overview of TASS 2020: Introducing Emotion | Paper | Dataset |
CH SIMS | CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality | Paper | Dataset |
Creep-Image | A Multimodal Dataset of Images and Text | Paper | Dataset |
Entheos | Entheos: A Multimodal Dataset for Studying Enthusiasm | Paper | Dataset |
- Machine Translation
Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
---|---|---|---|
Multi30K | Multi30K: Multilingual English-German Image Description | Paper | Dataset |
How2 | How2: A Large-scale Dataset for Multimodal Language Understanding | Paper | Dataset |
MLT | Multimodal Lexical Translation | Paper | Dataset |
IKEA | A Visual Attention Grounding Neural Model for Multimodal Machine Translation | Paper | Dataset |
Flickr30K (EN- (hi-IN)) | Multimodal Neural Machine Translation for Low-resource Language Pairs using Synthetic Data | Paper | On Request |
Hindi Visual Genome | Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation | Paper | Dataset |
HowTo100M | Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models | Paper | Dataset |
- Information Retrieval
Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
---|---|---|---|
MUSICLEF | MusiCLEF: a Benchmark Activity in Multimodal Music Information Retrieval | Paper | Dataset |
Moodo | The Moodo dataset: Integrating user context with emotional and color perception of music for affective music information retrieval | Paper | Dataset |
ALF-200k | ALF-200k: Towards Extensive Multimodal Analyses of Music Tracks and Playlists | Paper | Dataset |
MQA | Can Image Captioning Help Passage Retrieval in Multimodal Question Answering? | Paper | Dataset |
WAT2019 | WAT2019: English-Hindi Translation on Hindi Visual Genome Dataset | Paper | Dataset |
ViTT | Multimodal Pretraining for Dense Video Captioning | Paper | Dataset |
MTD | MTD: A Multimodal Dataset of Musical Themes for MIR Research | Paper | Dataset |
MusiClef | A professionally annotated and enriched multimodal data set on popular music | Paper | Dataset |
Schubert Winterreise | Schubert Winterreise dataset: A multimodal scenario for music analysis | Paper | Dataset |
WIT | WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning | Paper | Dataset |
- Question Answering
Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
---|---|---|---|
MQA | A Dataset for Multimodal Question Answering in the Cultural Heritage Domain | Paper | - |
MovieQA | Movieqa: Understanding stories in movies through question-answering MovieQA | Paper | Dataset |
PororoQA | Deep story video story qa by deep embedded memory networks | Paper | Dataset |
MemexQA | MemexQA: Visual Memex Question Answering | Paper | Dataset |
VQA | Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering | Paper | Dataset |
TDIUC | An analysis of visual question answering algorithms | Paper | Dataset |
TGIF-QA | TGIF-QA: Toward spatio-temporal reasoning in visual question answering | Paper | Dataset |
MSVD QA, MSRVTT QA | Video question answering via attribute augmented attention network learning | Paper | Dataset |
YouTube2Text | Video Question Answering via Gradually Refined Attention over Appearance and Motion | Paper | Dataset |
MovieFIB | A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering | Paper | Dataset |
Video Context QA | Uncovering the temporal context for video question answering | Paper | Dataset |
MarioQA | Marioqa: Answering questions by watching gameplay videos | Paper | Dataset |
TVQA | Tvqa: Localized, compositional video question answering | Paper | Dataset |
VQA-CP v2 | Don’t just assume; look and answer: Overcoming priors for visual question answering | Paper | Dataset |
RecipeQA | RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes | Paper | Dataset |
GQA | GQA: A new dataset for real-world visual reasoning and compositional question answering | Paper | Dataset |
Social IQ | Social-iq: A question answering benchmark for artificial social intelligence | Paper | Dataset |
MIMOQA | MIMOQA: Multimodal Input Multimodal Output Question Answering | Paper | - |
- Summarization
Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
---|---|---|---|
SumMe | Tvsum: Summarizing web videos using titles | Paper | Dataset |
TVSum | Creating summaries from user videos | Paper | Dataset |
QFVS | Query-focused video summarization: Dataset, evaluation, and a memory network based approach | Paper | Dataset |
MMSS | Multi-modal Sentence Summarization with Modality Attention and Image Filtering | Paper | - |
MSMO | MSMO: Multimodal Summarization with Multimodal Output | Paper | - |
Screen2Words | Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning | Paper | Dataset |
AVIATE | IEMOCAP: interactive emotional dyadic motion capture database | Paper | Dataset |
Multimodal Microblog Summarizaion | On Multimodal Microblog Summarization | Paper | - |
- Human Computer Interaction
Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
---|---|---|---|
CAUVE | CUAVE: A new audio-visual database for multimodal human-computer interface research | Paper | Dataset |
MHAD | Berkeley mhad: A comprehensive multimodal human action database | Paper | Dataset |
Multi-party interactions | A Multi-party Multi-modal Dataset for Focus of Visual Attention in Human-human and Human-robot Interaction | Paper | - |
MHHRI | Multimodal human-human-robot interactions (mhhri) dataset for studying personality and engagement | Paper | Dataset |
Red Hen Lab | Red Hen Lab: Dataset and Tools for Multimodal Human Communication Research | Paper | - |
EMRE | Generating a Novel Dataset of Multimodal Referring Expressions | Paper | Dataset |
Chinese Whispers | Chinese whispers: A multimodal dataset for embodied language grounding | Paper | Dataset |
uulmMAC | The uulmMAC database—A multimodal affective corpus for affective computing in human-computer interaction | Paper | Dataset |
- Semantic Analysis
Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
---|---|---|---|
WN9-IMG | Image-embodied Knowledge Representation Learning | Paper | Dataset |
Wikimedia Commons | A Dataset and Reranking Method for Multimodal MT of User-Generated Image Captions | Paper | Dataset |
Starsem18-multimodalKB | A Multimodal Translation-Based Approach for Knowledge Graph Representation Learning | Paper | Dataset |
MUStARD | Towards Multimodal Sarcasm Detection | Paper | Dataset |
YouMakeup | YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension | Paper | Dataset |
MDID | Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts | Paper | Dataset |
Social media posts from Flickr (Mental Health) | Inferring Social Media Users’ Mental Health Status from Multimodal Information | Paper | Dataset |
Twitter MEL | Building a Multimodal Entity Linking Dataset From Tweets Building a Multimodal Entity Linking Dataset From Tweets | Paper | Dataset |
MultiMET | MultiMET: A Multimodal Dataset for Metaphor Understanding | Paper | - |
MSDS | Multimodal Sarcasm Detection in Spanish: a Dataset and a Baseline | Paper | Dataset |
- Miscellaneous
Dataset | Title of the Paper | Link of the Paper | Link of the Dataset |
---|---|---|---|
MS COCO | Microsoft COCO: Common objects in context | Paper | Dataset |
ILSVRC | ImageNet Large Scale Visual Recognition Challenge | Paper | Dataset |
YFCC100M | YFCC100M: The new data in multimedia research | Paper | Dataset |
COGNIMUSE | COGNIMUSE: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization | Paper | Dataset |
SNAG | SNAG: Spoken Narratives and Gaze Dataset | Paper | Dataset |
UR-Funny | UR-FUNNY: A Multimodal Language Dataset for Understanding Humor | Paper | Dataset |
Bag-of-Lies | Bag-of-Lies: A Multimodal Dataset for Deception Detection | Paper | Dataset |
MARC | A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks | Paper | Dataset |
MuSE | MuSE: a Multimodal Dataset of Stressed Emotion | Paper | Dataset |
BabelPic | Fatality Killed the Cat or: BabelPic, a Multimodal Dataset for Non-Concrete Concept | Paper | Dataset |
Eye4Ref | Eye4Ref: A Multimodal Eye Movement Dataset of Referentially Complex Situations | Paper | - |
Troll Memes | A Dataset for Troll Classification of TamilMemes | Paper | Dataset |
SEMD | EmoSen: Generating sentiment and emotion controlled responses in a multimodal dialogue system | Paper | - |
Chat talk Corpus | Construction and Analysis of a Multimodal Chat-talk Corpus for Dialog Systems Considering Interpersonal Closeness | Paper | - |
EMOTyDA | Towards Emotion-aided Multi-modal Dialogue Act Classification | Paper | Dataset |
MELINDA | MELINDA: A Multimodal Dataset for Biomedical Experiment Method Classification | Paper | Dataset |
NewsCLIPpings | NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media | Paper | Dataset |
R2VQ | Designing Multimodal Datasets for NLP Challenges | Paper | Dataset |
M2H2 | M2H2: A Multimodal Multiparty Hindi Dataset For Humor Recognition in Conversations |