datasets-read-notes

<<<<<<< HEAD datasets-read-notes

multimodal datasets https://huggingface.co/datasets?sort=likes&search=multimodal

Awesome Datasets

Datasets of Pre-Training for Alignment

Name	Paper	Type	Modalities
ShareGPT4Video	ShareGPT4Video: Improving Video Understanding and Generation with Better Captions	Caption	Video-Text
COYO-700M	COYO-700M: Image-Text Pair Dataset	Caption	Image-Text
ShareGPT4V	ShareGPT4V: Improving Large Multi-Modal Models with Better Captions	Caption	Image-Text
AS-1B	The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World	Hybrid	Image-Text
InternVid	InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation	Caption	Video-Text
MS-COCO	Microsoft COCO: Common Objects in Context	Caption	Image-Text
SBU Captions	Im2Text: Describing Images Using 1 Million Captioned Photographs	Caption	Image-Text
Conceptual Captions	Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning	Caption	Image-Text
LAION-400M	LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs	Caption	Image-Text
VG Captions	Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations	Caption	Image-Text
Flickr30k	Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models	Caption	Image-Text
AI-Caps	AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding	Caption	Image-Text
Wukong Captions	Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark	Caption	Image-Text
GRIT	Kosmos-2: Grounding Multimodal Large Language Models to the World	Caption	Image-Text-Bounding-Box
Youku-mPLUG	Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	Caption	Video-Text
MSR-VTT	MSR-VTT: A Large Video Description Dataset for Bridging Video and Language	Caption	Video-Text
Webvid10M	Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval	Caption	Video-Text
WavCaps	WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research	Caption	Audio-Text
AISHELL-1	AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline	ASR	Audio-Text
AISHELL-2	AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale	ASR	Audio-Text
VSDial-CN	X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages	ASR	Image-Audio-Text

Datasets of Multimodal Instruction Tuning

Name	Paper	Link	Notes
VEGA	VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models	Link	A dataset for enchancing model capabilities in comprehension of interleaved information
ALLaVA-4V	ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model	Link	Vision and language caption and instruction dataset generated by GPT4V
IDK	Visually Dehallucinative Instruction Generation: Know What You Don't Know	Link	Dehallucinative visual instruction for "I Know" hallucination
CAP2QA	Visually Dehallucinative Instruction Generation	Link	Image-aligned visual instruction dataset
M3DBench	M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts	Link	A large-scale 3D instruction tuning dataset
ViP-LLaVA-Instruct	Making Large Multimodal Models Understand Arbitrary Visual Prompts	Link	A mixture of LLaVA-1.5 instruction data and the region-level visual prompting data
LVIS-Instruct4V	To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning	Link	A visual instruction dataset via self-instruction from GPT-4V
ComVint	What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning	Link	A synthetic instruction dataset for complex visual reasoning
SparklesDialogue	✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models	Link	A machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions to augment the conversational competence of instruction-following LLMs across multiple images and dialogue turns.
StableLLaVA	StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data	Link	A cheap and effective approach to collect visual instruction tuning data
M-HalDetect	Detecting and Preventing Hallucinations in Large Vision Language Models	Coming soon	A dataset used to train and benchmark models for hallucination detection and prevention
MGVLID	ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning	-	A high-quality instruction-tuning dataset including image-text and region-text pairs
BuboGPT	BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs	Link	A high-quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data
SVIT	SVIT: Scaling up Visual Instruction Tuning	Link	A large-scale dataset with 4.2M informative visual instruction tuning data, including conversations, detailed descriptions, complex reasoning and referring QAs
mPLUG-DocOwl	mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding	Link	An instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding
PF-1M	Visual Instruction Tuning with Polite Flamingo	Link	A collection of 37 vision-language datasets with responses rewritten by Polite Flamingo.
ChartLlama	ChartLlama: A Multimodal LLM for Chart Understanding and Generation	Link	A multi-modal instruction-tuning dataset for chart understanding and generation
LLaVAR	LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding	Link	A visual instruction-tuning dataset for Text-rich Image Understanding
MotionGPT	MotionGPT: Human Motion as a Foreign Language	Link	A instruction-tuning dataset including multiple human motion-related tasks
LRV-Instruction	Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	Link	Visual instruction tuning dataset for addressing hallucination issue
Macaw-LLM	Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration	Link	A large-scale multi-modal instruction dataset in terms of multi-turn dialogue
LAMM-Dataset	LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	Link	A comprehensive multi-modal instruction tuning dataset
Video-ChatGPT	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	Link	100K high-quality video instruction dataset
MIMIC-IT	MIMIC-IT: Multi-Modal In-Context Instruction Tuning	Link	Multimodal in-context instruction tuning
M3IT	M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning	Link	Large-scale, broad-coverage multimodal instruction tuning dataset
LLaVA-Med	LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day	Coming soon	A large-scale, broad-coverage biomedical instruction-following dataset
GPT4Tools	GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction	Link	Tool-related instruction datasets
MULTIS	ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst	Coming soon	Multimodal instruction tuning dataset covering 16 multimodal tasks
DetGPT	DetGPT: Detect What You Need via Reasoning	Link	Instruction-tuning dataset with 5000 images and around 30000 query-answer pairs
PMC-VQA	PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering	Coming soon	Large-scale medical visual question-answering dataset
VideoChat	VideoChat: Chat-Centric Video Understanding	Link	Video-centric multimodal instruction dataset
X-LLM	X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages	Link	Chinese multimodal instruction dataset
LMEye	LMEye: An Interactive Perception Network for Large Language Models	Link	A multi-modal instruction-tuning dataset
cc-sbu-align	MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	Link	Multimodal aligned dataset for improving model's usability and generation's fluency
LLaVA-Instruct-150K	Visual Instruction Tuning	Link	Multimodal instruction-following data generated by GPT
MultiInstruct	MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning	Link	The first multimodal instruction tuning benchmark dataset

Datasets of In-Context Learning

Name	Paper	Link	Notes
MIC	MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning	Link	A manually constructed instruction tuning dataset including interleaved text-image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs.
MIMIC-IT	MIMIC-IT: Multi-Modal In-Context Instruction Tuning	Link	Multimodal in-context instruction dataset

Datasets of Multimodal Chain-of-Thought

Name	Paper	Link	Notes
EMER	Explainable Multimodal Emotion Reasoning	Coming soon	A benchmark dataset for explainable emotion reasoning task
EgoCOT	EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought	Coming soon	Large-scale embodied planning dataset
VIP	Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction	Coming soon	An inference-time dataset that can be used to evaluate VideoCOT
ScienceQA	Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering	Link	Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains

Datasets of Multimodal RLHF

Name	Paper	Link	Notes
VLFeedback	Silkie: Preference Distillation for Large Visual Language Models	Link	A vision-language feedback dataset annotated by AI

Multimodal datasets for NLP Applications

Sentiment Analysis

Dataset	Title of the Paper	Link of the Paper	Link of the Dataset
EmoDB	A Database of German Emotional Speech	Paper	Dataset
VAM	The Vera am Mittag German Audio-Visual Emotional Speech Database	Paper	Dataset
IEMOCAP	IEMOCAP: interactive emotional dyadic motion capture database	Paper	Dataset
Mimicry	A Multimodal Database for Mimicry Analysis	Paper	Dataset
YouTube	Towards Multimodal Sentiment Analysis:Harvesting Opinions from the Web	[Paper](https://ict.usc.edu/pubs/Towards Multimodal Sentiment Analysis- Harvesting Opinions from The Web.pdf)	Dataset
HUMAINE	The HUMAINE database	Paper	Dataset
Large Movies	Sentiment classification on Large Movie Review	Paper	Dataset
SEMAINE	The SEMAINE Database: Annotated Multimodal Records of Emotionally Colored Conversations between a Person and a Limited Agent	Paper	Dataset
AFEW	Collecting Large, Richly Annotated Facial-Expression Databases from Movies	Paper	Dataset
SST	Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank	Paper	Dataset
ICT-MMMO	YouTube Movie Reviews: Sentiment Analysis in an AudioVisual Context	Paper	Dataset
RECOLA	Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions	Paper	Dataset
MOUD	Utterance-Level Multimodal Sentiment Analysis	Paper
CMU-MOSI	MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos	Paper	Dataset
POM	Multimodal Analysis and Prediction of Persuasiveness in Online Social Multimedia	Paper	Dataset
MELD	MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations	Paper	Dataset
CMU-MOSEI	Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph	Paper	Dataset
AMMER	Towards Multimodal Emotion Recognition in German Speech Events in Cars using Transfer Learning	Paper	On Request
SEWA	SEWA DB: A Rich Database for Audio-Visual Emotion and Sentiment Research in the Wild	Paper	Dataset
Fakeddit	r/fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection	Paper	Dataset
CMU-MOSEAS	CMU-MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French	Paper	Dataset
MultiOFF	Multimodal meme dataset (MultiOFF) for identifying offensive content in image and text	Paper	Dataset
MEISD	MEISD: A Multimodal Multi-Label Emotion, Intensity and Sentiment Dialogue Dataset for Emotion Recognition and Sentiment Analysis in Conversations	Paper	Dataset
TASS	Overview of TASS 2020: Introducing Emotion	Paper	Dataset
CH SIMS	CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality	Paper	Dataset
Creep-Image	A Multimodal Dataset of Images and Text	Paper	Dataset
Entheos	Entheos: A Multimodal Dataset for Studying Enthusiasm	Paper	Dataset

Machine Translation

Dataset	Title of the Paper	Link of the Paper	Link of the Dataset
Multi30K	Multi30K: Multilingual English-German Image Description	Paper	Dataset
How2	How2: A Large-scale Dataset for Multimodal Language Understanding	Paper	Dataset
MLT	Multimodal Lexical Translation	Paper	Dataset
IKEA	A Visual Attention Grounding Neural Model for Multimodal Machine Translation	Paper	Dataset
Flickr30K (EN- (hi-IN))	Multimodal Neural Machine Translation for Low-resource Language Pairs using Synthetic Data	Paper	On Request
Hindi Visual Genome	Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation	Paper	Dataset
HowTo100M	Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models	Paper	Dataset

Information Retrieval

Dataset	Title of the Paper	Link of the Paper	Link of the Dataset
MUSICLEF	MusiCLEF: a Benchmark Activity in Multimodal Music Information Retrieval	Paper	Dataset
Moodo	The Moodo dataset: Integrating user context with emotional and color perception of music for affective music information retrieval	Paper	Dataset
ALF-200k	ALF-200k: Towards Extensive Multimodal Analyses of Music Tracks and Playlists	Paper	Dataset
MQA	Can Image Captioning Help Passage Retrieval in Multimodal Question Answering?	Paper	Dataset
WAT2019	WAT2019: English-Hindi Translation on Hindi Visual Genome Dataset	Paper	Dataset
ViTT	Multimodal Pretraining for Dense Video Captioning	Paper	Dataset
MTD	MTD: A Multimodal Dataset of Musical Themes for MIR Research	Paper	Dataset
MusiClef	A professionally annotated and enriched multimodal data set on popular music	Paper	Dataset
Schubert Winterreise	Schubert Winterreise dataset: A multimodal scenario for music analysis	Paper	Dataset
WIT	WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning	Paper	Dataset

Question Answering

Dataset	Title of the Paper	Link of the Paper	Link of the Dataset
MQA	A Dataset for Multimodal Question Answering in the Cultural Heritage Domain	Paper	-
MovieQA	Movieqa: Understanding stories in movies through question-answering MovieQA	Paper	Dataset
PororoQA	Deep story video story qa by deep embedded memory networks	Paper	Dataset
MemexQA	MemexQA: Visual Memex Question Answering	Paper	Dataset
VQA	Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering	Paper	Dataset
TDIUC	An analysis of visual question answering algorithms	Paper	Dataset
TGIF-QA	TGIF-QA: Toward spatio-temporal reasoning in visual question answering	Paper	Dataset
MSVD QA, MSRVTT QA	Video question answering via attribute augmented attention network learning	Paper	Dataset
YouTube2Text	Video Question Answering via Gradually Refined Attention over Appearance and Motion	Paper	Dataset
MovieFIB	A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering	Paper	Dataset
Video Context QA	Uncovering the temporal context for video question answering	Paper	Dataset
MarioQA	Marioqa: Answering questions by watching gameplay videos	Paper	Dataset
TVQA	Tvqa: Localized, compositional video question answering	Paper	Dataset
VQA-CP v2	Don’t just assume; look and answer: Overcoming priors for visual question answering	Paper	Dataset
RecipeQA	RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes	Paper	Dataset
GQA	GQA: A new dataset for real-world visual reasoning and compositional question answering	Paper	Dataset
Social IQ	Social-iq: A question answering benchmark for artificial social intelligence	Paper	Dataset
MIMOQA	MIMOQA: Multimodal Input Multimodal Output Question Answering	Paper	-

Summarization

Dataset	Title of the Paper	Link of the Paper	Link of the Dataset
SumMe	Tvsum: Summarizing web videos using titles	Paper	Dataset
TVSum	Creating summaries from user videos	Paper	Dataset
QFVS	Query-focused video summarization: Dataset, evaluation, and a memory network based approach	Paper	Dataset
MMSS	Multi-modal Sentence Summarization with Modality Attention and Image Filtering	Paper	-
MSMO	MSMO: Multimodal Summarization with Multimodal Output	Paper	-
Screen2Words	Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning	Paper	Dataset
AVIATE	IEMOCAP: interactive emotional dyadic motion capture database	Paper	Dataset
Multimodal Microblog Summarizaion	On Multimodal Microblog Summarization	Paper	-

Human Computer Interaction

Dataset	Title of the Paper	Link of the Paper	Link of the Dataset
CAUVE	CUAVE: A new audio-visual database for multimodal human-computer interface research	Paper	Dataset
MHAD	Berkeley mhad: A comprehensive multimodal human action database	Paper	Dataset
Multi-party interactions	A Multi-party Multi-modal Dataset for Focus of Visual Attention in Human-human and Human-robot Interaction	Paper	-
MHHRI	Multimodal human-human-robot interactions (mhhri) dataset for studying personality and engagement	Paper	Dataset
Red Hen Lab	Red Hen Lab: Dataset and Tools for Multimodal Human Communication Research	Paper	-
EMRE	Generating a Novel Dataset of Multimodal Referring Expressions	Paper	Dataset
Chinese Whispers	Chinese whispers: A multimodal dataset for embodied language grounding	Paper	Dataset
uulmMAC	The uulmMAC database—A multimodal affective corpus for affective computing in human-computer interaction	Paper	Dataset

Semantic Analysis

Dataset	Title of the Paper	Link of the Paper	Link of the Dataset
WN9-IMG	Image-embodied Knowledge Representation Learning	Paper	Dataset
Wikimedia Commons	A Dataset and Reranking Method for Multimodal MT of User-Generated Image Captions	Paper	Dataset
Starsem18-multimodalKB	A Multimodal Translation-Based Approach for Knowledge Graph Representation Learning	Paper	Dataset
MUStARD	Towards Multimodal Sarcasm Detection	Paper	Dataset
YouMakeup	YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension	Paper	Dataset
MDID	Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts	Paper	Dataset
Social media posts from Flickr (Mental Health)	Inferring Social Media Users’ Mental Health Status from Multimodal Information	Paper	Dataset
Twitter MEL	Building a Multimodal Entity Linking Dataset From Tweets Building a Multimodal Entity Linking Dataset From Tweets	Paper	Dataset
MultiMET	MultiMET: A Multimodal Dataset for Metaphor Understanding	Paper	-
MSDS	Multimodal Sarcasm Detection in Spanish: a Dataset and a Baseline	Paper	Dataset

Miscellaneous

Dataset	Title of the Paper	Link of the Paper	Link of the Dataset
MS COCO	Microsoft COCO: Common objects in context	Paper	Dataset
ILSVRC	ImageNet Large Scale Visual Recognition Challenge	Paper	Dataset
YFCC100M	YFCC100M: The new data in multimedia research	Paper	Dataset
COGNIMUSE	COGNIMUSE: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization	Paper	Dataset
SNAG	SNAG: Spoken Narratives and Gaze Dataset	Paper	Dataset
UR-Funny	UR-FUNNY: A Multimodal Language Dataset for Understanding Humor	Paper	Dataset
Bag-of-Lies	Bag-of-Lies: A Multimodal Dataset for Deception Detection	Paper	Dataset
MARC	A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks	Paper	Dataset
MuSE	MuSE: a Multimodal Dataset of Stressed Emotion	Paper	Dataset
BabelPic	Fatality Killed the Cat or: BabelPic, a Multimodal Dataset for Non-Concrete Concept	Paper	Dataset
Eye4Ref	Eye4Ref: A Multimodal Eye Movement Dataset of Referentially Complex Situations	Paper	-
Troll Memes	A Dataset for Troll Classification of TamilMemes	Paper	Dataset
SEMD	EmoSen: Generating sentiment and emotion controlled responses in a multimodal dialogue system	Paper	-
Chat talk Corpus	Construction and Analysis of a Multimodal Chat-talk Corpus for Dialog Systems Considering Interpersonal Closeness	Paper	-
EMOTyDA	Towards Emotion-aided Multi-modal Dialogue Act Classification	Paper	Dataset
MELINDA	MELINDA: A Multimodal Dataset for Biomedical Experiment Method Classification	Paper	Dataset
NewsCLIPpings	NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media	Paper	Dataset
R2VQ	Designing Multimodal Datasets for NLP Challenges	Paper	Dataset
M2H2	M2H2: A Multimodal Multiparty Hindi Dataset For Humor Recognition in Conversations

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.obsidian		.obsidian
docs		docs
source		source
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
make.bat		make.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

datasets-read-notes

Awesome Datasets

Datasets of Pre-Training for Alignment

Datasets of Multimodal Instruction Tuning

Datasets of In-Context Learning

Datasets of Multimodal Chain-of-Thought

Datasets of Multimodal RLHF

Multimodal datasets for NLP Applications

About

Releases

Packages

Languages

License

isLinXu/datasets-read-notes

Folders and files

Latest commit

History

Repository files navigation

datasets-read-notes

Awesome Datasets

Datasets of Pre-Training for Alignment

Datasets of Multimodal Instruction Tuning

Datasets of In-Context Learning

Datasets of Multimodal Chain-of-Thought

Datasets of Multimodal RLHF

Multimodal datasets for NLP Applications

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages