Skip to content

isLinXu/datasets-read-notes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

datasets-read-notes

<<<<<<< HEAD datasets-read-notes

Awesome Datasets

Datasets of Pre-Training for Alignment

Name Paper Type Modalities
ShareGPT4Video ShareGPT4Video: Improving Video Understanding and Generation with Better Captions Caption Video-Text
COYO-700M COYO-700M: Image-Text Pair Dataset Caption Image-Text
ShareGPT4V ShareGPT4V: Improving Large Multi-Modal Models with Better Captions Caption Image-Text
AS-1B The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World Hybrid Image-Text
InternVid InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation Caption Video-Text
MS-COCO Microsoft COCO: Common Objects in Context Caption Image-Text
SBU Captions Im2Text: Describing Images Using 1 Million Captioned Photographs Caption Image-Text
Conceptual Captions Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning Caption Image-Text
LAION-400M LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs Caption Image-Text
VG Captions Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations Caption Image-Text
Flickr30k Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models Caption Image-Text
AI-Caps AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding Caption Image-Text
Wukong Captions Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark Caption Image-Text
GRIT Kosmos-2: Grounding Multimodal Large Language Models to the World Caption Image-Text-Bounding-Box
Youku-mPLUG Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks Caption Video-Text
MSR-VTT MSR-VTT: A Large Video Description Dataset for Bridging Video and Language Caption Video-Text
Webvid10M Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval Caption Video-Text
WavCaps WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research Caption Audio-Text
AISHELL-1 AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline ASR Audio-Text
AISHELL-2 AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale ASR Audio-Text
VSDial-CN X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages ASR Image-Audio-Text

Datasets of Multimodal Instruction Tuning

Name Paper Link Notes
VEGA VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models Link A dataset for enchancing model capabilities in comprehension of interleaved information
ALLaVA-4V ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model Link Vision and language caption and instruction dataset generated by GPT4V
IDK Visually Dehallucinative Instruction Generation: Know What You Don't Know Link Dehallucinative visual instruction for "I Know" hallucination
CAP2QA Visually Dehallucinative Instruction Generation Link Image-aligned visual instruction dataset
M3DBench M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts Link A large-scale 3D instruction tuning dataset
ViP-LLaVA-Instruct Making Large Multimodal Models Understand Arbitrary Visual Prompts Link A mixture of LLaVA-1.5 instruction data and the region-level visual prompting data
LVIS-Instruct4V To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning Link A visual instruction dataset via self-instruction from GPT-4V
ComVint What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning Link A synthetic instruction dataset for complex visual reasoning
SparklesDialogue ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models Link A machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions to augment the conversational competence of instruction-following LLMs across multiple images and dialogue turns.
StableLLaVA StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data Link A cheap and effective approach to collect visual instruction tuning data
M-HalDetect Detecting and Preventing Hallucinations in Large Vision Language Models Coming soon A dataset used to train and benchmark models for hallucination detection and prevention
MGVLID ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning - A high-quality instruction-tuning dataset including image-text and region-text pairs
BuboGPT BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs Link A high-quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data
SVIT SVIT: Scaling up Visual Instruction Tuning Link A large-scale dataset with 4.2M informative visual instruction tuning data, including conversations, detailed descriptions, complex reasoning and referring QAs
mPLUG-DocOwl mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding Link An instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding
PF-1M Visual Instruction Tuning with Polite Flamingo Link A collection of 37 vision-language datasets with responses rewritten by Polite Flamingo.
ChartLlama ChartLlama: A Multimodal LLM for Chart Understanding and Generation Link A multi-modal instruction-tuning dataset for chart understanding and generation
LLaVAR LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding Link A visual instruction-tuning dataset for Text-rich Image Understanding
MotionGPT MotionGPT: Human Motion as a Foreign Language Link A instruction-tuning dataset including multiple human motion-related tasks
LRV-Instruction Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning Link Visual instruction tuning dataset for addressing hallucination issue
Macaw-LLM Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration Link A large-scale multi-modal instruction dataset in terms of multi-turn dialogue
LAMM-Dataset LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark Link A comprehensive multi-modal instruction tuning dataset
Video-ChatGPT Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Link 100K high-quality video instruction dataset
MIMIC-IT MIMIC-IT: Multi-Modal In-Context Instruction Tuning Link Multimodal in-context instruction tuning
M3IT M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning Link Large-scale, broad-coverage multimodal instruction tuning dataset
LLaVA-Med LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day Coming soon A large-scale, broad-coverage biomedical instruction-following dataset
GPT4Tools GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction Link Tool-related instruction datasets
MULTIS ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst Coming soon Multimodal instruction tuning dataset covering 16 multimodal tasks
DetGPT DetGPT: Detect What You Need via Reasoning Link Instruction-tuning dataset with 5000 images and around 30000 query-answer pairs
PMC-VQA PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering Coming soon Large-scale medical visual question-answering dataset
VideoChat VideoChat: Chat-Centric Video Understanding Link Video-centric multimodal instruction dataset
X-LLM X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages Link Chinese multimodal instruction dataset
LMEye LMEye: An Interactive Perception Network for Large Language Models Link A multi-modal instruction-tuning dataset
cc-sbu-align MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models Link Multimodal aligned dataset for improving model's usability and generation's fluency
LLaVA-Instruct-150K Visual Instruction Tuning Link Multimodal instruction-following data generated by GPT
MultiInstruct MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning Link The first multimodal instruction tuning benchmark dataset

Datasets of In-Context Learning

Name Paper Link Notes
MIC MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning Link A manually constructed instruction tuning dataset including interleaved text-image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs.
MIMIC-IT MIMIC-IT: Multi-Modal In-Context Instruction Tuning Link Multimodal in-context instruction dataset

Datasets of Multimodal Chain-of-Thought

Name Paper Link Notes
EMER Explainable Multimodal Emotion Reasoning Coming soon A benchmark dataset for explainable emotion reasoning task
EgoCOT EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought Coming soon Large-scale embodied planning dataset
VIP Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction Coming soon An inference-time dataset that can be used to evaluate VideoCOT
ScienceQA Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering Link Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains

Datasets of Multimodal RLHF

Name Paper Link Notes
VLFeedback Silkie: Preference Distillation for Large Visual Language Models Link A vision-language feedback dataset annotated by AI

Multimodal datasets for NLP Applications

  1. Sentiment Analysis
Dataset Title of the Paper Link of the Paper Link of the Dataset
EmoDB A Database of German Emotional Speech Paper Dataset
VAM The Vera am Mittag German Audio-Visual Emotional Speech Database Paper Dataset
IEMOCAP IEMOCAP: interactive emotional dyadic motion capture database Paper Dataset
Mimicry A Multimodal Database for Mimicry Analysis Paper Dataset
YouTube Towards Multimodal Sentiment Analysis:Harvesting Opinions from the Web [Paper](https://ict.usc.edu/pubs/Towards Multimodal Sentiment Analysis- Harvesting Opinions from The Web.pdf) Dataset
HUMAINE The HUMAINE database Paper Dataset
Large Movies Sentiment classification on Large Movie Review Paper Dataset
SEMAINE The SEMAINE Database: Annotated Multimodal Records of Emotionally Colored Conversations between a Person and a Limited Agent Paper Dataset
AFEW Collecting Large, Richly Annotated Facial-Expression Databases from Movies Paper Dataset
SST Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank Paper Dataset
ICT-MMMO YouTube Movie Reviews: Sentiment Analysis in an AudioVisual Context Paper Dataset
RECOLA Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions Paper Dataset
MOUD Utterance-Level Multimodal Sentiment Analysis Paper
CMU-MOSI MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos Paper Dataset
POM Multimodal Analysis and Prediction of Persuasiveness in Online Social Multimedia Paper Dataset
MELD MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations Paper Dataset
CMU-MOSEI Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph Paper Dataset
AMMER Towards Multimodal Emotion Recognition in German Speech Events in Cars using Transfer Learning Paper On Request
SEWA SEWA DB: A Rich Database for Audio-Visual Emotion and Sentiment Research in the Wild Paper Dataset
Fakeddit r/fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection Paper Dataset
CMU-MOSEAS CMU-MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French Paper Dataset
MultiOFF Multimodal meme dataset (MultiOFF) for identifying offensive content in image and text Paper Dataset
MEISD MEISD: A Multimodal Multi-Label Emotion, Intensity and Sentiment Dialogue Dataset for Emotion Recognition and Sentiment Analysis in Conversations Paper Dataset
TASS Overview of TASS 2020: Introducing Emotion Paper Dataset
CH SIMS CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality Paper Dataset
Creep-Image A Multimodal Dataset of Images and Text Paper Dataset
Entheos Entheos: A Multimodal Dataset for Studying Enthusiasm Paper Dataset
  1. Machine Translation
Dataset Title of the Paper Link of the Paper Link of the Dataset
Multi30K Multi30K: Multilingual English-German Image Description Paper Dataset
How2 How2: A Large-scale Dataset for Multimodal Language Understanding Paper Dataset
MLT Multimodal Lexical Translation Paper Dataset
IKEA A Visual Attention Grounding Neural Model for Multimodal Machine Translation Paper Dataset
Flickr30K (EN- (hi-IN)) Multimodal Neural Machine Translation for Low-resource Language Pairs using Synthetic Data Paper On Request
Hindi Visual Genome Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation Paper Dataset
HowTo100M Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models Paper Dataset
  1. Information Retrieval
Dataset Title of the Paper Link of the Paper Link of the Dataset
MUSICLEF MusiCLEF: a Benchmark Activity in Multimodal Music Information Retrieval Paper Dataset
Moodo The Moodo dataset: Integrating user context with emotional and color perception of music for affective music information retrieval Paper Dataset
ALF-200k ALF-200k: Towards Extensive Multimodal Analyses of Music Tracks and Playlists Paper Dataset
MQA Can Image Captioning Help Passage Retrieval in Multimodal Question Answering? Paper Dataset
WAT2019 WAT2019: English-Hindi Translation on Hindi Visual Genome Dataset Paper Dataset
ViTT Multimodal Pretraining for Dense Video Captioning Paper Dataset
MTD MTD: A Multimodal Dataset of Musical Themes for MIR Research Paper Dataset
MusiClef A professionally annotated and enriched multimodal data set on popular music Paper Dataset
Schubert Winterreise Schubert Winterreise dataset: A multimodal scenario for music analysis Paper Dataset
WIT WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning Paper Dataset
  1. Question Answering
Dataset Title of the Paper Link of the Paper Link of the Dataset
MQA A Dataset for Multimodal Question Answering in the Cultural Heritage Domain Paper -
MovieQA Movieqa: Understanding stories in movies through question-answering MovieQA Paper Dataset
PororoQA Deep story video story qa by deep embedded memory networks Paper Dataset
MemexQA MemexQA: Visual Memex Question Answering Paper Dataset
VQA Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering Paper Dataset
TDIUC An analysis of visual question answering algorithms Paper Dataset
TGIF-QA TGIF-QA: Toward spatio-temporal reasoning in visual question answering Paper Dataset
MSVD QA, MSRVTT QA Video question answering via attribute augmented attention network learning Paper Dataset
YouTube2Text Video Question Answering via Gradually Refined Attention over Appearance and Motion Paper Dataset
MovieFIB A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering Paper Dataset
Video Context QA Uncovering the temporal context for video question answering Paper Dataset
MarioQA Marioqa: Answering questions by watching gameplay videos Paper Dataset
TVQA Tvqa: Localized, compositional video question answering Paper Dataset
VQA-CP v2 Don’t just assume; look and answer: Overcoming priors for visual question answering Paper Dataset
RecipeQA RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes Paper Dataset
GQA GQA: A new dataset for real-world visual reasoning and compositional question answering Paper Dataset
Social IQ Social-iq: A question answering benchmark for artificial social intelligence Paper Dataset
MIMOQA MIMOQA: Multimodal Input Multimodal Output Question Answering Paper -
  1. Summarization
Dataset Title of the Paper Link of the Paper Link of the Dataset
SumMe Tvsum: Summarizing web videos using titles Paper Dataset
TVSum Creating summaries from user videos Paper Dataset
QFVS Query-focused video summarization: Dataset, evaluation, and a memory network based approach Paper Dataset
MMSS Multi-modal Sentence Summarization with Modality Attention and Image Filtering Paper -
MSMO MSMO: Multimodal Summarization with Multimodal Output Paper -
Screen2Words Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning Paper Dataset
AVIATE IEMOCAP: interactive emotional dyadic motion capture database Paper Dataset
Multimodal Microblog Summarizaion On Multimodal Microblog Summarization Paper -
  1. Human Computer Interaction
Dataset Title of the Paper Link of the Paper Link of the Dataset
CAUVE CUAVE: A new audio-visual database for multimodal human-computer interface research Paper Dataset
MHAD Berkeley mhad: A comprehensive multimodal human action database Paper Dataset
Multi-party interactions A Multi-party Multi-modal Dataset for Focus of Visual Attention in Human-human and Human-robot Interaction Paper -
MHHRI Multimodal human-human-robot interactions (mhhri) dataset for studying personality and engagement Paper Dataset
Red Hen Lab Red Hen Lab: Dataset and Tools for Multimodal Human Communication Research Paper -
EMRE Generating a Novel Dataset of Multimodal Referring Expressions Paper Dataset
Chinese Whispers Chinese whispers: A multimodal dataset for embodied language grounding Paper Dataset
uulmMAC The uulmMAC database—A multimodal affective corpus for affective computing in human-computer interaction Paper Dataset
  1. Semantic Analysis
Dataset Title of the Paper Link of the Paper Link of the Dataset
WN9-IMG Image-embodied Knowledge Representation Learning Paper Dataset
Wikimedia Commons A Dataset and Reranking Method for Multimodal MT of User-Generated Image Captions Paper Dataset
Starsem18-multimodalKB A Multimodal Translation-Based Approach for Knowledge Graph Representation Learning Paper Dataset
MUStARD Towards Multimodal Sarcasm Detection Paper Dataset
YouMakeup YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension Paper Dataset
MDID Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts Paper Dataset
Social media posts from Flickr (Mental Health) Inferring Social Media Users’ Mental Health Status from Multimodal Information Paper Dataset
Twitter MEL Building a Multimodal Entity Linking Dataset From Tweets Building a Multimodal Entity Linking Dataset From Tweets Paper Dataset
MultiMET MultiMET: A Multimodal Dataset for Metaphor Understanding Paper -
MSDS Multimodal Sarcasm Detection in Spanish: a Dataset and a Baseline Paper Dataset
  1. Miscellaneous
Dataset Title of the Paper Link of the Paper Link of the Dataset
MS COCO Microsoft COCO: Common objects in context Paper Dataset
ILSVRC ImageNet Large Scale Visual Recognition Challenge Paper Dataset
YFCC100M YFCC100M: The new data in multimedia research Paper Dataset
COGNIMUSE COGNIMUSE: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization Paper Dataset
SNAG SNAG: Spoken Narratives and Gaze Dataset Paper Dataset
UR-Funny UR-FUNNY: A Multimodal Language Dataset for Understanding Humor Paper Dataset
Bag-of-Lies Bag-of-Lies: A Multimodal Dataset for Deception Detection Paper Dataset
MARC A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks Paper Dataset
MuSE MuSE: a Multimodal Dataset of Stressed Emotion Paper Dataset
BabelPic Fatality Killed the Cat or: BabelPic, a Multimodal Dataset for Non-Concrete Concept Paper Dataset
Eye4Ref Eye4Ref: A Multimodal Eye Movement Dataset of Referentially Complex Situations Paper -
Troll Memes A Dataset for Troll Classification of TamilMemes Paper Dataset
SEMD EmoSen: Generating sentiment and emotion controlled responses in a multimodal dialogue system Paper -
Chat talk Corpus Construction and Analysis of a Multimodal Chat-talk Corpus for Dialog Systems Considering Interpersonal Closeness Paper -
EMOTyDA Towards Emotion-aided Multi-modal Dialogue Act Classification Paper Dataset
MELINDA MELINDA: A Multimodal Dataset for Biomedical Experiment Method Classification Paper Dataset
NewsCLIPpings NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media Paper Dataset
R2VQ Designing Multimodal Datasets for NLP Challenges Paper Dataset
M2H2 M2H2: A Multimodal Multiparty Hindi Dataset For Humor Recognition in Conversations