GitHub - i2vec/A-survey-on-image-text-multimodal-models: the repository of A survey on image-text multimodal models

A survey on image-text multimodal models

This is the repository of A survey on image-text multimodal models, the article offers a thorough review of the current state of research concerning the application of large pretrained models in image-text tasks and provide a perspective on its future development trends. For details, please refer to:

A Survey on Image-text Multimodal Models
Paper

Feel free to contact us or pull requests if you find any related papers that are not included here.

Abstract

With the significant advancements of Large Language Models (LLMs) in the field of Natural Language Processing (NLP), the development of image-text multimodal models has garnered widespread attention. These models demonstrate immense potential in processing and integrating visual and textual information, particularly in areas such as multimodal robotics, document intelligence, and biomedicine. This paper provides a comprehensive review of the technological evolution of image-text multimodal models, from early explorations of feature space to the latest large model architectures. It emphasizes the pivotal role of attention mechanisms and their derivative architectures in advancing multimodal model development. Through case studies in the biomedical domain, we reveal the symbiotic relationship between the development of general technologies and their domain-specific applications, showcasing the practical applications and technological improvements of image-text multimodal models in addressing specific domain challenges. Our research not only offers an in-depth analysis of the technological progression of image-text multimodal models but also highlights the importance of integrating technological innovation with practical applications, providing guidance for future research directions. Despite the significant breakthroughs in the development of image-text multimodal models, they still face numerous challenges in domain applications. This paper categorizes these challenges into external factors and intrinsic factors, further subdividing them and proposing targeted strategies and directions for future research. For more details and data, please visit our GitHub page: https://github.com/i2vec/A-survey-on-image-text-multimodal-models.

Citation

If you find our work useful in your research, please consider citing:

@misc{guo2023survey,
      title={A Survey on Image-text Multimodal Models}, 
      author={Ruifeng Guo and Jingxuan Wei and Linzhuang Sun and Bihui Yu and Guiyong Chang and Dawei Liu and Sibo Zhang and Zhengbing Yao and Mingjun Xu and Liping Bu},
      year={2023},
      eprint={2309.15857},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Development Process

Technical Evoluation

Initial Stage and Early Stage

Paper	Published in
Framing image description as a ranking task: Data, models and evaluation metrics	IJCAI 2015
Mind’s eye: A recurrent visual representation for image caption generation	CVPR 2015
Deep visual-semantic alignments for generating image descriptions	CVPR 2015
Show, attend and tell: Neural image caption generation with visual attention	PMLR 2015
Show and tell: A neural image caption generator	CVPR 2015

Attention Mechanism and the Rise of Transformers

Paper	Published in
LARGE-SCALE APPROXIMATE KERNEL CANONICAL CORRELATION ANALYSIS	ICLR 2016
Bilinear attention networks	NeurIPS 2018
Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks	NeurIPS 2019
Lxmert: Learning cross-modality encoder representations from transformers	ACL 2019
Visualbert:A simple and performant baseline for vision and languag	arXiv2019
Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training	AAAI 2020
VL-BERT: pre-training of generic visual-linguistic representations	ICLR 2020

Recent Image-text Multimodal Models

Paper	Published in
Vilt: Vision-and-language transformer without convolution or region supervision	PMLR 2021
Learning transferable visual models from natural language supervision	PMLR 2021
An image is worth 16x16 words: Transformers for image recognition at scale	ICLR 2021
Vlmo: Unified vision-language pre-training with mixture-of-modality-expert	NeurlPS 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	PMLR 2022
OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework	PMLR 2022
Learning from fm communications: Toward accurate, efficient, all-terrain vehicle localization	IEEE 2022
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	arXiv 2023
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	NeurlPS 2023
mplug2: A modularized multi-modal foundation model across text"	arXiv 2023
Mmap: Multi-modal alignment prompt for cross-domain multi-task learning	arXiv 2023
Image as a foreign language: Beit pretraining for all vision and vision-language tasks	CVPR 2023
Visual Instruction Tuning	NeulPS2023
Sparks of artificial general intelligence: Early experiments with gpt-4	arXiv 2023
Minigpt-4: Enhancing vision-language understanding with advanced large language models	arXiv 2023
Minigpt-5: Interleaved vision-and-language generation via generative vokens	ICLR 2024
Structure-clip: Enhance multi-modal language representations with structure knowledg	AAAI 2024
m-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronize	arXiv 2024

Evolution of Application Technology

Initial Stage and Early Stage

Paper	Published in
A combined convolutional and recurrent neural network for enhanced glaucoma detection	Nature 2021

Attention Mechanism and the Rise of Transformers

Paper	Published in
Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition	IEEE 2021
Mmbert: Multimodal bert pretraining for improved medical vqa	IEEE 2021

Recent Image-text Multimodal Models

Paper	Published in
Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports	Nature 2022
Medclip: Contrastive learning from unpaired medical images and text	EMNLP 2022
Roentgen: vision-language foundation model for chest x-ray generatio	arXiv 2022
Lvit: language meets vision transformer in medical image segmentation	IEEE 2023
MMTN: Multi-Modal Memory Transformer Network for Image-Report Consistent Medical Report Generation	AAAI 2023
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day	NeurIPS 2023
XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models	arXiv 2023
Towards Generalist Foundation Model for Radiology by Leveraging Web-scale 2D&3D Medical Data	arXiv 2023

Applications of multimodal Models in Image- Text Tasks

Tasks

Pre-training Task

Paper	Published in
Every picture tells a story: Generating sentences from images	ECCV 2010
Similarity reasoning and filtration for image-text matching	AAAI 2021
Visual relationship detection: A survey	AAAI 2021

Model components

Paper	Published in
Very deep convolutional networks for large-scale image recognitio	AAAI 2021
Deformable DETR: Deformable Transformers for End-to-End Object Detection	ICLR 2021

Generic Model

model architecture

Paper	Published in
Learning transferable visual models from natural language supervision	PMLR 2021
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	PMLR 2022
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	arXiv 2023
Minigpt-4: Enhancing vision-language understanding with advanced large language models	arXiv 2023
Pandagpt: One model to instruction-follow them all	ACL 2023
Mobilevlm: A fast, reproducible and strong vision language assistant for mobile device	arXiv 2023
Qwen-vl: A frontier large vision-language model with versatile abilities	arXiv 2023
Minigpt-v2: large language model as a unified interface for vision-language multi-task learning	ICLR 2024
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities	arXiv 2024
Mobilevlm v2: Faster and stronger baseline for vision language mode	arXiv 2024
Llava-plus: Learning to use tools for creating multimodal agents	ICLR 2024

data

Paper	Published in
Im2text: Describing images using 1 million captioned photographs	NeurlPS 2011
Gqa: A new dataset for real-world visual reasoning and compositional question answerin	CVPR 2019
The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scal	IJCV 2020
Fashion iq: A new dataset towards retrieving images by natural language feedback	CVPR 2021
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts	CVPR 2021
Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learnin	ACL 2021
Wenlan: Bridging vision and language by large-scale multi-modal pre-training	arXiv 2021
RedCaps: Web-curated image-text data created by the people, for the people	NeurlPS 2021
Wit:Wikipedia-based image text dataset for multimodal multilingual machine learning	arXiv 2021
Flava: A foundational language and vision alignment model	CVPR 2022
Unimo-2: End-to-end unified vision-language grounded learning	ACL 2022
Laion-5b: An open large-scale dataset for training next generation image-text models	NeurlPS 2022

Medical Model

model architecture

Paper	Published in
Medblip: Bootstrapping language-image pre-training from 3d medical images and text	arXiv 2023
Med-flamingo: a multimodal medical few-shot learner, in: Machine Learning for Health (ML4H)	PMLR 2023
Pmc-vqa: Visual instruction tuning for medical visual question answering	arXiv 2023
Masked vision and language pre-training with unimodal and multimodal contrastive losses for medical visual question answering	MICCAI 2023
Pmc-clip: Contrastive language-image pre-training using biomedical documents	MICCAI 2023
Pmc-llama: Further finetuning llama on medical paper	arXiv 2023
MEDITRON-70B: Scaling Medical Pretraining for Large Language Models	arXiv 2023
Biomedgpt:Open multimodal generative pre-trained transformer for biomedicine	arXiv 2023
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day	NeurIPS 2023
MedJourney: Counterfactual Medical Image Generation by Instruction-Learning from Multimodal Patient Journeys	ICLR 2024

data

Paper	Published in
Radiology objects in context (roco): a multimodal image dataset	MICCAI 2018
A dataset of clinically generated visual questions and answers about radiology images	Nature 2018
Chexpert:A large chest radiograph dataset with uncertainty labels and expert comparison	AAAI 2019
Mimic-cxr-jpg, a large publicly available database of labeled chest radiograph	Nature 2019
Slake: A Semantically-Labeled Knowledge-Enhanced Dataset For Medical Visual Question Answering	IEEE 2021
K-PathVQA: Knowledge-Aware Multimodal Representation for Pathology Visual Question Answering	IEEE 2022
A foundational multimodal vision language ai assistant for human pathology	arXiv 2023
One model to rule them all: Towards universal segmentation for medical images with text prompt	arXiv 2023
Towards generalist foundation model for radiology	arXiv 2023

Challenges and future directions of multimodal models in image-text tasks

External Factor

Challenges for Multimodal Dataset

Paper	Published in
Annotation and processing of continuous emotional attributes: Challenges and opportunitie	IEEE 2013
Multimodal machine learning: A survey and taxonomy	IEEE 2018
A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets	IEEE 2018
A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets	-
Between subjectivity and imposition: Power dynamics in data annotation for computer vision	CSCW 2020
Algorithmic fairness in computational medicine	-
Bias and Non-Diversity of Big Data in Artificial Intelligence: Focus on Retinal Diseases	-

Computational Resource Demand

Paper	Published in
Model compression for deep neural networks: A survey	2023
survey on model compression for large language model	arXiv 2023
Weakly supervised machine learning	CAAI 2023
Semi-supervised and un-supervised clustering: A review and experimental evaluation	Information System 2023
Deep learning model compression techniques: Advances, opportunities, and perspective	2023

Intrinsic Factor

Unique Challenges for Image-Text Tasks

Paper	Published in
Cross-domain image captioning via cross-modal retrieval and model adaptation	IEEE 2020
Transformers in medical image analysis, Intelligent Medicine	Intelligent Medicine 2022
What you see is what you read? improving text-image alignment evaluatio	NeurIPS 2023
Foundational models in medical imaging: A comprehensive survey and future visio	arXiv 2023
A scoping review on multimodal deep learning in biomedical images and texts	arXiv 2023
Transformer Architecture and Attention Mechanisms in Genome Data Analysis: A Comprehensive Review	2023
Transformers in medical imaging: A survey, Medical Image Analysis	Medical Image Analysis 2023
A survey of multimodal hybrid deep learning for computer vision: Architectures, applications, trends, and challenges, Information Fusion	2023
Incorporating domain knowledge for biomedical text analysis into deep learning: A survey	Journal of Biomedical Informatics 2023
Towards electronic health record-based medical knowledge graph construction, completion, and applications: A literature study	2023
ECOFLAP: EFFICIENT COARSE-TO-FINE LAYER-WISE PRUNING FOR VISION-LANGUAGE MODELS	ICLR 2024
A novel attention-based cross-modal transfer learning framework for predicting cardiovascular disease	Computers in Biology and Medicine 2024
A survey on hallucination in large vision-language models	arXiv 2024

Multimodel Alignment and Co-learning

Paper	Published in
Aligning temporal data by sentinel events: discovering patterns in electronic health records	2008
Resilient learning of computational models with noisy labels	IEEE 2019
A label-noise robust active learning sample collection method for multi-temporal urban land-cover classification and change analysis	ISPRS 2020
Bayesian dividemix++ for enhanced learning with noisy labels,	2023
A survey on deep learning in medical image registration: New technologies, uncertainty, evaluation metrics, and beyond	arXiv 2023
On the resurgence of recurrent models for long sequences:Survey and research opportunities in the transformer era	arXiv 2024
Multi-Modal Machine Learning in Engineering Design: A Review and Future Directions	2024
A survey of multimodal information fusion for smart healthcare: Mapping the journey from data to wisdom	Information Fusion 2024

Catastrophic Forgetting

Paper	Published in
Multiscale Modeling Meets Machine Learning: What Can We Learn?	2020
Mitigating Forgetting in Online Continual Learning with Neuron Calibration	NeurlPS 2021
RDFM: An alternative approach for representing, storing, and maintaining meta-knowledge in web of data	2021
CNN Models Using Chest X-Ray Images for COVID-19 Detection: A Survey	2023
Advancing security in the industrial internet of things using deep progressive neural networks	2023
A progressive neural network for acoustic echo cancellation	IEEE 2023
How our understanding of memory replay evolves	2023
Replay as context-driven memory reactivation	bioRxiv 2023
Unleashing the power of meta-knowledge: Towards cumulative learning in interpreter training	2023

Model Interpretability and Transparency

Paper	Published in
Layer-Wise Relevance Propagation: An Overview	2019
Human factors in model interpretability: Industry practices,	CSCW 2020
Interpretation and visualization techniques for deep learning models in medical imaging	2021
Case studies of clinical decision-making through prescriptive models based on machine learning	2023
Interpreting black-box models: a review on explainable artificial intelligence	Cognitive Computation 2023
Terminology, Ontology and their Implementations	2023
AttnLRP: Attention-Aware Layer-wise Relevance Propagation for Transformers	arXiv 2024
From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI	2023

Model Bias and Fairness Issues

Paper	Published in
Towards fairness-aware federated learning	arXiv 2021
Toward fairness in artificial intelligence for medical image analysis: identification and mitigation of potential biases in the roadmap from data collection to model deployment	2023
Evaluating and mitigating unfairness in multimodal remote mental health assessments	medRxiv 2023
A Unified Approach to Demographic Data Collection for Research With Young Children Across Diverse Cultures	Developmental Psychology 2024
Bias Detection and Mitigation within Decision Support System: A Comprehensive Survey	2023
Automated monitoring and evaluation of highway subgrade compaction quality using artificial neural networks	Automation in Construction 2023

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
README.md		README.md

i2vec/A-survey-on-image-text-multimodal-models

Folders and files

Latest commit

History

Repository files navigation

A survey on image-text multimodal models

Abstract

Citation

Menu

Development Process

Technical Evoluation

Initial Stage and Early Stage

Attention Mechanism and the Rise of Transformers

Recent Image-text Multimodal Models

Evolution of Application Technology

Initial Stage and Early Stage

Attention Mechanism and the Rise of Transformers

Recent Image-text Multimodal Models

Applications of multimodal Models in Image- Text Tasks

Tasks

Pre-training Task

Model components

Generic Model

model architecture

data

Medical Model

model architecture

data

Challenges and future directions of multimodal models in image-text tasks

External Factor

Challenges for Multimodal Dataset

Computational Resource Demand

Intrinsic Factor

Unique Challenges for Image-Text Tasks

Multimodel Alignment and Co-learning

Catastrophic Forgetting

Model Interpretability and Transparency

Model Bias and Fairness Issues

Star History

Stargazers

Forkers

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages