Skip to content

i2vec/A-survey-on-image-text-multimodal-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 

Repository files navigation

A survey on image-text multimodal models

This is the repository of A survey on image-text multimodal models, the article offers a thorough review of the current state of research concerning the application of large pretrained models in image-text tasks and provide a perspective on its future development trends. For details, please refer to:

A Survey on Image-text Multimodal Models
Paper

arXiv made-with-Markdown

Feel free to contact us or pull requests if you find any related papers that are not included here.

Abstract

With the significant advancements of Large Language Models (LLMs) in the field of Natural Language Processing (NLP), the development of image-text multimodal models has garnered widespread attention. These models demonstrate immense potential in processing and integrating visual and textual information, particularly in areas such as multimodal robotics, document intelligence, and biomedicine. This paper provides a comprehensive review of the technological evolution of image-text multimodal models, from early explorations of feature space to the latest large model architectures. It emphasizes the pivotal role of attention mechanisms and their derivative architectures in advancing multimodal model development. Through case studies in the biomedical domain, we reveal the symbiotic relationship between the development of general technologies and their domain-specific applications, showcasing the practical applications and technological improvements of image-text multimodal models in addressing specific domain challenges. Our research not only offers an in-depth analysis of the technological progression of image-text multimodal models but also highlights the importance of integrating technological innovation with practical applications, providing guidance for future research directions. Despite the significant breakthroughs in the development of image-text multimodal models, they still face numerous challenges in domain applications. This paper categorizes these challenges into external factors and intrinsic factors, further subdividing them and proposing targeted strategies and directions for future research. For more details and data, please visit our GitHub page: https://github.com/i2vec/A-survey-on-image-text-multimodal-models.

Citation

If you find our work useful in your research, please consider citing:

@misc{guo2023survey,
      title={A Survey on Image-text Multimodal Models}, 
      author={Ruifeng Guo and Jingxuan Wei and Linzhuang Sun and Bihui Yu and Guiyong Chang and Dawei Liu and Sibo Zhang and Zhengbing Yao and Mingjun Xu and Liping Bu},
      year={2023},
      eprint={2309.15857},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Menu

Development Process

Technical Evoluation

Initial Stage and Early Stage

Paper Published in
Framing image description as a ranking task: Data, models and evaluation metrics IJCAI 2015
Mind’s eye: A recurrent visual representation for image caption generation CVPR 2015
Deep visual-semantic alignments for generating image descriptions CVPR 2015
Show, attend and tell: Neural image caption generation with visual attention PMLR 2015
Show and tell: A neural image caption generator CVPR 2015

Attention Mechanism and the Rise of Transformers

Paper Published in
LARGE-SCALE APPROXIMATE KERNEL CANONICAL CORRELATION ANALYSIS ICLR 2016
Bilinear attention networks NeurIPS 2018
Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks NeurIPS 2019
Lxmert: Learning cross-modality encoder representations from transformers ACL 2019
Visualbert:A simple and performant baseline for vision and languag arXiv2019
Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training AAAI 2020
VL-BERT: pre-training of generic visual-linguistic representations ICLR 2020

Recent Image-text Multimodal Models

Paper Published in
Vilt: Vision-and-language transformer without convolution or region supervision PMLR 2021
Learning transferable visual models from natural language supervision PMLR 2021
An image is worth 16x16 words: Transformers for image recognition at scale ICLR 2021
Vlmo: Unified vision-language pre-training with mixture-of-modality-expert NeurlPS 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation PMLR 2022
OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework PMLR 2022
Learning from fm communications: Toward accurate, efficient, all-terrain vehicle localization IEEE 2022
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models arXiv 2023
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning NeurlPS 2023
mplug2: A modularized multi-modal foundation model across text" arXiv 2023
Mmap: Multi-modal alignment prompt for cross-domain multi-task learning arXiv 2023
Image as a foreign language: Beit pretraining for all vision and vision-language tasks CVPR 2023
Visual Instruction Tuning NeulPS2023
Sparks of artificial general intelligence: Early experiments with gpt-4 arXiv 2023
Minigpt-4: Enhancing vision-language understanding with advanced large language models arXiv 2023
Minigpt-5: Interleaved vision-and-language generation via generative vokens ICLR 2024
Structure-clip: Enhance multi-modal language representations with structure knowledg AAAI 2024
m-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronize arXiv 2024

Evolution of Application Technology

Initial Stage and Early Stage

Paper Published in
A combined convolutional and recurrent neural network for enhanced glaucoma detection Nature 2021

Attention Mechanism and the Rise of Transformers

Paper Published in
Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition IEEE 2021
Mmbert: Multimodal bert pretraining for improved medical vqa IEEE 2021

Recent Image-text Multimodal Models

Paper Published in
Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports Nature 2022
Medclip: Contrastive learning from unpaired medical images and text EMNLP 2022
Roentgen: vision-language foundation model for chest x-ray generatio arXiv 2022
Lvit: language meets vision transformer in medical image segmentation IEEE 2023
MMTN: Multi-Modal Memory Transformer Network for Image-Report Consistent Medical Report Generation AAAI 2023
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day NeurIPS 2023
XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models arXiv 2023
Towards Generalist Foundation Model for Radiology by Leveraging Web-scale 2D&3D Medical Data arXiv 2023

Applications of multimodal Models in Image- Text Tasks

Tasks

Pre-training Task

Paper Published in
Every picture tells a story: Generating sentences from images ECCV 2010
Similarity reasoning and filtration for image-text matching AAAI 2021
Visual relationship detection: A survey AAAI 2021

Model components

Paper Published in
Very deep convolutional networks for large-scale image recognitio AAAI 2021
Deformable DETR: Deformable Transformers for End-to-End Object Detection ICLR 2021

Generic Model

model architecture
Paper Published in
Learning transferable visual models from natural language supervision PMLR 2021
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation PMLR 2022
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models arXiv 2023
Minigpt-4: Enhancing vision-language understanding with advanced large language models arXiv 2023
Pandagpt: One model to instruction-follow them all ACL 2023
Mobilevlm: A fast, reproducible and strong vision language assistant for mobile device arXiv 2023
Qwen-vl: A frontier large vision-language model with versatile abilities arXiv 2023
Minigpt-v2: large language model as a unified interface for vision-language multi-task learning ICLR 2024
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities arXiv 2024
Mobilevlm v2: Faster and stronger baseline for vision language mode arXiv 2024
Llava-plus: Learning to use tools for creating multimodal agents ICLR 2024

data

Paper Published in
Im2text: Describing images using 1 million captioned photographs NeurlPS 2011
Gqa: A new dataset for real-world visual reasoning and compositional question answerin CVPR 2019
The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scal IJCV 2020
Fashion iq: A new dataset towards retrieving images by natural language feedback CVPR 2021
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts CVPR 2021
Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learnin ACL 2021
Wenlan: Bridging vision and language by large-scale multi-modal pre-training arXiv 2021
RedCaps: Web-curated image-text data created by the people, for the people NeurlPS 2021
Wit:Wikipedia-based image text dataset for multimodal multilingual machine learning arXiv 2021
Flava: A foundational language and vision alignment model CVPR 2022
Unimo-2: End-to-end unified vision-language grounded learning ACL 2022
Laion-5b: An open large-scale dataset for training next generation image-text models NeurlPS 2022

Medical Model

model architecture

Paper Published in
Medblip: Bootstrapping language-image pre-training from 3d medical images and text arXiv 2023
Med-flamingo: a multimodal medical few-shot learner, in: Machine Learning for Health (ML4H) PMLR 2023
Pmc-vqa: Visual instruction tuning for medical visual question answering arXiv 2023
Masked vision and language pre-training with unimodal and multimodal contrastive losses for medical visual question answering MICCAI 2023
Pmc-clip: Contrastive language-image pre-training using biomedical documents MICCAI 2023
Pmc-llama: Further finetuning llama on medical paper arXiv 2023
MEDITRON-70B: Scaling Medical Pretraining for Large Language Models arXiv 2023
Biomedgpt:Open multimodal generative pre-trained transformer for biomedicine arXiv 2023
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day NeurIPS 2023
MedJourney: Counterfactual Medical Image Generation by Instruction-Learning from Multimodal Patient Journeys ICLR 2024
data
Paper Published in
Radiology objects in context (roco): a multimodal image dataset MICCAI 2018
A dataset of clinically generated visual questions and answers about radiology images Nature 2018
Chexpert:A large chest radiograph dataset with uncertainty labels and expert comparison AAAI 2019
Mimic-cxr-jpg, a large publicly available database of labeled chest radiograph Nature 2019
Slake: A Semantically-Labeled Knowledge-Enhanced Dataset For Medical Visual Question Answering IEEE 2021
K-PathVQA: Knowledge-Aware Multimodal Representation for Pathology Visual Question Answering IEEE 2022
A foundational multimodal vision language ai assistant for human pathology arXiv 2023
One model to rule them all: Towards universal segmentation for medical images with text prompt arXiv 2023
Towards generalist foundation model for radiology arXiv 2023

Challenges and future directions of multimodal models in image-text tasks

External Factor

Challenges for Multimodal Dataset

Paper Published in
Annotation and processing of continuous emotional attributes: Challenges and opportunitie IEEE 2013
Multimodal machine learning: A survey and taxonomy IEEE 2018
A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets IEEE 2018
A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets -
Between subjectivity and imposition: Power dynamics in data annotation for computer vision CSCW 2020
Algorithmic fairness in computational medicine -
Bias and Non-Diversity of Big Data in Artificial Intelligence: Focus on Retinal Diseases -

Computational Resource Demand

Paper Published in
Model compression for deep neural networks: A survey 2023
survey on model compression for large language model arXiv 2023
Weakly supervised machine learning CAAI 2023
Semi-supervised and un-supervised clustering: A review and experimental evaluation Information System 2023
Deep learning model compression techniques: Advances, opportunities, and perspective 2023

Intrinsic Factor

Unique Challenges for Image-Text Tasks

Paper Published in
Cross-domain image captioning via cross-modal retrieval and model adaptation IEEE 2020
Transformers in medical image analysis, Intelligent Medicine Intelligent Medicine 2022
What you see is what you read? improving text-image alignment evaluatio NeurIPS 2023
Foundational models in medical imaging: A comprehensive survey and future visio arXiv 2023
A scoping review on multimodal deep learning in biomedical images and texts arXiv 2023
Transformer Architecture and Attention Mechanisms in Genome Data Analysis: A Comprehensive Review 2023
Transformers in medical imaging: A survey, Medical Image Analysis Medical Image Analysis 2023
A survey of multimodal hybrid deep learning for computer vision: Architectures, applications, trends, and challenges, Information Fusion 2023
Incorporating domain knowledge for biomedical text analysis into deep learning: A survey Journal of Biomedical Informatics 2023
Towards electronic health record-based medical knowledge graph construction, completion, and applications: A literature study 2023
ECOFLAP: EFFICIENT COARSE-TO-FINE LAYER-WISE PRUNING FOR VISION-LANGUAGE MODELS ICLR 2024
A novel attention-based cross-modal transfer learning framework for predicting cardiovascular disease Computers in Biology and Medicine 2024
A survey on hallucination in large vision-language models arXiv 2024

Multimodel Alignment and Co-learning

Paper Published in
Aligning temporal data by sentinel events: discovering patterns in electronic health records 2008
Resilient learning of computational models with noisy labels IEEE 2019
A label-noise robust active learning sample collection method for multi-temporal urban land-cover classification and change analysis ISPRS 2020
Bayesian dividemix++ for enhanced learning with noisy labels, 2023
A survey on deep learning in medical image registration: New technologies, uncertainty, evaluation metrics, and beyond arXiv 2023
On the resurgence of recurrent models for long sequences:Survey and research opportunities in the transformer era arXiv 2024
Multi-Modal Machine Learning in Engineering Design: A Review and Future Directions 2024
A survey of multimodal information fusion for smart healthcare: Mapping the journey from data to wisdom Information Fusion 2024

Catastrophic Forgetting

Paper Published in
Multiscale Modeling Meets Machine Learning: What Can We Learn? 2020
Mitigating Forgetting in Online Continual Learning with Neuron Calibration NeurlPS 2021
RDFM: An alternative approach for representing, storing, and maintaining meta-knowledge in web of data 2021
CNN Models Using Chest X-Ray Images for COVID-19 Detection: A Survey 2023
Advancing security in the industrial internet of things using deep progressive neural networks 2023
A progressive neural network for acoustic echo cancellation IEEE 2023
How our understanding of memory replay evolves 2023
Replay as context-driven memory reactivation bioRxiv 2023
Unleashing the power of meta-knowledge: Towards cumulative learning in interpreter training 2023

Model Interpretability and Transparency

Paper Published in
Layer-Wise Relevance Propagation: An Overview 2019
Human factors in model interpretability: Industry practices, CSCW 2020
Interpretation and visualization techniques for deep learning models in medical imaging 2021
Case studies of clinical decision-making through prescriptive models based on machine learning 2023
Interpreting black-box models: a review on explainable artificial intelligence Cognitive Computation 2023
Terminology, Ontology and their Implementations 2023
AttnLRP: Attention-Aware Layer-wise Relevance Propagation for Transformers arXiv 2024
From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI 2023

Model Bias and Fairness Issues

Paper Published in
Towards fairness-aware federated learning arXiv 2021
Toward fairness in artificial intelligence for medical image analysis: identification and mitigation of potential biases in the roadmap from data collection to model deployment 2023
Evaluating and mitigating unfairness in multimodal remote mental health assessments medRxiv 2023
A Unified Approach to Demographic Data Collection for Research With Young Children Across Diverse Cultures Developmental Psychology 2024
Bias Detection and Mitigation within Decision Support System: A Comprehensive Survey 2023
Automated monitoring and evaluation of highway subgrade compaction quality using artificial neural networks Automation in Construction 2023

Star History

Star History Chart

Stargazers

Stargazers repo roster for @i2vec/A-survey-on-image-text-multimodal-models

Forkers

Forkers repo roster for @i2vec/A-survey-on-image-text-multimodal-models

About

the repository of A survey on image-text multimodal models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published