This is the repository of A survey on image-text multimodal models, the article offers a thorough review of the current state of research concerning the application of large pretrained models in image-text tasks and provide a perspective on its future development trends. For details, please refer to:
A Survey on Image-text Multimodal Models
Paper
Feel free to contact us or pull requests if you find any related papers that are not included here.
With the significant advancements of Large Language Models (LLMs) in the field of Natural Language Processing (NLP), the development of image-text multimodal models has garnered widespread attention. These models demonstrate immense potential in processing and integrating visual and textual information, particularly in areas such as multimodal robotics, document intelligence, and biomedicine. This paper provides a comprehensive review of the technological evolution of image-text multimodal models, from early explorations of feature space to the latest large model architectures. It emphasizes the pivotal role of attention mechanisms and their derivative architectures in advancing multimodal model development. Through case studies in the biomedical domain, we reveal the symbiotic relationship between the development of general technologies and their domain-specific applications, showcasing the practical applications and technological improvements of image-text multimodal models in addressing specific domain challenges. Our research not only offers an in-depth analysis of the technological progression of image-text multimodal models but also highlights the importance of integrating technological innovation with practical applications, providing guidance for future research directions. Despite the significant breakthroughs in the development of image-text multimodal models, they still face numerous challenges in domain applications. This paper categorizes these challenges into external factors and intrinsic factors, further subdividing them and proposing targeted strategies and directions for future research. For more details and data, please visit our GitHub page: https://github.com/i2vec/A-survey-on-image-text-multimodal-models.
If you find our work useful in your research, please consider citing:
@misc{guo2023survey,
title={A Survey on Image-text Multimodal Models},
author={Ruifeng Guo and Jingxuan Wei and Linzhuang Sun and Bihui Yu and Guiyong Chang and Dawei Liu and Sibo Zhang and Zhengbing Yao and Mingjun Xu and Liping Bu},
year={2023},
eprint={2309.15857},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Development Process
- Applications of multimodal Models in Image-Text Tasks
- Challenges and future directions of multimodal models in image-text tasks
Paper | Published in |
---|---|
A combined convolutional and recurrent neural network for enhanced glaucoma detection | Nature 2021 |
Paper | Published in |
---|---|
Every picture tells a story: Generating sentences from images | ECCV 2010 |
Similarity reasoning and filtration for image-text matching | AAAI 2021 |
Visual relationship detection: A survey | AAAI 2021 |
Paper | Published in |
---|---|
Very deep convolutional networks for large-scale image recognitio | AAAI 2021 |
Deformable DETR: Deformable Transformers for End-to-End Object Detection | ICLR 2021 |
Paper | Published in |
---|---|
Model compression for deep neural networks: A survey | 2023 |
survey on model compression for large language model | arXiv 2023 |
Weakly supervised machine learning | CAAI 2023 |
Semi-supervised and un-supervised clustering: A review and experimental evaluation | Information System 2023 |
Deep learning model compression techniques: Advances, opportunities, and perspective | 2023 |