Skip to content

jaeyun95/pre-trained-vlk-model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 

Repository files navigation

Pretrained model summary


pretrained language model

title paper link code link
Improving Language Understanding by Generative Pre-Training [paper] [code(pytorch)]
ELMo : Deep contextualized word representations [paper] [code(tensorflow)]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [paper] [code(tensorflow)][code(pytorch)]
ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS [paper] [code(tensorflow)][code(pytorch)]
RoBERTa: A Robustly Optimized BERT Pretraining Approach [paper] [code[pytorch]]
Language Models are Unsupervised Multitask Learners [paper] [code(tensorflow)]
Language Models are Few-Shot Learners [paper] [code]
XLNet: Generalized Autoregressive Pretraining for Language Understanding [paper] [code(tensorflow)]

pretrained image model

title paper link code link
Identity Mappings in Deep Residual Networks [paper] [code]
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks [paper] [code(pytorch)]
Mask R-CNN [paper] [code(tensorflow)][code(pytorch)]
You Only Look Once: Unified, Real-Time Object Detection [paper] [code(tensorflow)]
YOLOv3: An Incremental Improvement [paper] [code(tensorflow)][code(pytorch)]
YOLOv4: Optimal Speed and Accuracy of Object Detection [paper] [code(tensorflow)]
YOLOv5 [paper] [code(pytorch)]
Image Transformer [paper] [code(pytorch)]

pretrained video model

title paper link code link
Looking Fast and Slow: Memory-Guided Mobile Video Object Detection [paper] [code(tensorflow)]
Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection [paper] [code(tensorflow)]
Optimizing Video Object Detection via a Scale-Time Lattice [paper] [code(pytorch)]
Mobile Video Object Detection with Temporally-Aware Feature Maps [paper] [code(pytorch)]
X3D: Expanding Architectures for Efficient Video Recognition [paper] [code(pytorch)]
SibNet: Sibling Convolutional Encoder for Video Captioning [paper] [code]
SAM: Modeling Scene, Object and Action with Semantics Attention Modules for Video Recognition [paper] [code]
Bottleneck Transformers for Visual Recognition [paper] [code(pytorch)]

pretrained image and language model

summary table

image

papaer and code

title paper link code link
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks [paper] [code(pytorch)]
12-in-1: Multi-Task Vision and Language Representation Learning [paper] [code(pytorch)]
LXMERT: Learning Cross-Modality Encoder Representations from Transformers [paper] [code(pytorch)]
VISUALBERT: A SIMPLE AND PERFORMANT BASELINE FOR VISION AND LANGUAGE [paper] [code(pytorch)]
VL-BERT: Pre-training of Generic Visual-Linguistic Representations [paper] [code(pytorch)]
UNITER: LEARNING UNIVERSAL IMAGE-TEXT REPRESENTATIONS [paper] [code(pytorch)]
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training [paper] [code(pytorch)]
Large-Scale Adversarial Training for Vision-and-Language Representation Learning [paper] [code(pytorch)]
Fusion of Detected Objects in Text for Visual Question Answering [paper] [code(tensorflow)]
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph [paper] [code]
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers [paper] [code]
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers [paper] [code]
M6-v0: Vision-and-Language Interaction for Multi-modal Pretraining [paper] [code]
Unified Vision-Language Pre-Training for Image Captioning and VQA [paper] [code]
Multimodal Pretraining Unmasked:Unifying the Vision and Language BERTs [paper] [code]
VinVL: Making Visual Representations Matter in Vision-Language Models [paper] [code]
Seeing past words: Testing the cross-modal capabilities of pretrained V&L models [paper] [code]
Inferring spatial relations from textual descriptions of images [paper] [code]
DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation [paper] [code]
Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network [paper] [code(pytorch)]
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts [paper] [code]
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision [paper] [code]
Transformer is All You Need:Multimodal Multitask Learning with a Unified Transformer [paper] [code]
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning [paper] [code]

pretrained video and language model

title paper link code link
VideoBERT: A Joint Model for Video and Language Representation Learning [paper] [code]
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation [paper] [code]
Multi-modal Circulant Fusion for Video-to-Language and Backward [paper] [code]
Video-Grounded Dialogues with Pretrained Generation Language Models [paper] [code]
Deep Extreme Cut: From Extreme Points to Object Segmentation [paper] [code(pytorch)]
Integrating Multimodal Information in Large Pretrained Transformers [paper] [code(pytorch)]
Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text [paper] [code(caffe)]
PARAMETER EFFICIENT MULTIMODAL TRANSFORMERS FOR VIDEO REPRESENTATION LEARNING [paper] [code]
LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval [paper] [code]
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs [paper] [code]
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training [paper] [code(pytorch)]
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling [paper] [code(pytorch)]

pretrained knowledge and language model

title paper link code link
Knowledge Enhanced Contextual Word Representations [paper] [code(pytorch)]
Why Do Masked Neural Language Models Still Need Commonsense Repositories to Handle Semantic Variations in Question Answering? [paper] [code]
SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge [paper] [code]
Acquiring Knowledge from Pre-trained Model to Neural Machine Translation [paper] [code]
Knowledge-Aware Language Model Pretraining [paper] [code]
Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model [paper] [code]

pretrained vision and language and knowledge model

title paper link code link
Reasoning over Vision and Language:Exploring the Benefits of Supplemental Knowledge [paper] [code]
KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning [paper] [code]

About

pre-trained vision and language model summary

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published