Stars
Official implementation of SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference
[ECCV 2024] The official code of paper "Open-Vocabulary SAM".
Official repository of "SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory"
Official implementation of "Why are Visually-Grounded Language Models Bad at Image Classification?" (NeurIPS 2024)
Image Classification Testing with LLMs
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?
Code for Label Propagation for Zero-shot Classification with Vision-Language Models (CVPR2024)
A curated list of resources for Partial-Multi-Label-Learning
Exploring Structured Semantic Prior for Multi Label Recognition with Incomplete Labels [CVPR 2023]
[ICCV 2023] StreamPETR: Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection
Multi-label Image Recognition with Partial Labels (IJCV'24, ESWA'24, AAAI'22)
[2024 ACM MM] Official PyTorch implementation of the paper "Text-Region Matching for Multi-Label Image Recognition with Missing Labels"
This repo officially implements (IJCAI2024) TAI++: Text as Image for Multi-Label Image Classification by Co-Learning Transferable Prompt.
Unofficial Implementation to CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification [ICCV'23]
[ICLR 2024] Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models.
Collection of awesome test-time (domain/batch/instance) adaptation methods
CatVTON is a simple and efficient virtual try-on diffusion model with 1) Lightweight Network (899.06M parameters totally), 2) Parameter-Efficient Training (49.57M parameters trainable) and 3) Simpl…
[ICML 2024] Official implementation for "Image Fusion via Vision-Language Model".
Holds code for our CVPR'23 tutorial: All Things ViTs: Understanding and Interpreting Attention in Vision.
[CVPR 2023] Official repository of paper titled "Fine-tuned CLIP models are efficient video learners".
[CVPR 2023] Effcient Frequence Domain-based Transformer for High-Quality Image Deblurring