Stars
The calflops is designed to calculate FLOPs、MACs and Parameters in all various neural networks, such as Linear、 CNN、 RNN、 GCN、Transformer(Bert、LlaMA etc Large Language Model)
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
Official Repo for AAAI 2025 G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o
Janus-Series: Unified Multimodal Understanding and Generation Models
[ICLR 2025] Autoregressive Video Generation without Vector Quantization
SwinIR: Image Restoration Using Swin Transformer (official repository)
This repo contains code for "VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by VIdeo SpatioTemporal Augmentation"
[ArXiv] V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
PyTorch implementation of RCG https://arxiv.org/abs/2312.03701
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
👁️ 🖼️ 🔥PyTorch Toolbox for Image Quality Assessment, including PSNR, SSIM, LPIPS, FID, NIQE, NRQM(Ma), MUSIQ, TOPIQ, NIMA, DBCNN, BRISQUE, PI and more...
[ECCV 2024] codes of DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior
[ECCV2024] Pixel-Aware Stable Diffusion for Realistic Image Super-Resolution and Personalized Stylization
Official code for "FeatUp: A Model-Agnostic Frameworkfor Features at Any Resolution" ICLR 2024
PyTorch code and models for the DINOv2 self-supervised learning method.
A curated list of face restoration & enhancement papers and resources
[ECCV 2024] Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
A lightweight flexible Video-MLLM developed by TencentQQ Multimedia Research Team.
Graphic notes on Gilbert Strang's "Linear Algebra for Everyone"
LVBench: An Extreme Long Video Understanding Benchmark
Official repo for paper "MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions"
[ICLR 2025] OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Code for Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models
Awesome papers & datasets specifically focused on long-term videos.
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
[ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, Lu Hou
✨✨Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Dataset introduced in PlotQA: Reasoning over Scientific Plots