Highlights
- Pro
Stars
Rethinking Step-by-step Visual Reasoning in LLMs
An AI-powered research assistant that performs iterative, deep research on any topic by combining search engines, web scraping, and large language models. The goal of this repo is to provide the si…
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
This repository collects research papers of large Vision Language Models in Autonomous driving and Intelligent Transportation System. The repository will be continuously updated to track the lates…
Fully open reproduction of DeepSeek-R1
[ICCV2023] Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
[CVPR 2024] MAPLM: A Large-Scale Vision-Language Dataset for Map and Traffic Scene Understanding
TensorDict is a pytorch dedicated tensor container.
Clean, minimal, accessible reproduction of DeepSeek R1-Zero
a family of versatile and state-of-the-art video tokenizers.
VideoX: a collection of video cross-modal models
Pytorch Implementation for CVPR 2024 paper: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation
This is a simple demonstration of more advanced, agentic patterns built on top of the Realtime API.
Prompt Learning for Vision-Language Models (IJCV'22, CVPR'22)
Official implementation of "Why are Visually-Grounded Language Models Bad at Image Classification?" (NeurIPS 2024)
An open-source implementaion for fine-tuning SmolVLM.
Agent Laboratory is an end-to-end autonomous research workflow meant to assist you as the human researcher toward implementing your research ideas
A suite of image and video neural tokenizers
Cosmos is a world model development platform that consists of world foundation models, tokenizers and video processing pipeline to accelerate the development of Physical AI at Robotics & AV labs. C…
Infinity is a high-throughput, low-latency serving engine for text-embeddings, reranking models, clip, clap and colpali
fast python port of arc90's readability tool, updated to match latest readability.js!
🤗 smolagents: a barebones library for agents. Agents write python code to call tools and orchestrate other agents.
Implementing the 4 agentic patterns from scratch