Stars
A generative world for general-purpose robotics & embodied AI learning.
CVPR-24 | Official codebase for ZONE: Zero-shot InstructiON-guided Local Editing
OneTrainer is a one-stop solution for all your stable diffusion training needs.
Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
A port of muerrilla's sd-webui-Detail-Daemon as a node for ComfyUI, to adjust sigmas that control detail.
ChatGPT Advanced Voice Mode Gets an Avatar!
Netflix-level subtitle cutting, translation, alignment, and even dubbing - one-click fully automated AI video subtitle team | Netflix级字幕切割、翻译、对齐、甚至加上配音,一键全自动视频搬运AI字幕组
[ECCV 2024] PowerPaint, a versatile image inpainting model that supports text-guided object inpainting, object removal, image outpainting and shape-guided object inpainting with only a single model…
[NeurIPS 2024 Oral][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-sim…
Official Implementation of HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing
The offical repository of "SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model"
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control
unofficial implementation of Comfyui magic clothing
HunyuanVideo: A Systematic Framework For Large Video Generation Model
[arXiv 2024] Novel View Extrapolation with Video Diffusion Priors
Finetune Llama 3.3, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 70% less memory
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
WWW2025 Multimodal Intent Recognition for Dialogue Systems Challenge
[ECCV 2024 - Oral] HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution
🌈一个跨平台的划词翻译和OCR软件 | A cross-platform software for text translation and recognition.
The state-of-the-art image restoration model without nonlinear activation functions.
LLM-powered multiagent persona simulation for imagination enhancement and business insights.
[RSS 2023] Diffusion Policy Visuomotor Policy Learning via Action Diffusion
GIF is a photorealistic generative face model with explicit 3D geometric and photometric control.