-
快手
- Beijing, China.
Highlights
Stars
TransMLA: Equivalently Transforms Group Query Attention into Multi-head Latent Attention.
This repository contains the Hugging Face Agents Course.
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
🌾 OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.
An AI-powered research assistant that performs iterative, deep research on any topic by combining search engines, web scraping, and large language models. The goal of this repo is to provide the si…
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
AISystem 主要是指AI系统,包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
FlashInfer: Kernel Library for LLM Serving
Universal LLM Deployment Engine with ML Compilation
ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)
FlagGems is an operator library for large language models implemented in Triton Language.
[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
Bringing BERT into modernity via both architecture changes and scaling
PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.
Triton backend that enables pre-process, post-processing and other logic to be implemented in Python.
The core library and APIs implementing the Triton Inference Server.
SGLang is a fast serving framework for large language models and vision language models.
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
This repository contains tutorials and examples for Triton Inference Server
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
A multi purpose IDE specialized in C/C++/Rust/Python/PHP and Node.js. Written in C++