Skip to content
View lauthu's full-sized avatar
🎯
Focusing
🎯
Focusing
  • 快手
  • Beijing, China.

Block or report lauthu

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

CUDA Kernel Benchmarking Library

Cuda 558 71 Updated Nov 20, 2024

TransMLA: Equivalently Transforms Group Query Attention into Multi-head Latent Attention.

Python 68 6 Updated Feb 10, 2025

This repository contains the Hugging Face Agents Course.

MDX 9,266 490 Updated Feb 14, 2025

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations

Python 6,372 364 Updated Feb 15, 2025

🌾 OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.

Python 182 12 Updated Feb 8, 2025

An AI-powered research assistant that performs iterative, deep research on any topic by combining search engines, web scraping, and large language models. The goal of this repo is to provide the si…

TypeScript 11,659 1,130 Updated Feb 11, 2025

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

Python 971 80 Updated Feb 15, 2025

AISystem 主要是指AI系统,包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术

Jupyter Notebook 12,341 1,787 Updated Jan 2, 2025
Python 677 74 Updated Feb 10, 2025

CUDA Core Compute Libraries

C++ 1,459 188 Updated Feb 14, 2025

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

Cuda 1,715 448 Updated Oct 9, 2023

一种任务级GPU算力分时调度的高性能深度学习训练平台

Python 481 66 Updated Oct 24, 2023

FlashInfer: Kernel Library for LLM Serving

Cuda 2,015 207 Updated Feb 14, 2025

Universal LLM Deployment Engine with ML Compilation

Python 19,964 1,663 Updated Feb 12, 2025

ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)

Python 37 4 Updated Feb 15, 2025

FlagGems is an operator library for large language models implemented in Triton Language.

Python 417 64 Updated Feb 14, 2025

[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Cuda 660 41 Updated Feb 14, 2025

Bringing BERT into modernity via both architecture changes and scaling

Python 1,179 78 Updated Feb 13, 2025

PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.

Python 771 53 Updated Feb 12, 2025

Triton backend that enables pre-process, post-processing and other logic to be implemented in Python.

C++ 587 158 Updated Feb 11, 2025
Python 13 3 Updated Dec 7, 2024

The core library and APIs implementing the Triton Inference Server.

C++ 115 104 Updated Feb 14, 2025

SGLang is a fast serving framework for large language models and vision language models.

Python 9,610 913 Updated Feb 14, 2025

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

Python 519 39 Updated Feb 14, 2025

This repository contains tutorials and examples for Triton Inference Server

Python 641 105 Updated Feb 15, 2025

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ 2,540 150 Updated Feb 14, 2025

Rust bindings to the Triton Inference Server

Rust 11 2 Updated Mar 14, 2024

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

Cuda 948 58 Updated Jan 30, 2025

A multi purpose IDE specialized in C/C++/Rust/Python/PHP and Node.js. Written in C++

C++ 2,191 469 Updated Feb 13, 2025
Next