NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

C++ 11,251 2,160 Updated Feb 1, 2025

NVIDIA / nvbandwidth

A tool for bandwidth measurements on NVIDIA GPUs.

C++ 375 31 Updated Feb 7, 2025

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 4,404 383 Updated Feb 28, 2025

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 6,625 524 Updated Feb 28, 2025

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

Cuda 2,218 231 Updated Feb 28, 2025

lucidrains / native-sparse-attention-pytorch

Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paper

Python 488 18 Updated Feb 28, 2025

deepseek-ai / FlashMLA

FlashMLA: Efficient MLA Decoding Kernel for Hopper GPUs

C++ 10,735 697 Updated Feb 27, 2025

fla-org / native-sparse-attention

🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"

Python 458 24 Updated Feb 27, 2025

deepseek-ai / open-infra-index

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

5,566 103 Updated Feb 28, 2025

NVIDIA / nvidia-resiliency-ext

NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to fa…

Python 93 7 Updated Feb 21, 2025

sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.

Python 11,072 1,106 Updated Mar 1, 2025

thu-ml / SageAttention

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

Cuda 1,019 64 Updated Feb 28, 2025