
-
北京航空航天大学
- Beijing
-
10:11
(UTC +08:00)
- All languages
- ASL
- Assembly
- Batchfile
- BitBake
- C
- C#
- C++
- CSS
- CoffeeScript
- Cuda
- Dart
- Dockerfile
- Emacs Lisp
- Fortran
- Gleam
- Go
- HLSL
- HTML
- Handlebars
- Java
- JavaScript
- Jinja
- Julia
- Jupyter Notebook
- Kotlin
- LLVM
- Less
- Lua
- MATLAB
- MDX
- MLIR
- Makefile
- Markdown
- Metal
- Mojo
- Mustache
- Nix
- OCaml
- Objective-C
- PHP
- Pascal
- Perl
- PowerShell
- Python
- QML
- Roff
- Ruby
- Rust
- SCSS
- SVG
- Scala
- Shell
- Starlark
- Svelte
- Swift
- SystemVerilog
- Tcl
- TeX
- TypeScript
- V
- Vala
- Verilog
- Vim Script
- Vue
- XSLT
Starred repositories
A high-performance distributed file system designed to address the challenges of AI training and inference workloads.
Analyze computation-communication overlap in V3/R1.
A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
A tool for bandwidth measurements on NVIDIA GPUs.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepEP: an efficient expert-parallel communication library
FlashInfer: Kernel Library for LLM Serving
Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paper
FlashMLA: Efficient MLA Decoding Kernel for Hopper GPUs
🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to fa…
SGLang is a fast serving framework for large language models and vision language models.
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
verl: Volcano Engine Reinforcement Learning for LLMs
A list of awesome compiler projects and papers for tensor computation and deep learning.
kwai / Megatron-Kwai
Forked from NVIDIA/Megatron-LM[USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, Flash-Attention, Paged-Attention, Parallelism, etc. 🎉🎉
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
NVIDIA Linux open GPU with P2P support
Machine Learning Engineering Open Book
A Primer on Memory Consistency and Cache Coherence (Second Edition) 翻译计划
CV-CUDA™ is an open-source, GPU accelerated library for cloud-scale image processing and computer vision.
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also …
An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & RingAttention & RFT)