Skip to content
View oyanghd's full-sized avatar
  • 北京航空航天大学
  • Beijing
  • 10:11 (UTC +08:00)

Organizations

@buaasupercomputersociety

Block or report oyanghd

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

A high-performance distributed file system designed to address the challenges of AI training and inference workloads.

C++ 4,386 344 Updated Mar 1, 2025

Expert Parallelism Load Balancer

Python 853 101 Updated Feb 27, 2025

Analyze computation-communication overlap in V3/R1.

686 76 Updated Feb 27, 2025

A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.

Python 2,178 165 Updated Feb 28, 2025

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

C++ 11,251 2,160 Updated Feb 1, 2025

A tool for bandwidth measurements on NVIDIA GPUs.

C++ 375 31 Updated Feb 7, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 4,404 383 Updated Feb 28, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 6,625 524 Updated Feb 28, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 2,218 231 Updated Feb 28, 2025

Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paper

Python 488 18 Updated Feb 28, 2025

FlashMLA: Efficient MLA Decoding Kernel for Hopper GPUs

C++ 10,735 697 Updated Feb 27, 2025

🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"

Python 458 24 Updated Feb 27, 2025

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

5,566 103 Updated Feb 28, 2025

NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to fa…

Python 93 7 Updated Feb 21, 2025

SGLang is a fast serving framework for large language models and vision language models.

Python 11,072 1,106 Updated Mar 1, 2025

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

Cuda 1,019 64 Updated Feb 28, 2025

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations

Python 11,934 775 Updated Feb 28, 2025

verl: Volcano Engine Reinforcement Learning for LLMs

Python 3,963 362 Updated Feb 28, 2025

A list of awesome compiler projects and papers for tensor computation and deep learning.

2,495 307 Updated Oct 19, 2024

[USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism

Python 52 2 Updated Jul 31, 2024

📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, Flash-Attention, Paged-Attention, Parallelism, etc. 🎉🎉

3,527 242 Updated Feb 27, 2025

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

Cuda 2,540 262 Updated Feb 24, 2025

NVIDIA Linux open GPU with P2P support

C 1,034 101 Updated Dec 18, 2024

Machine Learning Engineering Open Book

Python 12,986 792 Updated Feb 28, 2025

TORCH_LOGS parser for PT2

Rust 33 14 Updated Feb 28, 2025

A Primer on Memory Consistency and Cache Coherence (Second Edition) 翻译计划

207 41 Updated May 5, 2024

CV-CUDA™ is an open-source, GPU accelerated library for cloud-scale image processing and computer vision.

C++ 2,448 220 Updated Mar 1, 2025

Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also …

C++ 297 52 Updated Feb 25, 2025

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & RingAttention & RFT)

Python 5,152 512 Updated Feb 28, 2025

A fast MoE impl for PyTorch

Python 1,641 191 Updated Feb 10, 2025
Next