Skip to content
View guanrenyang's full-sized avatar
  • Shanghai Jiao Tong University
  • Shanghai

Highlights

  • Pro

Block or report guanrenyang

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
13 stars written in Cuda
Clear filter

LLM training in simple, raw C/CUDA

Cuda 24,760 2,804 Updated Oct 2, 2024

cuGraph - RAPIDS Graph Analytics Library

Cuda 1,813 309 Updated Dec 19, 2024

Tile primitives for speedy kernels

Cuda 1,752 79 Updated Dec 13, 2024

how to optimize some algorithm in cuda.

Cuda 1,746 144 Updated Dec 18, 2024

📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).

Cuda 1,685 176 Updated Dec 19, 2024

FlashInfer: Kernel Library for LLM Serving

Cuda 1,573 160 Updated Dec 19, 2024

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 860 135 Updated Jul 29, 2023

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 657 58 Updated Apr 7, 2024

Examples demonstrating available options to program multiple GPUs in a single node or a cluster

Cuda 578 112 Updated Oct 30, 2024

Fast CUDA matrix multiplication from scratch

Cuda 540 66 Updated Dec 28, 2023

A simple high performance CUDA GEMM implementation.

Cuda 338 37 Updated Jan 4, 2024

A CUDA tutorial to make people learn CUDA program from 0

Cuda 200 54 Updated Jul 9, 2024

CUDA Matrix Multiplication Optimization

Cuda 147 13 Updated Jul 19, 2024