Highlights
- Pro
Lists (3)
Sort Name ascending (A-Z)
Stars
how to optimize some algorithm in cuda.
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
FlashInfer: Kernel Library for LLM Serving
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…
Flash Attention in ~100 lines of CUDA (forward pass only)
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
A simple high performance CUDA GEMM implementation.
A CUDA tutorial to make people learn CUDA program from 0
CUDA Matrix Multiplication Optimization