Starred repositories
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
Multithreaded matrix multiplication and analysis based on OpenMP and PThread
纯c++的全平台llm加速库,支持python调用,chatglm-6B级模型单卡可达10000+token / s,支持glm, llama, moss基座,手机端流畅运行
FlashInfer: Kernel Library for LLM Serving
Flash Attention in ~100 lines of CUDA (forward pass only)
校招、秋招、春招、实习好项目!带你从零实现一个高性能的深度学习推理库,支持大模型 llama2 、Unet、Yolov5、Resnet等模型的推理。Implement a high-performance deep learning inference library step by step
Transformer related optimization, including BERT, GPT
Fast and memory-efficient exact attention
Step-by-step optimization of CUDA SGEMM
Xiao's CUDA Optimization Guide [Active Adding New Contents]
Convolutional Neural Network with CUDA (MNIST 99.23%)
A simple high performance CUDA GEMM implementation.
A large number of cuda/tensorrt cases . 大量案例来学习cuda/tensorrt
This is a list of useful libraries and resources for CUDA development.
CNN accelerated by cuda. Test on mnist and finilly get 99.76%
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
An implementation of the transformer architecture onto an Nvidia CUDA kernel
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…