Skip to content
View felicitydu's full-sized avatar

Block or report felicitydu

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

Samples for CUDA Developers which demonstrates features in CUDA Toolkit

C 6,826 1,911 Updated Jul 26, 2024
2 Updated Jul 14, 2024

Notes of learning

Jupyter Notebook 4 Updated Sep 3, 2024

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 690 55 Updated Sep 4, 2024

Yinghan's Code Sample

Cuda 305 55 Updated Jul 25, 2022

Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.

Cuda 316 47 Updated Jan 2, 2025
Cuda 2 Updated Mar 15, 2023

Multithreaded matrix multiplication and analysis based on OpenMP and PThread

Cuda 146 35 Updated Nov 25, 2023

纯c++的全平台llm加速库,支持python调用,chatglm-6B级模型单卡可达10000+token / s,支持glm, llama, moss基座,手机端流畅运行

C++ 3,372 349 Updated Jan 24, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 1,898 189 Updated Jan 31, 2025

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 686 61 Updated Dec 30, 2024

Inference Llama 2 in one file of pure C

C 17,957 2,181 Updated Aug 6, 2024

校招、秋招、春招、实习好项目!带你从零实现一个高性能的深度学习推理库,支持大模型 llama2 、Unet、Yolov5、Resnet等模型的推理。Implement a high-performance deep learning inference library step by step

C++ 2,680 303 Updated Oct 26, 2024

Transformer related optimization, including BERT, GPT

C++ 6,001 897 Updated Mar 27, 2024

Fast and memory-efficient exact attention

Python 15,237 1,439 Updated Jan 30, 2025

LLM inference in C/C++

C++ 72,394 10,431 Updated Jan 31, 2025

Step-by-step optimization of CUDA SGEMM

Cuda 277 43 Updated Mar 30, 2022

Xiao's CUDA Optimization Guide [Active Adding New Contents]

260 18 Updated Nov 8, 2022

Implementation of a simple CNN using CUDA

Cuda 66 20 Updated May 2, 2017

Fast CUDA matrix multiplication from scratch

Cuda 605 75 Updated Dec 28, 2023

Convolutional Neural Network with CUDA (MNIST 99.23%)

C++ 185 39 Updated Apr 4, 2022

A simple high performance CUDA GEMM implementation.

Cuda 344 38 Updated Jan 4, 2024

matrix multiplication in CUDA

Cuda 119 66 Updated Aug 10, 2023

A large number of cuda/tensorrt cases . 大量案例来学习cuda/tensorrt

C++ 121 141 Updated Jul 24, 2022

This is a list of useful libraries and resources for CUDA development.

541 45 Updated Oct 8, 2017

CNN accelerated by cuda. Test on mnist and finilly get 99.76%

Cuda 184 85 Updated Oct 15, 2017

CUDA official sample codes

C++ 356 177 Updated Oct 6, 2015

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

Cuda 2,176 230 Updated Jan 27, 2025

An implementation of the transformer architecture onto an Nvidia CUDA kernel

Cuda 167 11 Updated Sep 24, 2023

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 896 142 Updated Jul 29, 2023
Next