felicitydu

Follow

felicitydu

Follow

2 followers · 1 following

Starred repositories

NVIDIA / cuda-samples

Samples for CUDA Developers which demonstrates features in CUDA Toolkit

C 6,826 1,911 Updated Jul 26, 2024

HuPengsheet / HPC-Fight

2 Updated Jul 14, 2024

thestuckedcat / StudyNotes

Notes of learning

Jupyter Notebook 4 Updated Sep 3, 2024

IST-DASLab / marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 690 55 Updated Sep 4, 2024

Yinghan-Li / YHs_Sample

Yinghan's Code Sample

Cuda 305 55 Updated Jul 25, 2022

yzhaiustc / Optimizing-SGEMM-on-NVIDIA-Turing-GPUs

Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.

Cuda 316 47 Updated Jan 2, 2025

wchy1993 / multi-head-attention

Cuda 2 Updated Mar 15, 2023

zly5 / Parallel-Computing-Lab

Multithreaded matrix multiplication and analysis based on OpenMP and PThread

Cuda 146 35 Updated Nov 25, 2023

ztxz16 / fastllm

纯c++的全平台llm加速库，支持python调用，chatglm-6B级模型单卡可达10000+token / s，支持glm, llama, moss基座，手机端流畅运行

C++ 3,372 349 Updated Jan 24, 2025

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

Cuda 1,898 189 Updated Jan 31, 2025

tspeterkim / flash-attention-minimal

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 686 61 Updated Dec 30, 2024

karpathy / llama2.c

Inference Llama 2 in one file of pure C

C 17,957 2,181 Updated Aug 6, 2024

zjhellofss / KuiperInfer

校招、秋招、春招、实习好项目！带你从零实现一个高性能的深度学习推理库，支持大模型 llama2 、Unet、Yolov5、Resnet等模型的推理。Implement a high-performance deep learning inference library step by step

C++ 2,680 303 Updated Oct 26, 2024

NVIDIA / FasterTransformer

Transformer related optimization, including BERT, GPT

C++ 6,001 897 Updated Mar 27, 2024

Dao-AILab / flash-attention

Fast and memory-efficient exact attention

Python 15,237 1,439 Updated Jan 30, 2025

ggerganov / llama.cpp

LLM inference in C/C++

C++ 72,394 10,431 Updated Jan 31, 2025

wangzyon / NVIDIA_SGEMM_PRACTICE

Step-by-step optimization of CUDA SGEMM

Cuda 277 43 Updated Mar 30, 2022

XiaoSong9905 / CUDA-Optimization-Guide

Xiao's CUDA Optimization Guide [Active Adding New Contents]

260 18 Updated Nov 8, 2022

paramhanji / CUDA-CNN

Implementation of a simple CNN using CUDA

Cuda 66 20 Updated May 2, 2017

siboehm / SGEMM_CUDA

Fast CUDA matrix multiplication from scratch

Cuda 605 75 Updated Dec 28, 2023

hijkzzz / cuda-neural-network

Convolutional Neural Network with CUDA (MNIST 99.23%)

C++ 185 39 Updated Apr 4, 2022

Cjkkkk / CUDA_gemm

A simple high performance CUDA GEMM implementation.

Cuda 344 38 Updated Jan 4, 2024

lzhengchun / matrix-cuda

matrix multiplication in CUDA

Cuda 119 66 Updated Aug 10, 2023

jinmin527 / learning-cuda-trt

A large number of cuda/tensorrt cases . 大量案例来学习cuda/tensorrt

C++ 121 141 Updated Jul 24, 2022

Erkaman / Awesome-CUDA

This is a list of useful libraries and resources for CUDA development.

541 45 Updated Oct 8, 2017

zhxfl / CUDA-CNN

CNN accelerated by cuda. Test on mnist and finilly get 99.76%

Cuda 184 85 Updated Oct 15, 2017

zchee / cuda-sample

CUDA official sample codes

C++ 356 177 Updated Oct 6, 2015

DefTruth / CUDA-Learn-Notes

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

Cuda 2,176 230 Updated Jan 27, 2025

linjames0 / Transformer-CUDA

An implementation of the transformer architecture onto an Nvidia CUDA kernel

Cuda 167 11 Updated Sep 24, 2023

Liu-xiandong / How_to_optimize_in_GPU

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 896 142 Updated Jul 29, 2023

Starred topics

inference-engine