[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 whil…

Python 853 39 Updated Dec 28, 2024

Rhythmicc / sub-clash

A Subscribe Convert Tool for Clash

Python 4 1 Updated Dec 24, 2024

icl-utk-edu / magma

C++ 28 14 Updated Dec 13, 2024

NVIDIA / cuda-samples

Samples for CUDA Developers which demonstrates features in CUDA Toolkit

C 6,657 1,875 Updated Jul 26, 2024

pytorch / extension-cpp

C++ extensions in PyTorch

Python 1,036 217 Updated Aug 7, 2024

BBuf / how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

Cuda 1,775 147 Updated Dec 28, 2024

BBuf / tvm_mlir_learn

compiler learning resources collect.

Python 2,216 339 Updated May 27, 2024

mli / transformers-benchmarks

real Transformer TeraFLOPS on various GPUs

Jupyter Notebook 885 110 Updated Jan 9, 2024

triton-lang / triton

Development repository for the Triton language and compiler

C++ 13,833 1,688 Updated Dec 28, 2024

linyiLYi / bilibot

A local chatbot fine-tuned by bilibili user comments.

Python 3,147 364 Updated May 15, 2024

hyfshishen / understanding-llvm-transformation-passes

Python 1 1 Updated Oct 2, 2024

AndSonder / CUDATutorial

Forked from PaddleJitLab/CUDATutorial

A self-learning tutorail for CUDA High Performance Programing.

JavaScript 2 1 Updated Nov 11, 2024

Eddie-Wang1120 / HPC-Learning-Notes

高性能计算相关知识学习笔记，包含学习笔记和相关知识的代码demo，在持续完善中。如果有帮助的话请Star一下，对作者帮助很大，谢谢！

Jupyter Notebook 389 34 Updated Mar 28, 2023

PaddlePaddle / CINN

Compiler Infrastructure for Neural Networks

C++ 144 114 Updated Jul 18, 2023

KnowingNothing / MatmulTutorial

A Easy-to-understand TensorOp Matmul Tutorial

C++ 301 32 Updated Sep 21, 2024

weiaicunzai / pytorch-cifar100

Practice on cifar100(ResNet, DenseNet, VGG, GoogleNet, InceptionV3, InceptionV4, Inception-ResNetv2, Xception, Resnet In Resnet, ResNext,ShuffleNet, ShuffleNetv2, MobileNet, MobileNetv2, SqueezeNet…

Python 4,354 1,181 Updated Jul 15, 2024

teivah / gossip-glomers

My solutions to the Glomers Challenge: a series of distributed systems challenges.

Go 113 7 Updated Mar 1, 2023

mrdbourke / pytorch-deep-learning

Materials for the Learn PyTorch for Deep Learning: Zero to Mastery course.

Jupyter Notebook 11,503 3,353 Updated Sep 12, 2024

romkatv / powerlevel10k

A Zsh theme

Shell 47,155 2,212 Updated Nov 15, 2024

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 32,725 4,983 Updated Dec 28, 2024

THUzhangga / NMSL

Abstraction your words——never mind the scandal and liber

Python 186 8 Updated Mar 25, 2020

gaowanliang / NMSL

一个纯前端的抽象话转换器

HTML 422 33 Updated Dec 2, 2019

Dao-AILab / flash-attention

Fast and memory-efficient exact attention

Python 14,810 1,393 Updated Dec 28, 2024

weifengliu-ssslab / Benchmark_SpTRSM_using_CSC

Fast Synchronization-Free Algorithms for Parallel Sparse Triangular Solves with Multiple Right-Hand Sides (SpTRSM)

C 12 4 Updated Feb 14, 2020

chenzomi12 / AISystem

AISystem 主要是指AI系统，包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术

Jupyter Notebook 11,725 1,697 Updated Dec 7, 2024

NVIDIA / FasterTransformer

Transformer related optimization, including BERT, GPT

C++ 5,956 896 Updated Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wenhao.Dai HeyDavid633

Highlights

Block or report HeyDavid633

Lists (1)

✨ Inspiration

Stars

apuaaChen / EVT_AE

FlagOpen / FlagGems

cmullovisl / triton-fused-mlp

pytorch-labs / attention-gym

microsoft / MInference