-
China University of petroleum-Beijing
- China, Beijing
- https://www.ssslab.cn/people.html
- https://orcid.org/0009-0000-5551-7603
Highlights
- Pro
Lists (1)
Sort Name ascending (A-Z)
Stars
FlagGems is an operator library for large language models implemented in Triton Language.
Memory-efficient multi layer perceptron implementation in OpenAI Triton.
Helpful tools and examples for working with flex-attention
[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 whil…
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
how to optimize some algorithm in cuda.
real Transformer TeraFLOPS on various GPUs
Development repository for the Triton language and compiler
A local chatbot fine-tuned by bilibili user comments.
AndSonder / CUDATutorial
Forked from PaddleJitLab/CUDATutorialA self-learning tutorail for CUDA High Performance Programing.
高性能计算相关知识学习笔记,包含学习笔记和相关知识的代码demo,在持续完善中。 如果有帮助的话请Star一下,对作者帮助很大,谢谢!
A Easy-to-understand TensorOp Matmul Tutorial
Practice on cifar100(ResNet, DenseNet, VGG, GoogleNet, InceptionV3, InceptionV4, Inception-ResNetv2, Xception, Resnet In Resnet, ResNext,ShuffleNet, ShuffleNetv2, MobileNet, MobileNetv2, SqueezeNet…
My solutions to the Glomers Challenge: a series of distributed systems challenges.
Materials for the Learn PyTorch for Deep Learning: Zero to Mastery course.
A high-throughput and memory-efficient inference and serving engine for LLMs
Abstraction your words——never mind the scandal and liber
Fast and memory-efficient exact attention
Fast Synchronization-Free Algorithms for Parallel Sparse Triangular Solves with Multiple Right-Hand Sides (SpTRSM)
AISystem 主要是指AI系统,包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术
Transformer related optimization, including BERT, GPT