Lists (1)
Sort Name ascending (A-Z)
Stars
Train transformer language models with reinforcement learning.
An unofficial cuda assembler, for all generations of SASS, hopefully :)
A framework for few-shot evaluation of language models.
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
DeepEP: an efficient expert-parallel communication library
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
LeaderWorkerSet: An API for deploying a group of pods as a unit of replication
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
✔(已完结)最全面的 深度学习 笔记【土堆 Pytorch】【李沐 动手学深度学习】【吴恩达 深度学习】
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
My learning notes/codes for ML SYS.
A throughput-oriented high-performance serving framework for LLMs
Clean, minimal, accessible reproduction of DeepSeek R1-Zero
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
A highly optimized LLM inference acceleration engine for Llama and its variants.
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
HLLM: Enhancing Sequential Recommendations via Hierarchical Large Language Models for Item and User Modeling
搜索、推荐、广告、用增等工业界实践文章收集(来源:知乎、Datafuntalk、技术公众号)
DLRover: An Automatic Distributed Deep Learning System
Efficient Triton Kernels for LLM Training
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…