Skip to content
View retonym's full-sized avatar

Block or report retonym

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Train transformer language models with reinforcement learning.

Python 12,387 1,670 Updated Mar 7, 2025

CUTLASS and CuTe Examples

Cuda 40 4 Updated Jan 4, 2025

An unofficial cuda assembler, for all generations of SASS, hopefully :)

Python 460 79 Updated Apr 20, 2023

A framework for few-shot evaluation of language models.

Python 8,193 2,182 Updated Mar 10, 2025

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

C++ 656 54 Updated Jan 21, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 7,112 615 Updated Mar 11, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 4,882 479 Updated Mar 10, 2025

LeaderWorkerSet: An API for deploying a group of pods as a unit of replication

Go 317 52 Updated Mar 10, 2025

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

6,719 198 Updated Mar 4, 2025

✔(已完结)最全面的 深度学习 笔记【土堆 Pytorch】【李沐 动手学深度学习】【吴恩达 深度学习】

Jupyter Notebook 7,954 998 Updated Dec 26, 2024

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations

Python 12,585 837 Updated Mar 7, 2025
Jupyter Notebook 88 10 Updated Mar 10, 2025

My learning notes/codes for ML SYS.

Python 1,357 69 Updated Mar 10, 2025

A throughput-oriented high-performance serving framework for LLMs

Cuda 750 29 Updated Sep 21, 2024

Clean, minimal, accessible reproduction of DeepSeek R1-Zero

Python 11,081 1,411 Updated Mar 10, 2025

Optimize GEMM with tensorcore step by step

23 5 Updated Dec 17, 2023

Samples for CUDA Developers which demonstrates features in CUDA Toolkit

C 7,077 1,955 Updated Mar 10, 2025
Python 50 3 Updated Dec 27, 2024

A highly optimized LLM inference acceleration engine for Llama and its variants.

C++ 872 102 Updated Mar 10, 2025

本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)

HTML 15,128 1,743 Updated Mar 2, 2025

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

Python 540 40 Updated Feb 14, 2025

HLLM: Enhancing Sequential Recommendations via Hierarchical Large Language Models for Item and User Modeling

Python 287 38 Updated Oct 4, 2024

Material for gpu-mode lectures

Jupyter Notebook 3,952 399 Updated Feb 9, 2025

搜索、推荐、广告、用增等工业界实践文章收集(来源:知乎、Datafuntalk、技术公众号)

Python 3,090 370 Updated Mar 11, 2025

DLRover: An Automatic Distributed Deep Learning System

Python 1,361 171 Updated Mar 11, 2025

A PyTorch Native LLM Training Framework

Python 748 40 Updated Dec 27, 2024

Efficient Triton Kernels for LLM Training

Python 4,601 278 Updated Mar 8, 2025

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 940 150 Updated Jul 29, 2023
Next