retonym

Follow

Mao Yunfei retonym

Follow

7 followers · 65 following

intel

Achievements

Achievements

Lists (1)

Sort

intel

Stars

huggingface / trl

Train transformer language models with reinforcement learning.

Python 12,387 1,670 Updated Mar 7, 2025

leimao / CUTLASS-Examples

CUTLASS and CuTe Examples

Cuda 40 4 Updated Jan 4, 2025

cloudcores / CuAssembler

An unofficial cuda assembler, for all generations of SASS, hopefully ：）

Python 460 79 Updated Apr 20, 2023

EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.

Python 8,193 2,182 Updated Mar 10, 2025

alibaba / rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

C++ 656 54 Updated Jan 21, 2025

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 7,112 615 Updated Mar 11, 2025

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 4,882 479 Updated Mar 10, 2025

kubernetes-sigs / lws

LeaderWorkerSet: An API for deploying a group of pods as a unit of replication

Go 317 52 Updated Mar 10, 2025

deepseek-ai / open-infra-index

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

6,719 198 Updated Mar 4, 2025

AccumulateMore / CV

✔（已完结）最全面的深度学习笔记【土堆 Pytorch】【李沐动手学深度学习】【吴恩达深度学习】

Jupyter Notebook 7,954 998 Updated Dec 26, 2024

kvcache-ai / ktransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations

Python 12,585 837 Updated Mar 7, 2025

spcl / muliticast-based-allgather

C 12 3 Updated Feb 12, 2025

mdy666 / mdy_triton

Jupyter Notebook 88 10 Updated Mar 10, 2025

zhaochenyang20 / Awesome-ML-SYS-Tutorial

My learning notes/codes for ML SYS.

Python 1,357 69 Updated Mar 10, 2025

efeslab / Nanoflow

A throughput-oriented high-performance serving framework for LLMs

Cuda 750 29 Updated Sep 21, 2024

Jiayi-Pan / TinyZero

Clean, minimal, accessible reproduction of DeepSeek R1-Zero

Python 11,081 1,411 Updated Mar 10, 2025

gty111 / GEMM_MMA

Optimize GEMM with tensorcore step by step

23 5 Updated Dec 17, 2023

NVIDIA / cuda-samples

Samples for CUDA Developers which demonstrates features in CUDA Toolkit

C 7,077 1,955 Updated Mar 10, 2025

yifuwang / symm-mem-recipes

Python 50 3 Updated Dec 27, 2024

hkproj / triton-flash-attention

Python 131 11 Updated Jan 2, 2025

zhihu / ZhiLight

A highly optimized LLM inference acceleration engine for Llama and its variants.

C++ 872 102 Updated Mar 10, 2025

liguodongiot / llm-action

本项目旨在分享大模型相关技术原理以及实战经验（大模型工程化、大模型应用落地）

HTML 15,128 1,743 Updated Mar 2, 2025

microsoft / BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

Python 540 40 Updated Feb 14, 2025

bytedance / HLLM

HLLM: Enhancing Sequential Recommendations via Hierarchical Large Language Models for Item and User Modeling

Python 287 38 Updated Oct 4, 2024

gpu-mode / lectures

Material for gpu-mode lectures

Jupyter Notebook 3,952 399 Updated Feb 9, 2025

Doragd / Algorithm-Practice-in-Industry

搜索、推荐、广告、用增等工业界实践文章收集（来源：知乎、Datafuntalk、技术公众号）

Python 3,090 370 Updated Mar 11, 2025

intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System

Python 1,361 171 Updated Mar 11, 2025

volcengine / veScale

A PyTorch Native LLM Training Framework

Python 748 40 Updated Dec 27, 2024

linkedin / Liger-Kernel

Efficient Triton Kernels for LLM Training

Python 4,601 278 Updated Mar 8, 2025

Liu-xiandong / How_to_optimize_in_GPU

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 940 150 Updated Jul 29, 2023