Txxx926

Follow

TAN Xin Txxx926

Follow

PhD Candidate@CUHK CSE

15 followers · 15 following

The Chinese University of Hong Kong
Hong Kong SAR
11:13 (UTC +08:00)
https://txxx926.github.io/

Highlights

Pro

Stars

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 7,033 600 Updated Mar 6, 2025

Open-Reasoner-Zero / Open-Reasoner-Zero

Official Repo for Open-Reasoner-Zero

Python 1,518 68 Updated Mar 5, 2025

stepfun-ai / Step-Video-T2V

Python 2,582 219 Updated Feb 27, 2025

eric-mitchell / direct-preference-optimization

Reference implementation for DPO (Direct Preference Optimization)

Python 2,425 202 Updated Aug 11, 2024

huggingface / picotron

Minimalistic 4D-parallelism distributed training framework for education purpose

Python 909 65 Updated Mar 3, 2025

llvm / torch-mlir

The Torch-MLIR project aims to provide first class support from the PyTorch ecosystem to the MLIR ecosystem.

C++ 1,453 537 Updated Mar 6, 2025

asplos-contest / 2025

The ASPLOS 2025 / EuroSys 2025 Contest Track

27 2 Updated Mar 4, 2025

fla-org / flash-linear-attention

🚀 Efficient implementations of state-of-the-art linear attention models in Torch and Triton

Python 2,060 126 Updated Mar 5, 2025

shawntan / scattermoe

Triton-based implementation of Sparse Mixture of Experts.

Python 203 17 Updated Nov 28, 2024

natolambert / rlhf-book

Textbook on reinforcement learning from human feedback

TeX 470 34 Updated Mar 6, 2025

mattjj / autodidact

A pedagogical implementation of Autograd

Jupyter Notebook 971 100 Updated May 26, 2020

Significant-Gravitas / AutoGPT

AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.

Python 172,359 45,214 Updated Mar 7, 2025

DefTruth / CUDA-Learn-Notes

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

Cuda 2,733 283 Updated Mar 4, 2025

RRZE-HPC / gpu-benches

collection of benchmarks to measure basic GPU capabilities

C++ 304 45 Updated Feb 11, 2025

stanford-futuredata / stk

Python 100 18 Updated Aug 26, 2024

BobMcDear / attorch

A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.

Python 517 26 Updated Feb 19, 2025

linkedin / Liger-Kernel

Efficient Triton Kernels for LLM Training

Python 4,565 276 Updated Mar 7, 2025

NVIDIA / multi-gpu-programming-models

Examples demonstrating available options to program multiple GPUs in a single node or a cluster

Cuda 631 119 Updated Feb 21, 2025

awslabs / raf

C++ 141 21 Updated Jan 30, 2025

awslabs / Lancet-Accelerating-MoE-Training-via-Whole-Graph-Computation-Communication-Overlapping

Official implementation for the paper Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping, published in MLSys'24.

C++ 9 5 Updated Sep 25, 2024

siliconflow / onediff

OneDiff: An out-of-the-box acceleration library for diffusion models.

Jupyter Notebook 1,832 123 Updated Jan 13, 2025

xdit-project / xDiT

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism

Python 1,423 127 Updated Mar 3, 2025

comfyanonymous / ComfyUI

The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.

Python 69,944 7,532 Updated Mar 6, 2025

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

Cuda 2,298 240 Updated Mar 6, 2025

EachSheep / ShortcutsBench

ShortcutsBench: A Large-Scale Real-World Benchmark for API-Based Agents

Python 83 2 Updated Feb 18, 2025

aigc-apps / PAI-RAG

An easy-to-use framework for modular RAG

Python 331 49 Updated Mar 7, 2025

KnowingNothing / MatmulTutorial

A Easy-to-understand TensorOp Matmul Tutorial

C++ 324 36 Updated Sep 21, 2024

feifeibear / Odysseus-Transformer

Odysseus: Playground of LLM Sequence Parallelism

Python 65 3 Updated Jun 17, 2024

FlagOpen / FlagGems

FlagGems is an operator library for large language models implemented in Triton Language.

Python 441 72 Updated Mar 7, 2025

zhuzilin / ring-flash-attention

Ring attention implementation with flash attention

Python 702 60 Updated Feb 24, 2025