Skip to content
View Txxx926's full-sized avatar

Highlights

  • Pro

Block or report Txxx926

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

DeepEP: an efficient expert-parallel communication library

Cuda 7,033 600 Updated Mar 6, 2025

Official Repo for Open-Reasoner-Zero

Python 1,518 68 Updated Mar 5, 2025

Reference implementation for DPO (Direct Preference Optimization)

Python 2,425 202 Updated Aug 11, 2024

Minimalistic 4D-parallelism distributed training framework for education purpose

Python 909 65 Updated Mar 3, 2025

The Torch-MLIR project aims to provide first class support from the PyTorch ecosystem to the MLIR ecosystem.

C++ 1,453 537 Updated Mar 6, 2025

The ASPLOS 2025 / EuroSys 2025 Contest Track

27 2 Updated Mar 4, 2025

🚀 Efficient implementations of state-of-the-art linear attention models in Torch and Triton

Python 2,060 126 Updated Mar 5, 2025

Triton-based implementation of Sparse Mixture of Experts.

Python 203 17 Updated Nov 28, 2024

Textbook on reinforcement learning from human feedback

TeX 470 34 Updated Mar 6, 2025

A pedagogical implementation of Autograd

Jupyter Notebook 971 100 Updated May 26, 2020

AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.

Python 172,359 45,214 Updated Mar 7, 2025

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

Cuda 2,733 283 Updated Mar 4, 2025

collection of benchmarks to measure basic GPU capabilities

C++ 304 45 Updated Feb 11, 2025
Python 100 18 Updated Aug 26, 2024

A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.

Python 517 26 Updated Feb 19, 2025

Efficient Triton Kernels for LLM Training

Python 4,565 276 Updated Mar 7, 2025

Examples demonstrating available options to program multiple GPUs in a single node or a cluster

Cuda 631 119 Updated Feb 21, 2025
C++ 141 21 Updated Jan 30, 2025

Official implementation for the paper Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping, published in MLSys'24.

C++ 9 5 Updated Sep 25, 2024

OneDiff: An out-of-the-box acceleration library for diffusion models.

Jupyter Notebook 1,832 123 Updated Jan 13, 2025

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism

Python 1,423 127 Updated Mar 3, 2025

The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.

Python 69,944 7,532 Updated Mar 6, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 2,298 240 Updated Mar 6, 2025

ShortcutsBench: A Large-Scale Real-World Benchmark for API-Based Agents

Python 83 2 Updated Feb 18, 2025

An easy-to-use framework for modular RAG

Python 331 49 Updated Mar 7, 2025

A Easy-to-understand TensorOp Matmul Tutorial

C++ 324 36 Updated Sep 21, 2024

Odysseus: Playground of LLM Sequence Parallelism

Python 65 3 Updated Jun 17, 2024

FlagGems is an operator library for large language models implemented in Triton Language.

Python 441 72 Updated Mar 7, 2025

Ring attention implementation with flash attention

Python 702 60 Updated Feb 24, 2025
Next