Skip to content
View Txxx926's full-sized avatar

Highlights

  • Pro

Block or report Txxx926

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Minimalistic 4D-parallelism distributed training framework for education purpose

Python 274 18 Updated Dec 20, 2024

The Torch-MLIR project aims to provide first class support from the PyTorch ecosystem to the MLIR ecosystem.

C++ 1,394 516 Updated Jan 2, 2025

The ASPLOS 2025 / EuroSys 2025 Contest Track

22 2 Updated Jan 1, 2025

Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton

Python 1,526 75 Updated Jan 1, 2025

Triton-based implementation of Sparse Mixture of Experts.

Python 192 15 Updated Nov 28, 2024

Textbook on reinforcement learning from human feedback

TeX 88 10 Updated Dec 19, 2024

A pedagogical implementation of Autograd

Jupyter Notebook 959 101 Updated May 26, 2020

AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.

Python 169,999 44,717 Updated Jan 1, 2025

📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

Cuda 1,837 191 Updated Jan 2, 2025

collection of benchmarks to measure basic GPU capabilities

Jupyter Notebook 272 41 Updated Jun 21, 2024
Python 95 20 Updated Aug 26, 2024

A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.

Python 501 26 Updated Oct 25, 2024

Efficient Triton Kernels for LLM Training

Python 4,080 233 Updated Jan 2, 2025

Examples demonstrating available options to program multiple GPUs in a single node or a cluster

Cuda 583 113 Updated Oct 30, 2024
C++ 141 21 Updated Nov 25, 2024

Official implementation for the paper Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping, published in MLSys'24.

C++ 6 3 Updated Sep 25, 2024

OneDiff: An out-of-the-box acceleration library for diffusion models.

Jupyter Notebook 1,754 110 Updated Dec 30, 2024

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism

Python 1,112 103 Updated Jan 1, 2025

The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.

Python 61,993 6,626 Updated Jan 2, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 1,665 164 Updated Jan 1, 2025

ShortcutsBench: A Large-Scale Real-World Benchmark for API-Based Agents

Python 78 Updated Nov 29, 2024

An easy-to-use framework for modular RAG

Python 304 45 Updated Jan 2, 2025

A Easy-to-understand TensorOp Matmul Tutorial

C++ 303 32 Updated Sep 21, 2024

Odysseus: Playground of LLM Sequence Parallelism

Python 63 3 Updated Jun 17, 2024

FlagGems is an operator library for large language models implemented in Triton Language.

Python 384 55 Updated Jan 2, 2025

Ring attention implementation with flash attention

Python 624 52 Updated Dec 19, 2024

A collection of memory efficient attention operators implemented in the Triton language.

Python 228 16 Updated Jun 5, 2024

Collection of kernels written in Triton language

87 3 Updated Oct 28, 2024

LLM training in simple, raw C/CUDA

Cuda 24,904 2,826 Updated Oct 2, 2024

Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.

Python 361 44 Updated Sep 11, 2024
Next