Stars
Accelerating Diffusion Transformers with Token-wise Feature Caching
A Python library transfers PyTorch tensors between CPU and NVMe
A very simple and barebones tensor decomposition library for CP decomposition a.k.a. PARAFAC a.k.a. TCA
Stateful load balancer custom-tailored for llama.cpp 🏓🦙
Making large AI models cheaper, faster and more accessible
Open-Sora: Democratizing Efficient Video Production for All
jax-triton contains integrations between JAX and OpenAI Triton
Zero-copy MPI communication of JAX arrays, for turbo-charged HPC applications in Python ⚡
RUDOLPH: One Hyper-Tasking Transformer can be creative as DALL-E and GPT-3 and smart as CLIP
Efficient LLM Inference over Long Sequences
PyTorch bindings for CUTLASS grouped GEMM for MoE.
fanshiqing / grouped_gemm
Forked from tgale96/grouped_gemmPyTorch bindings for CUTLASS grouped GEMM.
A tool for bandwidth measurements on NVIDIA GPUs.
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
Library for reading and processing ML training data.
a language for fast, portable data-parallel computation
Best practices & guides on how to write distributed pytorch training code
A native PyTorch Library for large model training
A Pytorch implementation of Sparsely-Gated Mixture of Experts, for massively increasing the parameter count of language models
PyTorch library for cost-effective, fast and easy serving of MoE models.
Development repository for the Triton language and compiler