Stars
FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/
fanshiqing / grouped_gemm
Forked from tgale96/grouped_gemmPyTorch bindings for CUTLASS grouped GEMM.
TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream d…
The Triton TensorRT-LLM Backend
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
WebUI extension for ControlNet
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
Stable Diffusion web UI
Fast and memory-efficient exact attention
GLIDE: a diffusion-based text-conditional image synthesis model
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
Transformer related optimization, including BERT, GPT
Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning Framework on a GPU (JMLR 2022)
a language for fast, portable data-parallel computation
mailyyin / server
Forked from triton-inference-server/serverThe Triton Inference Server provides an optimized cloud and edge inferencing solution.
Production First and Production Ready End-to-End Speech Recognition Toolkit
Simple samples for TensorRT programming
Implementation for PolarNet: An Improved Grid Representation for Online LiDAR Point Clouds Semantic Segmentation (CVPR 2020)
An Open Source Machine Learning Framework for Everyone
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Source code examples from the Parallel Forall Blog
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
OpenAI Baselines: high-quality implementations of reinforcement learning algorithms
a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.