Lists (1)
Sort Name ascending (A-Z)
Stars
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
High-speed Large Language Model Serving for Local Deployment
Transformer related optimization, including BERT, GPT
a language for fast, portable data-parallel computation
Lightning fast C++/CUDA neural network framework
A machine learning compiler for GPUs, CPUs, and ML accelerators
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
The Tensor Algebra Compiler (taco) computes sparse tensor expressions on CPUs and GPUs
GPGPU-Sim provides a detailed simulation model of contemporary NVIDIA GPUs running CUDA and/or OpenCL workloads. It includes support for features such as TensorCores and CUDA Dynamic Parallelism as…
A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
Representation and Reference Lowering of ONNX Models in MLIR Compiler Infrastructure
Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA
Antares: an automatic engine for multi-platform kernel generation and optimization. Supporting CPU, CUDA, ROCm, DirectX12, GraphCore, SYCL for CPU/GPU, OpenCL for AMD/NVIDIA, Android CPU/GPU backends.
Timeloop performs modeling, mapping and code-generation for tensor algebra workloads on various accelerator architectures.
A Easy-to-understand TensorOp Matmul Tutorial
A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
MSCCL++: A GPU-driven communication stack for scalable AI applications
A fast communication-overlapping library for tensor parallelism on GPUs.
A library of GPU kernels for sparse matrix operations.
A GPU-driven system framework for scalable AI applications
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning