Stars
nnScaler: Compiling DNN models for Parallel Training
Curated collection of papers in machine learning systems
A library to analyze PyTorch traces.
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
This project is about convolution operator optimization on GPU, include GEMM based (Implicit GEMM) convolution.
This repository contains datasets and baselines for benchmarking Chinese text recognition.
Build and Train a GPT-2 from scratch using PyTorch
A GPU accelerated error-bounded lossy compression for scientific data.
Code for the paper "Language Models are Unsupervised Multitask Learners"
A Pytorch implementation of Sparsely-Gated Mixture of Experts, for massively increasing the parameter count of language models
🔥 🔥 🔥 自建Docker镜像加速服务,基于官方Docker Registry 一键部署Docker、K8s、Quay、Ghcr、Mcr、Nvcr等镜像加速\管理服务。支持免服务器部署到Render\Koyeb
Tile primitives for speedy kernels
Awesome LLM compression research papers and tools.
Create user-notifications on macOS with swiftDialog
CUDA Matrix Multiplication Optimization
[TMLR 2024] Efficient Large Language Models: A Survey
how to optimize some algorithm in cuda.
Google Drive Public File Downloader when Curl/Wget Fails
A library for efficient similarity search and clustering of dense vectors.
Sample codes for my CUDA programming book
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…
A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
Source code examples from the Parallel Forall Blog