Skip to content
View Dazz993's full-sized avatar
  • Univeristy of Toronto
  • Toronto, CA
  • 11:34 (UTC -05:00)

Highlights

  • Pro

Block or report Dazz993

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
30 stars written in C++
Clear filter

MLX: An array framework for Apple silicon

C++ 19,237 1,096 Updated Feb 23, 2025

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…

C++ 9,494 1,112 Updated Feb 21, 2025

High-speed Large Language Model Serving for Local Deployment

C++ 8,114 424 Updated Feb 19, 2025

CUDA Templates for Linear Algebra Subroutines

C++ 6,429 1,087 Updated Feb 21, 2025

Transformer related optimization, including BERT, GPT

C++ 6,037 899 Updated Mar 27, 2024

a language for fast, portable data-parallel computation

C++ 5,972 1,078 Updated Feb 20, 2025

Lightning fast C++/CUDA neural network framework

C++ 3,887 479 Updated Jan 27, 2025

A machine learning compiler for GPUs, CPUs, and ML accelerators

C++ 2,972 509 Updated Feb 23, 2025

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ 2,615 161 Updated Feb 23, 2025

The Tensor Algebra Compiler (taco) computes sparse tensor expressions on CPUs and GPUs

C++ 1,279 191 Updated Apr 14, 2024

GPGPU-Sim provides a detailed simulation model of contemporary NVIDIA GPUs running CUDA and/or OpenCL workloads. It includes support for features such as TensorCores and CUDA Dynamic Parallelism as…

C++ 1,228 535 Updated Feb 15, 2025

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology

C++ 959 149 Updated Feb 18, 2025

Representation and Reference Lowering of ONNX Models in MLIR Compiler Infrastructure

C++ 816 337 Updated Feb 20, 2025

Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA

C++ 747 44 Updated Feb 21, 2025

IPADS 实验室新人培训第二讲:CMake(2021.11.3)

C++ 623 83 Updated Feb 16, 2025

Antares: an automatic engine for multi-platform kernel generation and optimization. Supporting CPU, CUDA, ROCm, DirectX12, GraphCore, SYCL for CPU/GPU, OpenCL for AMD/NVIDIA, Android CPU/GPU backends.

C++ 456 48 Updated Feb 19, 2025

Timeloop performs modeling, mapping and code-generation for tensor algebra workloads on various accelerator architectures.

C++ 364 107 Updated Feb 10, 2025

Microsoft Collective Communication Library

C++ 336 31 Updated Sep 20, 2023

A Easy-to-understand TensorOp Matmul Tutorial

C++ 317 35 Updated Sep 21, 2024

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")

C++ 307 54 Updated Feb 23, 2025

MSCCL++: A GPU-driven communication stack for scalable AI applications

C++ 298 45 Updated Feb 22, 2025

A fast communication-overlapping library for tensor parallelism on GPUs.

C++ 298 25 Updated Oct 30, 2024

A library of GPU kernels for sparse matrix operations.

C++ 258 52 Updated Nov 24, 2020

A GPU-driven system framework for scalable AI applications

C++ 112 17 Updated Feb 5, 2025

A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer

C++ 88 22 Updated Feb 22, 2025

Thunder Research Group's Collective Communication Library

C++ 33 3 Updated Apr 25, 2024

Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning

C++ 23 1 Updated Dec 12, 2024

方便扩展的Cuda算子理解和优化框架,仅用在学习使用

C++ 11 2 Updated Jun 13, 2024

高性能并行编程与优化 - 课件

C++ 8 Updated Sep 24, 2023
C++ 4 1 Updated Jul 1, 2020