Lists (3)
Sort Name ascending (A-Z)
Stars
Archived implementation of BLAS using the SYCL open standard. See oneMath for a replacement.
GPGPU-Sim provides a detailed simulation model of contemporary NVIDIA GPUs running CUDA and/or OpenCL workloads. It includes support for features such as TensorCores and CUDA Dynamic Parallelism as…
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstra…
A Easy-to-understand TensorOp Matmul Tutorial
Tile primitives for speedy kernels
A minimal GPU design in Verilog to learn how GPUs work from the ground up
the resources about the application based on LLM with RAG pattern
A comprehensive guide to building RAG-based LLM applications for production.
A simple high performance CUDA GEMM implementation.
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
A tool for bandwidth measurements on NVIDIA GPUs.
Benchmark code for the "Online normalizer calculation for softmax" paper
collection of benchmarks to measure basic GPU capabilities
C++ project template with unit-tests, documentation, ci-testing and workflows.
An extension library of WMMA API (Tensor Core API)
A collection of out-of-tree Clang plugins for teaching and learning
Rich is a Python library for rich text and beautiful formatting in the terminal.
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
图解计算机网络、操作系统、计算机组成、数据库,共 1000 张图 + 50 万字,破除晦涩难懂的计算机基础知识,让天下没有难懂的八股文!🚀 在线阅读:https://xiaolincoding.com