Highlights
- Pro
Starred repositories
Learning how to write "Less Slow" code in C++ 20, C 99, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
An open-source invisible desktop application to help you pass your technical interviews.
Meta Lingua: a lean, efficient, and easy-to-hack codebase to research LLMs.
SPIRV-Cross is a practical tool and library for performing reflection on SPIR-V and disassembling SPIR-V back to high level languages.
Any model. Any hardware. Zero compromise. Built with @ziglang / @openxla / MLIR / @bazelbuild
A high-performance, zero-overhead, extensible Python compiler with built-in NumPy support
Everything we actually know about the Apple Neural Engine (ANE)
Efficient Triton Kernels for LLM Training
regrettable-username / llm.metal
Forked from karpathy/llm.cLLM training in simple, raw C/Metal Shading Language
🚀 Efficient implementations of state-of-the-art linear attention models in Torch and Triton
Hackable and optimized Transformers building blocks, supporting a composable construction.
sm64-port / sm64-port
Forked from n64decomp/sm64A port of https://www.github.com/n64decomp/sm64 for modern devices.
ONNX Serving is a project written with C++ to serve onnx-mlir compiled models with GRPC and other protocols.Benefiting from C++ implementation, ONNX Serving has very low latency overhead and high t…
A Super Mario 64 decompilation, brought to you by a bunch of clever folks.
Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA
Official implementation of Half-Quadratic Quantization (HQQ)
A CocoaPods plugin to add SPM dependencies to CocoaPods-based projects
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.