- Beijing
Starred repositories
A playbook for systematically maximizing the performance of deep learning models.
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
A parser, editor and profiler tool for ONNX models.
Accessible large language models via k-bit quantization for PyTorch.
QLoRA: Efficient Finetuning of Quantized LLMs
Transformer related optimization, including BERT, GPT
Universal LLM Deployment Engine with ML Compilation
《Machine Learning Systems: Design and Implementation》- Chinese Version
AISystem 主要是指AI系统,包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术
A technical report on convolution arithmetic in the context of deep learning
Production First and Production Ready End-to-End Speech Recognition Toolkit
A tutorial for getting started with the Deep Learning Accelerator (DLA) on NVIDIA Jetson
GAN-based Mel-Spectrogram Inversion Network for Text-to-Speech Synthesis
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
🚀 Awesome System for Machine Learning ⚡️ AI System Papers and Industry Practice. ⚡️ System for Machine Learning, LLM (Large Language Model), GenAI (Generative AI). 🍻 OSDI, NSDI, SIGCOMM, SoCC, MLSy…
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
Code for our paper at ECCV 2020: Post-Training Piecewise Linear Quantization for Deep Neural Networks
CPU INFOrmation library (x86/x86-64/ARM/ARM64, Linux/Windows/Android/macOS/iOS)
High-efficiency floating-point neural network inference operators for mobile, server, and Web
Open standard for machine learning interoperability
Tensors and Dynamic neural networks in Python with strong GPU acceleration
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
Application Binary Interface for the Arm® Architecture
A cross platform C99 library to get cpu features at runtime.