Stars
VPTQ, A Flexible and Extreme low-bit quantization algorithm
[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 whil…
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
Fast Hadamard transform in CUDA, with a PyTorch interface
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Ext…
Official implementation of Half-Quadratic Quantization (HQQ)
The Truth Is In There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction
Official inference library for Mistral models
Code for paper titled "Towards the Law of Capacity Gap in Distilling Language Models"
The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
A curated list for Efficient Large Language Models
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
Fast and memory-efficient exact attention
Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"
PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Implementation of "Attention Is Off By One" by Evan Miller
Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"
Flexible simulator for mixed precision and format simulation of LLMs and vision transformers.
Awesome LLM compression research papers and tools.
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
4 bits quantization of LLaMA using GPTQ