Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.
-
12/13/2024 1.4.1: Added Qwen2-VL model support.
mse
quantization control exposed inQuantizeConfig
. Monkey patchpatch_vllm()
andpatch_hf()
api added to allow Transformers/Optimum/PEFT and vLLM to correctly loaded GPTQModel quantized models while upstream PRs are in pending status. -
12/10/2024 1.4.0
EvalPlus
harness integration merged upstream. We now support bothlm-eval
andEvalPlus
. Added pure torchTorch
kernel. RefactoredCuda
kernel to beDynamicCuda
kernel.Triton
kernel now auto-padded for max model support.Dynamic
quantization now supports both positive+:
:default, and-:
negative matching which allows matched modules to be skipped entirely for quantization. Fixed auto-Marlin
kerenl selection. Added auto-kernel fallback for unsupported kernel/module pairs. Lots of internal refractor and cleanup in-preparation for transformers/optimum/peft upstream PR merge. Deprecated the saving ofMarlin
weight format sinceMarlin
supports auto conversion ofgptq
format toMarlin
during runtime. -
11/29/2024 1.3.1 Olmo2 model support. Intel XPU acceleration via IPEX. Model sharding Transformer compat fix due to api deprecation in HF. Removed triton dependency. Triton kernel now optionally dependent on triton pkg.
-
11/26/2024 1.3.0 Zero-Day Hymba model support. Removed
tqdm
androgue
dependency. -
11/24/2024 1.2.3 HF GLM model support. ClearML logging integration. Use
device-smi
and replacegputil
+psutil
depends. Fixed model unit tests. -
11/11/2024 π 1.2.1 Meta MobileLLM model support added.
lm-eval[gptqmodel]
integration merged upstream. Intel/IPEX cpu inference merged replacing QBits (deprecated). Auto-fix/patch ChatGLM-3/GLM-4 compat with latest transformers. New.load()
and.save()
api. -
10/29/2024 π 1.1.0 IBM Granite model support. Full auto-buildless wheel install from pypi. Reduce max cpu memory usage by >20% during quantization. 100% CI model/feature coverage.
Archived News:
* 10/12/2024 β¨ [1.0.9](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.9) Move AutoRound to optional and fix pip install regression in v1.0.8.-
10/11/2024 β¨ 1.0.8 Add wheel for python 3.12 and cuda 11.8.
-
10/08/2024 β¨ 1.0.7 Fixed marlin (faster) kernel was not auto-selected for some models.
-
09/26/2024 β¨ 1.0.6 Fixed quantized Llama 3.2 vision quantized loader.
-
09/26/2024 β¨ 1.0.5 Partial Llama 3.2 Vision model support (mllama): only text-layer quantization layers are supported for now.
-
09/26/2024 β¨ 1.0.4 Integrated Liger Kernel support for ~1/2 memory reduction on some models during quantization. Added control toggle disable parallel packing.
-
09/18/2024 β¨ 1.0.3 Added Microsoft GRIN-MoE and MiniCPM3 support.
-
08/16/2024 β¨ 1.0.2 Support Intel/AutoRound v0.3, pre-built whl packages, and PyPI release.
-
08/14/2024 β¨ 1.0.0 40% faster
packing
, Fixed Python 3.9 compat, addedlm_eval
api. -
08/10/2024 π 0.9.11 Added LG EXAONE 3.0 model support. New
dynamic
per layer/module flexible quantization where each layer/module may have different bits/params. Added proper sharding support tobackend.BITBLAS
. Auto-heal quantization errors due to small damp values. -
07/31/2024 π 0.9.10 Ported vllm/nm
gptq_marlin
inference kernel with expanded bits (8bits), group_size (64,32), and desc_act support for all GPTQ models withFORMAT.GPTQ
. Auto calculate auto-round nsamples/seglen parameters based on calibration dataset. Fixed save_quantized() called on pre-quantized models with non-supported backends. HF transformers depend updated to ensure Llama 3.1 fixes are correctly applied to both quant and inference. -
07/25/2024 π 0.9.9: Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compat for vLLM/SGLang, and more.
-
07/13/2024 π 0.9.8: Run quantized models directly using GPTQModel using fast
vLLM
orSGLang
backend! Both vLLM and SGLang are optimized for dyanamic batching inference for maximumTPS
(check usage under examples). Marlin backend also got full end-to-end in/out features padding to enhance current/future model compatibility. -
07/08/2024 π 0.9.7: InternLM 2.5 model support added.
-
07/08/2024 π 0.9.6: Intel/AutoRound QUANT_METHOD support added for a potentially higher quality quantization with
lm_head
module quantization support for even more vram reduction: format export toFORMAT.GPTQ
for max inference compatibility. -
07/05/2024 π 0.9.5: Cuda kernels have been fully deprecated in favor of Exllama(v1/v2)/Marlin/Triton.
-
07/03/2024 π 0.9.4: HF Transformers integration added and bug fixed Gemma 2 support.
-
07/02/2024 π 0.9.3: Added Gemma 2 support, faster PPL calculations on gpu, and more code/arg refractor.
-
06/30/2024 π 0.9.2: Added auto-padding of model in/out-features for exllama and exllama v2. Fixed quantization of OPT and DeepSeek V2-Lite models. Fixed inference for DeepSeek V2-Lite.
-
06/29/2024 π 0.9.1: With 3 new models (DeepSeek-V2, DeepSeek-V2-Lite, DBRX Converted), BITBLAS new format/kernel, proper batching of calibration dataset resulting > 50% quantization speedup, security hash check of loaded model weights, tons of refractor/usability improvements, bugs fixes and much more.
-
06/20/2924 β¨ 0.9.0: Thanks for all the work from ModelCloud team and the opensource ML community for their contributions!
GPTQModel started out as a major refractor (fork) of AutoGTQP but has now morphed into a full-stand-in replacement with cleaner api, up-to-date model support, faster inference, faster quantization, higher quality quants and a pledge that ModelCloud, together with the open-source ML community, will take every effort to bring the library up-to-date with latest advancements and model support.
Public tests/papers and ModelCloud's internal tests have shown that GPTQ is on-par and/or exceeds other 4bit quantization methods in terms of both quality recovery and production level inference speed in both token latency and rps. GPTQ has the optimal blend of quality and inference speed you would want to use in a real-world production system.
- π Extensive model support for:
Llama 1-3.3
,Qwen2-VL
,Olmo2
,Hymba
,GLM
,IBM Granite
,Llama 3.2 Vision
,MiniCPM3
,GRIN-Moe
,Phi 1-4
,EXAONE 3.0
,InternLM 2.5
,Gemma 2
,DeepSeek-V2
,DeepSeek-V2-Lite
,ChatGLM
,MiniCPM
,Qwen2MoE
,DBRX
. - π― 100% CI unit-test coverage for all supported models and kernels including post-quantization quality regression.
- β¨
Dynamic
/Mixed quantization control on a per-module basis. Each layer/module can have a unique quantization config or be excluded from quantization all together. - π vLLM and SGLang inference integration for quantized model where format =
FORMAT.GPTQ
- π Intel/IPEX 4bit quantization/inference support on CPU (recent Intel/AMD) and Intel/XPU.
- π Microsoft/BITBLAS format + dynamically compiled inference.
- β¨ Intel/AutoRound QUANT_METHOD support added for a potentially higher quality quantization.
- β¨ Asymmetric
Sym=False
support viaFORMAT.GPTQ_V2
. - β¨
lm_head
module quant inference support for further VRAM reduction (auto-round only). - π Faster quantization: More than 50% faster for TinyLlama + 4090 with batching and large calibration dataset.
- β¨ Better quality quants as measured by PPL. (Test config: defaults +
sym=True
+FORMAT.GPTQ
, TinyLlama) - β¨ Model weights sharding support
- β¨ Security: hash check of model weights on load
- π Over 50% faster PPL calculations (OPT)
- π Over 40% faster
packing
stage in quantization (Llama 3.1 8B)
π€ ModelCloud quantized ultra-high recovery vortex-series models on HF
Model | ||||||||
---|---|---|---|---|---|---|---|---|
Baichuan | β | Falcon | β | Llama 1-3.3 | β | OLMo2 | π | |
Bloom | β | Gemma 2 | π | Llama 3.2 Vision | π | Phi 1-4 | π | |
ChatGLM | π | GPTBigCod | β | LongLLaMA | β | Qwen | β | |
CodeGen | β | GPTNeoX | β | MiniCPM3 | β | Qwen2MoE | π | |
Cohere | β | GPT-2 | β | Mistral | β | Qwen2VL | π | |
DBRX Converted | π | GPT-J | β | Mixtral | β | RefinedWeb | β | |
Deci | β | Granite | π | MobileLLM | π | StableLM | β | |
DeepSeek-V2 | π | GRIN-MoE | π | MOSS | β | StarCoder2 | β | |
DeepSeek-V2-Lite | π | Hymba | π | MPT | β | XVERSE | β | |
EXAONE 3.0 | π | InternLM 1/2.5 | π | OPT | β | Yi | β |
GPTQModel is validated for Linux x86_64 with the following devices:
Device | Optimized Arch | Kernels | |
---|---|---|---|
Nvidia GPU | β | Ampere or Higher | Marlin, Exllama V2, Exallma V1, Triton, DyanamicCuda, Torch |
Intel/AMD CPU | β | avx512 or amx |
IPEX, Torch |
Intel XPU | β | Intel Arc + Datacenter Max | IPEX, Torch |
# You can install optional modules like autoround, ipex, vllm, sglang, bitblas, and ipex.
# Example: pip install -v --no-build-isolation gptqmodel[vllm,sglang,bitblas,ipex,auto_round]
pip install -v gptqmodel --no-build-isolation
uv pip install -v gptqmodel --no-build-isolation
# clone repo
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel
# pip: compile and install
# You can install optional modules like autoround, ipex, vllm, sglang, bitblas, and ipex.
# Example: pip install -v --no-build-isolation .[vllm,sglang,bitblas,ipex,auto_round]
pip install -v . --no-build-isolation
Below is a basic sample using GPTQModel
to quantize a llm model and perform post-quantization inference:
from datasets import load_dataset
from transformers import AutoTokenizer
from gptqmodel import GPTQModel, QuantizeConfig
model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_id)
calibration_dataset = [
tokenizer(example["text"])
for example in load_dataset(
"allenai/c4",
data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
).select(range(1024))
]
quant_config = QuantizeConfig(bits=4, group_size=128)
model = GPTQModel.load(model_id, quant_config)
model.quantize(calibration_dataset)
model.save(quant_path)
model = GPTQModel.load(quant_path)
result = model.generate(
**tokenizer(
"Uncovering deep insights begins with", return_tensors="pt"
).to(model.device)
)[0]
For more advanced features of model quantization, please reference to this script
Read the gptqmodel/models/llama.py
code which explains in detail via comments how the model support is defined. Use it as guide to PR for to new models. Most models follow the same pattern.
GPTQModel inference is integrated into both lm-eval and evalplus
We highly recommend avoid using ppl
and use lm-eval
/evalplus
to validate post-quantization model quality. ppl
should only be used for regression tests and is not a good indicator of model output quality.
# gptqmodel is integrated into lm-eval >= v0.4.6
pip install lm-eval>=0.4.6
# gptqmodel is integrated into evalplus[main]
pip install -U "evalplus @ git+https://github.com/evalplus/evalplus"
Below is a basic sample using GPTQModel.eval
API
from gptqmodel import GPTQModel
from gptqmodel.utils import EVAL
model_id = "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1"
# Use `lm-eval` as framework to evaluate the model
lm_eval_results = GPTQModel.eval(model_id, framework=EVAL.LM_EVAL, tasks=[EVAL.LM_EVAL.ARC_CHALLENGE], output_file='lm-eval_result.json')
# Use `evalplus` as framework to evaluate the model
evalplus_results = GPTQModel.eval(model_id, framework=EVAL.EVALPLUS, tasks=[EVAL.EVALPLUS.HUMAN], output_file='evalplus_result.json')
@misc{gptqmodel,
author = {ModelCloud.ai},
title = {GPTQModel},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/modelcloud/gptqmodel}},
}
@article{frantar-gptq,
title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers},
author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
year={2022},
journal={arXiv preprint arXiv:2210.17323}
}
@article{frantar2024marlin,
title={MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models},
author={Frantar, Elias and Castro, Roberto L and Chen, Jiale and Hoefler, Torsten and Alistarh, Dan},
journal={arXiv preprint arXiv:2408.11743},
year={2024}
}