Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
bloom.py		bloom.py
bloom_inference.py		bloom_inference.py
datautils.py		datautils.py
gptq.py		gptq.py
llama.py		llama.py
llama_inference.py		llama_inference.py
llama_inference_offload.py		llama_inference_offload.py
modelutils.py		modelutils.py
quant.py		quant.py
quant_cuda.cpp		quant_cuda.cpp
quant_cuda_kernel.cu		quant_cuda_kernel.cu
requirements.txt		requirements.txt
setup_cuda.py		setup_cuda.py
test_kernel.py		test_kernel.py

README.md

GPTQ-for-Bloom & LLaMa

8 bits quantization of Bloom using GPTQ

GPTQ is SOTA one-shot weight quantization method

This code is based on GPTQ-for-LLaMa

Huggingface models

model name	file size	GPU memory usage
base	27G	~28.2G
bloom7b-2m-8bit-128g.pt	9.7G	~11.4G
bloom7b-2m-4bit-128g.pt	6.9G	~8.4G
bloom7b-0.2m-8bit-128g.pt	9.7G	~11.4G
bloom7b-0.2m-4bit-128g.pt	6.9G	~8.4G

All experiments were run on a single NVIDIA A100.

Installation

If you don't have conda, install it first.

conda create --name gptq python=3.9 -y
conda activate gptq
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
# Or, if you're having trouble with conda, use pip with python3.9:
# pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

pip install -r requirements.txt
python setup_cuda.py install

# Benchmark performance for FC2 layer of LLaMa-7B
CUDA_VISIBLE_DEVICES=0 python test_kernel.py

Dependencies

torch: tested on v2.0.0+cu117
transformers: tested on v4.28.0.dev0
datasets: tested on v2.10.1
safetensors: tested on v0.3.0
(to run 4-bit kernels: setup for compiling PyTorch CUDA extensions, see also https://pytorch.org/tutorials/advanced/cpp_extension.html, tested on CUDA 11.7)

Model inference with the saved model

# BELLE-7B-gptq: local saved model path from Huggingface
git lfs install
git clone https://huggingface.co/BelleGroup/BELLE-7B-gptq
# model inference with the saved model
CUDA_VISIBLE_DEVICES=0 python bloom_inference.py BELLE-7B-gptq --wbits 8 --groupsize 128 --load BELLE-7B-gptq/bloom7b-2m-8bit-128g.pt --text "hello"

Model quantization

# BELLE-7B-gptq: local saved model path
# Save compressed model
CUDA_VISIBLE_DEVICES=0 python bloom.py BelleGroup/BELLE-7B-2M wikitext2 --wbits 8 --groupsize 128 --save BELLE-7B-gptq/bloom7b-2m-8bit-128g.pt

CUDA Kernels support 2,3,4,8 bits.

Basically, 8-bit quantization and 128 groupsize are recommended.

Acknowledgements

This code is based on GPTQ-for-LLaMa

Thanks to Bloom, a powerful LLM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gptq

gptq

README.md

GPTQ-for-Bloom & LLaMa

Huggingface models

Installation

Dependencies

Model inference with the saved model

Model quantization

Acknowledgements

Files

gptq

Directory actions

More options

Directory actions

More options

Latest commit

History

gptq

Folders and files

parent directory

README.md

GPTQ-for-Bloom & LLaMa

Huggingface models

Installation

Dependencies

Model inference with the saved model

Model quantization

Acknowledgements