Skip to content

Commit

Permalink
Merge branch 'main' into integration
Browse files Browse the repository at this point in the history
  • Loading branch information
BlackSamorez committed Feb 20, 2024
2 parents d6b49e7 + 48132f6 commit 0e8e64a
Show file tree
Hide file tree
Showing 3 changed files with 30 additions and 16 deletions.
19 changes: 15 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@ Official PyTorch implementation for [Extreme Compression of Large Language Model

Learn how to run the prequantized models using this Google Colab examples:

Generating with GPU
Running `Mixtral` on a single T4 GPU:

<a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/colab_example.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="AQLM In Colab"/>
</a>

Streaming with GPU/CPU
Streaming with GPU/CPU:

<a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/streaming_example.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="AQLM In Colab"/>
Expand All @@ -33,9 +33,21 @@ We provide a number of prequantized models:
| Llama-2-7b | 8x8 | 7.83 | 2.2 | [Link](https://huggingface.co/BlackSamorez/Llama-2-7b-AQLM-2Bit-8x8-hf) |
| Llama-2-13b| 1x16 | 5.41 | 4.1 | [Link](https://huggingface.co/BlackSamorez/Llama-2-13b-AQLM-2Bit-1x16-hf)|
| Llama-2-70b| 1x16 | 3.96 | 18.8 | [Link](https://huggingface.co/BlackSamorez/Llama-2-70b-AQLM-2Bit-1x16-hf)|
| Mixtral-8x7b| 1x15 | 4.61 | 12.6 | [Link](https://huggingface.co/BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x15-hf)|
| Llama-2-70b| 2x8 | 4.83 | 18.2 | [Link](https://huggingface.co/BlackSamorez/Llama-2-70b-AQLM-2Bit-2x8-hf) |
| Mixtral-8x7b| 1x16 | 4.37 | 12.6 | [Link](https://huggingface.co/BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf)|


### Inference kernels

AQLM quantization setpus vary mainly on the number of codebooks used as well as the codebook sizes in bits. The most popular setups, as well as inference kernels they support are:

| Kernel | Number of codebooks | Codebook size, bits | Scheme Notation | Accuracy | Speedup | Fast GPU inference | Fast CPU inference |
|---|---------------------|---------------------|----------|-------------|-------------|--------------------|--------------------|
| Triton | K | N | KxN | - | Up to ~0.7x |||
| CUDA | 1 | 16 | 1x16 | Best | Up to ~1.3x |||
| CUDA | 2 | 8 | 2x8 | OK | Up to ~3.0x |||
| Numba | K | 8 | Kx8 | Good | Up to ~4.0x |||

### Installation


Expand All @@ -58,7 +70,6 @@ quantized_model = AutoModelForCausalLM.from_pretrained(
```
Notice that `torch_dtype` should be set to either `torch.float16` or `"auto"` on GPU and `torch.float32` on CPU. After that, the model can be used exactly the same as one would use and unquantized model.

As of now, we provide efficient implementations for matrix-vector multiplications for `1x16` and `2x8` AQLM schemes on GPU, and `Kx8` scheme on CPU.


## Quantization
Expand Down
6 changes: 3 additions & 3 deletions inference_lib/setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ author_email = [email protected]
description = Efficiently run models quantized with AQLM
long_description = file: README.md
long_description_content_type = text/markdown
url = https://github.com/Vage1994/AQLM
url = https://github.com/Vahe1994/AQLM
project_urls =
Bug Tracker = https://github.com/Vage1994/AQLM/issues
Bug Tracker = https://github.com/Vahe1994/AQLM/issues
classifiers =
Development Status :: 4 - Beta
Intended Audience :: Developers
Expand All @@ -32,7 +32,7 @@ include_package_data = True
python_requires = >=3.10
install_requires =
torch>=2.1.1
transformers==4.37.0
transformers>=4.37.0
[options.extras_require]
gpu =
triton>=2.1
Expand Down
21 changes: 12 additions & 9 deletions notebooks/colab_example.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,9 @@
"id": "6egoxPVyckBF"
},
"source": [
"**Install the `aqlm` library**\n",
"- the only extra dependency to run AQLM models."
"**Install the requirements**\n",
"- `aqlm` is the only extra dependency to run AQLM models.\n",
"- Install the latest `accelerate` to pull the latest bugfixes."
]
},
{
Expand All @@ -32,7 +33,8 @@
"outputs": [],
"source": [
"%%capture\n",
"!pip install aqlm[gpu]==1.0.0"
"!pip install aqlm[gpu]==1.0.0\n",
"!pip install git+https://github.com/huggingface/accelerate.git@main"
]
},
{
Expand All @@ -44,10 +46,11 @@
"**Load the model as usual**\n",
"\n",
"Just don't forget to add:\n",
" - `trust_remote_code=True` to pull the inference code\n",
" - `trust_remote_code=True` to pull the inference code.\n",
" - `torch_dtype=\"auto\"` to load the model in it's native dtype.\n",
" - `device_map=\"cuda\"` to load the model on GPU straight away, saving RAM.\n",
"\n",
"The tokenizer is just a normal `Llama 2` tokenizer."
"The tokenizer is just a normal `Mixtral` tokenizer."
]
},
{
Expand Down Expand Up @@ -167,10 +170,10 @@
"from transformers import AutoTokenizer, AutoModelForCausalLM\n",
"\n",
"quantized_model = AutoModelForCausalLM.from_pretrained(\n",
" \"BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf\",\n",
" trust_remote_code=True, torch_dtype=\"auto\"\n",
" \"BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf\",\n",
" trust_remote_code=True, torch_dtype=\"auto\", device_map=\"cuda\"\n",
").cuda()\n",
"tokenizer = AutoTokenizer.from_pretrained(\"BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf\")"
"tokenizer = AutoTokenizer.from_pretrained(\"mistralai/Mixtral-8x7B-v0.1\")"
]
},
{
Expand Down Expand Up @@ -243,7 +246,7 @@
"id": "nvShqlguccep"
},
"source": [
"**Check that the output is what one would expect from Llama-2-7b**"
"**Check that the output is what one would expect from Mixtral**"
]
},
{
Expand Down

0 comments on commit 0e8e64a

Please sign in to comment.