Merge branch 'main' into integration

HiroRolli · Feb 20, 2024 · 0e8e64a · 0e8e64a
2 parents d6b49e7 + 48132f6
commit 0e8e64a
Show file tree

Hide file tree

Showing 3 changed files with 30 additions and 16 deletions.
diff --git a/README.md b/README.md
@@ -8,13 +8,13 @@ Official PyTorch implementation for [Extreme Compression of Large Language Model
 
 Learn how to run the prequantized models using this Google Colab examples:
 
-Generating with GPU
+Running `Mixtral` on a single T4 GPU:
 
 <a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/colab_example.ipynb">
   <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="AQLM In Colab"/>
 </a>
 
-Streaming with GPU/CPU
+Streaming with GPU/CPU:
 
 <a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/streaming_example.ipynb">
   <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="AQLM In Colab"/>
@@ -33,9 +33,21 @@ We provide a number of prequantized models:
 | Llama-2-7b | 8x8         | 7.83           | 2.2            | [Link](https://huggingface.co/BlackSamorez/Llama-2-7b-AQLM-2Bit-8x8-hf)  |
 | Llama-2-13b| 1x16        | 5.41           | 4.1            | [Link](https://huggingface.co/BlackSamorez/Llama-2-13b-AQLM-2Bit-1x16-hf)|
 | Llama-2-70b| 1x16        | 3.96           | 18.8           | [Link](https://huggingface.co/BlackSamorez/Llama-2-70b-AQLM-2Bit-1x16-hf)|
-| Mixtral-8x7b| 1x15       | 4.61           | 12.6            | [Link](https://huggingface.co/BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x15-hf)|
+| Llama-2-70b| 2x8         | 4.83           | 18.2           | [Link](https://huggingface.co/BlackSamorez/Llama-2-70b-AQLM-2Bit-2x8-hf) |
+| Mixtral-8x7b| 1x16       | 4.37           | 12.6            | [Link](https://huggingface.co/BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf)|
 
 
+### Inference kernels
+
+AQLM quantization setpus vary mainly on the number of codebooks used as well as the codebook sizes in bits. The most popular setups, as well as inference kernels they support are:
+
+| Kernel | Number of codebooks | Codebook size, bits | Scheme Notation | Accuracy | Speedup     | Fast GPU inference | Fast CPU inference |
+|---|---------------------|---------------------|----------|-------------|-------------|--------------------|--------------------|
+| Triton | K                   | N                  | KxN     | -        | Up to ~0.7x | ✅                  | ❌                  |
+| CUDA | 1                   | 16                  | 1x16     | Best        | Up to ~1.3x | ✅                  | ❌                  |
+| CUDA | 2                   | 8                   | 2x8      | OK          | Up to ~3.0x | ✅                  | ❌                  |
+| Numba | K                   | 8                   | Kx8      | Good        | Up to ~4.0x | ❌                  | ✅                  |
+
 ### Installation
 
 
@@ -58,7 +70,6 @@ quantized_model = AutoModelForCausalLM.from_pretrained(
 ```
 Notice that `torch_dtype` should be set to either `torch.float16` or `"auto"` on GPU and `torch.float32` on CPU. After that, the model can be used exactly the same as one would use and unquantized model. 
 
-As of now, we provide efficient implementations for matrix-vector multiplications for `1x16` and `2x8` AQLM schemes on GPU, and `Kx8` scheme on CPU.
 
 
 ## Quantization

diff --git a/inference_lib/setup.cfg b/inference_lib/setup.cfg
@@ -6,9 +6,9 @@ author_email = [email protected]
 description = Efficiently run models quantized with AQLM
 long_description = file: README.md
 long_description_content_type = text/markdown
-url = https://github.com/Vage1994/AQLM
+url = https://github.com/Vahe1994/AQLM
 project_urls =
-    Bug Tracker = https://github.com/Vage1994/AQLM/issues
+    Bug Tracker = https://github.com/Vahe1994/AQLM/issues
 classifiers =
     Development Status :: 4 - Beta
     Intended Audience :: Developers
@@ -32,7 +32,7 @@ include_package_data = True
 python_requires = >=3.10
 install_requires =
     torch>=2.1.1
-    transformers==4.37.0
+    transformers>=4.37.0
 [options.extras_require]
 gpu =
     triton>=2.1

diff --git a/notebooks/colab_example.ipynb b/notebooks/colab_example.ipynb
@@ -19,8 +19,9 @@
         "id": "6egoxPVyckBF"
       },
       "source": [
-        "**Install the `aqlm` library**\n",
-        "- the only extra dependency to run AQLM models."
+        "**Install the requirements**\n",
+        "- `aqlm` is the only extra dependency to run AQLM models.\n",
+        "- Install the latest `accelerate` to pull the latest bugfixes."
       ]
     },
     {
@@ -32,7 +33,8 @@
       "outputs": [],
       "source": [
         "%%capture\n",
-        "!pip install aqlm[gpu]==1.0.0"
+        "!pip install aqlm[gpu]==1.0.0\n",
+        "!pip install git+https://github.com/huggingface/accelerate.git@main"
       ]
     },
     {
@@ -44,10 +46,11 @@
         "**Load the model as usual**\n",
         "\n",
         "Just don't forget to add:\n",
-        " - `trust_remote_code=True` to pull the inference code\n",
+        " - `trust_remote_code=True` to pull the inference code.\n",
         " - `torch_dtype=\"auto\"` to load the model in it's native dtype.\n",
+        " - `device_map=\"cuda\"` to load the model on GPU straight away, saving RAM.\n",
         "\n",
-        "The tokenizer is just a normal `Llama 2` tokenizer."
+        "The tokenizer is just a normal `Mixtral` tokenizer."
       ]
     },
     {
@@ -167,10 +170,10 @@
         "from transformers import AutoTokenizer, AutoModelForCausalLM\n",
         "\n",
         "quantized_model = AutoModelForCausalLM.from_pretrained(\n",
-        "    \"BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf\",\n",
-        "    trust_remote_code=True, torch_dtype=\"auto\"\n",
+        "    \"BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf\",\n",
+        "    trust_remote_code=True, torch_dtype=\"auto\", device_map=\"cuda\"\n",
         ").cuda()\n",
-        "tokenizer = AutoTokenizer.from_pretrained(\"BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf\")"
+        "tokenizer = AutoTokenizer.from_pretrained(\"mistralai/Mixtral-8x7B-v0.1\")"
       ]
     },
     {
@@ -243,7 +246,7 @@
         "id": "nvShqlguccep"
       },
       "source": [
-        "**Check that the output is what one would expect from Llama-2-7b**"
+        "**Check that the output is what one would expect from Mixtral**"
       ]
     },
     {