Update docs about 4bit and 8bit inference.

zengming16 · Jul 23, 2023 · 03ad790 · 03ad790
1 parent 641666d
commit 03ad790
Showing 1 changed file with 8 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -116,6 +116,14 @@ If your the VRAM of your GPU is less than 24GB (e.g., RTX 3090, RTX 4090, etc.),
 python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ./checkpoints/LLaVA-13B-v0 --num-gpus 2
 ```
 
+#### Launch a model worker (4-bit, 8-bit inference, quantized)
+
+You can launch the model worker with quantized bits (4-bit, 8-bit), which allows you to run the inference with reduced GPU memory footprint, potentially allowing you to run on a GPU with as few as 12GB VRAM. Note that inference with quantized bits may not be as accurate as the full-precision model. Simply append `--load-4bit` or `--load-8bit` to the **model worker** command that you are executing. Below is an example of running with 4-bit quantization.
+
+```Shell
+python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-llama-2-13b-chat-lightning-preview --load-4bit
+```
+
 ### CLI Inference
 
 A starting script for inference with LLaVA without the need of Gradio interface. The current implementation only supports for a single-turn Q-A session, and the interactive CLI is WIP.  This also serves as an example for users to build customized inference scripts.