Skip to content

Latest commit

 

History

History
104 lines (69 loc) · 5.69 KB

hardware.md

File metadata and controls

104 lines (69 loc) · 5.69 KB

Hardware infrastructure

All my model runs and inference results are captured in my notebooks.

You can see the GPU requirements (NVIDIA card name, VRAM size) for my experiments there.

Below is a collection of model and hardware test results, including mine.

Guide on GPU requirements

How much GPU VRAM do I need to run the 7B model? Can 65B fit on 4x RTX 3090? Out of memory?

Here's a table showing a working combination of model sizes, GPU name, and VRAM requirements. Some data points are reported by other users.

Model GPU Min. VRAM VRAM Used Precision Environment Throughput
7B A100 40GB 40GB No OOM TF32 Colab Pro <1 tok/s
7B^ 3090 24GB 24GB No OOM FP32 Home PC ??
7B A4000 16GB 16GB No OOM TF32 Home PC ??
7B** 3060 12GB 10GB 9.1GB int8 Unspecified 4-9it/s
7B** 3080 10GB 10GB 9.2GB int8 Unspecified ??
7B** 3090 24GB 10GB 9.4GB int8 Unspecified 29-55it/s
7B^^ 2080 8GB 8GB FP32 Home PC ??
Model GPU Min. VRAM VRAM Used Precision Environment Throughput
13B** 3090Ti 24GB 20GB 16.2GB int8 Unspecified 13-29 it/s
13B** 4090 24GB 20GB 16.5GB int8 Unspecified 11-32 it/s
Model GPU Min. VRAM VRAM Used Precision Environment Throughput
33B** A6000 48GB 40GB 35.8GB int8 Unspecified 19-38 it/s
33B** A100 40GB 40GB 36.2GB int8 Unspecified 21-39 it/s
Model GPU Min. VRAM VRAM Used Precision Environment Throughput
65B** A100 80GB 80GB ~74.3GB int8 Unspecified 15-35 it/s

(WIP: incomplete, accuracy is not fully confirm yet.)

^ A modified of LLaMA model (model.py), configured for running with a single GPU (default is distributed GPU).Lowered batch size to 1 so the model can fit within VRAM.

^^ A modified of LLaMA model. Only keep a single transformer block on the GPU at a time. Changed from fairscale layers to torch.nn.Linear. Details, see this GitHub Issue.

** 8-bit quantized model.

Memory requirements for each model size

Model arguments:

  • max_batch_size: 1 (IMPORTANT)
  • max_seq_length: 1024
Model Params (Billions) 6.7 13 32.5 65.2
n_layers 32 40 60 80
n_heads 32 40 52 64
dim 4096 5120 6656 8192

The above numbers are gathered from the paper (Table 2).

Memory requirements in fp16 precision (before int8 quantization):

Model Params (Billions) 6.7 13 32.5 65.2
Model on disk (GB)*** 13 26 65 130
Cache (GB) 1 1 2 3
Total (GB) 14 27 67 133

Transformer kv cache (decoding cache) formula:

  • Per token (bytes) = cache_per_token = 2 * 2 * n_layers * n_heads * head_dim 1

    Example for 7B: cache_per_token = 2 * 2 * 32 * 32 * (4096 / 32) = 524288. head_dim = dim / n_heads

  • Total (bytes) = total = cache_per_token * max_batch_size * max_seq_len

    Example for 7B: total = 524288 * 1 * 1024 = 536870912 (~1GB)

Example: 7B require at least 14GB in 16-bit (fp16) precision or 7GB VRAM in 8-bit (int8) precision (half of VRAM).

(LLaMA's FAQ about this is confusing. I calculate based on the spreadsheet below (not mine). The previous formula here was based on it.)

*** Model (on disk) is the total file size of consolidated.XX.pth for a model. For example, 13B is 24GB because it has two consolidated.XX.pth files, each has a file size of 12GB. Weight file sizes are below.

llama-model-weights-resized

Well, 65B needs 133GB in fp16 or 66.5GB in int8 precison. That's a single A100 80GB.

Spreadsheet to calculate the memory requirements for each model size, following the FAQ and paper. You can make a copy to adjust the batch size and sequence length.2

Some people just made enough code changes to run the 7B model on the CPU. 3 I can't confirm this though.

Troubleshooting

Footnotes

  1. Transformer Inference Arithmetic

  2. GitHub Issue: Post your hardware specs here if you got it to work

  3. GitHub Issue reply by "gmorenz"