Skip to content

Commit

Permalink
fix: Add latest result on 3090/ 4090
Browse files Browse the repository at this point in the history
  • Loading branch information
hiro-v committed Mar 20, 2024
1 parent 70e10fc commit c885d59
Showing 1 changed file with 27 additions and 27 deletions.
54 changes: 27 additions & 27 deletions docs/blog/2024-03-19-TensorRT-LLM.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,15 @@ Jan now supports [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) as an al

We've made a few TensorRT-LLM models TensorRT-LLM models available in the Jan Hub for download:

- TinyLlama-1.1b
- TinyLlama-1.1b
- Mistral 7b
- TinyJensen-1.1b 😂
- TinyJensen-1.1b 😂

You can get started by following our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm).
You can get started by following our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm).

## Performance Benchmarks

TensorRT-LLM is mainly used in datacenter-grade GPUs to achieve [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) type speeds. Naturally, we were curious to see how this would perform on consumer-grade GPUs.
TensorRT-LLM is mainly used in datacenter-grade GPUs to achieve [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) type speeds. Naturally, we were curious to see how this would perform on consumer-grade GPUs.

We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://github.com/ggerganov/llama.cpp), our default inference engine.

Expand All @@ -29,7 +29,7 @@ We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://githu
| RTX 3090 | Ampere | 24 | 10,496 | 328 | 384 | 935.8 |
| RTX 4060 | Ada | 8 | 3,072 | 96 | 128 | 272 |

- We tested using batch_size 1 and input length 2048, output length 512 as it’s the common use case people all use.
- We tested using batch_size 1 and input length 2048, output length 512 as it’s the common use case people all use.
- We ran the tests 5 times to get get the Average.
- CPU, Memory were obtained from... Windows Task Manager
- GPU Metrics were obtained from `nvidia-smi` or `htop`/`nvtop`
Expand All @@ -38,30 +38,30 @@ We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://githu

### RTX 4090 on Windows PC

TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly,
TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly,

- CPU: Intel 13th series
- GPU: NVIDIA GPU 4090 (Ampere - sm 86)
- RAM: 120GB
- OS: Windows
- RAM: 32GB
- OS: Windows 11 Pro

#### TinyLlama-1.1b q4
#### TinyLlama-1.1b FP16

| Metrics | GGUF (using the GPU) | TensorRT-LLM |
| -------------------- | -------------------- | ------------ |
| Throughput (token/s) | 104 |131 |
| VRAM Used (GB) | 2.1 | 😱 21.5 |
| RAM Used (GB) | 0.3 | 😱 15 |
| Disk Size (GB) | 4.07 | 4.07 |
| Throughput (token/s) | No support |257.76 |
| VRAM Used (GB) | No support | 3.3 |
| RAM Used (GB) | No support | 0.54 |
| Disk Size (GB) | No support | 2 |

#### Mistral-7b int4

| Metrics | GGUF (using the GPU) | TensorRT-LLM |
| -------------------- | -------------------- | ------------ |
| Throughput (token/s) | 80 |97.9 |
| VRAM Used (GB) | 2.1 | 😱 23.5 |
| RAM Used (GB) | 0.3 | 😱 15 |
| Disk Size (GB) | 4.07 | 4.07 |
| Throughput (token/s) | 101.3 |159 |
| VRAM Used (GB) | 5.5 | 6.3 |
| RAM Used (GB) | 0.54 | 0.42 |
| Disk Size (GB) | 4.07 | 3.66 |

### RTX 3090 on Windows PC

Expand All @@ -70,23 +70,23 @@ TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly,
- RAM: 64GB
- OS: Windows

#### TinyLlama-1.1b q4
#### TinyLlama-1.1b FP16

| Metrics | GGUF (using the GPU) | TensorRT-LLM |
| -------------------- | -------------------- | ------------ |
| Throughput (token/s) | 131.28 |194 |
| VRAM Used (GB) | 2.1 | 😱 21.5 |
| RAM Used (GB) | 0.3 | 😱 15 |
| Disk Size (GB) | 4.07 | 4.07 |
| Throughput (token/s) | No support |203 |
| VRAM Used (GB) | No support | 3.8 |
| RAM Used (GB) | No support | 0.54 |
| Disk Size (GB) | No support | 2 |

#### Mistral-7b int4

| Metrics | GGUF (using the GPU) | TensorRT-LLM |
| -------------------- | -------------------- | ------------ |
| Throughput (token/s) | 88 | ✅ 137 |
| VRAM Used (GB) | 6.0 | 😱 23.8 |
| RAM Used (GB) | 0.3 | 😱 25 |
| Disk Size (GB) | 4.07 | 4.07 |
| Throughput (token/s) | 90 | 140.27 |
| VRAM Used (GB) | 6.0 | 6.8 |
| RAM Used (GB) | 0.54 | 0.42 |
| Disk Size (GB) | 4.07 | 3.66 |

### RTX 4060 on Windows Laptop

Expand All @@ -95,7 +95,7 @@ TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly,
- RAM: 16GB
- GPU: NVIDIA Laptop GPU 4060 (Ada)

#### TinyLlama-1.1b q4
#### TinyLlama-1.1b FP16

| Metrics | GGUF (using the GPU) | TensorRT-LLM |
| -------------------- | -------------------- | ------------ |
Expand Down

0 comments on commit c885d59

Please sign in to comment.