fix: Add latest result on 3090/ 4090

DEVBOX10 · Mar 20, 2024 · c885d59 · c885d59
1 parent 70e10fc
commit c885d59
Showing 1 changed file with 27 additions and 27 deletions.
diff --git a/docs/blog/2024-03-19-TensorRT-LLM.md b/docs/blog/2024-03-19-TensorRT-LLM.md
@@ -11,15 +11,15 @@ Jan now supports [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) as an al
 
 We've made a few TensorRT-LLM models TensorRT-LLM models available in the Jan Hub for download:
 
-- TinyLlama-1.1b 
+- TinyLlama-1.1b
 - Mistral 7b
-- TinyJensen-1.1b 😂 
+- TinyJensen-1.1b 😂
 
-You can get started by following our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm). 
+You can get started by following our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm).
 
 ## Performance Benchmarks
 
-TensorRT-LLM is mainly used in datacenter-grade GPUs to achieve [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) type speeds. Naturally, we were curious to see how this would perform on consumer-grade GPUs. 
+TensorRT-LLM is mainly used in datacenter-grade GPUs to achieve [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) type speeds. Naturally, we were curious to see how this would perform on consumer-grade GPUs.
 
 We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://github.com/ggerganov/llama.cpp), our default inference engine.
 
@@ -29,7 +29,7 @@ We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://githu
 | RTX 3090   | Ampere       | 24             | 10,496     | 328          | 384                    | 935.8                   |
 | RTX 4060   | Ada          | 8              | 3,072      | 96           | 128                    | 272                     |
 
-- We tested using batch_size 1 and input length 2048, output length 512 as it’s the common use case people all use. 
+- We tested using batch_size 1 and input length 2048, output length 512 as it’s the common use case people all use.
 - We ran the tests 5 times to get get the Average.
 - CPU, Memory were obtained from... Windows Task Manager
 - GPU Metrics were obtained from `nvidia-smi` or `htop`/`nvtop`
@@ -38,30 +38,30 @@ We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://githu
 
 ### RTX 4090 on Windows PC
 
-TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly, 
+TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly,
 
 - CPU: Intel 13th series
 - GPU: NVIDIA GPU 4090 (Ampere - sm 86)
-- RAM: 120GB
-- OS: Windows
+- RAM: 32GB
+- OS: Windows 11 Pro
 
-#### TinyLlama-1.1b q4
+#### TinyLlama-1.1b FP16
 
 | Metrics              | GGUF (using the GPU) | TensorRT-LLM |
 | -------------------- | -------------------- | ------------ |
-| Throughput (token/s) | 104                  | ✅ 131       |
-| VRAM Used (GB)       | 2.1                  | 😱 21.5      |
-| RAM Used (GB)        | 0.3                  | 😱 15        |
-| Disk Size (GB)       | 4.07                 | 4.07         |
+| Throughput (token/s) | No support           | ✅ 257.76    |
+| VRAM Used (GB)       | No support           | 3.3          |
+| RAM Used (GB)        | No support           | 0.54         |
+| Disk Size (GB)       | No support           | 2            |
 
 #### Mistral-7b int4
 
 | Metrics              | GGUF (using the GPU) | TensorRT-LLM |
 | -------------------- | -------------------- | ------------ |
-| Throughput (token/s) | 80                   | ✅ 97.9      |
-| VRAM Used (GB)       | 2.1                  | 😱 23.5      |
-| RAM Used (GB)        | 0.3                  | 😱 15        |
-| Disk Size (GB)       | 4.07                 | 4.07         |
+| Throughput (token/s) | 101.3                | ✅ 159       |
+| VRAM Used (GB)       | 5.5                  | 6.3          |
+| RAM Used (GB)        | 0.54                 | 0.42         |
+| Disk Size (GB)       | 4.07                 | 3.66         |
 
 ### RTX 3090 on Windows PC
 
@@ -70,23 +70,23 @@ TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly,
 - RAM: 64GB
 - OS: Windows
 
-#### TinyLlama-1.1b q4
+#### TinyLlama-1.1b FP16
 
 | Metrics              | GGUF (using the GPU) | TensorRT-LLM |
 | -------------------- | -------------------- | ------------ |
-| Throughput (token/s) | 131.28               | ✅ 194       |
-| VRAM Used (GB)       | 2.1                  | 😱 21.5      |
-| RAM Used (GB)        | 0.3                  | 😱 15        |
-| Disk Size (GB)       | 4.07                 | 4.07         |
+| Throughput (token/s) | No support           | ✅ 203       |
+| VRAM Used (GB)       | No support           | 3.8          |
+| RAM Used (GB)        | No support           | 0.54         |
+| Disk Size (GB)       | No support           | 2            |
 
 #### Mistral-7b int4
 
 | Metrics              | GGUF (using the GPU) | TensorRT-LLM |
 | -------------------- | -------------------- | ------------ |
-| Throughput (token/s) | 88                   | ✅ 137       |
-| VRAM Used (GB)       | 6.0                  | 😱 23.8      |
-| RAM Used (GB)        | 0.3                  | 😱 25        |
-| Disk Size (GB)       | 4.07                 | 4.07         |
+| Throughput (token/s) | 90                   | 140.27       |
+| VRAM Used (GB)       | 6.0                  | 6.8          |
+| RAM Used (GB)        | 0.54                 | 0.42         |
+| Disk Size (GB)       | 4.07                 | 3.66         |
 
 ### RTX 4060 on Windows Laptop
 
@@ -95,7 +95,7 @@ TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly,
 - RAM: 16GB
 - GPU: NVIDIA Laptop GPU 4060 (Ada)
 
-#### TinyLlama-1.1b q4
+#### TinyLlama-1.1b FP16
 
 | Metrics              | GGUF (using the GPU) | TensorRT-LLM |
 | -------------------- | -------------------- | ------------ |