diff --git a/docs/blog/2024-03-19-TensorRT-LLM.md b/docs/blog/2024-03-19-TensorRT-LLM.md index 8b41b81789..0cd61adc19 100644 --- a/docs/blog/2024-03-19-TensorRT-LLM.md +++ b/docs/blog/2024-03-19-TensorRT-LLM.md @@ -11,15 +11,15 @@ Jan now supports [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) as an al We've made a few TensorRT-LLM models TensorRT-LLM models available in the Jan Hub for download: -- TinyLlama-1.1b +- TinyLlama-1.1b - Mistral 7b -- TinyJensen-1.1b πŸ˜‚ +- TinyJensen-1.1b πŸ˜‚ -You can get started by following our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm). +You can get started by following our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm). ## Performance Benchmarks -TensorRT-LLM is mainly used in datacenter-grade GPUs to achieve [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) type speeds. Naturally, we were curious to see how this would perform on consumer-grade GPUs. +TensorRT-LLM is mainly used in datacenter-grade GPUs to achieve [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) type speeds. Naturally, we were curious to see how this would perform on consumer-grade GPUs. We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://github.com/ggerganov/llama.cpp), our default inference engine. @@ -29,7 +29,7 @@ We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://githu | RTX 3090 | Ampere | 24 | 10,496 | 328 | 384 | 935.8 | | RTX 4060 | Ada | 8 | 3,072 | 96 | 128 | 272 | -- We tested using batch_size 1 and input length 2048, output length 512 as it’s the common use case people all use. +- We tested using batch_size 1 and input length 2048, output length 512 as it’s the common use case people all use. - We ran the tests 5 times to get get the Average. - CPU, Memory were obtained from... Windows Task Manager - GPU Metrics were obtained from `nvidia-smi` or `htop`/`nvtop` @@ -38,30 +38,30 @@ We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://githu ### RTX 4090 on Windows PC -TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly, +TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly, - CPU: Intel 13th series - GPU: NVIDIA GPU 4090 (Ampere - sm 86) -- RAM: 120GB -- OS: Windows +- RAM: 32GB +- OS: Windows 11 Pro -#### TinyLlama-1.1b q4 +#### TinyLlama-1.1b FP16 | Metrics | GGUF (using the GPU) | TensorRT-LLM | | -------------------- | -------------------- | ------------ | -| Throughput (token/s) | 104 | βœ… 131 | -| VRAM Used (GB) | 2.1 | 😱 21.5 | -| RAM Used (GB) | 0.3 | 😱 15 | -| Disk Size (GB) | 4.07 | 4.07 | +| Throughput (token/s) | No support | βœ… 257.76 | +| VRAM Used (GB) | No support | 3.3 | +| RAM Used (GB) | No support | 0.54 | +| Disk Size (GB) | No support | 2 | #### Mistral-7b int4 | Metrics | GGUF (using the GPU) | TensorRT-LLM | | -------------------- | -------------------- | ------------ | -| Throughput (token/s) | 80 | βœ… 97.9 | -| VRAM Used (GB) | 2.1 | 😱 23.5 | -| RAM Used (GB) | 0.3 | 😱 15 | -| Disk Size (GB) | 4.07 | 4.07 | +| Throughput (token/s) | 101.3 | βœ… 159 | +| VRAM Used (GB) | 5.5 | 6.3 | +| RAM Used (GB) | 0.54 | 0.42 | +| Disk Size (GB) | 4.07 | 3.66 | ### RTX 3090 on Windows PC @@ -70,23 +70,23 @@ TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly, - RAM: 64GB - OS: Windows -#### TinyLlama-1.1b q4 +#### TinyLlama-1.1b FP16 | Metrics | GGUF (using the GPU) | TensorRT-LLM | | -------------------- | -------------------- | ------------ | -| Throughput (token/s) | 131.28 | βœ… 194 | -| VRAM Used (GB) | 2.1 | 😱 21.5 | -| RAM Used (GB) | 0.3 | 😱 15 | -| Disk Size (GB) | 4.07 | 4.07 | +| Throughput (token/s) | No support | βœ… 203 | +| VRAM Used (GB) | No support | 3.8 | +| RAM Used (GB) | No support | 0.54 | +| Disk Size (GB) | No support | 2 | #### Mistral-7b int4 | Metrics | GGUF (using the GPU) | TensorRT-LLM | | -------------------- | -------------------- | ------------ | -| Throughput (token/s) | 88 | βœ… 137 | -| VRAM Used (GB) | 6.0 | 😱 23.8 | -| RAM Used (GB) | 0.3 | 😱 25 | -| Disk Size (GB) | 4.07 | 4.07 | +| Throughput (token/s) | 90 | 140.27 | +| VRAM Used (GB) | 6.0 | 6.8 | +| RAM Used (GB) | 0.54 | 0.42 | +| Disk Size (GB) | 4.07 | 3.66 | ### RTX 4060 on Windows Laptop @@ -95,7 +95,7 @@ TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly, - RAM: 16GB - GPU: NVIDIA Laptop GPU 4060 (Ada) -#### TinyLlama-1.1b q4 +#### TinyLlama-1.1b FP16 | Metrics | GGUF (using the GPU) | TensorRT-LLM | | -------------------- | -------------------- | ------------ |