From 4e001bb2459c2201196079391ab04da03daf35b5 Mon Sep 17 00:00:00 2001 From: Daniel Date: Wed, 20 Mar 2024 12:12:49 +0800 Subject: [PATCH] Initial commit for TensorRT-LLM blog --- docs/blog/2024-03-19-TensorRT-LLM.md | 26 ++++++++++++-------------- 1 file changed, 12 insertions(+), 14 deletions(-) diff --git a/docs/blog/2024-03-19-TensorRT-LLM.md b/docs/blog/2024-03-19-TensorRT-LLM.md index 08f1a1d1a3..8b41b81789 100644 --- a/docs/blog/2024-03-19-TensorRT-LLM.md +++ b/docs/blog/2024-03-19-TensorRT-LLM.md @@ -13,20 +13,15 @@ We've made a few TensorRT-LLM models TensorRT-LLM models available in the Jan Hu - TinyLlama-1.1b - Mistral 7b -- TinyJensen-1.1b, which is trained on Jensen Huang's 👀 +- TinyJensen-1.1b 😂 -## What is TensorRT-LLM? - -Please read our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm). - -TensorRT-LLM is mainly used in datacenter-grade GPUs to achieve [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) type speeds. +You can get started by following our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm). ## Performance Benchmarks +TensorRT-LLM is mainly used in datacenter-grade GPUs to achieve [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) type speeds. Naturally, we were curious to see how this would perform on consumer-grade GPUs. -We were curious to see how this would perform on consumer-grade GPUs, as most of Jan's users use consumer-grade GPUs. - -- We’ve done a comparison of how TensorRT-LLM does vs. llama.cpp, our default inference engine. +We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://github.com/ggerganov/llama.cpp), our default inference engine. | NVIDIA GPU | Architecture | VRAM Used (GB) | CUDA Cores | Tensor Cores | Memory Bus Width (bit) | Memory Bandwidth (GB/s) | | ---------- | ------------ | -------------- | ---------- | ------------ | ---------------------- | ----------------------- | @@ -34,14 +29,17 @@ We were curious to see how this would perform on consumer-grade GPUs, as most of | RTX 3090 | Ampere | 24 | 10,496 | 328 | 384 | 935.8 | | RTX 4060 | Ada | 8 | 3,072 | 96 | 128 | 272 | -> We test using batch_size 1 and input length 2048, output length 512 as it’s the common use case people all use. We run 5 times and get the Average. - -> We use Windows task manager and Linux NVIDIA-SMI/ Htop to get CPU/ Memory/ NVIDIA GPU metrics per process. - -> We turn off all user application and only open Jan app with Nitro tensorrt-llm or NVIDIA benchmark script in python +- We tested using batch_size 1 and input length 2048, output length 512 as it’s the common use case people all use. +- We ran the tests 5 times to get get the Average. +- CPU, Memory were obtained from... Windows Task Manager +- GPU Metrics were obtained from `nvidia-smi` or `htop`/`nvtop` +- All tests were run on bare metal PCs with no other apps open +- There is a slight difference between the models: AWQ models for TensorRT-LLM, while llama.cpp has its own quantization technique ### RTX 4090 on Windows PC +TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly, + - CPU: Intel 13th series - GPU: NVIDIA GPU 4090 (Ampere - sm 86) - RAM: 120GB