[Docs] Update W4A16 News (InternLM#227)

* update news and add supported models * fix typo * add ampere note * update supported models * replace icon with yes or no * avoid ambiguity * fix typo
felixstander · Aug 14, 2023 · af517a4 · af517a4
1 parent 43f75f7
commit af517a4
Show file tree

Hide file tree

Showing 2 changed files with 52 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -13,7 +13,9 @@ ______________________________________________________________________
 
 ## News 🎉
 
-- \[2023/08\] TurboMind supports 4-bit quantization and inference.
+- \[2023/08\] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀.
+- \[2023/08\] LMDeploy has launched on the [HuggingFace Hub](https://huggingface.co/lmdeploy), providing ready-to-use 4-bit models.
+- \[2023/08\] LMDeploy supports 4-bit quantization using the [AWQ](https://arxiv.org/abs/2306.00978) algorithm.
 - \[2023/07\] TurboMind supports Llama-2 70B with GQA.
 - \[2023/07\] TurboMind supports Llama-2 7B/13B.
 - \[2023/07\] TurboMind supports tensor-parallel inference of InternLM.
@@ -34,6 +36,29 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by
 
 ![PersistentBatchInference](https://github.com/InternLM/lmdeploy/assets/67539920/e3876167-0671-44fc-ac52-5a0f9382493e)
 
+## Supported Models
+
+`LMDeploy` has two inference backends, `Pytorch` and `TurboMind`.
+
+### TurboMind
+
+> **Note**<br />
+> W4A16 inference requires Nvidia GPU with Ampere architecture or above.
+
+|  Models  | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
+| :------: | :-------------: | :--: | :-----: | :---: | :--: |
+|  Llama   |       Yes       | Yes  |   Yes   |  Yes  |  No  |
+|  Llama2  |       Yes       | Yes  |   Yes   |  Yes  |  No  |
+| InternLM |       Yes       | Yes  |   Yes   |  Yes  |  No  |
+
+### Pytorch
+
+|  Models  | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
+| :------: | :-------------: | :--: | :-----: | :---: | :--: |
+|  Llama   |       Yes       | Yes  |   No    |  No   |  No  |
+|  Llama2  |       Yes       | Yes  |   No    |  No   |  No  |
+| InternLM |       Yes       | Yes  |   No    |  No   |  No  |
+
 ## Performance
 
 **Case I**: output token throughput with fixed input token and output token number (1, 2048)

diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -13,7 +13,9 @@ ______________________________________________________________________
 
 ## 更新 🎉
 
-- \[2023/08\] TurboMind 支持权重 4-bit 量化和推理
+- \[2023/08\] TurboMind 支持 4-bit 推理，速度是 FP16 的 2.4 倍，是目前最快的开源实现🚀
+- \[2023/08\] LMDeploy 开通了 [HuggingFace Hub](https://huggingface.co/lmdeploy) ，提供开箱即用的 4-bit 模型
+- \[2023/08\] LMDeploy 支持使用 [AWQ](https://arxiv.org/abs/2306.00978) 算法进行 4-bit 量化
 - \[2023/07\] TurboMind 支持使用 GQA 的 Llama-2 70B 模型
 - \[2023/07\] TurboMind 支持 Llama-2 7B/13B 模型
 - \[2023/07\] TurboMind 支持 InternLM 的 Tensor Parallel 推理
@@ -35,6 +37,29 @@ LMDeploy 由 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 和 [MMRazor](ht
 
   ![PersistentBatchInference](https://github.com/InternLM/lmdeploy/assets/67539920/e3876167-0671-44fc-ac52-5a0f9382493e)
 
+## 支持的模型
+
+`LMDeploy` 支持 `TurboMind` 和 `Pytorch` 两种推理后端
+
+### TurboMind
+
+> **Note**<br />
+> W4A16 推理需要 Ampere 及以上架构的 Nvidia GPU
+
+|   模型   | 模型并行 | FP16 | KV INT8 | W4A16 | W8A8 |
+| :------: | :------: | :--: | :-----: | :---: | :--: |
+|  Llama   |   Yes    | Yes  |   Yes   |  Yes  |  No  |
+|  Llama2  |   Yes    | Yes  |   Yes   |  Yes  |  No  |
+| InternLM |   Yes    | Yes  |   Yes   |  Yes  |  No  |
+
+### Pytorch
+
+|   模型   | 模型并行 | FP16 | KV INT8 | W4A16 | W8A8 |
+| :------: | :------: | :--: | :-----: | :---: | :--: |
+|  Llama   |   Yes    | Yes  |   No    |  No   |  No  |
+|  Llama2  |   Yes    | Yes  |   No    |  No   |  No  |
+| InternLM |   Yes    | Yes  |   No    |  No   |  No  |
+
 ## 性能
 
 **场景一**: 固定的输入、输出token数（1,2048），测试 output token throughput