Skip to content

Commit

Permalink
[Docs] Update W4A16 News (InternLM#227)
Browse files Browse the repository at this point in the history
* update news and add supported models

* fix typo

* add ampere note

* update supported models

* replace icon with yes or no

* avoid ambiguity

* fix typo
  • Loading branch information
pppppM authored Aug 14, 2023
1 parent 43f75f7 commit af517a4
Show file tree
Hide file tree
Showing 2 changed files with 52 additions and 2 deletions.
27 changes: 26 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,9 @@ ______________________________________________________________________

## News 🎉

- \[2023/08\] TurboMind supports 4-bit quantization and inference.
- \[2023/08\] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀.
- \[2023/08\] LMDeploy has launched on the [HuggingFace Hub](https://huggingface.co/lmdeploy), providing ready-to-use 4-bit models.
- \[2023/08\] LMDeploy supports 4-bit quantization using the [AWQ](https://arxiv.org/abs/2306.00978) algorithm.
- \[2023/07\] TurboMind supports Llama-2 70B with GQA.
- \[2023/07\] TurboMind supports Llama-2 7B/13B.
- \[2023/07\] TurboMind supports tensor-parallel inference of InternLM.
Expand All @@ -34,6 +36,29 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by

![PersistentBatchInference](https://github.com/InternLM/lmdeploy/assets/67539920/e3876167-0671-44fc-ac52-5a0f9382493e)

## Supported Models

`LMDeploy` has two inference backends, `Pytorch` and `TurboMind`.

### TurboMind

> **Note**<br />
> W4A16 inference requires Nvidia GPU with Ampere architecture or above.
| Models | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
| :------: | :-------------: | :--: | :-----: | :---: | :--: |
| Llama | Yes | Yes | Yes | Yes | No |
| Llama2 | Yes | Yes | Yes | Yes | No |
| InternLM | Yes | Yes | Yes | Yes | No |

### Pytorch

| Models | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
| :------: | :-------------: | :--: | :-----: | :---: | :--: |
| Llama | Yes | Yes | No | No | No |
| Llama2 | Yes | Yes | No | No | No |
| InternLM | Yes | Yes | No | No | No |

## Performance

**Case I**: output token throughput with fixed input token and output token number (1, 2048)
Expand Down
27 changes: 26 additions & 1 deletion README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,9 @@ ______________________________________________________________________

## 更新 🎉

- \[2023/08\] TurboMind 支持权重 4-bit 量化和推理
- \[2023/08\] TurboMind 支持 4-bit 推理,速度是 FP16 的 2.4 倍,是目前最快的开源实现🚀
- \[2023/08\] LMDeploy 开通了 [HuggingFace Hub](https://huggingface.co/lmdeploy) ,提供开箱即用的 4-bit 模型
- \[2023/08\] LMDeploy 支持使用 [AWQ](https://arxiv.org/abs/2306.00978) 算法进行 4-bit 量化
- \[2023/07\] TurboMind 支持使用 GQA 的 Llama-2 70B 模型
- \[2023/07\] TurboMind 支持 Llama-2 7B/13B 模型
- \[2023/07\] TurboMind 支持 InternLM 的 Tensor Parallel 推理
Expand All @@ -35,6 +37,29 @@ LMDeploy 由 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 和 [MMRazor](ht

![PersistentBatchInference](https://github.com/InternLM/lmdeploy/assets/67539920/e3876167-0671-44fc-ac52-5a0f9382493e)

## 支持的模型

`LMDeploy` 支持 `TurboMind``Pytorch` 两种推理后端

### TurboMind

> **Note**<br />
> W4A16 推理需要 Ampere 及以上架构的 Nvidia GPU
| 模型 | 模型并行 | FP16 | KV INT8 | W4A16 | W8A8 |
| :------: | :------: | :--: | :-----: | :---: | :--: |
| Llama | Yes | Yes | Yes | Yes | No |
| Llama2 | Yes | Yes | Yes | Yes | No |
| InternLM | Yes | Yes | Yes | Yes | No |

### Pytorch

| 模型 | 模型并行 | FP16 | KV INT8 | W4A16 | W8A8 |
| :------: | :------: | :--: | :-----: | :---: | :--: |
| Llama | Yes | Yes | No | No | No |
| Llama2 | Yes | Yes | No | No | No |
| InternLM | Yes | Yes | No | No | No |

## 性能

**场景一**: 固定的输入、输出token数(1,2048),测试 output token throughput
Expand Down

0 comments on commit af517a4

Please sign in to comment.