Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
gradio_chat_client.py		gradio_chat_client.py
infer.py		infer.py
openai_client.py		openai_client.py
run_gradio_client.sh		run_gradio_client.sh
run_vllm_server.sh		run_vllm_server.sh

README.md

vllm部署工具及paged attention

🎥 视频教程

1. Vllm介绍

📝 1. 准备模型文件

THUDM/glm-4-9b-chat

使用huggingface 镜像站下载

export HF_ENDPOINT=https://hf-mirror.com

huggingface-cli download THUDM/glm-4-9b-chat --local-dir /root/autodl-tmp/models/glm-4-9b-chat

使用modelscope下载

pip install modelscope
modelscope download --model ZhipuAI/glm-4-9b-chat --local_dir /root/autodl-tmp/models/glm-4-9b-chat

# int4 量化模型
modelscope download --model qwen/Qwen2-1.5B --local_dir /root/autodl-tmp/models/qwen2-1.5b

2. 安装vllm

pip install vllm

3. 使用vllm

3.1 推理

python infer.py

3.2 部署服务

bash vllm_server.sh

3.3 调用服务

# 1.使用openai 风格的客户端调用
python openai_client.py

# 2. 使用gradio客户端
bash run_gradio_client.sh

4. Paged Attention介绍

Paged Attention

在 vLLM 中，LLM 服务的性能瓶颈在于内存。
KV Cache 的分块存储。
共享内存对多输出序列的优化。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm

vllm

README.md

vllm部署工具及paged attention

1. Vllm介绍

📝 1. 准备模型文件

2. 安装vllm

3. 使用vllm

4. Paged Attention介绍

Files

vllm

Directory actions

More options

Directory actions

More options

Latest commit

History

vllm

Folders and files

parent directory

README.md

vllm部署工具及paged attention

1. Vllm介绍

📝 1. 准备模型文件

2. 安装vllm

3. 使用vllm

4. Paged Attention介绍