AgentScope supports developers to build their local model API serving with different inference engines/libraries. This document will introduce how to fast build their local API serving with provided scripts.
- Set up Local Model API Serving
- Table of Contents
ollama is a CPU inference engine for LLMs. With ollama, developers can build their local model API serving without GPU requirements.
-
First, install ollama in its official repository based on your system (e.g. macOS, windows or linux).
-
Follow ollama's guidance to pull or create a model and start its serving. Taking llama2 as an example, you can run the following command to pull the model files.
ollama pull llama2
In AgentScope, you can use the following model configurations to load the model.
- For ollama Chat API:
{
"config_name": "my_ollama_chat_config",
"model_type": "ollama_chat",
# Required parameters
"model_name": "{model_name}", # The model name used in ollama API, e.g. llama2
# Optional parameters
"options": { # Parameters passed to the model when calling
# e.g. "temperature": 0., "seed": 123,
},
"keep_alive": "5m", # Controls how long the model will stay loaded into memory
}
- For ollama generate API:
{
"config_name": "my_ollama_generate_config",
"model_type": "ollama_generate",
# Required parameters
"model_name": "{model_name}", # The model name used in ollama API, e.g. llama2
# Optional parameters
"options": { # Parameters passed to the model when calling
# "temperature": 0., "seed": 123,
},
"keep_alive": "5m", # Controls how long the model will stay loaded into memory
}
- For ollama embedding API:
{
"config_name": "my_ollama_embedding_config",
"model_type": "ollama_embedding",
# Required parameters
"model_name": "{model_name}", # The model name used in ollama API, e.g. llama2
# Optional parameters
"options": { # Parameters passed to the model when calling
# "temperature": 0., "seed": 123,
},
"keep_alive": "5m", # Controls how long the model will stay loaded into memory
}
Flask is a lightweight web application framework. It is easy to build a local model API serving with Flask.
Here we provide two Flask examples with Transformers and ModelScope library, respectively. You can build your own model API serving with few modifications.
Install Flask and Transformers by following command.
pip install flask torch transformers accelerate
Taking model meta-llama/Llama-2-7b-chat-hf
and port 8000
as an example,
set up the model API serving by running the following command.
python flask_transformers/setup_hf_service.py \
--model_name_or_path meta-llama/Llama-2-7b-chat-hf \
--device "cuda:0" \
--port 8000
You can replace meta-llama/Llama-2-7b-chat-hf
with any model card in
huggingface model hub.
In AgentScope, you can load the model with the following model configs: ./flask_transformers/model_config.json
.
{
"model_type": "post_api_chat",
"config_name": "flask_llama2-7b-chat-hf",
"api_url": "http://127.0.0.1:8000/llm/",
"json_args": {
"max_length": 4096,
"temperature": 0.5
}
}
In this model serving, the messages from post requests should be in STRING
format. You can use templates for chat model in
transformers with a little modification in ./flask_transformers/setup_hf_service.py
.
Install Flask and modelscope by following command.
pip install flask torch modelscope
Taking model modelscope/Llama-2-7b-chat-ms
and port 8000
as an example,
to set up the model API serving, run the following command.
python flask_modelscope/setup_ms_service.py \
--model_name_or_path modelscope/Llama-2-7b-chat-ms \
--device "cuda:0" \
--port 8000
You can replace modelscope/Llama-2-7b-chat-ms
with any model card in
modelscope model hub.
In AgentScope, you can load the model with the following model configs:
flask_modelscope/model_config.json
.
{
"model_type": "post_api_chat",
"config_name": "flask_llama2-7b-chat-ms",
"api_url": "http://127.0.0.1:8000/llm/",
"json_args": {
"max_length": 4096,
"temperature": 0.5
}
}
Similar with the example of transformers, the messages from post requests should be in STRING format.
FastChat is an open platform that provides quick setup for model serving with OpenAI-compatible RESTful APIs.
To install FastChat, run
pip install "fschat[model_worker,webui]"
Taking model meta-llama/Llama-2-7b-chat-hf
and port 8000
as an example,
to set up model API serving, run the following command to set up model serving.
bash fastchat/fastchat_setup.sh -m meta-llama/Llama-2-7b-chat-hf -p 8000
Refer to supported model list of FastChat.
Now you can load the model in AgentScope by the following model config: fastchat/model_config.json
.
{
"model_type": "openai_chat",
"config_name": "fastchat_llama2-7b-chat-hf",
"model_name": "meta-llama/Llama-2-7b-chat-hf",
"api_key": "EMPTY",
"client_args": {
"base_url": "http://127.0.0.1:8000/v1/"
},
"generate_args": {
"temperature": 0.5
}
}
vllm is a high-throughput inference and serving engine for LLMs.
To install vllm, run
pip install vllm
Taking model meta-llama/Llama-2-7b-chat-hf
and port 8000
as an example,
to set up model API serving, run
./vllm/vllm_setup.sh -m meta-llama/Llama-2-7b-chat-hf -p 8000
Please refer to the supported models list of vllm.
Now you can load the model in AgentScope by the following model config: vllm/model_config.json
.
{
"model_type": "openai_chat",
"config_name": "vllm_llama2-7b-chat-hf",
"model_name": "meta-llama/Llama-2-7b-chat-hf",
"api_key": "EMPTY",
"client_args": {
"base_url": "http://127.0.0.1:8000/v1/"
},
"generate_args": {
"temperature": 0.5
}
}
Both Huggingface and
ModelScope provide model inference API,
which can be used with AgentScope post api model wrapper.
Taking gpt2
in HuggingFace inference API as an example, you can use the
following model config in AgentScope.
{
"model_type": "post_api_chat",
"config_name": "gpt2",
"headers": {
"Authorization": "Bearer {YOUR_API_TOKEN}"
},
"api_url": "https://api-inference.huggingface.co/models/gpt2"
}