Skip to content

Commit 6bb48a0

Browse files
mzegladtrawins
andauthoredJun 14, 2024··
Serving LLMs - documentation update (#2497)
--------- Co-authored-by: Dariusz Trawinski <[email protected]>
1 parent 271f5ef commit 6bb48a0

9 files changed

+300
-83
lines changed
 

‎README.md

+3-2
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,9 @@ Start using OpenVINO Model Server with a fast-forward serving example from the [
2121
Read [release notes](https://github.com/openvinotoolkit/model_server/releases) to find out what’s new.
2222

2323
### Key features:
24-
- **[NEW]** [Python code execution](https://docs.openvino.ai/nightly/ovms_docs_python_support_reference.html)
25-
- **[NEW]** [gRPC streaming](https://docs.openvino.ai/nightly/ovms_docs_streaming_endpoints.html)
24+
- **[NEW]** [Efficient Text Generation via OpenAI API - preview](https://docs.openvino.ai/nightly/ovms_docs_llm_reference.html)
25+
- [Python code execution](https://docs.openvino.ai/nightly/ovms_docs_python_support_reference.html)
26+
- [gRPC streaming](https://docs.openvino.ai/nightly/ovms_docs_streaming_endpoints.html)
2627
- [MediaPipe graphs serving](https://docs.openvino.ai/nightly/ovms_docs_mediapipe.html)
2728
- Model management - including [model versioning](https://docs.openvino.ai/nightly/ovms_docs_model_version_policy.html) and [model updates in runtime](https://docs.openvino.ai/nightly/ovms_docs_online_config_changes.html)
2829
- [Dynamic model inputs](https://docs.openvino.ai/nightly/ovms_docs_shape_batch_layout.html)

‎docs/clients_openai.md

+1
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ hidden:
77
---
88
99
Chat API <ovms_docs_rest_api_chat>
10+
Chat API <ovms_docs_rest_api_completion>
1011
demo <https://github.com/openvinotoolkit/model_server/tree/main/demos/continuous_batching/>
1112
LLM calculator <ovms_docs_llm_caclulator>
1213
```

‎docs/features.md

+6
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ maxdepth: 1
66
hidden:
77
---
88
9+
ovms_docs_llm_reference
910
ovms_docs_dag
1011
ovms_docs_mediapipe
1112
ovms_docs_streaming_endpoints
@@ -22,6 +23,11 @@ ovms_docs_c_api
2223
ovms_docs_advanced
2324
```
2425

26+
## Efficient LLM Serving
27+
Serve LLMs enhanced with state of the art optimization techniques for best performance and resource utilization on generative workloads
28+
29+
[Learn more](llm/reference.md)
30+
2531
## Python Code Execution
2632
Write Python code that will do your custom processing and serve it in the Model Server.
2733
Take advantage of a rich environment of Python modules in domains like data processing and data science to create flexible solutions without the need to write C++ code.

‎docs/home.md

+3-2
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,9 @@ The models used by the server need to be stored locally or hosted remotely by ob
3737
Start using OpenVINO Model Server with a fast-forward serving example from the [Quickstart guide](ovms_quickstart.md) or explore [Model Server features](features.md).
3838

3939
### Key features:
40-
- **[NEW]** [Python code execution](python_support/reference.md)
41-
- **[NEW]** [gRPC streaming](streaming_endpoints.md)
40+
- **[NEW]** [Efficient Text Generation - preview](llm/reference.md)
41+
- [Python code execution](python_support/reference.md)
42+
- [gRPC streaming](streaming_endpoints.md)
4243
- [MediaPipe graphs serving](mediapipe.md)
4344
- Model management - including [model versioning](model_version_policy.md) and [model updates in runtime](online_config_changes.md)
4445
- [Dynamic model inputs](shape_batch_size_and_layout.md)

‎docs/llm/quickstart.md

+140
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# Efficient LLM Serving - quickstart {#ovms_docs_llm_quickstart}
2+
3+
Let's deploy [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) model and request generation.
4+
5+
1. Install python dependencies for the conversion script:
6+
```bash
7+
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/nightly"
8+
9+
pip3 install --pre "optimum-intel[nncf,openvino]"@git+https://github.com/huggingface/optimum-intel.git openvino-tokenizers
10+
```
11+
12+
2. Run optimum-cli to download and quantize the model:
13+
```bash
14+
mkdir workspace && cd workspace
15+
16+
optimum-cli export openvino --disable-convert-tokenizer --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int8 TinyLlama-1.1B-Chat-v1.0
17+
18+
convert_tokenizer -o TinyLlama-1.1B-Chat-v1.0 --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens TinyLlama/TinyLlama-1.1B-Chat-v1.0
19+
```
20+
21+
3. Create `graph.pbtxt` file in a model directory:
22+
```bash
23+
echo '
24+
input_stream: "HTTP_REQUEST_PAYLOAD:input"
25+
output_stream: "HTTP_RESPONSE_PAYLOAD:output"
26+
27+
node: {
28+
name: "LLMExecutor"
29+
calculator: "HttpLLMCalculator"
30+
input_stream: "LOOPBACK:loopback"
31+
input_stream: "HTTP_REQUEST_PAYLOAD:input"
32+
input_side_packet: "LLM_NODE_RESOURCES:llm"
33+
output_stream: "LOOPBACK:loopback"
34+
output_stream: "HTTP_RESPONSE_PAYLOAD:output"
35+
input_stream_info: {
36+
tag_index: "LOOPBACK:0",
37+
back_edge: true
38+
}
39+
node_options: {
40+
[type.googleapis.com / mediapipe.LLMCalculatorOptions]: {
41+
models_path: "./"
42+
}
43+
}
44+
input_stream_handler {
45+
input_stream_handler: "SyncSetInputStreamHandler",
46+
options {
47+
[mediapipe.SyncSetInputStreamHandlerOptions.ext] {
48+
sync_set {
49+
tag_index: "LOOPBACK:0"
50+
}
51+
}
52+
}
53+
}
54+
}
55+
' > TinyLlama-1.1B-Chat-v1.0/graph.pbtxt
56+
```
57+
58+
4. Create server `config.json` file:
59+
```bash
60+
echo '
61+
{
62+
"model_config_list": [],
63+
"mediapipe_config_list": [
64+
{
65+
"name": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
66+
"base_path": "TinyLlama-1.1B-Chat-v1.0"
67+
}
68+
]
69+
}
70+
' > config.json
71+
```
72+
5. Deploy:
73+
74+
```bash
75+
docker run -d --rm -p 8000:8000 -v $(pwd)/:/workspace:ro openvino/model_server --rest_port 8000 --config_path /workspace/config.json
76+
```
77+
Wait for the model to load. You can check the status with a simple command:
78+
```bash
79+
curl http://localhost:8000/v1/config
80+
{
81+
"TinyLlama/TinyLlama-1.1B-Chat-v1.0" :
82+
{
83+
"model_version_status": [
84+
{
85+
"version": "1",
86+
"state": "AVAILABLE",
87+
"status": {
88+
"error_code": "OK",
89+
"error_message": "OK"
90+
}
91+
}
92+
]
93+
}
94+
```
95+
6. Run generation
96+
```bash
97+
curl -s http://localhost:8000/v3/chat/completions \
98+
-H "Content-Type: application/json" \
99+
-d '{
100+
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
101+
"max_tokens":30,
102+
"stream":false,
103+
"messages": [
104+
{
105+
"role": "system",
106+
"content": "You are a helpful assistant."
107+
},
108+
{
109+
"role": "user",
110+
"content": "What is OpenVINO?"
111+
}
112+
]
113+
}'| jq .
114+
```
115+
```json
116+
"choices": [
117+
{
118+
"finish_reason": "stop",
119+
"index": 0,
120+
"logprobs": null,
121+
"message": {
122+
"content": "OpenVINO is a software development kit (SDK) for machine learning (ML) and deep learning (DL) applications. It is developed",
123+
"role": "assistant"
124+
}
125+
}
126+
],
127+
"created": 1718401064,
128+
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
129+
"object": "chat.completion"
130+
}
131+
132+
```
133+
**Note:** If you want to get the response chunks streamed back as they are generated change `stream` parameter in the request to `true`.
134+
135+
136+
## References:
137+
- [Efficient LLM Serving - reference](./reference.md)
138+
- [Chat Completions API](./model_server_rest_api_chat.md)
139+
- [Completions API](./model_server_rest_api_completions.md)
140+
- [Demo with Llama3 serving](./../demos/continuous_batching/)

‎docs/llm/reference.md

+145
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
# Efficient LLM Serving {#ovms_docs_llm_reference}
2+
3+
**THIS IS A PREVIEW FEATURE**
4+
5+
## Overview
6+
7+
With rapid development of generative AI, new techniques and algorithms for performance optimization and better resource utilization are introduced to make best use of the hardware and provide best generation performance. OpenVINO implements those state of the art methods in it's [GenAI Library](https://github.com/ilya-lavrenov/openvino.genai/tree/ct-beam-search/text_generation/causal_lm/cpp/continuous_batching/library) like:
8+
- Continuous Batching
9+
- Paged Attention
10+
- Dynamic Split Fuse
11+
- *and more...*
12+
13+
It is now integrated into OpenVINO Model Server providing efficient way to run generative workloads.
14+
15+
Check out the [quickstart guide](quickstart.md) for a simple example that shows how to use this feature.
16+
17+
## LLM Calculator
18+
As you can see in the quickstart above, big part of the configuration resides in `graph.pbtxt` file. That's because model server text generation servables are deployed as MediaPipe graphs with dedicated LLM calculator that works with latest [OpenVINO GenAI](https://github.com/ilya-lavrenov/openvino.genai/tree/ct-beam-search/text_generation/causal_lm/cpp/continuous_batching/library) solutions. The calculator is designed to run in cycles and return the chunks of reponses to the client.
19+
20+
On the input it expects a HttpPayload struct passed by the Model Server frontend:
21+
```cpp
22+
struct HttpPayload {
23+
std::string uri;
24+
std::vector<std::pair<std::string, std::string>> headers;
25+
std::string body; // always
26+
rapidjson::Document* parsedJson; // pre-parsed body = null
27+
};
28+
```
29+
The input json content should be compatible with the [chat completions](./model_server_rest_api_chat.md) or [completions](./model_server_rest_api_completions.md) API.
30+
31+
The input also includes a side packet with a reference to `LLM_NODE_RESOURCES` which is a shared object representing an LLM engine. It loads the model, runs the generation cycles and reports the generated results to the LLM calculator via a generation handler.
32+
33+
**Every node based on LLM Calculator MUST have exactly that specification of this side packet:**
34+
35+
`input_side_packet: "LLM_NODE_RESOURCES:llm"`
36+
37+
**If it's modified, model server will fail to provide graph with the model**
38+
39+
On the output the calculator creates an std::string with the json content, which is returned to the client as one response or in chunks with streaming.
40+
41+
Let's have a look at the graph from the graph configuration from the quickstart:
42+
```protobuf
43+
input_stream: "HTTP_REQUEST_PAYLOAD:input"
44+
output_stream: "HTTP_RESPONSE_PAYLOAD:output"
45+
46+
node: {
47+
name: "LLMExecutor"
48+
calculator: "HttpLLMCalculator"
49+
input_stream: "LOOPBACK:loopback"
50+
input_stream: "HTTP_REQUEST_PAYLOAD:input"
51+
input_side_packet: "LLM_NODE_RESOURCES:llm"
52+
output_stream: "LOOPBACK:loopback"
53+
output_stream: "HTTP_RESPONSE_PAYLOAD:output"
54+
input_stream_info: {
55+
tag_index: 'LOOPBACK:0',
56+
back_edge: true
57+
}
58+
node_options: {
59+
[type.googleapis.com / mediapipe.LLMCalculatorOptions]: {
60+
models_path: "./"
61+
}
62+
}
63+
input_stream_handler {
64+
input_stream_handler: "SyncSetInputStreamHandler",
65+
options {
66+
[mediapipe.SyncSetInputStreamHandlerOptions.ext] {
67+
sync_set {
68+
tag_index: "LOOPBACK:0"
69+
}
70+
}
71+
}
72+
}
73+
}
74+
```
75+
76+
Above node configuration should be used as a template since user is not expected to change most of it's content. Fields that can be safely changed are:
77+
- `name`
78+
- `input_stream: "HTTP_REQUEST_PAYLOAD:input"` - in case you want to change input name
79+
- `output_stream: "HTTP_RESPONSE_PAYLOAD:output"` - in case you want to change input name
80+
- `node_options`
81+
82+
From this options only `node_options` really requires user attention as they specify LLM engine parameters. The rest of them can remain unchanged.
83+
84+
The calculator supports the following `node_options` for tuning the pipeline configuration:
85+
- `required string models_path` - location of the model directory (can be relative);
86+
- `optional uint64 max_num_batched_tokens` - max number of tokens processed in a single iteration [default = 256];
87+
- `optional uint64 cache_size` - memory size in GB for storing KV cache [default = 8];
88+
- `optional uint64 block_size` - number of tokens which KV is stored in a single block (Paged Attention related) [default = 32];
89+
- `optional uint64 max_num_seqs` - max number of sequences actively processed by the engine [default = 256];
90+
- `optional bool dynamic_split_fuse` - use Dynamic Split Fuse token scheduling [default = true];
91+
- `optional string device` - device to load models to. Supported values: "CPU" [default = "CPU"]
92+
- `optional string plugin_config` - [OpenVINO device plugin configuration](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html). Should be provided in the same format for regular [models configuration](./parameters.md#model-configuration-options) [default = ""]
93+
94+
95+
The value of `cache_size` might have performance implications. It is used for storing LLM model KV cache data. Adjust it based on your environment capabilities, model size and expected level of concurrency.
96+
97+
## Models Directory
98+
99+
In node configuration we set `models_path` indicating location of the directory with files loaded by LLM engine. It loads following files:
100+
101+
```
102+
├── openvino_detokenizer.bin
103+
├── openvino_detokenizer.xml
104+
├── openvino_model.bin
105+
├── openvino_model.xml
106+
├── openvino_tokenizer.bin
107+
├── openvino_tokenizer.xml
108+
├── tokenizer_config.json
109+
├── template.jinja
110+
```
111+
112+
Main model as well as tokenizer and detokenizer are loaded from `.xml` and `.bin` files and all of them are required. `tokenizer_config.json` and `template.jinja` are loaded to read information required for chat template processing. Chat template is used only on `/chat/completions` endpoint. Template is not applied for calls to `/completions`, so it doesn't have to exist, if you plan to work only with `/completions`.
113+
114+
### Chat template
115+
116+
Loading chat template proceeds as follows:
117+
1. If `tokenizer.jinja` is present, try to load template from it.
118+
2. If there is no `tokenizer.jinja` and `tokenizer_config.json` exists, try to read template from its `chat_template` field. If it's not present, use default template.
119+
3. If `tokenizer_config.json` exists try to read `eos_token` and `bos_token` fields. If they are not present, both values are set to empty string.
120+
121+
**Note**: If both `template.jinja` file and `chat_completion` field from `tokenizer_config.json` are successfully loaded `template.jinja` takes precedence over `tokenizer_config.json`.
122+
123+
If there are errors in loading or reading files or fields (they exist but are wrong) no template is loaded and servable will not respond to `/chat/completions` calls.
124+
125+
If no chat template has been specified, default template is applied. The template looks as follows:
126+
```
127+
"{% if messages|length > 1 %} {{ raise_exception('This servable accepts only single message requests') }}{% endif %}{{ messages[0]['content'] }}"
128+
```
129+
130+
When default template is loaded, servable accepts `/chat/completions` calls when `messages` list contains only single element (otherwise returns error) and treats `content` value of that single message as an input prompt for the model.
131+
132+
133+
## Limitations
134+
135+
As it's in preview, this feature has set of limitations:
136+
137+
- Limited support for [API parameters](./model_server_rest_api_chat.md#request),
138+
- Only one node with LLM calculator can be deployed at once,
139+
- Metrics related to text generation - they are planned to be added later,
140+
- Improvements in stability and recovery mechanisms are also expected
141+
142+
## References:
143+
- [Chat Completions API](./model_server_rest_api_chat.md)
144+
- [Completions API](./model_server_rest_api_completions.md)
145+
- [Demo](./../demos/continuous_batching/)

‎docs/llm_calculator.md

-75
This file was deleted.

‎docs/model_server_rest_api_chat.md

+1-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
# OpenAI API {#ovms_docs_rest_api_chat}
22

3-
4-
3+
**Note**: This endpoint works only with [LLM graphs](./llm/reference.md).
54

65
## API Reference
76
OpenVINO Model Server includes now the `chat/completions` endpoint using OpenAI API.

‎docs/model_server_rest_api_completions.md

+1-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
# OpenAI API {#ovms_docs_rest_api_completion}
22

3-
4-
3+
**Note**: This endpoint works only with [LLM graphs](./llm/reference.md).
54

65
## API Reference
76
OpenVINO Model Server includes now the `completions` endpoint using OpenAI API.

0 commit comments

Comments
 (0)
Please sign in to comment.