Serving LLMs - documentation update (#2497)

mzegla · dtrawins · web-flow · commit 6bb48a0d493e · 2024-06-15T00:00:44.000+02:00
---------

Co-authored-by: Dariusz Trawinski &lt;Dariusz.Trawinski@intel.com&gt;
diff --git a/README.md b/README.md
@@ -21,8 +21,9 @@ Start using OpenVINO Model Server with a fast-forward serving example from the [
 Read [release notes](https://github.com/openvinotoolkit/model_server/releases) to find out what’s new.
 
 ### Key features:
-- **[NEW]** [Python code execution](https://docs.openvino.ai/nightly/ovms_docs_python_support_reference.html)
-- **[NEW]** [gRPC streaming](https://docs.openvino.ai/nightly/ovms_docs_streaming_endpoints.html)
+- **[NEW]** [Efficient Text Generation via OpenAI API - preview](https://docs.openvino.ai/nightly/ovms_docs_llm_reference.html)
+- [Python code execution](https://docs.openvino.ai/nightly/ovms_docs_python_support_reference.html)
+- [gRPC streaming](https://docs.openvino.ai/nightly/ovms_docs_streaming_endpoints.html)
 - [MediaPipe graphs serving](https://docs.openvino.ai/nightly/ovms_docs_mediapipe.html) 
 - Model management - including [model versioning](https://docs.openvino.ai/nightly/ovms_docs_model_version_policy.html) and [model updates in runtime](https://docs.openvino.ai/nightly/ovms_docs_online_config_changes.html)
 - [Dynamic model inputs](https://docs.openvino.ai/nightly/ovms_docs_shape_batch_layout.html)
diff --git a/docs/clients_openai.md b/docs/clients_openai.md
@@ -7,6 +7,7 @@ hidden:
 ---
 
 Chat API <ovms_docs_rest_api_chat>
+Chat API <ovms_docs_rest_api_completion>
 demo <https://github.com/openvinotoolkit/model_server/tree/main/demos/continuous_batching/>
 LLM calculator <ovms_docs_llm_caclulator>
 ```
diff --git a/docs/features.md b/docs/features.md
@@ -6,6 +6,7 @@ maxdepth: 1
 hidden:
 ---
 
+ovms_docs_llm_reference
 ovms_docs_dag
 ovms_docs_mediapipe
 ovms_docs_streaming_endpoints
@@ -22,6 +23,11 @@ ovms_docs_c_api
 ovms_docs_advanced
 ```
 
+## Efficient LLM Serving
+Serve LLMs enhanced with state of the art optimization techniques for best performance and resource utilization on generative workloads
+
+[Learn more](llm/reference.md)
+
 ## Python Code Execution
 Write Python code that will do your custom processing and serve it in the Model Server. 
 Take advantage of a rich environment of Python modules in domains like data processing and data science to create flexible solutions without the need to write C++ code.
diff --git a/docs/home.md b/docs/home.md
@@ -37,8 +37,9 @@ The models used by the server need to be stored locally or hosted remotely by ob
 Start using OpenVINO Model Server with a fast-forward serving example from the [Quickstart guide](ovms_quickstart.md) or explore [Model Server features](features.md).
 
 ### Key features:
-- **[NEW]** [Python code execution](python_support/reference.md)
-- **[NEW]** [gRPC streaming](streaming_endpoints.md)
+- **[NEW]** [Efficient Text Generation - preview](llm/reference.md)
+- [Python code execution](python_support/reference.md)
+- [gRPC streaming](streaming_endpoints.md)
 - [MediaPipe graphs serving](mediapipe.md) 
 - Model management - including [model versioning](model_version_policy.md) and [model updates in runtime](online_config_changes.md)
 - [Dynamic model inputs](shape_batch_size_and_layout.md)
diff --git a/docs/llm/quickstart.md b/docs/llm/quickstart.md
@@ -0,0 +1,140 @@
+# Efficient LLM Serving - quickstart {#ovms_docs_llm_quickstart}
+
+Let's deploy [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) model and request generation.
+
+1. Install python dependencies for the conversion script:
+```bash
+export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/nightly"
+
+pip3 install --pre "optimum-intel[nncf,openvino]"@git+https://github.com/huggingface/optimum-intel.git openvino-tokenizers
+```
+
+2. Run optimum-cli to download and quantize the model:
+```bash
+mkdir workspace && cd workspace
+
+optimum-cli export openvino --disable-convert-tokenizer --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int8 TinyLlama-1.1B-Chat-v1.0
+
+convert_tokenizer -o TinyLlama-1.1B-Chat-v1.0 --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens TinyLlama/TinyLlama-1.1B-Chat-v1.0
+```
+
+3. Create `graph.pbtxt` file in a model directory: 
+```bash
+echo '
+input_stream: "HTTP_REQUEST_PAYLOAD:input"
+output_stream: "HTTP_RESPONSE_PAYLOAD:output"
+
+node: {
+  name: "LLMExecutor"
+  calculator: "HttpLLMCalculator"
+  input_stream: "LOOPBACK:loopback"
+  input_stream: "HTTP_REQUEST_PAYLOAD:input"
+  input_side_packet: "LLM_NODE_RESOURCES:llm"
+  output_stream: "LOOPBACK:loopback"
+  output_stream: "HTTP_RESPONSE_PAYLOAD:output"
+  input_stream_info: {
+    tag_index: "LOOPBACK:0",
+    back_edge: true
+  }
+  node_options: {
+      [type.googleapis.com / mediapipe.LLMCalculatorOptions]: {
+          models_path: "./"
+      }
+  }
+  input_stream_handler {
+    input_stream_handler: "SyncSetInputStreamHandler",
+    options {
+      [mediapipe.SyncSetInputStreamHandlerOptions.ext] {
+        sync_set {
+          tag_index: "LOOPBACK:0"
+        }
+      }
+    }
+  }
+}
+' > TinyLlama-1.1B-Chat-v1.0/graph.pbtxt
+```
+
+4. Create server `config.json` file:
+```bash
+echo '
+{
+    "model_config_list": [],
+    "mediapipe_config_list": [
+        {
+            "name": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+            "base_path": "TinyLlama-1.1B-Chat-v1.0"
+        }
+    ]
+}
+' > config.json
+```
+5. Deploy:
+
+```bash
+docker run -d --rm -p 8000:8000 -v $(pwd)/:/workspace:ro openvino/model_server --rest_port 8000 --config_path /workspace/config.json
+```
+Wait for the model to load. You can check the status with a simple command:
+```bash
+curl http://localhost:8000/v1/config
+{
+"TinyLlama/TinyLlama-1.1B-Chat-v1.0" : 
+{
+ "model_version_status": [
+  {
+   "version": "1",
+   "state": "AVAILABLE",
+   "status": {
+    "error_code": "OK",
+    "error_message": "OK"
+   }
+  }
+ ]
+}
+```
+6. Run generation
+```bash
+curl -s http://localhost:8000/v3/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    "max_tokens":30,
+    "stream":false,
+    "messages": [
+      {
+        "role": "system",
+        "content": "You are a helpful assistant."
+      },
+      {
+        "role": "user",
+        "content": "What is OpenVINO?"
+      }
+    ]
+  }'| jq .
+```
+```json
+  "choices": [
+    {
+      "finish_reason": "stop",
+      "index": 0,
+      "logprobs": null,
+      "message": {
+        "content": "OpenVINO is a software development kit (SDK) for machine learning (ML) and deep learning (DL) applications. It is developed",
+        "role": "assistant"
+      }
+    }
+  ],
+  "created": 1718401064,
+  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+  "object": "chat.completion"
+}
+
+```
+**Note:** If you want to get the response chunks streamed back as they are generated change `stream` parameter in the request to `true`.
+
+
+## References:
+- [Efficient LLM Serving - reference](./reference.md)
+- [Chat Completions API](./model_server_rest_api_chat.md)
+- [Completions API](./model_server_rest_api_completions.md)
+- [Demo with Llama3 serving](./../demos/continuous_batching/)
diff --git a/docs/llm/reference.md b/docs/llm/reference.md
@@ -0,0 +1,145 @@
+# Efficient LLM Serving {#ovms_docs_llm_reference}
+
+**THIS IS A PREVIEW FEATURE**
+
+## Overview
+
+With rapid development of generative AI, new techniques and algorithms for performance optimization and better resource utilization are introduced to make best use of the hardware and provide best generation performance. OpenVINO implements those state of the art methods in it's [GenAI Library](https://github.com/ilya-lavrenov/openvino.genai/tree/ct-beam-search/text_generation/causal_lm/cpp/continuous_batching/library) like:
+  - Continuous Batching
+  - Paged Attention
+  - Dynamic Split Fuse 
+  - *and more...*
+
+It is now integrated into OpenVINO Model Server providing efficient way to run generative workloads.
+
+Check out the [quickstart guide](quickstart.md) for a simple example that shows how to use this feature.
+
+## LLM Calculator
+As you can see in the quickstart above, big part of the configuration resides in `graph.pbtxt` file. That's because model server text generation servables are deployed as MediaPipe graphs with dedicated LLM calculator that works with latest [OpenVINO GenAI](https://github.com/ilya-lavrenov/openvino.genai/tree/ct-beam-search/text_generation/causal_lm/cpp/continuous_batching/library) solutions. The calculator is designed to run in cycles and return the chunks of reponses to the client.
+
+On the input it expects a HttpPayload struct passed by the Model Server frontend:
+```cpp
+struct HttpPayload {
+    std::string uri;
+    std::vector<std::pair<std::string, std::string>> headers;
+    std::string body;                 // always
+    rapidjson::Document* parsedJson;  // pre-parsed body             = null
+};
+```
+The input json content should be compatible with the [chat completions](./model_server_rest_api_chat.md) or [completions](./model_server_rest_api_completions.md) API.
+
+The input also includes a side packet with a reference to `LLM_NODE_RESOURCES` which is a shared object representing an LLM engine. It loads the model, runs the generation cycles and reports the generated results to the LLM calculator via a generation handler. 
+
+**Every node based on LLM Calculator MUST have exactly that specification of this side packet:**
+
+`input_side_packet: "LLM_NODE_RESOURCES:llm"`
+
+**If it's modified, model server will fail to provide graph with the model**
+
+On the output the calculator creates an std::string with the json content, which is returned to the client as one response or in chunks with streaming.
+
+Let's have a look at the graph from the graph configuration from the quickstart:
+```protobuf
+input_stream: "HTTP_REQUEST_PAYLOAD:input"
+output_stream: "HTTP_RESPONSE_PAYLOAD:output"
+
+node: {
+  name: "LLMExecutor"
+  calculator: "HttpLLMCalculator"
+  input_stream: "LOOPBACK:loopback"
+  input_stream: "HTTP_REQUEST_PAYLOAD:input"
+  input_side_packet: "LLM_NODE_RESOURCES:llm"
+  output_stream: "LOOPBACK:loopback"
+  output_stream: "HTTP_RESPONSE_PAYLOAD:output"
+  input_stream_info: {
+    tag_index: 'LOOPBACK:0',
+    back_edge: true
+  }
+  node_options: {
+      [type.googleapis.com / mediapipe.LLMCalculatorOptions]: {
+          models_path: "./"
+      }
+  }
+  input_stream_handler {
+    input_stream_handler: "SyncSetInputStreamHandler",
+    options {
+      [mediapipe.SyncSetInputStreamHandlerOptions.ext] {
+        sync_set {
+          tag_index: "LOOPBACK:0"
+        }
+      }
+    }
+  }
+}
+```
+
+Above node configuration should be used as a template since user is not expected to change most of it's content. Fields that can be safely changed are:
+ - `name`
+ - `input_stream: "HTTP_REQUEST_PAYLOAD:input"` - in case you want to change input name
+ - `output_stream: "HTTP_RESPONSE_PAYLOAD:output"` - in case you want to change input name
+- `node_options`
+
+From this options only `node_options` really requires user attention as they specify LLM engine parameters. The rest of them can remain unchanged. 
+
+The calculator supports the following `node_options` for tuning the pipeline configuration:
+-    `required string models_path` - location of the model directory (can be relative);
+-    `optional uint64 max_num_batched_tokens` - max number of tokens processed in a single iteration [default = 256];
+-    `optional uint64 cache_size` - memory size in GB for storing KV cache [default = 8];
+-    `optional uint64 block_size` - number of tokens which KV is stored in a single block (Paged Attention related) [default = 32];
+-    `optional uint64 max_num_seqs` - max number of sequences actively processed by the engine [default = 256];
+-    `optional bool dynamic_split_fuse` - use Dynamic Split Fuse token scheduling [default = true];
+-    `optional string device` - device to load models to. Supported values: "CPU" [default = "CPU"]
+-    `optional string plugin_config` - [OpenVINO device plugin configuration](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html). Should be provided in the same format for regular [models configuration](./parameters.md#model-configuration-options) [default = ""]
+
+
+The value of `cache_size` might have performance  implications. It is used for storing LLM model KV cache data. Adjust it based on your environment capabilities, model size and expected level of concurrency.
+
+## Models Directory
+
+In node configuration we set `models_path` indicating location of the directory with files loaded by LLM engine. It loads following files:
+
+```
+├── openvino_detokenizer.bin
+├── openvino_detokenizer.xml
+├── openvino_model.bin
+├── openvino_model.xml
+├── openvino_tokenizer.bin
+├── openvino_tokenizer.xml
+├── tokenizer_config.json
+├── template.jinja
+```
+
+Main model as well as tokenizer and detokenizer are loaded from `.xml` and `.bin` files and all of them are required. `tokenizer_config.json` and `template.jinja` are loaded to read information required for chat template processing. Chat template is used only on `/chat/completions` endpoint. Template is not applied for calls to `/completions`, so it doesn't have to exist, if you plan to work only with `/completions`. 
+
+### Chat template
+
+Loading chat template proceeds as follows:
+1. If `tokenizer.jinja` is present, try to load template from it.
+2. If there is no `tokenizer.jinja` and `tokenizer_config.json` exists, try to read template from its `chat_template` field. If it's not present, use default template.
+3. If `tokenizer_config.json` exists try to read `eos_token` and `bos_token` fields. If they are not present, both values are set to empty string. 
+
+**Note**: If both `template.jinja` file and `chat_completion` field from `tokenizer_config.json` are successfully loaded `template.jinja` takes precedence over `tokenizer_config.json`.
+
+If there are errors in loading or reading files or fields (they exist but are wrong) no template is loaded and servable will not respond to `/chat/completions` calls. 
+
+If no chat template has been specified, default template is applied. The template looks as follows:
+```
+"{% if messages|length > 1 %} {{ raise_exception('This servable accepts only single message requests') }}{% endif %}{{ messages[0]['content'] }}"
+```
+
+When default template is loaded, servable accepts `/chat/completions` calls when `messages` list contains only single element (otherwise returns error) and treats `content` value of that single message as an input prompt for the model.
+
+
+## Limitations
+
+As it's in preview, this feature has set of limitations:
+
+- Limited support for [API parameters](./model_server_rest_api_chat.md#request),
+- Only one node with LLM calculator can be deployed at once,
+- Metrics related to text generation - they are planned to be added later,
+- Improvements in stability and recovery mechanisms are also expected
+
+## References:
+- [Chat Completions API](./model_server_rest_api_chat.md)
+- [Completions API](./model_server_rest_api_completions.md)
+- [Demo](./../demos/continuous_batching/)
diff --git a/docs/llm_calculator.md b/docs/llm_calculator.md
diff --git a/docs/model_server_rest_api_chat.md b/docs/model_server_rest_api_chat.md
diff --git a/docs/model_server_rest_api_completions.md b/docs/model_server_rest_api_completions.md

-Original file line number
+Diff line change
 # OpenAI API {#ovms_docs_rest_api_chat}
+-
+-
 +**Note**: This endpoint works only with [LLM graphs](./llm/reference.md).
 ## API Reference
 OpenVINO Model Server includes now the `chat/completions` endpoint using OpenAI API.