|
| 1 | +# Efficient LLM Serving {#ovms_docs_llm_reference} |
| 2 | + |
| 3 | +**THIS IS A PREVIEW FEATURE** |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +With rapid development of generative AI, new techniques and algorithms for performance optimization and better resource utilization are introduced to make best use of the hardware and provide best generation performance. OpenVINO implements those state of the art methods in it's [GenAI Library](https://github.com/ilya-lavrenov/openvino.genai/tree/ct-beam-search/text_generation/causal_lm/cpp/continuous_batching/library) like: |
| 8 | + - Continuous Batching |
| 9 | + - Paged Attention |
| 10 | + - Dynamic Split Fuse |
| 11 | + - *and more...* |
| 12 | + |
| 13 | +It is now integrated into OpenVINO Model Server providing efficient way to run generative workloads. |
| 14 | + |
| 15 | +Check out the [quickstart guide](quickstart.md) for a simple example that shows how to use this feature. |
| 16 | + |
| 17 | +## LLM Calculator |
| 18 | +As you can see in the quickstart above, big part of the configuration resides in `graph.pbtxt` file. That's because model server text generation servables are deployed as MediaPipe graphs with dedicated LLM calculator that works with latest [OpenVINO GenAI](https://github.com/ilya-lavrenov/openvino.genai/tree/ct-beam-search/text_generation/causal_lm/cpp/continuous_batching/library) solutions. The calculator is designed to run in cycles and return the chunks of reponses to the client. |
| 19 | + |
| 20 | +On the input it expects a HttpPayload struct passed by the Model Server frontend: |
| 21 | +```cpp |
| 22 | +struct HttpPayload { |
| 23 | + std::string uri; |
| 24 | + std::vector<std::pair<std::string, std::string>> headers; |
| 25 | + std::string body; // always |
| 26 | + rapidjson::Document* parsedJson; // pre-parsed body = null |
| 27 | +}; |
| 28 | +``` |
| 29 | +The input json content should be compatible with the [chat completions](./model_server_rest_api_chat.md) or [completions](./model_server_rest_api_completions.md) API. |
| 30 | +
|
| 31 | +The input also includes a side packet with a reference to `LLM_NODE_RESOURCES` which is a shared object representing an LLM engine. It loads the model, runs the generation cycles and reports the generated results to the LLM calculator via a generation handler. |
| 32 | +
|
| 33 | +**Every node based on LLM Calculator MUST have exactly that specification of this side packet:** |
| 34 | +
|
| 35 | +`input_side_packet: "LLM_NODE_RESOURCES:llm"` |
| 36 | +
|
| 37 | +**If it's modified, model server will fail to provide graph with the model** |
| 38 | +
|
| 39 | +On the output the calculator creates an std::string with the json content, which is returned to the client as one response or in chunks with streaming. |
| 40 | +
|
| 41 | +Let's have a look at the graph from the graph configuration from the quickstart: |
| 42 | +```protobuf |
| 43 | +input_stream: "HTTP_REQUEST_PAYLOAD:input" |
| 44 | +output_stream: "HTTP_RESPONSE_PAYLOAD:output" |
| 45 | +
|
| 46 | +node: { |
| 47 | + name: "LLMExecutor" |
| 48 | + calculator: "HttpLLMCalculator" |
| 49 | + input_stream: "LOOPBACK:loopback" |
| 50 | + input_stream: "HTTP_REQUEST_PAYLOAD:input" |
| 51 | + input_side_packet: "LLM_NODE_RESOURCES:llm" |
| 52 | + output_stream: "LOOPBACK:loopback" |
| 53 | + output_stream: "HTTP_RESPONSE_PAYLOAD:output" |
| 54 | + input_stream_info: { |
| 55 | + tag_index: 'LOOPBACK:0', |
| 56 | + back_edge: true |
| 57 | + } |
| 58 | + node_options: { |
| 59 | + [type.googleapis.com / mediapipe.LLMCalculatorOptions]: { |
| 60 | + models_path: "./" |
| 61 | + } |
| 62 | + } |
| 63 | + input_stream_handler { |
| 64 | + input_stream_handler: "SyncSetInputStreamHandler", |
| 65 | + options { |
| 66 | + [mediapipe.SyncSetInputStreamHandlerOptions.ext] { |
| 67 | + sync_set { |
| 68 | + tag_index: "LOOPBACK:0" |
| 69 | + } |
| 70 | + } |
| 71 | + } |
| 72 | + } |
| 73 | +} |
| 74 | +``` |
| 75 | + |
| 76 | +Above node configuration should be used as a template since user is not expected to change most of it's content. Fields that can be safely changed are: |
| 77 | + - `name` |
| 78 | + - `input_stream: "HTTP_REQUEST_PAYLOAD:input"` - in case you want to change input name |
| 79 | + - `output_stream: "HTTP_RESPONSE_PAYLOAD:output"` - in case you want to change input name |
| 80 | +- `node_options` |
| 81 | + |
| 82 | +From this options only `node_options` really requires user attention as they specify LLM engine parameters. The rest of them can remain unchanged. |
| 83 | + |
| 84 | +The calculator supports the following `node_options` for tuning the pipeline configuration: |
| 85 | +- `required string models_path` - location of the model directory (can be relative); |
| 86 | +- `optional uint64 max_num_batched_tokens` - max number of tokens processed in a single iteration [default = 256]; |
| 87 | +- `optional uint64 cache_size` - memory size in GB for storing KV cache [default = 8]; |
| 88 | +- `optional uint64 block_size` - number of tokens which KV is stored in a single block (Paged Attention related) [default = 32]; |
| 89 | +- `optional uint64 max_num_seqs` - max number of sequences actively processed by the engine [default = 256]; |
| 90 | +- `optional bool dynamic_split_fuse` - use Dynamic Split Fuse token scheduling [default = true]; |
| 91 | +- `optional string device` - device to load models to. Supported values: "CPU" [default = "CPU"] |
| 92 | +- `optional string plugin_config` - [OpenVINO device plugin configuration](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html). Should be provided in the same format for regular [models configuration](./parameters.md#model-configuration-options) [default = ""] |
| 93 | + |
| 94 | + |
| 95 | +The value of `cache_size` might have performance implications. It is used for storing LLM model KV cache data. Adjust it based on your environment capabilities, model size and expected level of concurrency. |
| 96 | + |
| 97 | +## Models Directory |
| 98 | + |
| 99 | +In node configuration we set `models_path` indicating location of the directory with files loaded by LLM engine. It loads following files: |
| 100 | + |
| 101 | +``` |
| 102 | +├── openvino_detokenizer.bin |
| 103 | +├── openvino_detokenizer.xml |
| 104 | +├── openvino_model.bin |
| 105 | +├── openvino_model.xml |
| 106 | +├── openvino_tokenizer.bin |
| 107 | +├── openvino_tokenizer.xml |
| 108 | +├── tokenizer_config.json |
| 109 | +├── template.jinja |
| 110 | +``` |
| 111 | + |
| 112 | +Main model as well as tokenizer and detokenizer are loaded from `.xml` and `.bin` files and all of them are required. `tokenizer_config.json` and `template.jinja` are loaded to read information required for chat template processing. Chat template is used only on `/chat/completions` endpoint. Template is not applied for calls to `/completions`, so it doesn't have to exist, if you plan to work only with `/completions`. |
| 113 | + |
| 114 | +### Chat template |
| 115 | + |
| 116 | +Loading chat template proceeds as follows: |
| 117 | +1. If `tokenizer.jinja` is present, try to load template from it. |
| 118 | +2. If there is no `tokenizer.jinja` and `tokenizer_config.json` exists, try to read template from its `chat_template` field. If it's not present, use default template. |
| 119 | +3. If `tokenizer_config.json` exists try to read `eos_token` and `bos_token` fields. If they are not present, both values are set to empty string. |
| 120 | + |
| 121 | +**Note**: If both `template.jinja` file and `chat_completion` field from `tokenizer_config.json` are successfully loaded `template.jinja` takes precedence over `tokenizer_config.json`. |
| 122 | + |
| 123 | +If there are errors in loading or reading files or fields (they exist but are wrong) no template is loaded and servable will not respond to `/chat/completions` calls. |
| 124 | + |
| 125 | +If no chat template has been specified, default template is applied. The template looks as follows: |
| 126 | +``` |
| 127 | +"{% if messages|length > 1 %} {{ raise_exception('This servable accepts only single message requests') }}{% endif %}{{ messages[0]['content'] }}" |
| 128 | +``` |
| 129 | + |
| 130 | +When default template is loaded, servable accepts `/chat/completions` calls when `messages` list contains only single element (otherwise returns error) and treats `content` value of that single message as an input prompt for the model. |
| 131 | + |
| 132 | + |
| 133 | +## Limitations |
| 134 | + |
| 135 | +As it's in preview, this feature has set of limitations: |
| 136 | + |
| 137 | +- Limited support for [API parameters](./model_server_rest_api_chat.md#request), |
| 138 | +- Only one node with LLM calculator can be deployed at once, |
| 139 | +- Metrics related to text generation - they are planned to be added later, |
| 140 | +- Improvements in stability and recovery mechanisms are also expected |
| 141 | + |
| 142 | +## References: |
| 143 | +- [Chat Completions API](./model_server_rest_api_chat.md) |
| 144 | +- [Completions API](./model_server_rest_api_completions.md) |
| 145 | +- [Demo](./../demos/continuous_batching/) |
0 commit comments