You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+3-3
Original file line number
Diff line number
Diff line change
@@ -13,17 +13,17 @@ Model Server hosts models and makes them accessible to software components over
13
13
14
14
OpenVINO™ Model Server (OVMS) is a high-performance system for serving models. Implemented in C++ for scalability and optimized for deployment on Intel architectures. It uses the same API as [TensorFlow Serving](https://github.com/tensorflow/serving) and [KServe](https://github.com/kserve/kserve) while applying OpenVINO for inference execution. Inference service is provided via gRPC or REST API, making deploying new algorithms and AI experiments easy.
15
15
16
-
In addition, there are included endpoints for generative use cases compatible with [OpenAI API and Cohere API](./clients_genai.md).
16
+
In addition, there are included endpoints for generative use cases compatible with [OpenAI API and Cohere API](./docs/clients_genai.md).
17
17
18
18

19
19
20
20
The models used by the server need to be stored locally or hosted remotely by object storage services. For more details, refer to [Preparing Model Repository](docs/models_repository.md) documentation. Model server works inside [Docker containers](docs/deploying_server.md#deploying-model-server-in-docker-container), on [Bare Metal](docs/deploying_server.md#deploying-model-server-on-baremetal-without-container), and in [Kubernetes environment](docs/deploying_server.md#deploying-model-server-in-kubernetes).
21
-
Start using OpenVINO Model Server with a fast-forward serving example from the [QuickStart guide](docs/ovms_quickstart.md) or [LLM QuickStart guide](./llm/quickstart.md).
21
+
Start using OpenVINO Model Server with a fast-forward serving example from the [QuickStart guide](docs/ovms_quickstart.md) or [LLM QuickStart guide](./docs/llm/quickstart.md).
22
22
23
23
Read [release notes](https://github.com/openvinotoolkit/model_server/releases) to find out what’s new.
24
24
25
25
### Key features:
26
-
-**[NEW]** Native Windows support. Check updated [deployment guide](./deploying_server.md)
26
+
-**[NEW]** Native Windows support. Check updated [deployment guide](./docs/deploying_server.md)
27
27
-**[NEW]**[Text Embeddings compatible with OpenAI API](demos/embeddings/README.md)
28
28
-**[NEW]**[Reranking compatible with Cohere API](demos/rerank/README.md)
29
29
-**[NEW]**[Efficient Text Generation via OpenAI API](demos/continuous_batching/README.md)
Copy file name to clipboardexpand all lines: demos/continuous_batching/speculative_decoding/README.md
+3-3
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# How to serve LLM Models in Speculative Decoding Pipeline{#ovms_demos_continuous_batching_speculative_decoding}
2
2
3
-
Following [OpenVINO GenAI docs](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide.html#efficient-text-generation-via-speculative-decoding):
3
+
Following [OpenVINO GenAI docs](https://docs.openvino.ai/2025/openvino-workflow-generative/inference-with-genai.html#efficient-text-generation-via-speculative-decoding):
4
4
> Speculative decoding (or assisted-generation) enables faster token generation when an additional smaller draft model is used alongside the main model. This reduces the number of infer requests to the main model, increasing performance.
5
5
>
6
6
> The draft model predicts the next K tokens one by one in an autoregressive manner. The main model validates these predictions and corrects them if necessary - in case of a discrepancy, the main model prediction is used. Then, the draft model acquires this token and runs prediction of the next K tokens, thus repeating the cycle.
@@ -13,7 +13,7 @@ This demo shows how to use speculative decoding in the model serving scenario, b
13
13
14
14
**Model preparation**: Python 3.9 or higher with pip and HuggingFace account
15
15
16
-
**Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../docs/deploying_server_baremetal.md)
16
+
**Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../../docs/deploying_server_baremetal.md)
17
17
18
18
## Model considerations
19
19
@@ -103,7 +103,7 @@ Assuming you have unpacked model server package, make sure to:
103
103
-**On Windows**: run `setupvars` script
104
104
-**On Linux**: set `LD_LIBRARY_PATH` and `PATH` environment variables
105
105
106
-
as mentioned in [deployment guide](../../docs/deploying_server_baremetal.md), in every new shell that will start OpenVINO Model Server.
106
+
as mentioned in [deployment guide](../../../docs/deploying_server_baremetal.md), in every new shell that will start OpenVINO Model Server.
107
107
108
108
Depending on how you prepared models in the first step of this demo, they are deployed to either CPU or GPU (it's defined in `config.json`). If you run on GPU make sure to have appropriate drivers installed, so the device is accessible for the model server.
Copy file name to clipboardexpand all lines: docs/accelerators.md
+9-9
Original file line number
Diff line number
Diff line change
@@ -4,9 +4,9 @@
4
4
5
5
Docker engine installed (on Linux and WSL), or ovms binary package installed as described in the [guide](./deploying_server_baremetal.md) (on Linux or Windows).
6
6
7
-
Supported HW is documented in [OpenVINO system requirements](https://docs.openvino.ai/2024/about-openvino/release-notes-openvino/system-requirements.html)
7
+
Supported HW is documented in [OpenVINO system requirements](https://docs.openvino.ai/2025/about-openvino/release-notes-openvino/system-requirements.html)
8
8
9
-
Before staring the model server as a binary package, make sure there are installed GPU or/and NPU required drivers like described in [https://docs.openvino.ai/2024/get-started/configurations.html](https://docs.openvino.ai/2024/get-started/configurations.html)
9
+
Before staring the model server as a binary package, make sure there are installed GPU or/and NPU required drivers like described in [https://docs.openvino.ai/2025/get-started/install-openvino/configurations.html](https://docs.openvino.ai/2025/get-started/install-openvino/configurations.html)
10
10
11
11
Additional considerations when deploying with docker container:
12
12
- make sure to use the image version including runtime drivers. The public image has a suffix -gpu like `openvino/model_server:latest-gpu`.
@@ -27,7 +27,7 @@ rm model/1/model.tar.gz
27
27
28
28
## Starting Model Server with Intel GPU
29
29
30
-
The [GPU plugin](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device.html) uses the [oneDNN](https://github.com/oneapi-src/oneDNN) and [OpenCL](https://github.com/KhronosGroup/OpenCL-SDK) to infer deep neural networks. For inference execution, it employs Intel® Processor Graphics including
30
+
The [GPU plugin](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device.html) uses the [oneDNN](https://github.com/oneapi-src/oneDNN) and [OpenCL](https://github.com/KhronosGroup/OpenCL-SDK) to infer deep neural networks. For inference execution, it employs Intel® Processor Graphics including
31
31
Intel® Arc™ GPU Series, Intel® UHD Graphics, Intel® HD Graphics, Intel® Iris® Graphics, Intel® Iris® Xe Graphics, and Intel® Iris® Xe MAX graphics and Intel® Data Center GPU.
Starting the server with GPU acceleration requires installation of runtime drivers and ocl-icd-libopencl1 package like described on [configuration guide](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html)
60
+
Starting the server with GPU acceleration requires installation of runtime drivers and ocl-icd-libopencl1 package like described on [configuration guide](https://docs.openvino.ai/2025/get-started/install-openvino/configurations/configurations-intel-gpu.html)
61
61
62
62
Start the model server with GPU accelerations using a command:
OpenVINO Model Server supports using [NPU device](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/npu-device.html)
70
+
OpenVINO Model Server supports using [NPU device](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes/npu-device.html)
71
71
72
72
### Container
73
73
Example command to run container with NPU:
@@ -82,13 +82,13 @@ Start the model server with NPU accelerations using a command:
Check more info about the [NPU driver configuration](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-npu.html).
85
+
Check more info about the [NPU driver configuration](https://docs.openvino.ai/2025/get-started/install-openvino/configurations/configurations-intel-npu.html).
86
86
87
87
> **NOTE**: NPU device execute models with static input and output shapes only. If your model has dynamic shape, it can be reset to static with parameters `--batch_size` or `--shape`.
88
88
89
89
## Using Heterogeneous Plugin
90
90
91
-
The [HETERO plugin](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.html) makes it possible to distribute inference load of one model
91
+
The [HETERO plugin](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.html) makes it possible to distribute inference load of one model
92
92
among several computing devices. That way different parts of the deep learning network can be executed by devices best suited to their type of calculations.
93
93
OpenVINO automatically divides the network to optimize the process.
[Auto Device](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/auto-device-selection.html) (or AUTO in short) is a new special “virtual” or “proxy” device in the OpenVINO toolkit, it doesn’t bind to a specific type of HW device.
118
+
[Auto Device](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes/auto-device-selection.html) (or AUTO in short) is a new special “virtual” or “proxy” device in the OpenVINO toolkit, it doesn’t bind to a specific type of HW device.
119
119
AUTO solves the complexity in application required to code a logic for the HW device selection (through HW devices) and then, on the deducing the best optimization settings on that device.
120
120
AUTO always chooses the best device, if compiling model fails on this device, AUTO will try to compile it on next best device until one of them succeeds.
[Auto Batching](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/automatic-batching.html) (or BATCH in short) is a new special “virtual” device
200
+
[Auto Batching](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes/automatic-batching.html) (or BATCH in short) is a new special “virtual” device
201
201
which explicitly defines the auto batching.
202
202
203
203
It performs automatic batching on-the-fly to improve device utilization by grouping inference requests together, without programming effort from the user.
Leverage the OpenVINO [model caching](https://docs.openvino.ai/2024/openvino-workflow/running-inference/optimize-inference/optimizing-latency/model-caching-overview.html) feature to speed up subsequent model loading on a target device.
21
+
Leverage the OpenVINO [model caching](https://docs.openvino.ai/2025/openvino-workflow/running-inference/optimize-inference/optimizing-latency/model-caching-overview.html) feature to speed up subsequent model loading on a target device.
Copy file name to clipboardexpand all lines: docs/build_from_source.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -143,7 +143,7 @@ make release_image MEDIAPIPE_DISABLE=1 PYTHON_DISABLE=1
143
143
144
144
### `GPU`
145
145
146
-
When set to `1`, OpenVINO&trade Model Server will be built with the drivers required by [GPU plugin](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device.html) support. Default value: `0`.
146
+
When set to `1`, OpenVINO&trade Model Server will be built with the drivers required by [GPU plugin](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device.html) support. Default value: `0`.
Copy file name to clipboardexpand all lines: docs/deploying_server_baremetal.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -164,7 +164,7 @@ Learn more about model server [starting parameters](parameters.md).
164
164
165
165
> **NOTE**:
166
166
> When serving models on [AI accelerators](accelerators.md), some additional steps may be required to install device drivers and dependencies.
167
-
> Learn more in the [Additional Configurations for Hardware](https://docs.openvino.ai/2024/get-started/configurations.html) documentation.
167
+
> Learn more in the [Additional Configurations for Hardware](https://docs.openvino.ai/2025/get-started/install-openvino/configurations.html) documentation.
- Intel® Core™ processor (6-13th gen.) or Intel® Xeon® processor (1st to 4th gen.)
9
9
- Linux, macOS or Windows via [WSL](https://docs.microsoft.com/en-us/windows/wsl/)
10
-
- (optional) AI accelerators [supported by OpenVINO](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html). Accelerators are tested only on bare-metal Linux hosts.
10
+
- (optional) AI accelerators [supported by OpenVINO](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes.html). Accelerators are tested only on bare-metal Linux hosts.
11
11
12
12
### Launch Model Server Container
13
13
@@ -85,4 +85,4 @@ make release_image GPU=1
85
85
It will create an image called `openvino/model_server:latest`.
86
86
> **Note:** This operation might take 40min or more depending on your build host.
87
87
> **Note:**`GPU` parameter in image build command is needed to include dependencies for GPU device.
88
-
> **Note:** The public image from the last release might be not compatible with models exported using the the latest export script. Check the [demo version from the last release](https://github.com/openvinotoolkit/model_server/tree/releases/2024/4/demos/continuous_batching) to use the public docker image.
88
+
> **Note:** The public image from the last release might be not compatible with models exported using the the latest export script. We recommend using export script and docker image from the same release to avoid compatibility issues.
Copy file name to clipboardexpand all lines: docs/dynamic_shape_dynamic_model.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ Enable dynamic shape by setting the `shape` parameter to range or undefined:
8
8
-`--shape "(1,3,200:500,200:500)"` when model is supposed to support height and width values in a range of 200-500. Note that any dimension can support range of values, height and width are only examples here.
9
9
10
10
> Note that some models do not support dynamic dimensions. Learn more about supported model graph layers including all limitations
11
-
on [Shape Inference Document](https://docs.openvino.ai/2024/openvino-workflow/running-inference/changing-input-shape.html).
11
+
on [Shape Inference Document](https://docs.openvino.ai/2025/openvino-workflow/running-inference/changing-input-shape.html).
12
12
13
13
Another option to use dynamic shape feature is to export the model with dynamic dimension using Model Optimizer. OpenVINO Model Server will inherit the dynamic shape and no additional settings are needed.
Copy file name to clipboardexpand all lines: docs/home.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -58,5 +58,5 @@ Start using OpenVINO Model Server with a fast-forward serving example from the [
58
58
*[RAG building blocks made easy and affordable with OpenVINO Model Server](https://medium.com/openvino-toolkit/rag-building-blocks-made-easy-and-affordable-with-openvino-model-server-e7b03da5012b)
59
59
*[Simplified Deployments with OpenVINO™ Model Server and TensorFlow Serving](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Simplified-Deployments-with-OpenVINO-Model-Server-and-TensorFlow/post/1353218)
60
60
*[Inference Scaling with OpenVINO™ Model Server in Kubernetes and OpenShift Clusters](https://www.intel.com/content/www/us/en/developer/articles/technical/deploy-openvino-in-openshift-and-kubernetes.html)
-`optional string device` - device to load models to. Supported values: "CPU", "GPU" [default = "CPU"]
84
-
-`optional string plugin_config` - [OpenVINO device plugin configuration](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html). Should be provided in the same format for regular [models configuration](../parameters.md#model-configuration-options)[default = "{}"]
84
+
-`optional string plugin_config` - [OpenVINO device plugin configuration](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes.html). Should be provided in the same format for regular [models configuration](../parameters.md#model-configuration-options)[default = "{}"]
85
85
-`optional uint32 best_of_limit` - max value of best_of parameter accepted by endpoint [default = 20];
86
86
-`optional uint32 max_tokens_limit` - max value of max_tokens parameter accepted by endpoint [default = 4096];
Copy file name to clipboardexpand all lines: docs/mediapipe.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -54,7 +54,7 @@ Check their [documentation](https://github.com/openvinotoolkit/mediapipe/blob/ma
54
54
55
55
## PyTensorOvTensorConverterCalculator
56
56
57
-
`PyTensorOvTensorConverterCalculator` enables conversion between nodes that are run by `PythonExecutorCalculator` and nodes that receive and/or produce [OV Tensors](https://docs.openvino.ai/2024/api/c_cpp_api/classov_1_1_tensor.html)
57
+
`PyTensorOvTensorConverterCalculator` enables conversion between nodes that are run by `PythonExecutorCalculator` and nodes that receive and/or produce [OV Tensors](https://docs.openvino.ai/2025/api/c_cpp_api/classov_1_1_tensor.html)
58
58
59
59
## How to create the graph for deployment in OpenVINO Model Server
Copy file name to clipboardexpand all lines: docs/model_cache.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
# Model Cache {#ovms_docs_model_cache}
2
2
3
3
## Overview
4
-
The Model Server can leverage a [OpenVINO™ model cache functionality](https://docs.openvino.ai/2024/openvino-workflow/running-inference/optimize-inference/optimizing-latency/model-caching-overview.html), to speed up subsequent model loading on a target device.
4
+
The Model Server can leverage a [OpenVINO™ model cache functionality](https://docs.openvino.ai/2025/openvino-workflow/running-inference/optimize-inference/optimizing-latency/model-caching-overview.html), to speed up subsequent model loading on a target device.
5
5
The cached files make the Model Server initialization usually faster.
6
6
The boost depends on a model and a target device. The most noticeable improvement will be observed with GPU devices. On other devices, like CPU, it is possible to observe no speed up effect or even slower loading process depending on used model. Test the setup before final deployment.
Copy file name to clipboardexpand all lines: docs/model_server_c_api.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -47,7 +47,7 @@ To execute inference using C API you must follow steps described below.
47
47
Create an inference request using `OVMS_InferenceRequestNew` specifying which servable name and optionally version to use. Then specify input tensors with `OVMS_InferenceRequestAddInput` and set the tensor data using `OVMS_InferenceRequestInputSetData`. Optionally you can also set one or all outputs with `OVMS_InferenceRequestAddOutput` and `OVMS_InferenceRequestOutputSetData`. For asynchronous inference you also have to set callback with `OVMS_InferenceRequestSetCompletionCallback`.
48
48
49
49
#### Using OpenVINO Remote Tensor
50
-
With OpenVINO Model Server C-API you could also leverage the OpenVINO remote tensors support. Check original documentation [here](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device/remote-tensor-api-gpu-plugin.html). In order to use OpenCL buffers you need to first create `cl::Buffer` and then use its pointer in setting input with `OVMS_InferenceRequestInputSetData` or output with `OVMS_InferenceRequestOutputSetData` and buffer type `OVMS_BUFFERTYPE_OPENCL`. In case of VA surfaces you need to create appropriate VA surfaces and then use the same calls with buffer type `OVMS_BUFFERTYPE_VASURFACE_Y` and `OVMS_BUFFERTYPE_VASURFACE_UV`.
50
+
With OpenVINO Model Server C-API you could also leverage the OpenVINO remote tensors support. Check original documentation [here](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device/remote-tensor-api-gpu-plugin.html). In order to use OpenCL buffers you need to first create `cl::Buffer` and then use its pointer in setting input with `OVMS_InferenceRequestInputSetData` or output with `OVMS_InferenceRequestOutputSetData` and buffer type `OVMS_BUFFERTYPE_OPENCL`. In case of VA surfaces you need to create appropriate VA surfaces and then use the same calls with buffer type `OVMS_BUFFERTYPE_VASURFACE_Y` and `OVMS_BUFFERTYPE_VASURFACE_UV`.
51
51
52
52
#### Invoke inference
53
53
Execute inference with OpenVINO Model Server using `OVMS_Inference` synchronous call. During inference execution you must not modify `OVMS_InferenceRequest` and bound memory buffers.
OpenVINO Model Server can perform inference using pre-trained models in either [OpenVINO IR](https://docs.openvino.ai/2024/documentation/openvino-ir-format/operation-sets.html)
3
+
OpenVINO Model Server can perform inference using pre-trained models in either [OpenVINO IR](https://docs.openvino.ai/2025/documentation/openvino-ir-format/operation-sets.html)
4
4
, [ONNX](https://onnx.ai/), [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) or [TensorFlow](https://www.tensorflow.org/) format. You can get them by:
5
5
6
6
- downloading models from [Open Model Zoo](https://storage.openvinotoolkit.org/repositories/open_model_zoo/)
7
7
- generating the model in a training framework and saving it to a supported format: TensorFlow saved_model, ONNX or PaddlePaddle.
8
8
- downloading the models from models hubs like [Kaggle](https://www.kaggle.com/models) or [ONNX models zoo](https://github.com/onnx/models).
9
-
- converting models from any formats using [conversion tool](https://docs.openvino.ai/2024/openvino-workflow/model-preparation/convert-model-to-ir.html)
9
+
- converting models from any formats using [conversion tool](https://docs.openvino.ai/2025/openvino-workflow/model-preparation/convert-model-to-ir.html)
10
10
11
11
This guide uses a [Faster R-CNN with Resnet-50 V1 Object Detection model](https://www.kaggle.com/models/tensorflow/faster-rcnn-resnet-v1/tensorFlow2/faster-rcnn-resnet50-v1-640x640/1) in TensorFlow format.
Copy file name to clipboardexpand all lines: docs/parameters.md
+4-4
Original file line number
Diff line number
Diff line change
@@ -7,17 +7,17 @@
7
7
|---|---|---|
8
8
|`"model_name"/"name"`|`string`| Model name exposed over gRPC and REST API.(use `model_name` in command line, `name` in json config) |
9
9
|`"model_path"/"base_path"`|`string`| If using a Google Cloud Storage, Azure Storage or S3 path, see [cloud storage guide](./using_cloud_storage.md). The path may look as follows:<br>`"/opt/ml/models/model"`<br>`"gs://bucket/models/model"`<br>`"s3://bucket/models/model"`<br>`"azure://bucket/models/model"`<br>The path can be also relative to the config.json location<br>(use `model_path` in command line, `base_path` in json config) |
10
-
| `"shape"` | `tuple/json/"auto"` | `shape` is optional and takes precedence over `batch_size`. The `shape` argument changes the model that is enabled in the model server to fit the parameters. `shape` accepts three forms of the values: * `auto` - The model server reloads the model with the shape that matches the input data matrix. * a tuple, such as `(1,3,224,224)` - The tuple defines the shape to use for all incoming requests for models with a single input. * A dictionary of shapes, such as `{"input1":"(1,3,224,224)","input2":"(1,3,50,50)", "input3":"auto"}` - This option defines the shape of every included input in the model.Some models don't support the reshape operation.If the model can't be reshaped, it remains in the original parameters and all requests with incompatible input format result in an error. See the logs for more information about specific errors.Learn more about supported model graph layers including all limitations at [Shape Inference Document](https://docs.openvino.ai/2024/openvino-workflow/running-inference/changing-input-shape.html). |
10
+
| `"shape"` | `tuple/json/"auto"` | `shape` is optional and takes precedence over `batch_size`. The `shape` argument changes the model that is enabled in the model server to fit the parameters. `shape` accepts three forms of the values: * `auto` - The model server reloads the model with the shape that matches the input data matrix. * a tuple, such as `(1,3,224,224)` - The tuple defines the shape to use for all incoming requests for models with a single input. * A dictionary of shapes, such as `{"input1":"(1,3,224,224)","input2":"(1,3,50,50)", "input3":"auto"}` - This option defines the shape of every included input in the model.Some models don't support the reshape operation.If the model can't be reshaped, it remains in the original parameters and all requests with incompatible input format result in an error. See the logs for more information about specific errors.Learn more about supported model graph layers including all limitations at [Shape Inference Document](https://docs.openvino.ai/2025/openvino-workflow/running-inference/changing-input-shape.html). |
11
11
| `"batch_size"` | `integer/"auto"` | Optional. By default, the batch size is derived from the model, defined through the OpenVINO Model Optimizer. `batch_size` is useful for sequential inference requests of the same batch size.Some models, such as object detection, don't work correctly with the `batch_size` parameter. With these models, the output's first dimension doesn't represent the batch size. You can set the batch size for these models by using network reshaping and setting the `shape` parameter appropriately.The default option of using the Model Optimizer to determine the batch size uses the size of the first dimension in the first input for the size. For example, if the input shape is `(1, 3, 225, 225)`, the batch size is set to `1`. If you set `batch_size` to a numerical value, the model batch size is changed when the service starts.`batch_size` also accepts a value of `auto`. If you use `auto`, then the served model batch size is set according to the incoming data at run time. The model is reloaded each time the input data changes the batch size. You might see a delayed response upon the first request. |
12
12
|`"layout" `|`json/string`|`layout` is optional argument which allows to define or change the layout of model input and output tensors. To change the layout (add the transposition step), specify `<target layout>:<source layout>`. Example: `NHWC:NCHW` means that user will send input data in `NHWC` layout while the model is in `NCHW` layout.<br><br>When specified without colon separator, it doesn't add a transposition but can determine the batch dimension. E.g. `--layout CN` makes prediction service treat second dimension as batch size.<br><br>When the model has multiple inputs or the output layout has to be changed, use a json format. Set the mapping, such as: `{"input1":"NHWC:NCHW","input2":"HWN:NHW","output1":"CN:NC"}`.<br><br>If not specified, layout is inherited from model.<br><br> [Read more](shape_batch_size_and_layout.md#changing-model-input-output-layout)|
13
13
|`"model_version_policy"`|`json/string`| Optional. The model version policy lets you decide which versions of a model that the OpenVINO Model Server is to serve. By default, the server serves the latest version. One reason to use this argument is to control the server memory consumption.The accepted format is in json or string. Examples: <br> `{"latest": { "num_versions":2 }` <br> `{"specific": { "versions":[1, 3] } }` <br> `{"all": {} }`|
14
-
|`"plugin_config"`|`json/string`| List of device plugin parameters. For full list refer to [OpenVINO documentation](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) and [performance tuning guide](./performance_tuning.md). Example: <br> `{"PERFORMANCE_HINT": "LATENCY"}`|
14
+
|`"plugin_config"`|`json/string`| List of device plugin parameters. For full list refer to [OpenVINO documentation](https://docs.openvino.ai/2025/documentation/compatibility-and-support/supported-devices.html) and [performance tuning guide](./performance_tuning.md). Example: <br> `{"PERFORMANCE_HINT": "LATENCY"}`|
15
15
|`"nireq"`|`integer`| The size of internal request queue. When set to 0 or no value is set value is calculated automatically based on available resources.|
16
16
|`"target_device"`|`string`| Device name to be used to execute inference operations. Accepted values are: `"CPU"/"GPU"/"MULTI"/"HETERO"`|
17
17
|`"stateful"`|`bool`| If set to true, model is loaded as stateful. |
18
18
|`"idle_sequence_cleanup"`|`bool`| If set to true, model will be subject to periodic sequence cleaner scans. See [idle sequence cleanup](stateful_models.md). |
19
19
|`"max_sequence_number"`|`uint32`| Determines how many sequences can be handled concurrently by a model instance. |
20
-
|`"low_latency_transformation"`|`bool`| If set to true, model server will apply [low latency transformation](https://docs.openvino.ai/2024/openvino-workflow/running-inference/stateful-models/obtaining-stateful-openvino-model.html#lowlatency2-transformation) on model load. |
20
+
|`"low_latency_transformation"`|`bool`| If set to true, model server will apply [low latency transformation](https://docs.openvino.ai/2025/openvino-workflow/running-inference/stateful-models/obtaining-stateful-openvino-model.html#lowlatency2-transformation) on model load. |
21
21
|`"metrics_enable"`|`bool`| Flag enabling [metrics](metrics.md) endpoint on rest_port. |
22
22
|`"metrics_list"`|`string`| Comma separated list of [metrics](metrics.md). If unset, only default metrics will be enabled.|
23
23
@@ -44,7 +44,7 @@ Configuration options for the server are defined only via command-line options a
44
44
|`file_system_poll_wait_seconds`|`integer`| Time interval between config and model versions changes detection in seconds. Default value is 1. Zero value disables changes monitoring. |
45
45
|`sequence_cleaner_poll_wait_minutes`|`integer`| Time interval (in minutes) between next sequence cleaner scans. Sequences of the models that are subjects to idle sequence cleanup that have been inactive since the last scan are removed. Zero value disables sequence cleaner. See [idle sequence cleanup](stateful_models.md). It also sets the schedule for releasing free memory from the heap. |
46
46
|`custom_node_resources_cleaner_interval_seconds`|`integer`| Time interval (in seconds) between two consecutive resources cleanup scans. Default is 1. Must be greater than 0. See [custom node development](custom_node_development.md). |
47
-
|`cpu_extension`|`string`| Optional path to a library with [custom layers implementation](https://docs.openvino.ai/2024/documentation/openvino-extensibility.html). |
47
+
|`cpu_extension`|`string`| Optional path to a library with [custom layers implementation](https://docs.openvino.ai/2025/documentation/openvino-extensibility.html). |
This mode prioritizes low latency, providing short response time for each inference job. It performs best for tasks where inference is required for a single input image, like a medical analysis of an ultrasound scan image. It also fits the tasks of real-time or nearly real-time applications, such as an industrial robot's response to actions in its environment or obstacle avoidance for autonomous vehicles.
47
-
Note that currently the `PERFORMANCE_HINT` property is supported by CPU and GPU devices only. [More information](https://docs.openvino.ai/2024/openvino-workflow/running-inference/optimize-inference/high-level-performance-hints.html#performance-hints-how-it-works).
47
+
Note that currently the `PERFORMANCE_HINT` property is supported by CPU and GPU devices only. [More information](https://docs.openvino.ai/2025/openvino-workflow/running-inference/optimize-inference/high-level-performance-hints.html#performance-hints-how-it-works).
48
48
49
49
To enable Performance Hints for your application, use the following command:
50
50
@@ -124,7 +124,7 @@ In case of using CPU plugin to run the inference, it might be also beneficial to
124
124
| ENABLE_CPU_PINNING | This property allows CPU threads pinning during inference. |
125
125
126
126
127
-
> **NOTE:** For additional information about all parameters read about [OpenVINO device properties](https://docs.openvino.ai/2024/api/c_cpp_api/group__ov__runtime__cpp__prop__api.html).
127
+
> **NOTE:** For additional information about all parameters read about [OpenVINO device properties](https://docs.openvino.ai/2025/api/c_cpp_api/group__ov__runtime__cpp__prop__api.html).
128
128
129
129
- Example:
130
130
Following docker command will set `NUM_STREAMS` parameter to a value `1`:
@@ -167,7 +167,7 @@ The default value is 1 second which ensures prompt response to creating new mode
167
167
168
168
Depending on the device employed to run the inference operation, you can tune the execution behavior with a set of parameters. Each device is handled by its OpenVINO plugin.
169
169
170
-
> **NOTE**: For additional information, read [supported configuration parameters for all plugins](https://docs.openvino.ai/2024/api/c_cpp_api/group__ov__runtime__cpp__prop__api.html).
170
+
> **NOTE**: For additional information, read [supported configuration parameters for all plugins](https://docs.openvino.ai/2025/api/c_cpp_api/group__ov__runtime__cpp__prop__api.html).
171
171
172
172
Model's plugin configuration is a dictionary of param:value pairs passed to OpenVINO Plugin on network load. It can be set with `plugin_config` parameter.
Recommended steps to investigate achievable performance and discover bottlenecks:
185
-
1.[Launch OV benchmark app](https://docs.openvino.ai/2024/learn-openvino/openvino-samples/benchmark-tool.html)
185
+
1.[Launch OV benchmark app](https://docs.openvino.ai/2025/get-started/learn-openvino/openvino-samples/benchmark-tool.html)
186
186
187
187
**Note:** It is useful to drop plugin configuration from benchmark app using `-dump_config` and then use the same plugin configuration in model loaded into OVMS
Copy file name to clipboardexpand all lines: docs/python_support/reference.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -947,7 +947,7 @@ That's why converter calculators exists. They work as adapters between nodes and
947
947
948
948
#### PyTensorOvTensorConverterCalculator
949
949
950
-
OpenVINO Model Server comes with a built-in `PyTensorOvTensorConverterCalculator` that provides conversion between [Python Tensor](#python-tensor) and [OV Tensor](https://docs.openvino.ai/2024/api/c_cpp_api/classov_1_1_tensor.html).
950
+
OpenVINO Model Server comes with a built-in `PyTensorOvTensorConverterCalculator` that provides conversion between [Python Tensor](#python-tensor) and [OV Tensor](https://docs.openvino.ai/2025/api/c_cpp_api/classov_1_1_tensor.html).
951
951
952
952
Currently `PyTensorOvTensorConverterCalculator` works with only one input and one output.
953
953
- The stream that expects Python Tensor **must** have tag `OVMS_PY_TENSOR`
|`stateful`|`bool`| If set to true, model is loaded as stateful. | false |
72
72
|`idle_sequence_cleanup`|`bool`| If set to true, model will be subject to periodic sequence cleaner scans. <br> See [idle sequence cleanup](#idle-sequence-cleanup). | true |
73
73
|`max_sequence_number`|`uint32`| Determines how many sequences can be handled concurrently by a model instance. | 500 |
74
-
|`low_latency_transformation`|`bool`| If set to true, model server will apply [low latency transformation](https://docs.openvino.ai/2024/openvino-workflow/running-inference/stateful-models.html) on model load. | false |
74
+
|`low_latency_transformation`|`bool`| If set to true, model server will apply [low latency transformation](https://docs.openvino.ai/2025/openvino-workflow/running-inference/stateful-models.html) on model load. | false |
75
75
76
76
**Note:** Setting `idle_sequence_cleanup`, `max_sequence_number` and `low_latency_transformation` require setting `stateful` to true.
77
77
@@ -305,7 +305,7 @@ If set to `true` sequence cleaner will check that model. Otherwise, sequence cle
305
305
There are limitations for using stateful models with OVMS:
306
306
307
307
- Support inference execution only using CPU as the target device.
308
-
- Support Kaldi models with memory layers and non-Kaldi models with Tensor Iterator. See this [docs about stateful networks](https://docs.openvino.ai/2024/openvino-workflow/running-inference/stateful-models.html) to learn about stateful networks representation in OpenVINO.
308
+
- Support Kaldi models with memory layers and non-Kaldi models with Tensor Iterator. See this [docs about stateful networks](https://docs.openvino.ai/2025/openvino-workflow/running-inference/stateful-models.html) to learn about stateful networks representation in OpenVINO.
309
309
-[Auto batch size and shape](shape_batch_size_and_layout.md) are **not** available in stateful models.
310
310
- Stateful model instances **cannot** be used in [DAGs](dag_scheduler.md).
311
311
- Requests ordering is guaranteed only when a single client sends subsequent requests in a synchronous manner. Concurrent interaction with the same sequence might negatively affect the accuracy of the results.
Copy file name to clipboardexpand all lines: docs/tf_model_binary_input.md
+3-3
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ This guide shows how to convert TensorFlow models and deploy them with the OpenV
4
4
5
5
- In this example TensorFlow model [ResNet](https://github.com/tensorflow/models/tree/v2.2.0/official/r1/resnet) will be used.
6
6
7
-
- TensorFlow model can be converted into Intermediate Representation format using model_optimizer tool. There are several formats for storing TensorFlow model. In this guide, we present conversion from SavedModel format. More information about conversion process can be found in the [model optimizer guide](https://docs.openvino.ai/2024/openvino-workflow/model-preparation.html).
7
+
- TensorFlow model can be converted into Intermediate Representation format using model_optimizer tool. There are several formats for storing TensorFlow model. In this guide, we present conversion from SavedModel format. More information about conversion process can be found in the [model optimizer guide](https://docs.openvino.ai/2025/openvino-workflow/model-preparation.html).
8
8
9
9
- Binary input format has several requirements for the model and ovms configuration. More information can be found in [binary inputs documentation](binary_input.md).
*Note:* Some models might require other parameters such as `--scale` parameter.
31
31
-`--reverse_input_channels` - required for models that are trained with images in RGB order.
32
-
-`--mean_values` , `--scale` - should be provided if input pre-processing operations are not a part of topology- and the pre-processing relies on the application providing input data. They can be determined in several ways described in [conversion parameters guide](https://docs.openvino.ai/2024/openvino-workflow/model-preparation/convert-model-tensorflow.html). In this example [model pre-processing script](https://github.com/tensorflow/models/blob/v2.2.0/official/r1/resnet/imagenet_preprocessing.py) was used to determine them.
32
+
-`--mean_values` , `--scale` - should be provided if input pre-processing operations are not a part of topology- and the pre-processing relies on the application providing input data. They can be determined in several ways described in [conversion parameters guide](https://docs.openvino.ai/2025/openvino-workflow/model-preparation/convert-model-tensorflow.html). In this example [model pre-processing script](https://github.com/tensorflow/models/blob/v2.2.0/official/r1/resnet/imagenet_preprocessing.py) was used to determine them.
33
33
34
34
35
-
*Note:* You can find out more about [TensorFlow Model conversion into Intermediate Representation](https://docs.openvino.ai/2024/openvino-workflow/model-preparation/convert-model-tensorflow.html) if your model is stored in other formats.
35
+
*Note:* You can find out more about [TensorFlow Model conversion into Intermediate Representation](https://docs.openvino.ai/2025/openvino-workflow/model-preparation/convert-model-tensorflow.html) if your model is stored in other formats.
36
36
37
37
This operation will create model files in `${PWD}/resnet_v2/models/resnet/1/` folder.
Copy file name to clipboardexpand all lines: src/custom_nodes/image_transformation/README.md
+3-3
Original file line number
Diff line number
Diff line change
@@ -48,9 +48,9 @@ make BASE_OS=redhat NODES=image_transformation
48
48
| target_image_color_order | Output image color order. If specified and differs from original_image_color_order, color order conversion will be performed |`BGR`||
49
49
| original_image_layout | Input image layout. This is required to determine image shape from input shape ||✓|
50
50
| target_image_layout | Output image layout. If specified and differs from original_image_layout, layout conversion will be performed |||
51
-
| scale | All values will be divided by this value. When `scale_values` is specified, this value is ignored. [read more](https://docs.openvino.ai/2024/documentation/legacy-features/transition-legacy-conversion-api/legacy-conversion-api/%5Blegacy%5D-embedding-preprocessing-computation.html#specifying-mean-and-scale-values)|||
52
-
| scale_values | Scale values to be used for the input image per channel. Input data will be divided by those values. Values should be provided in the same order as output image color order. [read more](https://docs.openvino.ai/2024/documentation/legacy-features/transition-legacy-conversion-api/legacy-conversion-api/%5Blegacy%5D-embedding-preprocessing-computation.html#specifying-mean-and-scale-values)|||
53
-
| mean_values | Mean values to be used for the input image per channel. Values will be subtracted from each input image data value. Values should be provided in the same order as output image color order. [read more](https://docs.openvino.ai/2024/documentation/legacy-features/transition-legacy-conversion-api/legacy-conversion-api/%5Blegacy%5D-embedding-preprocessing-computation.html#specifying-mean-and-scale-values)|||
51
+
| scale | All values will be divided by this value. When `scale_values` is specified, this value is ignored. [read more](https://docs.openvino.ai/2024/documentation/legacy-features/transition-legacy-conversion-api.html#scale-values)|||
52
+
| scale_values | Scale values to be used for the input image per channel. Input data will be divided by those values. Values should be provided in the same order as output image color order. [read more](https://docs.openvino.ai/2024/documentation/legacy-features/transition-legacy-conversion-api.html#scale-values)|||
53
+
| mean_values | Mean values to be used for the input image per channel. Values will be subtracted from each input image data value. Values should be provided in the same order as output image color order. [read more](https://docs.openvino.ai/2024/documentation/legacy-features/transition-legacy-conversion-api.html#mean-values)|||
54
54
| debug | Defines if debug messages should be displayed | false ||
55
55
56
56
> **_NOTE:_** Subtracting mean values is performed before division by scale values.
0 commit comments