Releases · openvinotoolkit/model_server

01 Dec 14:42

dkalinowski

v2025.4

cf82edc

OpenVINO Model Server 2025.4 Latest

Latest

Agentic use case improvements

Tool parsers for new models Qwen3-Coder-30B and Qwen3-30B-A3B-Instruct have been enabled. These models are supported in OpenVINO Runtime as a preview feature and can be evaluated with “tool calling” capabilities.
Streaming with “tool calling” for phi-4-mini-instruct and mistral-7B-v0.4 models is supported just like for the rest of supported agentic models.
Tool parsers for mistral and hermes3 have been improved, resolving multiple issues related to complex generated JSON objects and increasing overall response reliability.
Guided generation now supports all rules from XGrammar integration. The response_format parameter can now accept XGrammar structural tags format (not part of the OpenAI API). Example: {"type": "regex", "pattern": "\\w+\\.\\w+@company\\.com"}.

New or updated demos

Deployment improvements

GGUF model format can now be deployed directly from Hugging Face Hub for several LLM architectures. Architectures such as Qwen2, Qwen2.5, Qwen3 and Llama3 can be deployed with a single command. See Loading GGUF models in OVMS demo for details.
OpenVINO Model Server can be deployed as a service in the Windows operating system. It can be managed by service configuration management, shared by all running applications, and controlled using a simplified CLI to pull, configure, enable, and disable models. LINK
Pulling the model in IR format has been extended beyond the OpenVINO™ organization in Hugging Face* Hub. While OpenVINO org models are validated by Intel, a rapidly growing ecosystem of IR-format models from other publishers can now also be pulled and deployed via the OVMS CLI. Note: The repository needs to be populated by optimum-cli export openvino command and must include tokenizer model in IR format to be successfully loaded by OpenVINO Model Server.
CLI simplifications for easier deployment:
--plugin_config parameter can now be applied not only to classic models but also to generative pipelines.
--cache_dir now enables compilation caching for both classic models and generative pipelines.
--enable_prefix_caching can be used the same way for all target devices.
--add_to_config and --remove_from_config,like –list_models, are now OVMS CLI directives and no longer expect a value. The configuration values should be passed through the following parameters --config_path, --model_repository_path, --model_name or --model_path.
When a service is deployed, the CLI can be simplified by setting the environment variable OVMS_MODEL_REPOSITORY_PATH to point to the models folder. This automatically applies the default parameters for model management, ensuring that --config_path and --model_repository_path are set correctly. For example:
ovms --pull--task text_generation OpenVINO/Qwen3-8B-int4
ovms --list_models
ovms --add_to_models --model_name OpenVINO/Qwen3-8B-int4
ovms --remove_from_models --model_name OpenVINO/Qwen3-8B-int4
The --api_key parameter is now available, enabling client authorization using an API key.
Binding parameters are added for both IPv6 and IPv4 addresses for gRPC and REST interfaces.
The metrics endpoint is now compatible with Prometheus v3. The output header type has been updated from JSON to plain text.

Performance improvements

First-token generation performance has been significantly improved for LLM models with GPU acceleration and prefix caching. This is particularly beneficial for agentic use cases, where repeated chat history creates very long contexts that can now be processed much faster. Prefix caching can be enabled with OVMS CLI parameter --enable_prefix_caching true
A new parameter is introduced to increase the allowed prompt length for LLM and VLLM models deployed on NPU. The context can now be extended by adding the CLI parameter --max_prompt_lenght. The default is 1024 tokens and can be extended up to 10k tokens. Set it to the required value to avoid unnecessary memory usage. For VLM models running on both NPU and CPU, use a device-specific configuration to apply the setting only to the NPU device: --plugin_config '{"DEVICE_PROPERTIES": {"NPU":{"MAX_PROMPT_LEN":2048}}}'
Model loading time has been reduced through compilation cache, with significant improvements on GPU and NPU devices. Enable caching using the --cache_dir parameter.
Improved guided generation performance, including support for tool call guiding.

Audio endpoints added

Text to speech endpoint compatible with the OpenAI API - /audio/speech
Speech to text endpoints compatible with the OpenAI API:
/audio/translation - converts provided audio content to English text
/audio/transcription - converts provided audio content to text in the original language

Embeddings endpoints improvements

A tokenize endpoint has been added to get tokens before sending the input text to embeddings calculation. This helps assess input length to avoid exceeding the model context.
Embeddings Model now supports three pooling options: CLS, LAST, and MEAN, widening model compatibility. See Text Embeddings Models list for details.

Breaking changes

The old embeddings and reranking calculators were removed and replaced by embeddings_ov and reranking_ov. These new calculators follow the optimum-cli / Hugging Face model structure and support more features. If you use the old calculators, re-export your models and pull the updated versions from Hugging Face. Demo

Bug fixes

Fixed model phi-4-mini-instruct generating incorrect responses when context exceeded 4k tokens
Other minor fixes

Discontinued in 2025

Deprecated OpenVINO Model Server (OVMS) benchmark client in C++ using TensorFlow Serving API.

Deprecated to be removed in the future

The dedicated OpenVINO operator for Kubernetes and OpenShift is now deprecated in favor of the recommended KServe operator. The OpenVINO operator will continue to function in upcoming OpenVINO Model Server releases but will no longer be actively developed. Since KServe provides broader capabilities, no loss of functionality is expected. In contrary, more functionalities will be accessible and migration between other serving solutions and OpenVINO Model Server will be much easier.
TensorFlow Serving (TFS) API support is planned for deprecation. With increasing adoption of the KServe API for classic models and the OpenAI API for generative workloads, usage of the TFS API has significantly declined. Dropping date is to be determined based on the feedback, with a tentative target of mid-2026.
Support for Stateful models will be deprecated. This capabilities was originally introduced for Kaldi audio models which is no longer relevant. Current audio models support relies on the OpenAI API, and pipelines implemented via OpenVINO GenAI library.
Directed Acyclic Graph Scheduler will be deprecated in favor of pipelines managed by MediaPipe scheduler and will be removed mid-2026. That approach gives more flexibility, includes wider range of calculators and has support for using processing accelerators.

You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:

docker pull openvino/model_server:2025.4- CPU device support with image based on Ubuntu 24.04
docker pull openvino/model_server:2025.4-gpu - GPU, NPU and CPU device support with image based on Ubuntu 24.04

or use provided binary packages. Only packages with sufffix _python_on have support for python.

There is also additional distribution channel via https://storage.openvinotoolkit.org/repositories/openvino_model_server/packages/2025.4.0/

Check the instructions how to install the binary package
The prebuilt image is available also on RedHat Ecosystem Catalog

Assets 18

ovms_redhat.tar.gz

sha256:ed8be2b4eaa26dfbea75e9d4df8ee80095e2c3914e3038a03cc0653e482bf244

116 MB 2025-12-01T12:03:24Z
ovms_redhat.tar.gz.sha256

sha256:aeb390348606b964c74dc32202a2b818aac33518052ab124ee04d39fc9d1118c

78 Bytes 2025-12-01T12:03:35Z
ovms_redhat_python_on.tar.gz

sha256:113d4003343b63b4eba718a29df7e23f7d7b112c4a334d71c7ec40579e5c21cf

121 MB 2025-12-01T12:03:39Z
ovms_redhat_python_on.tar.gz.sha256

sha256:2002379263233aaa69adf29f21250c55ab5102b16fdb1e2b66b209c7760e99bf

78 Bytes 2025-12-01T12:03:57Z
ovms_ubuntu22.tar.gz

sha256:60369e45d0329cf56add10885d621cf0cb772b88f901cb9db33db5203ca3e49a

109 MB 2025-12-01T12:07:00Z
ovms_ubuntu22.tar.gz.sha256

sha256:f6aa8944d18258f2118f29f2908ea39e2bfd2ae3df30976d2e1240040ada1858

78 Bytes 2025-12-01T12:07:14Z
ovms_ubuntu22_python_on.tar.gz

sha256:0e6925c997e7ae8d0b0052b373285f344e449e2afc8ca2dc3092c9415bcaef39

123 MB 2025-12-01T12:07:22Z
ovms_ubuntu22_python_on.tar.gz.sha256

sha256:d5b8c51c66177c6de6d7f458f1c2777ca88129e43e6e182cb2a3203d7abf79bb

78 Bytes 2025-12-01T12:07:32Z
ovms_ubuntu24.tar.gz

sha256:e0c42dd1e7b003102828c5d8dd5082271c162fa8a25bf293ad3f70ef10b96f17

111 MB 2025-12-01T12:10:16Z
ovms_ubuntu24.tar.gz.sha256

sha256:a6de0abd9a3da2f6858f71e98bf34ec3b3032cb0edd9f6e626ffbebbf39840af

78 Bytes 2025-12-01T12:10:25Z
Source code (zip)

2025-12-01T11:02:30Z
Source code (tar.gz)

2025-12-01T11:02:30Z

04 Sep 19:51

rasapala

v2025.3

13e735f

OpenVINO Model Server 2025.3

The 2025.3 is a major release which improves the agentic use case, adds official support for image generation endpoints and simplified deployment. It also adds support for a range of new generative models.

Agentic use case improvements

Implemented tool guided generation with a server parameter --enable_tool_guided_generation and --tool_parser to turn on model specific XGrammar configuration to follow expected response syntax. It uses dynamic rules based on the generated sequence. This increases model accuracy and minimizes invalid response format for the tools. Link
Extended list of supported models with tool handling by adding tool parser for Mistral-7B-Instruct-v0.3.
Implemented stream response for Qwen3, Hermes3 and Llama3 models, this enables the models to be used with tools in a more interactive manner.
Separated implementation and configuration of tool parser and reasoning parser – instead of using a parameter response_parser, use separate parameters: tool_parser and reasoning_parser. This adds more flexibility in implementing and configuring the parsers on the server. The parsers can be shared between models independently. As of now, supported tool parsers are hermes3, phi4, llama3 mistral. qwen3 is the only reasoning parser implemented so far.
Changed the file name of the chat template from template.jinja to chat_template.jinja if chat template is not included in tokenizer_config.json.
Structured output is now supported with addition of JSON schema guided generation with OpenAI response_format field. This parameter can be used to generate JSON response which can be applied in applications for automation purposes and to improve accuracy of the responses. See documentation for more details: Link A script testing the accuracy gain is included as well.
Option to enforce tool call generation using chat/completions field tool_choice=required. It initiates the beginning of tool sequence to make the model start generating at least one tool response. While it does not guarantee the response will be valid, it can increase the response reliability. Link
Updated demo for using MCP server with all features included. Link

New models and use case supported

Qwen3-embedding models – added support for embedding models that uses last token pooling mode. Exporting such model requires passing additional parameter --pooling, example can be found here: Link
Qwen3-reranker models – added support for tomaarsen/Qwen3-Reranker-seq-cl which is a copy of the Qwen3-Reranker model (original model is not supported in OVMS) modified as a sequence classification model. This model requires applying template on query and documents, here is example how to do this: Link
Cross-encoder reranking models - added support for models with token_type_ids input
Gemma3 VLM models
HETERO plugin supports now multi GPU deployments with continuous batching algorithm for LLM models
Added support for GPU B60 cards

Deployment improvements

Model pulling from HuggingFaces now shows a progress bar. When downloading the models from OpenVINO organization, user can also observe the status in the logs.
Documentation on how to build a docker image with optimum-cli is now available, this enables the docker image pull any model from HF and convert it to IR online in one step. Link
/models and /models/{model} endpoints (list models and retrieve model) compatible with OpenAI API implemented – endpoint will return a list of available models in the expected OpenAI JSON schema. It has been included to simplify integration with existing applications. Link
Reduced package size by removing the git and git-lfs dependencies which reduce the image size by about 15MB. Now, model files are pulled from HuggingFace using libgit2 and curl libs.
UTF8 chat template is supported out of the box now, there no need for additional installation steps for Windows any longer.
Preview functionality for GGUF models for LLM models with architectures Qwen2, Qwen2.5 and Llama3. Models can be deployed directly from HuggingFaces Hub by passing model_id and the file name. Note that the accuracy and performance might be lower than with models in IR format.

Image Generation

Image generation has now production ready status. It has been extended to image editing endpoint with image-2-image pipelines. Demo has been extended to illustrate editing capabilities and using accelerators. Link

New or updated demos

Bug fixes

Prompts exceeding model length in embeddings can be now truncated or raise error depending on the server configuration
Fixed metrics enablement for deployment with a single pipeline deployment including generative use cases
Improved error messages and debug logs for better usability

Known issues and limitations

Gemma4-it model needs manual adjustments of tokenizer config
Llava VL model with images resolution might cause execution errors

Breaking changes

Response parser split into tool call parser and reasoning parser – because of that change graph.pbtxt files created in 2025.2 with tool support will not be compatible.
Chat template is read from chat_template.jinja (instead of template.jinja) in case chat template is not included in tokenizer_config.json.

Deprecated features

ovmsclient – This client was used as a lightweight alternative to Tensorflow Serving API client library. With a growing usage of generative endpoints and KServe API, we deprecate this client, and no more updates are planned.

Embeddings and reranking calculator with models' versioning – there are created alternative calculators implementing embeddings and reranking endpoints with models' folder structure compatible with optimum-intel and OpenVINO GenAI. New calculators already have more features and more models supported. Old calculators will be dropped in the next release.

You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:

docker pull openvino/model_server:2025.3- CPU device support with image based on Ubuntu24.04
docker pull openvino/model_server:2025.3-gpu - GPU, NPU and CPU device support with image based on Ubuntu 24.04

or use provided binary packages. Only packages with sufffix _python_on have support for python.

There is also additional distribution channel via https://storage.openvinotoolkit.org/repositories/openvino_model_server/packages/2025.3.0/

Check the instructions how to install the binary package
The prebuilt image is available also on RedHat Ecosystem Catalog

Assets 18

16 Jul 13:49

atobiszei

v2025.2.1

b7e0a09

OpenVINO Model Server 2025.2.1

The 2025.2.1 is a minor release with bug fixes and improvements, mainly in automatic model pulling and image generation.

Improvements:

Enable passing chat_template_kwargs parameters in chat/completion request. It can be used to turn off model reasoning.
Allow setting CORS headers in HTTP response. It can resolve connectivity problems between OpenWebUI and the model server.

Other changes:

Changed NPU driver version from 1.17 to 1.19 in docker images
Security related updates in dependencies

Bug fixes:

Removed limitation for Image generation - now it supports requesting several output images with parameter n
add_to_config and remove_from_config parameters accept path to configuration file in addition to directory containing config.json file
Resolved connectivity issues while pulling models from HuggingFace Hub without proxy configuration
Fixed handling HF_ENDPOINT environment variable with HTTP addresses as previously https:// prefix was incorrectly added.
Changed pull feature environment variables GIT_SERVER_CONNECT_TIMEOUT_MS to GIT_OPT_SET_SERVER_TIMEOUT and GIT_SERVER_TIMEOUT_MS to GIT_OPT_SET_SERVER_TIMEOUT to unify with underlying libgit2 implementation.
Fixed handling relative paths on Windows with MediaPipes/LLMs for config_path parameter.
Fixed agentic demo not working without proxy
Stop rejecting response_format field in image generation. While parameter accepts now only base64_json value, it allows to integrate with Open WebUI
Add missing --response_parser parameter when using OVMS to pull LLM's model and prepare its configuration
Block simultaneous use of --list_models and --pull parameters as they are exclusive.
Fixed accuracy for the Phi4-mini model response parser while using functions with lists as arguments
export_model.py script fix for handling target_device for embeddings and reranking models
stateful text generation pipeline do not include usage content - it is not supported for such pipeline type. Before it was returning incorrect response.

Known issues and limitations

VLM models QwenVL2, QwenVL2.5, and Phi3_VL have lower accuracy when deployed on CPU in a text generation pipeline with continuous batching. It is recommended to deploy these models in a stateful pipeline which processes the requests sequentially like in the demo
Using NPU for image generation endpoints is unsupported in this release.

You can use an OpenVINO Model Server public docker images based on Ubuntu via the following command:

docker pull openvino/model_server:2025.2.1- CPU device support with image based on Ubuntu24.04
docker pull openvino/model_server:2025.2.1-gpu - GPU, NPU and CPU device support with image based on Ubuntu 24.04
or use provided binary packages. Only packages with suffix _python_on have support for python.

Check the instructions how to install the binary package
The prebuilt image is also available on RedHat Ecosystem Catalog

Assets 18

18 Jun 09:52

dkalinowski

v2025.2

814c4ef

OpenVINO™ Model Server 2025.2

The 2025.2 is a major release adding support for image generation, support for AI agents with tools_calls handling and new features in models’ management.

Image generation (preview)

Image generation endpoint – this preview feature enables image generation based on text prompts. The endpoint is compatible with OpenAI API making it easy to integrate with existing ecosystem. It supports the popular models like Stable Diffusion, Stable Diffusion XL, Stable Diffusion 3 and FLUX.

Check the end-to-end demo

Image generation API reference

Agentic AI (preview)

When generating text in LLM models, you can extend the context using tools. The tools can provide additional context from external sources like python functions. AI Agents can use OpenVINO Model Server to choose the right tool and generate functions parameters. The final agent response can be also created based on the tool response.

It is now possible to use in the chat/completions endpoint for text generation, the specification of tools and the messages can include tools responses (tool_calls). Such agentic use case requires specially tuned chat templates and custom response parsers. They are enabled for the popular tool enabled models.

Check the demo with AI Agent

Model management for generative use cases

This release brings several improvements for the model management and development mechanism especially for generative use cases.

It is now possible to pull and deploy the generative models in OpenVINO format directly from Hugging Faces Hub. All the runtime parameters for the generative pipeline can be set via Model Server command line interface. ovms binary can be used to pull the model to the local models repository to reuse in subsequent runs. There are also included CLI commands for listing the models in the models repository and adding or removing the models from the list of enabled in the configuration file.

More details about the CLI usage to pull models and start the server

Check the RAG demo how easy it is to deploy 3 models in a single server instance.

Note that the python script export_models.py can be still used to prepare models from outside of OpenVINO organization in HF Hub. It is extended to support image generation task.

Breaking changes

Till now, the default text generation sampling parameters were static. This release changes default sampling parameters to be based on generation_config.json from the model folder.

Other changes

VLM models with chat/completion endpoint can now support passing the images as URL or as path to a local file system. Model Server will download the image and use as part of the message content.  Check updated API examples

Python is no longer required to use LLM chat/completions endpoint. The package version without python, applies the chat templates using JinjaCpp library. It has however limitations: tools usage and system prompt are not supported.

New version of embeddings and rerank calculators which are using flat models structure identical with the output of optimum-intel export and existing OpenVINO models in Hugging Face Hub. Previous calculators supporting models versioning are still present for compatibility with previously exported models. They will be deprecated in the future release. It is recommended to reexport the models using --task rerank_ov or embeddings_ov.

Documented use case with long context models and very long prompts

Bug fixes

Correct error status now reported in streaming mode. 

Fixed sporadic issue of extra special token at the beginning of prompt when applying chat template.

Security and stability related improvements.

Known issues and limitations

VLM models QwenVL2, QwenVL2.5, and Phi3_VL have lower accuracy when deployed on CPU in a text generation pipeline with continuous batching. It is recommended to deploy these models in a stateful pipeline which processes the requests sequentially like in the demo

Using NPU for image generation endpoints is unsupported in this release.

OVMS on linux OS in environment without proxy, requires setting env variable GIT_SERVER_TIMEOUT_MS=4000 to be able to pull the models from HuggingFace Hub. The default value was set too short.

You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:

docker pull openvino/model_server:2025.2- CPU device support with image based on Ubuntu24.04
docker pull openvino/model_server:2025.2-gpu - GPU, NPU and CPU device support with image based on Ubuntu 24.04

or use provided binary packages. Only packages with sufffix _python_on have support for python.

Check the instructions how to install the binary package
The prebuilt image is available also on RedHat Ecosystem Catalog

Assets 18

10 Apr 12:03

dkalinowski

v2025.1

c9658a3

OpenVINO™ Model Server 2025.1

The 2025.1. is a major release adding support for visual language models and enabling text generation on NPU accelerator.

VLM support

The endpoint chat/completion has been extended to support vision language models. Now it is possible to send images in the context of chat. Vision language models can be deployed just like the LLM models.

Check the end-to-end demo: Link

Updated API reference: Link

Text Generation on NPU

Now it is possible to deploy LLM and VLM models on NPU accelerator. Text generation will be exposed over completions and chat/completions endpoints. From the client perspective it works the same way as with GPU and CPU deployment but it doesn’t support continuous batching algorithm. NPU is targeted for AI PC use cases with low concurrency.

Check the NPU LLM demo and NPU VLM demo.

Model management improvements

Option to start MediaPipe graphs and generative endpoints from CLI without the configuration file. Simply point --model_path CLI argument to directory with MediaPipe graph.
Unification for the JSON configuration file structure for models and graphs under section models_config_list.

Breaking changes

gRPC server is now optional. There is no default gRPC port set. The parameter –port is mandatory to start gRPC server. It is possible to start only REST API server with --rest_port parameter. At least one port number needs to be defined to start OVMS from CLI (--port for gRPC or --rest_port for REST). Starting OVMS via C-API does not require any port to be defined.

Other changes

Updated scalability demonstration using multiple instance: Link
Increased allowed number of text generation stop words in the request from 4 to 16
Enabled and tested OVMS integration with Visual Studio Code extension of Continue. OpenVINO Model Server can be used as a backend for code completion and built-in IDE chat assistant. Check out instructions: Link
Performance improvements – enhancements in OpenVINO Runtime and also in text sampling generation algorithm which should increase the throughput in high concurrency load

Bug fixes

Fixed handling of the LLM context length - now OVMS will stop generating the text when model context is exceeded. An error will be raised when the prompt is longer than the context or when the max_tokens plus the input tokens exceed the model context.
Security and stability improvements
Fixed cancellation of text generation workloads - clients are allowed to stop the generation in non-streaming scenarios by simply closing the connection

Known issues and limitations

chat/completions API accepts images encoded to base64 format but does not accept URL format.

Qwen Vision models deployed on GPU might experience an execution error when image size has too high resolution. It is recommended to edit the model preprocessor_config.json and lower max_pixels parameter to a value. It will ensure the images will be resized automatically to smaller resolution. It will avoid the outage on GPU and improve performance. In some cases, accuracy might be impacted, though.

Note that by default, NPU sets limitation to the prompt length to 1024 tokens. You can modify that limit by using --max_prompt_len parameter in model export script, or manually modify MAX_PROMPT_LEN plugin config param in graph.pbtxt.

You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:

docker pull openvino/model_server:2025.1 - CPU device support
docker pull openvino/model_server:2025.1-gpu - GPU, NPU and CPU device support

or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog

Assets 16

06 Feb 14:36

atobiszei

v2025.0

8a47d52

OpenVINO™ Model Server 2025.0

The 2025.0 is a major release adding support for Windows native deployments and improvements to the generative use cases.

New feature - Windows native server deployment

This release enables model server deployment on Windows operating systems as a binary application
Full support for generative endpoints – text generation and embeddings based on OpenAI API, reranking based on Cohere API
Functional parity with linux version with several minor differences: cloud storage, CAPI interface, DAG pipelines - read more
It is targeted on client machines with Windows 11 and Data Center environment with Windows 2022 Server OS
Demos are updated to work both on Linux and Windows. Check the installation guide

Other Changes and Improvements

Added official support for Battle Mage GPU, Arrow Lake CPU, iGPU, NPU and Lunar Lake CPU, iGPU and NPU
Updated base docker images – added Ubuntu 24 and RedHat UBI 9, dropped Ubuntu 20 and RedHat UBI 8
Extended chat/completions API to support max_completion_tokens parameter and messages content as an array. Those changes are to make the API keep compatibility with OpenAI API.
Truncate option in embeddings endpoint – It is now possible to export the embeddings model with option to truncate the input automatically to match the embeddings context length. By default, the error is raised when too long input is passed.
Speculative decoding algorithm added to text generations – Check the demo
Added direct support for models without named outputs – when models don’t have named outputs, generic names will be assigned in the model initialization with a pattern out_<index>
Added histogram metric for tracking MediaPipe graph processing duration
Performance improvements

Breaking changes

Discontinued support for NVIDIA plugin

Bug fixes

Corrected behavior of cancelling text generation for disconnected clients
Fixed detecting of the model context length for embeddings endpoint
Security and stability improvements

You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:

docker pull openvino/model_server:2025.0 - CPU device support
docker pull openvino/model_server:2025.0-gpu - GPU, NPU and CPU device support

or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog

Assets 16

20 Nov 14:08

atobiszei

v2024.5

3c284cf

OpenVINO™ Model Server 2024.5

The 2024.5 release comes with support for embedding and rerank endpoints, as well as experimental Windows support version.

Changes and improvements

The OpenAI API text embedding endpoint has been added, enabling OVMS to be used as a building block for AI applications like RAG.
The rerank endpoint has been added based on Cohere API, enabling easy similarity detection between a query and a set of documents. It is one of the building blocks for AI applications like RAG and makes integration with frameworks such as langchain easy.
The echo sampling parameter together with logprobs in the completions endpoint is now supported.
Performance increase on both CPU and GPU for LLM text generation.
LLM dynamic_split_fuse for GPU target device boosts throughput in high-concurrency scenarios.
The procedure for LLM service deployment and model repository preparation has been simplified.
Improvements in LLM tests coverage and stability.
Instructions how to build experimental version of a Windows binary package - native model server for Windows OS – is available. This release includes a set of limitations and has limited tests coverage. It is intended for testing, while the production-ready release is expected with 2025.0. All feedback is welcome.
OpenVINO Model Server C-API now supports asynchronous inference, improves performance with ability of setting outputs, enables using OpenCL & VA surfaces on both inputs & outputs for GPU target device's
KServe REST API Model_metadata endpoint can now provide additional model_info references.
Included support for NPU and iGPU on MTL and LNL platforms
Security and stability improvements

Breaking changes

No breaking changes.

Bug fixes:

Fix support for url encoded model name for KServe REST API
OpenAI text generation endpoints now accepts requests with both v3 & v3/v1 path prefix
Fix reporting metrics in video stream benchmark client
Fix sporadic INVALID_ARGUMENT error on completions endpoint
Fix incorrect LLM finish reason when expecting stop but got length

Discontinuation plans

In the future release, support for the following build options will not be maintained:

Ubuntu 20 as the base image
OpenVINO NVIDIA plugin

You can use an OpenVINO Model Server public Docker images based on Ubuntu22.04 via the following command:

docker pull openvino/model_server:2024.5 - CPU device support
docker pull openvino/model_server:2024.5-gpu - GPU, NPU and CPU device support

or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog

Assets 6

19 Sep 11:46

rasapala

v2024.4

f958bf8

OpenVINO™ Model Server 2024.4

The 2024.4 release brings official support for OpenAI API text generation. It is now recommended for production usage. It comes with a set of added features and improvements.

Changes and improvements

Significant performance improvements for multinomial sampling algorithm
finish_reason in the response correctly determines reaching the max_tokens (length) and completed the sequence (stop)
Added automatic cancelling of text generation for disconnected clients
Included prefix caching feature which speeds up text generation by caching the prompt evaluation
Option to compress the KV Cache to lower precision – it reduces the memory consumption with minimal impact on accuracy
Added support for stop sampling parameters. It can define a sequence which stops text generation.
Added support for logprobs sampling parameter. It returns the probabilities of generated tokens.
Included generic metrics related to execution of MediaPipe graph. Metric ovms_current_graphs can be used for autoscaling based on current load and the level of concurrency. Counters like ovms_requests_accepted and ovms_responses can track the activity of the server.
Included demo of text generation horizontal scalability
Configurable handling of non-UTF-8 responses from the model – detokenizer can now automatically change then to Unicode replacement character
Included support for Llama3.1 models
Text generation is supported both on CPU and GPU -check the demo

Breaking changes

No breaking changes.

Bug fixes

Security and stability improvements
Fixed handling of model templates without bos_token

You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2024.4 - CPU device support with the image based on Ubuntu22.04
docker pull openvino/model_server:2024.4-gpu - CPU, GPU and NPU device support with the image based on Ubuntu22.04
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog

Assets 6

31 Jul 14:15

michalkulakowski

v2024.3

a6ddd3f

OpenVINO™ Model Server 2024.3

The 2024.3 release focus mostly on improvements in OpenAI API text generation implementation.

Changes and improvements

A set of improvements in OpenAI API text generation:

Significantly better performance thanks to numerous improvements in OpenVINO Runtime and sampling algorithms
Added config parameters best_of_limit and max_tokens_limit to avoid memory overconsumption impact from invalid requests Read more
Added reporting LLM metrics in the server logs Read more
Added extra sampling parameters diversity_penalty, length_penalty, repetition_penalty. Read more

Improvements in documentation and demos:

Added RAG demo with OpenAI API
Added K8S deployment demo for text generation scenarios
Simplified models initialization for a set of demos with mediapipe graphs using pose_detection model. TFLite models don't required any conversions Check demo

Breaking changes

No breaking changes.

Bug fixes

Resolved issue with sporadic text generation hang via OpenAI API endpoints
Fixed issue with chat streamer impacting incomplete utf-8 sequences
Corrected format of the last streaming event in completions endpoint
Fixed issue with request hanging when running out of available cache

You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2024.3 - CPU device support with the image based on Ubuntu22.04
docker pull openvino/model_server:2024.3-gpu - GPU and CPU device support with the image based on Ubuntu22.04
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog

Assets 6

17 Jun 13:37

dkalinowski

v2024.2

31ad50a

OpenVINO™ Model Server 2024.2

The major new functionality in 2024.2 is a preview feature of OpenAI compatible API for text generation along with state of the art techniques like continuous batching and paged attention for improving efficiency of generative workloads.

Changes and improvements

Updated OpenVINO Runtime backend to 2024.2
OpenVINO Model Server can be now used for text generation use cases using OpenAI compatible API
Added support for continuous batching and PagedAttention algorithms for text generation with fast and efficient in high concurrency load especially on Intel Xeon processors. Learn more about it.
Added LLM text generation OpenAI API demo.
Added notebook showcasing RAG algorithm with online scope changes delegated to the model server. Link
Enabled python 3.12 for python clients, samples and demos.
Updated RedHat UBI base image to 8.10

Breaking changes

No breaking changes.

You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2024.2 - CPU device support with the image based on Ubuntu 22.04
docker pull openvino/model_server:2024.2-gpu - GPU and CPU device support with the image based on Ubuntu 22.04
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog

Assets 6

Releases: openvinotoolkit/model_server

OpenVINO Model Server 2025.4

Agentic use case improvements

New or updated demos

Deployment improvements

Performance improvements

Audio endpoints added

Embeddings endpoints improvements

Breaking changes

Bug fixes

Discontinued in 2025

Deprecated to be removed in the future

Uh oh!

OpenVINO Model Server 2025.3

Agentic use case improvements

New models and use case supported

Deployment improvements

Image Generation

New or updated demos

Bug fixes

Known issues and limitations

Breaking changes

Deprecated features

Uh oh!

OpenVINO Model Server 2025.2.1

Uh oh!

OpenVINO™ Model Server 2025.2

Image generation (preview)

Agentic AI (preview)

Model management for generative use cases

Breaking changes

Other changes

Bug fixes

Known issues and limitations

Uh oh!

OpenVINO™ Model Server 2025.1

VLM support

Text Generation on NPU

Model management improvements

Breaking changes

Other changes

Bug fixes

Known issues and limitations

Uh oh!

OpenVINO™ Model Server 2025.0

New feature - Windows native server deployment

Other Changes and Improvements

Breaking changes

Bug fixes

Uh oh!

OpenVINO™ Model Server 2024.5

Changes and improvements

Breaking changes

Bug fixes:

Discontinuation plans

Uh oh!

OpenVINO™ Model Server 2024.4

Changes and improvements

Breaking changes

Bug fixes

Uh oh!

OpenVINO™ Model Server 2024.3

Changes and improvements

Breaking changes

Bug fixes

Uh oh!

OpenVINO™ Model Server 2024.2

Changes and improvements

Breaking changes

Uh oh!