Skip to content

OpenVINO™ Model Server 2025.2

Choose a tag to compare

@dkalinowski dkalinowski released this 18 Jun 09:52
· 32 commits to releases/2025/2 since this release
814c4ef

The 2025.2 is a major release adding support for image generation, support for AI agents with tools_calls handling and new features in models’ management.

Image generation (preview)

Image generation endpoint – this preview feature enables image generation based on text prompts. The endpoint is compatible with OpenAI API making it easy to integrate with existing ecosystem. It supports the popular models like Stable Diffusion, Stable Diffusion XL, Stable Diffusion 3 and FLUX.

Check the end-to-end demo

Image generation API reference

Agentic AI (preview)

When generating text in LLM models, you can extend the context using tools. The tools can provide additional context from external sources like python functions. AI Agents can use OpenVINO Model Server to choose the right tool and generate functions parameters. The final agent response can be also created based on the tool response.

It is now possible to use in the chat/completions endpoint for text generation, the specification of tools and the messages can include tools responses (tool_calls). Such agentic use case requires specially tuned chat templates and custom response parsers. They are enabled for the popular tool enabled models.

Check the demo with AI Agent

Model management for generative use cases

This release brings several improvements for the model management and development mechanism especially for generative use cases.

It is now possible to pull and deploy the generative models in OpenVINO format directly from Hugging Faces Hub. All the runtime parameters for the generative pipeline can be set via Model Server command line interface. ovms binary can be used to pull the model to the local models repository to reuse in subsequent runs. There are also included CLI commands for listing the models in the models repository and adding or removing the models from the list of enabled in the configuration file.

More details about the CLI usage to pull models and start the server

Check the RAG demo how easy it is to deploy 3 models in a single server instance.

Note that the python script export_models.py can be still used to prepare models from outside of OpenVINO organization in HF Hub. It is extended to support image generation task.

Breaking changes

Till now, the default text generation sampling parameters were static. This release changes default sampling parameters to be based on generation_config.json from the model folder.

Other changes

VLM models with chat/completion endpoint can now support passing the images as URL or as path to a local file system. Model Server will download the image and use as part of the message content.  Check updated API examples

Python is no longer required to use LLM chat/completions endpoint. The package version without python, applies the chat templates using JinjaCpp library. It has however limitations: tools usage and system prompt are not supported.

New version of embeddings and rerank calculators which are using flat models structure identical with the output of optimum-intel export and existing OpenVINO models in Hugging Face Hub. Previous calculators supporting models versioning are still present for compatibility with previously exported models. They will be deprecated in the future release. It is recommended to reexport the models using --task rerank_ov or embeddings_ov.

Documented use case with long context models and very long prompts

Bug fixes

Correct error status now reported in streaming mode. 

Fixed sporadic issue of extra special token at the beginning of prompt when applying chat template.

Security and stability related improvements.

Known issues and limitations

VLM models QwenVL2, QwenVL2.5, and Phi3_VL have lower accuracy when deployed on CPU in a text generation pipeline with continuous batching. It is recommended to deploy these models in a stateful pipeline which processes the requests sequentially like in the demo

Using NPU for image generation endpoints is unsupported in this release.

OVMS on linux OS in environment without proxy, requires setting env variable GIT_SERVER_TIMEOUT_MS=4000 to be able to pull the models from HuggingFace Hub. The default value was set too short.

You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:

  • docker pull openvino/model_server:2025.2- CPU device support with image based on Ubuntu24.04
  • docker pull openvino/model_server:2025.2-gpu - GPU, NPU and CPU device support with image based on Ubuntu 24.04

or use provided binary packages. Only packages with sufffix _python_on have support for python.

Check the instructions how to install the binary package
The prebuilt image is available also on RedHat Ecosystem Catalog