Skip to content

OpenVINO Model Server 2025.4

Latest

Choose a tag to compare

@dkalinowski dkalinowski released this 01 Dec 14:42
· 2 commits to releases/2025/4 since this release
cf82edc

Agentic use case improvements

  • Tool parsers for new models Qwen3-Coder-30B and Qwen3-30B-A3B-Instruct have been enabled. These models are supported in OpenVINO Runtime as a preview feature and can be evaluated with “tool calling” capabilities.
  • Streaming with “tool calling” for phi-4-mini-instruct and mistral-7B-v0.4 models is supported just like for the rest of supported agentic models.
  • Tool parsers for mistral and hermes3 have been improved, resolving multiple issues related to complex generated JSON objects and increasing overall response reliability.
  • Guided generation now supports all rules from XGrammar integration. The response_format parameter can now accept XGrammar structural tags format (not part of the OpenAI API). Example: {"type": "regex", "pattern": "\\w+\\.\\w+@company\\.com"}.

New or updated demos

Deployment improvements

  • GGUF model format can now be deployed directly from Hugging Face Hub for several LLM architectures. Architectures such as Qwen2, Qwen2.5, Qwen3 and Llama3 can be deployed with a single command. See Loading GGUF models in OVMS demo for details.

  • OpenVINO Model Server can be deployed as a service in the Windows operating system. It can be managed by service configuration management, shared by all running applications, and controlled using a simplified CLI to pull, configure, enable, and disable models. LINK

  • Pulling the model in IR format has been extended beyond the OpenVINO™ organization in Hugging Face* Hub. While OpenVINO org models are validated by Intel, a rapidly growing ecosystem of IR-format models from other publishers can now also be pulled and deployed via the OVMS CLI. Note: The repository needs to be populated by optimum-cli export openvino command and must include tokenizer model in IR format to be successfully loaded by OpenVINO Model Server.

  • CLI simplifications for easier deployment:
    --plugin_config parameter can now be applied not only to classic models but also to generative pipelines.
    --cache_dir now enables compilation caching for both classic models and generative pipelines.
    --enable_prefix_caching can be used the same way for all target devices.

  • --add_to_config and --remove_from_config,like –list_models, are now OVMS CLI directives and no longer expect a value. The configuration values should be passed through the following parameters --config_path, --model_repository_path, --model_name or --model_path.

  • When a service is deployed, the CLI can be simplified by setting the environment variable OVMS_MODEL_REPOSITORY_PATH to point to the models folder. This automatically applies the default parameters for model management, ensuring that --config_path and --model_repository_path are set correctly. For example:
    ovms --pull--task text_generation OpenVINO/Qwen3-8B-int4
    ovms --list_models
    ovms --add_to_models --model_name OpenVINO/Qwen3-8B-int4
    ovms --remove_from_models --model_name OpenVINO/Qwen3-8B-int4

  • The --api_key parameter is now available, enabling client authorization using an API key.

  • Binding parameters are added for both IPv6 and IPv4 addresses for gRPC and REST interfaces.

  • The metrics endpoint is now compatible with Prometheus v3. The output header type has been updated from JSON to plain text.

Performance improvements

  • First-token generation performance has been significantly improved for LLM models with GPU acceleration and prefix caching. This is particularly beneficial for agentic use cases, where repeated chat history creates very long contexts that can now be processed much faster. Prefix caching can be enabled with OVMS CLI parameter --enable_prefix_caching true

  • A new parameter is introduced to increase the allowed prompt length for LLM and VLLM models deployed on NPU. The context can now be extended by adding the CLI parameter --max_prompt_lenght. The default is 1024 tokens and can be extended up to 10k tokens. Set it to the required value to avoid unnecessary memory usage. For VLM models running on both NPU and CPU, use a device-specific configuration to apply the setting only to the NPU device: --plugin_config '{"DEVICE_PROPERTIES":{"NPU":{"MAX_PROMPT_LEN":2048}}}'

  • Model loading time has been reduced through compilation cache, with significant improvements on GPU and NPU devices. Enable caching using the --cache_dir parameter.

  • Improved guided generation performance, including support for tool call guiding.

Audio endpoints added

  • Text to speech endpoint compatible with the OpenAI API - /audio/speech
  • Speech to text endpoints compatible with the OpenAI API:
    /audio/translation - converts provided audio content to English text
    /audio/transcription - converts provided audio content to text in the original language

Embeddings endpoints improvements

  • A tokenize endpoint has been added to get tokens before sending the input text to embeddings calculation. This helps assess input length to avoid exceeding the model context.
  • Embeddings Model now supports three pooling options: CLS, LAST, and MEAN, widening model compatibility. See Text Embeddings Models list for details.

Breaking changes

The old embeddings calculator was removed and replaced by embeddings_ov. The new calculator follows the optimum-cli / Hugging Face model structure and support more features. If you use the old calculator, re-export your models and pull the updated versions from Hugging Face. Demo

Bug fixes

  • Fixed model phi-4-mini-instruct generating incorrect responses when context exceeded 4k tokens
  • Other minor fixes

Discontinued in 2025

  • Deprecated OpenVINO Model Server (OVMS) benchmark client in C++ using TensorFlow Serving API. 

Deprecated to be removed in the future

  • The dedicated OpenVINO operator for Kubernetes and OpenShift is now deprecated in favor of the recommended KServe operator. The OpenVINO operator will continue to function in upcoming OpenVINO Model Server releases but will no longer be actively developed. Since KServe provides broader capabilities, no loss of functionality is expected. In contrary, more functionalities will be accessible and migration between other serving solutions and OpenVINO Model Server will be much easier.
  • TensorFlow Serving (TFS) API support is planned for deprecation. With increasing adoption of the KServe API for classic models and the OpenAI API for generative workloads, usage of the TFS API has significantly declined. Dropping date is to be determined based on the feedback, with a tentative target of mid-2026.
  • Support for Stateful models will be deprecated. This capabilities was originally introduced for Kaldi audio models which is no longer relevant. Current audio models support relies on the OpenAI API, and pipelines implemented via OpenVINO GenAI library.
  • Directed Acyclic Graph Scheduler will be deprecated in favor of pipelines managed by MediaPipe scheduler and will be removed mid-2026. That approach gives more flexibility, includes wider range of calculators and has support for using processing accelerators.

You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:

  • docker pull openvino/model_server:2025.4- CPU device support with image based on Ubuntu 24.04
  • docker pull openvino/model_server:2025.4-gpu - GPU, NPU and CPU device support with image based on Ubuntu 24.04

or use provided binary packages. Only packages with sufffix _python_on have support for python.

There is also additional distribution channel via https://storage.openvinotoolkit.org/repositories/openvino_model_server/packages/2025.4.0/

Check the instructions how to install the binary package
The prebuilt image is available also on RedHat Ecosystem Catalog