Releases: openvinotoolkit/openvino.genai
Releases · openvinotoolkit/openvino.genai
2025.4.0.0
What's Changed
- Bump product version 2025.4 by @akladiev in #2620
- Fix CLEANUP_CACHE by @Wovchena in #2617
- [wwb] Add tests for mac/win to ci by @sbalandi in #2603
- xfail embed by @Wovchena in #2618
- Bump py-build-cmake from 0.4.3 to 0.5.0 by @dependabot[bot] in #2624
- Bump actions/upload-pages-artifact from 3.0.1 to 4.0.0 by @dependabot[bot] in #2625
- Bump optimum-intel[nncf] from 1.25.1 to 1.25.2 by @dependabot[bot] in #2613
- Bump optimum-intel from 1.25.1 to 1.25.2 in /tests/python_tests by @dependabot[bot] in #2614
- [GGUF] Fix Q4_1 accuracy by @wine99 in #2563
- [llm bench] Add support of arcee model by @sbalandi in #2636
- Warn about older transformers by @Wovchena in #2634
- Limited max GPU KV-cache considering max allocatable GPU memory size by @popovaan in #2633
- Bump actions/dependency-review-action from 4.7.1 to 4.7.2 by @dependabot[bot] in #2639
- [llm_bench] Override max_length to preserve max_new_tokens by @Wovchena in #2641
- Cache images by @Wovchena in #2629
- Test logging by @Wovchena in #2621
- [CMAKE] Fix samples installation by @mryzhov in #2649
- [JS] Add prettier and align eslint by @almilosz in #2631
- [WWB] Remove use_flash_attention_2 argument for phi4mm by @nikita-savelyevv in #2653
- Implement text embedding pipeline shape fix by @as-suvorov in #2449
- [CMAKE] Solve pybind targets conflict by @mryzhov in #2655
- [GHA] Disable Cacheopt tests on mac by @mryzhov in #2663
- Fix Coverity by @Wovchena in #2601
- Bump peft from 0.17.0 to 0.17.1 in /samples by @dependabot[bot] in #2658
- Bump peft from 0.17.0 to 0.17.1 in /tests/python_tests by @dependabot[bot] in #2660
- downgrade xgrammar version in master by @pavel-esir in #2668
- Extend chat template test models by @yatarkan in #2648
- [NPU]Enable chunk prefill for VLM. by @intelgaoxiong in #2657
- [JS] Add build NodeJS bindings into Manylinux 2_28 by @Retribution98 in #2537
- WWB empty_adapters mode by @likholat in #2671
- Bump actions/dependency-review-action from 4.7.2 to 4.7.3 by @dependabot[bot] in #2674
- Bump aquasecurity/trivy-action from 0.32.0 to 0.33.0 by @dependabot[bot] in #2677
- Bump langchain-core from 0.3.74 to 0.3.75 in /tests/python_tests by @dependabot[bot] in #2673
- Bump actions/download-artifact from 4.3.0 to 5.0.0 by @dependabot[bot] in #2676
- [OV JS] Add perfMetrics grammar getters & update docstrings by @almilosz in #2681
- Bump langchain-community from 0.3.27 to 0.3.29 in /tests/python_tests by @dependabot[bot] in #2680
- Bump actions/checkout from 4.2.2 to 5.0.0 by @dependabot[bot] in #2685
- Fix StructuredOutputConfig pybind11-subgen signatures generation by @pavel-esir in #2669
- [llm_bench] Add start memory info by @sbalandi in #2686
- Updating KVCrush hyperparameters by @gopikrishnajha in #2678
- [Docs] Convert whisper as stateless in the quantization example by @nikita-savelyevv in #2690
- print genai version by @wgzintel in #2684
- Fix initializer for the sparse attention mode by @vshampor in #2689
- Add docs entry about building GenAI with free threaded Python by @p-wysocki in #2679
- Reduce structured output controller mutex locking scope by @mzegla in #2687
- [speculative decoding] Move from ManualTimer to pure metrics by @sbalandi in #2695
- [CI] [GHA] Use custom
actions/download-artifactaction with the fixed retries logic by @akashchi in #2692 - Enable VLM generation on NPU without image input by @AlexanderKalistratov in #2694
- [llm bench] Add possibility to setup cache eviction config for LLM by @sbalandi in #2693
- Remove not supported rerank models from docs by @as-suvorov in #2702
- Tune automatic memory allocation by @popovaan in #2697
- Bump actions/setup-python from 5.6.0 to 6.0.0 by @dependabot[bot] in #2705
- Bump pytest from 8.4.1 to 8.4.2 in /tests/python_tests by @dependabot[bot] in #2710
- [WWB] friendly error message for wrong model type by @isanghao in #2672
- [CI] [GHA] Use smaller runners for image generation samples by @akashchi in #2682
- Add image generation pipeline reuse into README by @JohnLeFeng in #2701
- Align benchmark_vlm.py and cpp by @Wovchena in #2711
- [CI] Fix NodeJS tests for manylinux by @Retribution98 in #2715
- fix checking tokenizers version by @pavel-esir in #2667
- Optimize qwen2vl encoder by @WeldonWangwang in #2630
- Check available memory before allocating KV-cache. by @popovaan in #2683
- Fixed clearing of kv-cache for GPU by @popovaan in #2717
- [GHA] w/a to build ov samples by @mryzhov in #2734
- [OV JS] Initial support for SchedulerConfig by @almilosz in #2696
- Test LLM samples with GGUF models by @Retribution98 in #2464
- [llm_bench] LLMPipeline fix negative time by @sbalandi in #2742
- Bump pydantic from 2.11.7 to 2.11.9 in /samples by @dependabot[bot] in #2732
- [llm bench] Add mem info on initial/compilation phase to json/csv by @sbalandi in #2741
- Increase timeouts by @Wovchena in #2743
- Allow additional_params for tokenizer decode in TextStreamer by @dkalinowski in #2729
- Fix attention mask pass for whisper (static) by @eshiryae in #2665
- Support from-onnx parameter by @sstrehlk in #2441
- Use model path property for caching by @praasz in #2720
- Increase GGUF timeouts by @Wovchena in #2756
- [VLM] Fixed measuring of embeddings preparation. by @popovaan in #2752
- Bump timm from 1.0.19 to 1.0.20 by @dependabot[bot] in #2754
- [llm_bench] Fix OpenVINO config not being passed for speech-to-text and Whisper models by @aobolensk in #2763
- OPT & Clean code of openvino_vision_embeddings_merger_model inputs processing by @zhaixuejun1993 in #2726
- Add .github/pull_request_template.md by @Wovchena in #2765
- Update transformers to 4.53.3 by @as-suvorov in https://gi...
2025.3.0.0
What's Changed
- Bump product version 2025.3 by @akladiev in #2255
- Implement SnapKV by @vshampor in #2067
- [WWB] Additional processing of native phi4mm by @nikita-savelyevv in #2276
- Update ov genai version in samples by @as-suvorov in #2275
- use chat templates in vlm by @eaidova in #2279
- Fix 'Unsupported property' fails if set prompt_lookup to False by @sbalandi in #2240
- Force the PA implementation in the llm-bench by default by @sbalandi in #2271
- Update Whisper README.md as "--disable-stateful" is no longer required to export models for NPU by @luke-lin-vmc in #2249
- Removed 'slices' from EncodedImage by @popovaan in #2258
- support text embeddings in llm_bench by @eaidova in #2269
- [wwb]: load transformers model first, then only trust_remote_code by @eaidova in #2270
- [GHA] Coverity pipeline fixes by @mryzhov in #2283
- [GHA][DEV] Fixed coverity path creation by @mryzhov in #2285
- [GHA][DEV]Save coveity tool to cache by @mryzhov in #2286
- [GHA][DEV] Set cache key for coverity tool by @mryzhov in #2288
- Image generation multiconcurrency (#2190) by @dkalinowski in #2284
- [GGUF] Support GGUF format for tokenizers and detokenizers by @rkazants in #2263
- Unskip whisper tests & update optimum-intel by @as-suvorov in #2247
- Update README.md with text-to-speech by @rkazants in #2294
- add new chat template for qwen3 by @eaidova in #2297
- [DOCS] Correct cmd-line for TTS conversion by @rkazants in #2303
- [GHA] Enabled product manifest.yml by @mryzhov in #2281
- Bump the npm_and_yarn group across 3 directories with 2 updates by @dependabot[bot] in #2309
- [GHA] Save artifacts to cloud share by @akladiev in #1943
- [GHA][[COVERITY] added manual trigger by @mryzhov in https://github.com//pull/2289
- [GHA] Fix missing condition for Extract Artifacts step by @akladiev in #2313
- [llm bench] Turn off PA backend for VLM by @sbalandi in #2312
- [llm_bench] Add setting of max_num_batched_tokens for SchedulerConfig by @sbalandi in #2316
- [GHA] Fix missing condition for LLM & VLM test by @sammysun0711 in #2326
- [Test] Skip gguf test on MacOS due to sporadic failure by @sammysun0711 in #2328
- [GGUF] support Qwen3 architecture by @TianmengChen in #2273
- [llm_bench] Increase max_num_batched_tokens to the largest positive integer by @sbalandi in #2327
- Bump aquasecurity/trivy-action from 0.30.0 to 0.31.0 by @dependabot[bot] in #2310
- Bump actions/download-artifact from 4.1.9 to 4.3.0 by @dependabot[bot] in #2315
- Fix system_message forwarding by @Wovchena in #2325
- Disabled crop of the prompt for minicpmv. by @andreyanufr in #2320
- [llm bench] Fix setting ATTENTION_BACKEND to plugin config in case of fallback to Optimum by @sbalandi in #2332
- Bump brace-expansion from 2.0.1 to 2.0.2 in /.github/actions/install_wheel in the npm_and_yarn group across 1 directory by @dependabot[bot] in #2338
- Fix multinomial sampling for PromptLookupDecoding by @sbalandi in #2331
- [llm bench] Avoid using not supported by beam_search parameters for beam_search case by @sbalandi in #2336
- Update Export Requirements by @apaniukov in #2342
- [GGUF] Serialize Generated OV Model for Faster LLMPipeline Init by @sammysun0711 in #2218
- Fixed system message in chat mode. by @popovaan in #2343
- Bump librosa from 0.10.2.post1 to 0.11.0 in /samples by @dependabot[bot] in #2346
- [Test][GGUF] Add DeepSeek-R1-Distill-Qwen GGUF in CI by @sammysun0711 in #2329
- [llm_bench] Remove default scheduler config by @sbalandi in #2341
- master: add Phi-4-multimodal-instruct by @Wovchena in #2264
- Fix paths with unicode for tokenizers by @yatarkan in #2337
- [WWB] Add try-except block for processor loading by @nikita-savelyevv in #2352
- [WWB] Bring back eager attention implementation by default by @nikita-savelyevv in #2353
- fix supported models link in TTS samples by @eaidova in #2300
- StaticLLMPipeline: Add tests on caching by @smirnov-alexey in #1905
- [WWB] Fix loading the tokenizer for VLMs by @l-bat in #2351
- Pass Scheduler Config for VLM Pipeline in WhoWhatBenchmark. by @popovaan in #2318
- Remove misspelled CMAKE_CURRENT_SOUCE_DIR by @Wovchena in #2362
- Increase timeout for LLM & VLM by @Wovchena in #2359
- [llm bench] Fix setting ATTENTION_BACKEND to plugin config in case of fallback to Optimum for VLM by @sbalandi in #2361
- Support multi images for vlm benchmarking in samples and llm_bench by @wgzintel in #2197
- CB: Hetero pipeline parallel support by @WeldonWangwang in #2227
- Update conversion instructions by @adrianboguszewski in #2287
- Merge stderr from failed samples by @Wovchena in #2156
- Revert cache folder by @Wovchena in #2372
- Update README in Node.js API by @almilosz in #2374
- [Docs] Rework home page by @yatarkan in #2368
- Align PromptLookupDecoding with greedy when dynamic_split_fuse works by @sbalandi in #2360
- Support to collect latency for transformers V4.52.0 by @wgzintel in #2373
- Bump diffusers from 0.33.1 to 0.34.0 in /samples by @dependabot[bot] in #2381
- Bump diffusers from 0.33.1 to 0.34.0 in /tests/python_tests by @dependabot[bot] in #2380
- Structured Output generation with
XGrammarby @pavel-esir in #2295 - Disable XGrammar on Android by @apaniukov in #2389
- [wwb] Take prompts from different categories for def dataset for VLM by @sbalandi in #2349
- Fix for cloning NPU Image Generation pipelines (#2376) by @dkalinowski in #2393
- Set add_special_tokens=false for image tags in MiniCPM. by @popovaan in #2404
- Fix missing use cases for inpainting models and defining use case with relative path by @sbalandi in #2387
- temporary skip failing whisper tests by @pavel-esir in #2396
- Fix test_vlm_npu_no_exception by @AlexanderKalistratov in #2388
- Bump timm from 1.0.15 to 1.0.16 by @dependabot[bot] in #2390
- Optimize VisionEncoderQwen2VL::encode by @usst...
2025.2.0.0
What's Changed
- [GHA] Replaced isual_language_chat_sample-ubuntu-minicpm_v2_6 job by @mryzhov in #1909
- [GHA] Replaced cpp-chat_sample-ubuntu pipeline by @mryzhov in #1913
- Add support of Prompt Lookup decoding to llm bench by @sbalandi in #1917
- [GHA] Introduce SDL pipeline by @mryzhov in #1924
- Switch Download OpenVINO step to aks-medium-runner by @ababushk in #1889
- Bump product version 2025.2 by @akladiev in #1920
- [GHA] Replaced cpp-continuous-batching by @mryzhov in #1910
- Update dependencies in samples by @ilya-lavrenov in #1925
- phi3_v: add universal tag by @Wovchena in #1921
- Fix image_id unary error by @rkazants in #1927
- [Docs] Image generation use case by @yatarkan in #1877
- Add perf metrics for CB VLM by @pavel-esir in #1897
- Enhance the flexibility of the c streamer by @apinge in #1941
- add Gemma3 LLM to supported models by @eaidova in #1942
- Added GPTQ/AWQ support with HF Transformers by @AlexKoff88 in #1933
- Add --static_reshape option to llm_bench, to force static reshape + compilation at pipeline creation by @RyanMetcalfeInt8 in #1851
- benchmark_image_gen: Add --reshape option, and ability to specify multiple devices by @RyanMetcalfeInt8 in #1878
- Revert perf regression changes by @dkalinowski in #1949
- Add running greedy_causal_lm for JS to the sample tests by @Retribution98 in #1930
- [Docs] Add VLM use case by @yatarkan in #1907
- Added possibility to generate base text on GPU for text evaluation. by @andreyanufr in #1945
- VLM: change infer to start_async/wait by @dkalinowski in #1948
- [WWB]: Addressed issues with validation on Windows by @AlexKoff88 in #1953
- [GHA] Remove bandit pipeline by @mryzhov in #1956
- Disable MSVC debug assertions, addressing false positives in iterator checking by @apinge in #1952
- [GHA] Replaced genai-tools pipeline by @mryzhov in #1954
- configurable delay by @eaidova in #1963
- Update cast of tensor data pointer for const tensors by @praasz in #1966
- Remove tokens after EOS for draft model for speculative decoding by @sbalandi in #1951
- Add testcase for chat_sample_c by @apinge in #1934
- Skip warm-up iteration during llm_bench results averaging by @nikita-savelyevv in #1972
- Reset pipeline cache usage statistics on each generate call by @vshampor in #1961
- [Docs] Update models, rebuild on push by @yatarkan in #1922
- Updated logic whether PA backend is explicitly required by @ilya-lavrenov in #1976
- [GHA] [MAC] Use latest_available_commit OV artifacts by @mryzhov in #1977
- [GHA] Set HF_TOKEN by @mryzhov in #1986
- [GHA] Setup ov_cache by @mryzhov in #1962
- [GHA] Changed cleanup runner by @mryzhov in #1995
- Added mutex to methods which use blocks map. by @popovaan in #1975
- Add documentation and sample on KV cache eviction by @vshampor in #1960
- StaticLLMPipeline: Simplify compile_model call logic by @smirnov-alexey in #1915
- Fix reshape in heterogeneous SD samples by @helena-intel in #1994
- Update tokenizers by @mryzhov in #2002
- docs: fix max_new_tokens option description by @tpragasa in #1987
- [Docs] Add speech recognition with whisper use case by @yatarkan in #1971
- Revert "VLM: change infer to start_async/wait " by @ilya-lavrenov in #2004
- Revert "Revert perf regression changes" by @ilya-lavrenov in #2003
- Set xfail to failing tests. by @popovaan in #2006
- [GHA] Use cpack bindings in the samples tests by @mryzhov in #1979
- [Docs]: add Phi3.5MoE to supported models by @eaidova in #2012
- add TensorArt SD3.5 models to supported list by @eaidova in #2013
- Move MiniCPM resampler to vision encoder by @popovaan in #1997
- [GHA] Fix ccache on Win/Mac by @mryzhov in #2008
- samples/python/text_generation/lora.py -> samples/python/text_generation/lora_greedy_causal_lm.py by @Wovchena in #2007
- Whisper timestamp fix by @RyanMetcalfeInt8 in #1918
- Unskip Qwen2-VL-2B-Instruct sample test by @as-suvorov in #1970
- [GHA] Use developer openvino packages by @mryzhov in #2000
- Added NNCF to export-requirements.txt by @ilya-lavrenov in #1974
- Bump py-build-cmake from 0.4.2 to 0.4.3 by @dependabot in #2016
- Use OV_CACHE for python tests by @as-suvorov in #2020
- [GHA] Disable HTTP calls to the Hugging Face Hub by @mryzhov in #2021
- Add python bindings to VLMPipeline for encrypted models by @olpipi in #1916
- Bump the npm_and_yarn group across 1 directory with 2 updates by @dependabot in #2017
- CB: auto plugin support by @ilya-lavrenov in #2034
- timeout-minutes: 90 by @Wovchena in #2039
- Bump diffusers from 0.32.2 to 0.33.1 by @dependabot in #2031
- Bump diffusers from 0.32.2 to 0.33.1 in /samples by @dependabot in #2032
- Enable cache and add cache encryption to samples by @olpipi in #1990
- Fix VLM concurrency by @mzegla in #2022
- Move Phi3 vision projection model to vision encoder by @popovaan in #2009
- Fix spelling by @Wovchena in #2025
- [Docs] Enable autogenerated samples docs by @yatarkan in #2029
- Synchronize entire embeddings calculation phase (#1967) by @mzegla in #1993
- Add missing finish reason set when finishing the sequence by @mzegla in #2036
- Bump image-size from 1.2.0 to 1.2.1 in /site in the npm_and_yarn group across 1 directory by @dependabot in #1998
- Add README for C Samples by @apinge in #2040
- Use ov_cache for test_vlm_pipeline by @as-suvorov in #2042
- increase timeouts by @Wovchena in #2041
- [GHA] Use azure runners for python tests by @mryzhov in #1991
- [WWB]: move diffusers imports closer to usage by @eaidova in #2046
- [llm bench] Move calculation of memory consumption to memory_monitor tool by @sbalandi in #1...
2025.1.0.0
What's Changed
- skip failing Chinese prompt on Win by @pavel-esir in #1573
- Bump product version 2025.1 by @akladiev in #1571
- Bump tokenizers submodule by @akladiev in #1575
- [LLM_BENCH] relax md5 checks and allow pass cb config without use_cb by @eaidova in #1570
- [VLM] Add Qwen2VL by @yatarkan in #1553
- Fix links, remind about ABI by @Wovchena in #1585
- Add nightly to instructions similar to requirements by @Wovchena in #1582
- GHA: use nightly from 2025.1.0 by @ilya-lavrenov in #1577
- NPU LLM Pipeline: Switch to STATEFUL by default by @dmatveev in #1561
- Verify not empty rendered chat template by @yatarkan in #1574
- [RTTI] Fix passes rtti definitions by @t-jankowski in #1588
- Test
add_special_tokensproperly by @pavel-esir in #1586 - Add indentation for llm_bench json report dumping by @nikita-savelyevv in #1584
- prioretize config model type under path-based task determination by @eaidova in #1587
- Replace openvino.runtime imports with openvino by @helena-intel in #1579
- Add tests for Whisper static pipeline by @eshiryae in #1250
- CB: removed handle_dropped() misuse by @ilya-lavrenov in #1594
- Bump timm from 1.0.13 to 1.0.14 by @dependabot in #1595
- Update samples readme by @olpipi in #1545
- [ Speculative decoding ][ Prompt lookup ] Enable Perf Metrics for assisting pipelines by @iefode in #1599
- [LLM] [NPU] StaticLLMPipeline: Export blob by @smirnov-alexey in #1601
- [llm_bench] enable prompt permutations for prevent prefix caching and fix vlm image load by @eaidova in #1607
- LLM: use set_output_seq_len instead of WA by @ilya-lavrenov in #1611
- CB: support different number of K and V heads per layer by @ilya-lavrenov in #1610
- LLM: fixed Slice / Gather of last MatMul by @ilya-lavrenov in #1616
- Switch to VS 2022 by @mryzhov in #1598
- Add Phi-3.5-vision-instruct and Phi-3-vision-128k-instruct by @Wovchena in #1609
- Whisper pipeline: apply slice matmul by @as-suvorov in #1623
- GHA: use OV master in mac.yml by @ilya-lavrenov in #1622
- [Image Generation] Image2Image for FLUX by @likholat in #1621
- add missed ignore_eos in generation config by @eaidova in #1625
- Master increase priority for rt info to fix Phi-3.5-vision-instruct and Phi-3-vision-128k-instruct by @Wovchena in #1626
- Correct model name by @wgzintel in #1624
- Token rotation by @vshampor in #987
- Whisper pipeline: use Sampler by @as-suvorov in #1615
- Fix setting eos_token_id with kwarg by @Wovchena in #1629
- Extract cacheopt E2E tests into separate test matrix field by @vshampor in #1630
- [CB] Split token streaming and generation to different threads for all CB based pipelines by @iefode in #1544
- Don't silence a error if a file can't be opened by @Wovchena in #1620
- [CMAKE]: use different version for macOS arm64 by @ilya-lavrenov in #1632
- Test invalid fields assignment raises in GenerationConfig by @Wovchena in #1633
- do_sample=False for NPU in chat_sample, add NPU to README by @helena-intel in #1637
- [JS] Add GenAI Node.js bindings by @vishniakov-nikolai in #1193
- CB: preparation for relying on KV cache precisions from plugins by @ilya-lavrenov in #1634
- [LLM bench]support providing adapter config mode by @eaidova in #1644
- Automatically apply chat template in non-chat scenarios by @sbalandi in #1533
- beam_search_causal_lm.cpp: delete wrong comment by @Wovchena in #1639
- [WWB]: Fixed chat template usage in VLM GenAI pipeline by @AlexKoff88 in #1643
- [WWB]: Fixed nano-Llava preprocessor selection by @AlexKoff88 in #1646
- [WWB]: Added config to preprocessor call in VLMs by @AlexKoff88 in #1638
- CB: remove DeviceConfig class by @ilya-lavrenov in #1640
- [WWB]: Added initialization of nano-llava in case of Transformers model by @AlexKoff88 in #1649
- WWB: simplify code around start_chat / use_template by @ilya-lavrenov in #1650
- Tokenizers update by @ilya-lavrenov in #1653
- DOCS: reorganized support models for image generation by @ilya-lavrenov in #1655
- Fix using lm_bemch/wwb with version w/o apply_chat_template by @sbalandi in #1651
- Fix Qwen2VL generation without images by @yatarkan in #1645
- Parallel sampling with threadpool by @mzegla in #1252
- [Coverity] Enabling coverity scan by @akazakov-github in #1657
- [ CB ] Fix streaming in case of empty outputs by @iefode in #1647
- Allow overriding eos_token_id by @Wovchena in #1654
- CB: remove GenerationHandle:back by @ilya-lavrenov in #1662
- Fix tiny-random-llava-next in VLM Pipeline by @yatarkan in #1660
- [CB] Add KVHeadConfig parameters to PagedAttention's rt_info by @sshlyapn in #1666
- Bump py-build-cmake from 0.3.4 to 0.4.0 by @dependabot in #1668
- pin optimum version by @pavel-esir in #1675
- [LLM] Enabled CB by default by @ilya-lavrenov in #1455
- SAMPLER: fixed hang during destruction of ThreadPool by @ilya-lavrenov in #1681
- CB: use optimized scheduler config for cases when user explicitly asked CB backend by @ilya-lavrenov in #1679
- [CB] Return Block manager asserts to destructors by @iefode in #1569
- phi3_v: allow images, remove unused var by @Wovchena in #1670
- [Image Generation] Inpainting for FLUX by @likholat in #1685
- [WWB]: Added support for SchedulerConfig in LLMPipeline by @AlexKoff88 in #1671
- Add LongBench validation by @l-bat in #1220
- Fix Tokenizer for several added special tokens by @pavel-esir in #1659
- Unpin optimum-intel version by @ilya-lavrenov in #1680
- Image generation: proper error message when encode() is used w/o encoder passed to ctor by @ilya-lavrenov in #1683
- Fix excluding stop str from output for some tokenizer by @sbalandi in #1676
- [VLM] Fix chat template fallback in chat mode with defined system message by @yatarkan in https://github.com/openvinotoolkit/openvino.genai/pull/...
2025.0.0.0
Please check out the latest documentation pages related to the new openvino_genai package!
2024.6.0.0
Please check out the latest documentation pages related to the new openvino_genai package!
2024.5.0.0
Please check out the latest documentation pages related to the new openvino_genai package!
2024.4.1.0
Please check out the latest documentation pages related to the new openvino_genai package!
What's Changed
- Bump OV version to 2024.4.1 by @akladiev in #894
- Update requirements.txt and add requirements_2024.4.txt by @wgzintel in #893
Full Changelog: 2024.4.0.0...2024.4.1.0
2024.4.0.0
Please check out the latest documentation pages related to the new openvino_genai package!
What's Changed
- Support chat conversation for StaticLLMPipeline by @TolyaTalamanov in #580
- Prefix caching. by @popovaan in #639
- Allow to build GenAI with OpenVINO via extra modules by @ilya-lavrenov in #726
- Simplified partial preemption algorithm. by @popovaan in #730
- Add set_chat_template by @Wovchena in #734
- Detect KV cache sequence length axis by @as-suvorov in #744
- Enable u8 KV cache precision for CB by @ilya-lavrenov in #759
- Add test case for native pytorch model by @wgzintel in #722
- Prefix caching improvements by @popovaan in #758
- Add USS metric by @wgzintel in #762
- Prefix caching optimization by @popovaan in #785
- Transition to default int4 compression configs from optimum-intel by @nikita-savelyevv in #689
- Control KV-cache size for StaticLLMPipeline by @TolyaTalamanov in #795
- [2024.4] update optimum intel commit to include mxfp4 conversion by @eaidova in #828
- [2024.4] use perf metrics for genai in llm bench by @eaidova in #830
- Update Pybind to version 13 by @mryzhov in #836
- Introduce stop_strings and stop_token_ids sampling params [2024.4 base] by @mzegla in #817
- StaticLLMPipeline: Handle single element list of prompts by @TolyaTalamanov in #848
- Fix Meta-Llama-3.1-8B-Instruct chat template by @pavel-esir in #846
- Add GPU support for continuous batching [2024.4] by @sshlyapn in #858
Full Changelog: 2024.3.0.0...2024.4.0.0
2024.3.0.0
Please check out the latest documentation pages related to the new openvino_genai package!