-
Notifications
You must be signed in to change notification settings - Fork 302
[do not review, test only]port to 25.4 RC3 #3076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: releases/2025/4
Are you sure you want to change the base?
[do not review, test only]port to 25.4 RC3 #3076
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR ports the codebase to version 25.4 RC3, introducing Eagle3 speculative decoding support. The main changes include adding hidden state handling infrastructure, implementing Eagle3-specific model transformations, and refactoring the generate method to use a common strategy pattern.
Key changes:
- Added Eagle3 speculative decoding implementation with hidden state extraction and model transformation passes
- Refactored speculative decoding to use a common
generate_commonfunction with configurable strategy - Extended data structures to support hidden state tensors for sequence tracking
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| update_request_structs.hpp | Added hidden_states field to GeneratedSequence |
| speculative_decoding_impl.hpp | Added GenerateStrategy struct and generate_common template function |
| speculative_decoding_impl.cpp | Refactored generate method to use strategy pattern via generate_common |
| speculative_decoding_eagle3_impl.hpp | New Eagle3 implementation with transformation passes |
| speculative_decoding_eagle3_impl.cpp | Eagle3 logic including embedding weight sharing and hidden state extraction |
| continuous_batching_for_speculative_decoding_impl.hpp | Added Eagle3DecodingImpl class |
| continuous_batching_for_speculative_decoding_impl.cpp | Hidden state truncation and eagle mode handling |
| sequence_group.hpp | Added hidden state storage to Sequence class |
| sampler.hpp | Added draft-to-target mapping support |
| sampler.cpp | Token index offset computation for draft models |
| pipeline.cpp | Eagle runtime info extraction from model metadata |
| continuous_batching/pipeline.cpp | Eagle3 pipeline instantiation logic |
| model_runner.hpp | Hidden state flags and sequence mapping infrastructure |
| continuous_batching_pipeline.hpp | Forward declarations for Eagle3 classes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| GeneratedSequence(const std::vector<int64_t>& generated_token_ids, | ||
| const std::vector<float>& generated_log_probs) : | ||
| const std::vector<float>& generated_log_probs, | ||
| const ov::Tensor generated_hidden_states = {}) : |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default parameter generated_hidden_states = {} initializes an empty ov::Tensor. Consider using ov::Tensor() for clarity, or document whether an empty tensor is a valid default state.
| const ov::Tensor generated_hidden_states = {}) : | |
| const ov::Tensor generated_hidden_states = ov::Tensor()) : |
| template<class Impl> | ||
| std::vector<EncodedGenerationResult> generate_common( | ||
| Impl* self, | ||
| const std::vector<ov::Tensor>& input_ids, | ||
| const std::vector<GenerationConfig>& sampling_params, | ||
| const StreamerVariant& streamer, | ||
| std::optional<std::vector<ov::Tensor>> token_type_ids, | ||
| GenerateStrategy& strategy) { |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generate_common template function lacks documentation. Add a doc comment explaining its purpose, parameters, and how the strategy pattern is used.
|
|
||
| void drop_requests(); | ||
| bool is_requests_empty(); | ||
| std::vector<SequenceGroup::Ptr> get_awaiting_requests(); |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The init_speculative_models method lacks documentation. Add a comment explaining what it returns (main and draft scheduler configs) and its purpose.
| std::vector<SequenceGroup::Ptr> get_awaiting_requests(); | |
| std::vector<SequenceGroup::Ptr> get_awaiting_requests(); | |
| /** | |
| * Initializes scheduler configurations for the main and draft models used in speculative decoding. | |
| * | |
| * @param main_model_desc Description of the main model. | |
| * @param draft_model_desc Description of the draft model. | |
| * @return A pair containing the scheduler configuration for the main model (first) and the draft model (second). | |
| */ |
| void share_embedding_weights(std::shared_ptr<ov::Model>& main_model, std::shared_ptr<ov::Model>& draft_model) { | ||
| // extract embedding weight from main model | ||
| auto find_embedding_gather = [](const std::shared_ptr<ov::Model>& model) | ||
| -> std::shared_ptr<ov::Node> { |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic number 1000 for MIN_VOCAB_SIZE_THRESHOLD should be documented. Add a comment explaining why this threshold was chosen for identifying embedding layers.
| -> std::shared_ptr<ov::Node> { | |
| -> std::shared_ptr<ov::Node> { | |
| // Heuristic threshold for identifying embedding layers: | |
| // In most NLP models, embedding layers have a vocabulary size (first dimension) much larger than 1000. | |
| // This threshold helps distinguish embedding layers from other layers with smaller dimensions. | |
| // The value 1000 was chosen as a conservative lower bound based on common vocabulary sizes in language models. |
| // If you wish to use different layers, provide the "hidden_layers_list" parameter in the config. | ||
| eagle_rt_info.hidden_layers_list = { 2, num_decoder_layers / 2, num_decoder_layers - 3 }; | ||
| } | ||
| OPENVINO_ASSERT(eagle_rt_info.hidden_layers_list.size() == 3, "only exact 3 layer extraction are expected in eagle3"); |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'exact' to 'exactly' in the assertion message.
| OPENVINO_ASSERT(eagle_rt_info.hidden_layers_list.size() == 3, "only exact 3 layer extraction are expected in eagle3"); | |
| OPENVINO_ASSERT(eagle_rt_info.hidden_layers_list.size() == 3, "only exactly 3 layer extraction are expected in eagle3"); |
| enum HiddenStateFlags : uint8_t { | ||
| HS_NONE = 0, | ||
| HS_EXPORT = 1 << 0, | ||
| HS_IMPORT = 1 << 1, | ||
| HS_INTERNAL = 1 << 2 |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HiddenStateFlags enum lacks documentation. Add comments explaining what each flag controls in the context of Eagle3 decoding.
| enum HiddenStateFlags : uint8_t { | |
| HS_NONE = 0, | |
| HS_EXPORT = 1 << 0, | |
| HS_IMPORT = 1 << 1, | |
| HS_INTERNAL = 1 << 2 | |
| /** | |
| * @brief Flags to control hidden state handling in Eagle3 decoding. | |
| * HS_NONE: No special handling of hidden states. | |
| * HS_EXPORT: Export hidden states after decoding for use outside the model (e.g., for caching or inspection). | |
| * HS_IMPORT: Import hidden states before decoding, allowing the model to resume from a previous state. | |
| * HS_INTERNAL: Use hidden states only internally within the model runner; not exposed for import/export. | |
| */ | |
| enum HiddenStateFlags : uint8_t { | |
| HS_NONE = 0, ///< No special handling of hidden states. | |
| HS_EXPORT = 1 << 0, ///< Export hidden states after decoding. | |
| HS_IMPORT = 1 << 1, ///< Import hidden states before decoding. | |
| HS_INTERNAL = 1 << 2 ///< Use hidden states only internally. |
| static void _copy_roi_between_tensors(const ov::Tensor& src, | ||
| size_t src_start_idx, | ||
| size_t copy_length, | ||
| const ov::Tensor& dst_base, | ||
| size_t dst_first_dim_start) { |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _copy_roi_between_tensors helper method lacks documentation. Add a doc comment explaining its purpose and parameters.
| auto d2t = d2t_tensor.data<int64_t>(); | ||
| sampled_token.m_index = sampled_token.m_index + (d2t? d2t[sampled_token.m_index] : 0); |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The variable name d2t is unclear. Consider renaming to draft_to_target_mapping or adding a comment explaining that it maps draft token indices to target token indices.
| auto d2t = d2t_tensor.data<int64_t>(); | |
| sampled_token.m_index = sampled_token.m_index + (d2t? d2t[sampled_token.m_index] : 0); | |
| // Map from draft token indices to target token indices | |
| auto draft_to_target_mapping = d2t_tensor.data<int64_t>(); | |
| sampled_token.m_index = sampled_token.m_index + (draft_to_target_mapping ? draft_to_target_mapping[sampled_token.m_index] : 0); |
Signed-off-by: fishbell <[email protected]>
Signed-off-by: fishbell <[email protected]>
Description
CVS-###
Fixes #(issue)
Checklist: