[do not review, test only]port to 25.4 RC3 #3076

songbell · 2025-11-26T05:53:59Z

Description

CVS-###

Fixes #(issue)

Checklist:

Tests have been updated or added to cover the new code.
This patch fully addresses the ticket.
I have made corresponding changes to the documentation.

Copilot

Pull request overview

This PR ports the codebase to version 25.4 RC3, introducing Eagle3 speculative decoding support. The main changes include adding hidden state handling infrastructure, implementing Eagle3-specific model transformations, and refactoring the generate method to use a common strategy pattern.

Key changes:

Added Eagle3 speculative decoding implementation with hidden state extraction and model transformation passes
Refactored speculative decoding to use a common generate_common function with configurable strategy
Extended data structures to support hidden state tensors for sequence tracking

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
update_request_structs.hpp	Added hidden_states field to GeneratedSequence
speculative_decoding_impl.hpp	Added GenerateStrategy struct and generate_common template function
speculative_decoding_impl.cpp	Refactored generate method to use strategy pattern via generate_common
speculative_decoding_eagle3_impl.hpp	New Eagle3 implementation with transformation passes
speculative_decoding_eagle3_impl.cpp	Eagle3 logic including embedding weight sharing and hidden state extraction
continuous_batching_for_speculative_decoding_impl.hpp	Added Eagle3DecodingImpl class
continuous_batching_for_speculative_decoding_impl.cpp	Hidden state truncation and eagle mode handling
sequence_group.hpp	Added hidden state storage to Sequence class
sampler.hpp	Added draft-to-target mapping support
sampler.cpp	Token index offset computation for draft models
pipeline.cpp	Eagle runtime info extraction from model metadata
continuous_batching/pipeline.cpp	Eagle3 pipeline instantiation logic
model_runner.hpp	Hidden state flags and sequence mapping infrastructure
continuous_batching_pipeline.hpp	Forward declarations for Eagle3 classes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-26T05:55:36Z

src/cpp/src/speculative_decoding/update_request_structs.hpp

    GeneratedSequence(const std::vector<int64_t>& generated_token_ids,
-                    const std::vector<float>& generated_log_probs) :
+                    const std::vector<float>& generated_log_probs,
+                    const ov::Tensor generated_hidden_states = {}) :


The default parameter generated_hidden_states = {} initializes an empty ov::Tensor. Consider using ov::Tensor() for clarity, or document whether an empty tensor is a valid default state.

Suggested change

const ov::Tensor generated_hidden_states = {}) :

const ov::Tensor generated_hidden_states = ov::Tensor()) :

Copilot · 2025-11-26T05:55:36Z

src/cpp/src/speculative_decoding/speculative_decoding_impl.hpp

+template<class Impl>
+std::vector<EncodedGenerationResult> generate_common(
+        Impl* self,
+        const std::vector<ov::Tensor>& input_ids,
+        const std::vector<GenerationConfig>& sampling_params,
+        const StreamerVariant& streamer,
+        std::optional<std::vector<ov::Tensor>> token_type_ids,
+        GenerateStrategy& strategy) {


The generate_common template function lacks documentation. Add a doc comment explaining its purpose, parameters, and how the strategy pattern is used.

Copilot · 2025-11-26T05:55:37Z

src/cpp/src/speculative_decoding/speculative_decoding_impl.hpp


    void drop_requests();
    bool is_requests_empty();
    std::vector<SequenceGroup::Ptr> get_awaiting_requests();


The init_speculative_models method lacks documentation. Add a comment explaining what it returns (main and draft scheduler configs) and its purpose.

Suggested change

std::vector<SequenceGroup::Ptr> get_awaiting_requests();

std::vector<SequenceGroup::Ptr> get_awaiting_requests();

/**

* Initializes scheduler configurations for the main and draft models used in speculative decoding.

*

* @param main_model_desc Description of the main model.

* @param draft_model_desc Description of the draft model.

* @return A pair containing the scheduler configuration for the main model (first) and the draft model (second).

*/

Copilot · 2025-11-26T05:55:37Z

src/cpp/src/speculative_decoding/speculative_decoding_eagle3_impl.cpp

+void share_embedding_weights(std::shared_ptr<ov::Model>& main_model, std::shared_ptr<ov::Model>& draft_model) {
+    // extract embedding weight from main model
+    auto find_embedding_gather = [](const std::shared_ptr<ov::Model>& model)
+        -> std::shared_ptr<ov::Node> {


The magic number 1000 for MIN_VOCAB_SIZE_THRESHOLD should be documented. Add a comment explaining why this threshold was chosen for identifying embedding layers.

Suggested change

-> std::shared_ptr<ov::Node> {

-> std::shared_ptr<ov::Node> {

// Heuristic threshold for identifying embedding layers:

// In most NLP models, embedding layers have a vocabulary size (first dimension) much larger than 1000.

// This threshold helps distinguish embedding layers from other layers with smaller dimensions.

// The value 1000 was chosen as a conservative lower bound based on common vocabulary sizes in language models.

Copilot · 2025-11-26T05:55:37Z

src/cpp/src/continuous_batching/pipeline.cpp

+            // If you wish to use different layers, provide the "hidden_layers_list" parameter in the config.
+            eagle_rt_info.hidden_layers_list = { 2, num_decoder_layers / 2, num_decoder_layers - 3 };
+        }
+        OPENVINO_ASSERT(eagle_rt_info.hidden_layers_list.size() == 3, "only exact 3 layer extraction are expected in eagle3");


Corrected spelling of 'exact' to 'exactly' in the assertion message.

Suggested change

OPENVINO_ASSERT(eagle_rt_info.hidden_layers_list.size() == 3, "only exact 3 layer extraction are expected in eagle3");

OPENVINO_ASSERT(eagle_rt_info.hidden_layers_list.size() == 3, "only exactly 3 layer extraction are expected in eagle3");

Copilot · 2025-11-26T05:55:37Z

src/cpp/src/continuous_batching/model_runner.hpp

+enum HiddenStateFlags : uint8_t {
+    HS_NONE      = 0,
+    HS_EXPORT    = 1 << 0,
+    HS_IMPORT    = 1 << 1,
+    HS_INTERNAL  = 1 << 2


The HiddenStateFlags enum lacks documentation. Add comments explaining what each flag controls in the context of Eagle3 decoding.

Suggested change

enum HiddenStateFlags : uint8_t {

HS_NONE = 0,

HS_EXPORT = 1 << 0,

HS_IMPORT = 1 << 1,

HS_INTERNAL = 1 << 2

/**

* @brief Flags to control hidden state handling in Eagle3 decoding.

* HS_NONE: No special handling of hidden states.

* HS_EXPORT: Export hidden states after decoding for use outside the model (e.g., for caching or inspection).

* HS_IMPORT: Import hidden states before decoding, allowing the model to resume from a previous state.

* HS_INTERNAL: Use hidden states only internally within the model runner; not exposed for import/export.

*/

enum HiddenStateFlags : uint8_t {

HS_NONE = 0, ///< No special handling of hidden states.

HS_EXPORT = 1 << 0, ///< Export hidden states after decoding.

HS_IMPORT = 1 << 1, ///< Import hidden states before decoding.

HS_INTERNAL = 1 << 2 ///< Use hidden states only internally.

Copilot · 2025-11-26T05:55:38Z

src/cpp/src/continuous_batching/model_runner.hpp

+    static void _copy_roi_between_tensors(const ov::Tensor& src,
+                                         size_t src_start_idx,
+                                         size_t copy_length,
+                                         const ov::Tensor& dst_base,
+                                         size_t dst_first_dim_start) {


The _copy_roi_between_tensors helper method lacks documentation. Add a doc comment explaining its purpose and parameters.

Copilot · 2025-11-26T05:55:38Z

src/cpp/src/sampling/sampler.cpp

+                    auto d2t = d2t_tensor.data<int64_t>();
+                    sampled_token.m_index = sampled_token.m_index + (d2t? d2t[sampled_token.m_index] : 0);


[nitpick] The variable name d2t is unclear. Consider renaming to draft_to_target_mapping or adding a comment explaining that it maps draft token indices to target token indices.

Suggested change

auto d2t = d2t_tensor.data<int64_t>();

sampled_token.m_index = sampled_token.m_index + (d2t? d2t[sampled_token.m_index] : 0);

// Map from draft token indices to target token indices

auto draft_to_target_mapping = d2t_tensor.data<int64_t>();

sampled_token.m_index = sampled_token.m_index + (draft_to_target_mapping ? draft_to_target_mapping[sampled_token.m_index] : 0);

Signed-off-by: fishbell <[email protected]>

Copilot AI review requested due to automatic review settings November 26, 2025 05:54

Copilot AI reviewed Nov 26, 2025

View reviewed changes

songbell added 2 commits November 26, 2025 21:29

port to 25.4 RC3

9410dcf

Signed-off-by: fishbell <[email protected]>

fix conflict

88b1169

Signed-off-by: fishbell <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[do not review, test only]port to 25.4 RC3 #3076

[do not review, test only]port to 25.4 RC3 #3076

songbell commented Nov 26, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

Copilot AI Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	const ov::Tensor generated_hidden_states = {}) :
	const ov::Tensor generated_hidden_states = ov::Tensor()) :

-    std::vector<SequenceGroup::Ptr> get_awaiting_requests();
+    std::vector<SequenceGroup::Ptr> get_awaiting_requests();
+    /**
+     * Initializes scheduler configurations for the main and draft models used in speculative decoding.
+     *
+     * @param main_model_desc  Description of the main model.
+     * @param draft_model_desc Description of the draft model.
+     * @return A pair containing the scheduler configuration for the main model (first) and the draft model (second).
+     */

-        -> std::shared_ptr<ov::Node> {
+        -> std::shared_ptr<ov::Node> {
+        // Heuristic threshold for identifying embedding layers:
+        // In most NLP models, embedding layers have a vocabulary size (first dimension) much larger than 1000.
+        // This threshold helps distinguish embedding layers from other layers with smaller dimensions.
+        // The value 1000 was chosen as a conservative lower bound based on common vocabulary sizes in language models.

	OPENVINO_ASSERT(eagle_rt_info.hidden_layers_list.size() == 3, "only exact 3 layer extraction are expected in eagle3");
	OPENVINO_ASSERT(eagle_rt_info.hidden_layers_list.size() == 3, "only exactly 3 layer extraction are expected in eagle3");

-enum HiddenStateFlags : uint8_t {
-    HS_NONE      = 0,
-    HS_EXPORT    = 1 << 0,
-    HS_IMPORT    = 1 << 1,
-    HS_INTERNAL  = 1 << 2
+/**
+ * @brief Flags to control hidden state handling in Eagle3 decoding.
+ * HS_NONE:     No special handling of hidden states.
+ * HS_EXPORT:   Export hidden states after decoding for use outside the model (e.g., for caching or inspection).
+ * HS_IMPORT:   Import hidden states before decoding, allowing the model to resume from a previous state.
+ * HS_INTERNAL: Use hidden states only internally within the model runner; not exposed for import/export.
+ */
+enum HiddenStateFlags : uint8_t {
+    HS_NONE      = 0,        ///< No special handling of hidden states.
+    HS_EXPORT    = 1 << 0,   ///< Export hidden states after decoding.
+    HS_IMPORT    = 1 << 1,   ///< Import hidden states before decoding.
+    HS_INTERNAL  = 1 << 2    ///< Use hidden states only internally.

		auto d2t = d2t_tensor.data<int64_t>();
		sampled_token.m_index = sampled_token.m_index + (d2t? d2t[sampled_token.m_index] : 0);

-                    auto d2t = d2t_tensor.data<int64_t>();
-                    sampled_token.m_index = sampled_token.m_index + (d2t? d2t[sampled_token.m_index] : 0);
+                    // Map from draft token indices to target token indices
+                    auto draft_to_target_mapping = d2t_tensor.data<int64_t>();
+                    sampled_token.m_index = sampled_token.m_index + (draft_to_target_mapping ? draft_to_target_mapping[sampled_token.m_index] : 0);

[do not review, test only]port to 25.4 RC3 #3076

Are you sure you want to change the base?

[do not review, test only]port to 25.4 RC3 #3076

Conversation

songbell commented Nov 26, 2025

Description

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant