Skip to content

Conversation

@songbell
Copy link

Description

CVS-###

Fixes #(issue)

Checklist:

  • Tests have been updated or added to cover the new code.
  • This patch fully addresses the ticket.
  • I have made corresponding changes to the documentation.

Copilot AI review requested due to automatic review settings November 26, 2025 05:54
@github-actions github-actions bot added category: continuous batching Continuous batching category: LLM LLM pipeline (stateful, static) category: sampling Sampling / Decoding algorithms category: speculative decoding Speculative decoding category: CPP API Changes in GenAI C++ public headers no-match-files labels Nov 26, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ports the codebase to version 25.4 RC3, introducing Eagle3 speculative decoding support. The main changes include adding hidden state handling infrastructure, implementing Eagle3-specific model transformations, and refactoring the generate method to use a common strategy pattern.

Key changes:

  • Added Eagle3 speculative decoding implementation with hidden state extraction and model transformation passes
  • Refactored speculative decoding to use a common generate_common function with configurable strategy
  • Extended data structures to support hidden state tensors for sequence tracking

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
update_request_structs.hpp Added hidden_states field to GeneratedSequence
speculative_decoding_impl.hpp Added GenerateStrategy struct and generate_common template function
speculative_decoding_impl.cpp Refactored generate method to use strategy pattern via generate_common
speculative_decoding_eagle3_impl.hpp New Eagle3 implementation with transformation passes
speculative_decoding_eagle3_impl.cpp Eagle3 logic including embedding weight sharing and hidden state extraction
continuous_batching_for_speculative_decoding_impl.hpp Added Eagle3DecodingImpl class
continuous_batching_for_speculative_decoding_impl.cpp Hidden state truncation and eagle mode handling
sequence_group.hpp Added hidden state storage to Sequence class
sampler.hpp Added draft-to-target mapping support
sampler.cpp Token index offset computation for draft models
pipeline.cpp Eagle runtime info extraction from model metadata
continuous_batching/pipeline.cpp Eagle3 pipeline instantiation logic
model_runner.hpp Hidden state flags and sequence mapping infrastructure
continuous_batching_pipeline.hpp Forward declarations for Eagle3 classes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

GeneratedSequence(const std::vector<int64_t>& generated_token_ids,
const std::vector<float>& generated_log_probs) :
const std::vector<float>& generated_log_probs,
const ov::Tensor generated_hidden_states = {}) :
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default parameter generated_hidden_states = {} initializes an empty ov::Tensor. Consider using ov::Tensor() for clarity, or document whether an empty tensor is a valid default state.

Suggested change
const ov::Tensor generated_hidden_states = {}) :
const ov::Tensor generated_hidden_states = ov::Tensor()) :

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +35
template<class Impl>
std::vector<EncodedGenerationResult> generate_common(
Impl* self,
const std::vector<ov::Tensor>& input_ids,
const std::vector<GenerationConfig>& sampling_params,
const StreamerVariant& streamer,
std::optional<std::vector<ov::Tensor>> token_type_ids,
GenerateStrategy& strategy) {
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generate_common template function lacks documentation. Add a doc comment explaining its purpose, parameters, and how the strategy pattern is used.

Copilot uses AI. Check for mistakes.

void drop_requests();
bool is_requests_empty();
std::vector<SequenceGroup::Ptr> get_awaiting_requests();
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init_speculative_models method lacks documentation. Add a comment explaining what it returns (main and draft scheduler configs) and its purpose.

Suggested change
std::vector<SequenceGroup::Ptr> get_awaiting_requests();
std::vector<SequenceGroup::Ptr> get_awaiting_requests();
/**
* Initializes scheduler configurations for the main and draft models used in speculative decoding.
*
* @param main_model_desc Description of the main model.
* @param draft_model_desc Description of the draft model.
* @return A pair containing the scheduler configuration for the main model (first) and the draft model (second).
*/

Copilot uses AI. Check for mistakes.
void share_embedding_weights(std::shared_ptr<ov::Model>& main_model, std::shared_ptr<ov::Model>& draft_model) {
// extract embedding weight from main model
auto find_embedding_gather = [](const std::shared_ptr<ov::Model>& model)
-> std::shared_ptr<ov::Node> {
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 1000 for MIN_VOCAB_SIZE_THRESHOLD should be documented. Add a comment explaining why this threshold was chosen for identifying embedding layers.

Suggested change
-> std::shared_ptr<ov::Node> {
-> std::shared_ptr<ov::Node> {
// Heuristic threshold for identifying embedding layers:
// In most NLP models, embedding layers have a vocabulary size (first dimension) much larger than 1000.
// This threshold helps distinguish embedding layers from other layers with smaller dimensions.
// The value 1000 was chosen as a conservative lower bound based on common vocabulary sizes in language models.

Copilot uses AI. Check for mistakes.
// If you wish to use different layers, provide the "hidden_layers_list" parameter in the config.
eagle_rt_info.hidden_layers_list = { 2, num_decoder_layers / 2, num_decoder_layers - 3 };
}
OPENVINO_ASSERT(eagle_rt_info.hidden_layers_list.size() == 3, "only exact 3 layer extraction are expected in eagle3");
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'exact' to 'exactly' in the assertion message.

Suggested change
OPENVINO_ASSERT(eagle_rt_info.hidden_layers_list.size() == 3, "only exact 3 layer extraction are expected in eagle3");
OPENVINO_ASSERT(eagle_rt_info.hidden_layers_list.size() == 3, "only exactly 3 layer extraction are expected in eagle3");

Copilot uses AI. Check for mistakes.
Comment on lines +27 to +31
enum HiddenStateFlags : uint8_t {
HS_NONE = 0,
HS_EXPORT = 1 << 0,
HS_IMPORT = 1 << 1,
HS_INTERNAL = 1 << 2
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HiddenStateFlags enum lacks documentation. Add comments explaining what each flag controls in the context of Eagle3 decoding.

Suggested change
enum HiddenStateFlags : uint8_t {
HS_NONE = 0,
HS_EXPORT = 1 << 0,
HS_IMPORT = 1 << 1,
HS_INTERNAL = 1 << 2
/**
* @brief Flags to control hidden state handling in Eagle3 decoding.
* HS_NONE: No special handling of hidden states.
* HS_EXPORT: Export hidden states after decoding for use outside the model (e.g., for caching or inspection).
* HS_IMPORT: Import hidden states before decoding, allowing the model to resume from a previous state.
* HS_INTERNAL: Use hidden states only internally within the model runner; not exposed for import/export.
*/
enum HiddenStateFlags : uint8_t {
HS_NONE = 0, ///< No special handling of hidden states.
HS_EXPORT = 1 << 0, ///< Export hidden states after decoding.
HS_IMPORT = 1 << 1, ///< Import hidden states before decoding.
HS_INTERNAL = 1 << 2 ///< Use hidden states only internally.

Copilot uses AI. Check for mistakes.
Comment on lines +711 to +715
static void _copy_roi_between_tensors(const ov::Tensor& src,
size_t src_start_idx,
size_t copy_length,
const ov::Tensor& dst_base,
size_t dst_first_dim_start) {
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _copy_roi_between_tensors helper method lacks documentation. Add a doc comment explaining its purpose and parameters.

Copilot uses AI. Check for mistakes.
Comment on lines +858 to +859
auto d2t = d2t_tensor.data<int64_t>();
sampled_token.m_index = sampled_token.m_index + (d2t? d2t[sampled_token.m_index] : 0);
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The variable name d2t is unclear. Consider renaming to draft_to_target_mapping or adding a comment explaining that it maps draft token indices to target token indices.

Suggested change
auto d2t = d2t_tensor.data<int64_t>();
sampled_token.m_index = sampled_token.m_index + (d2t? d2t[sampled_token.m_index] : 0);
// Map from draft token indices to target token indices
auto draft_to_target_mapping = d2t_tensor.data<int64_t>();
sampled_token.m_index = sampled_token.m_index + (draft_to_target_mapping ? draft_to_target_mapping[sampled_token.m_index] : 0);

Copilot uses AI. Check for mistakes.
Signed-off-by: fishbell <[email protected]>
Signed-off-by: fishbell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: continuous batching Continuous batching category: CPP API Changes in GenAI C++ public headers category: LLM LLM pipeline (stateful, static) category: sampling Sampling / Decoding algorithms category: speculative decoding Speculative decoding no-match-files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant