Skip to content

Conversation

@jade-cho
Copy link
Contributor

Description

Optimize the image preprocessing of the Phi-3.5 Vision model on GPU.

  • Create ov::Model for image preprocessing.
  • Combine it to vision encoding model as input.

CVS-176393

Copilot AI review requested due to automatic review settings November 24, 2025 14:03
@github-actions github-actions bot added the category: visual language Visual language pipeline label Nov 24, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes image preprocessing for the Phi-3.5 Vision model on GPU by creating an OpenVINO preprocessing model that can be combined with the vision encoding model, moving preprocessing operations from CPU to GPU.

Key changes:

  • Added GPU preprocessing support controlled by device type and environment variable
  • Created create_combined_preprocessing_vision_model() to build preprocessing operations as an OpenVINO model
  • Modified constructor to conditionally compile combined preprocessing+vision model for GPU devices

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/cpp/src/visual_language/phi3_vision/classes.hpp Added private members for GPU preprocessing flag and model creation function
src/cpp/src/visual_language/phi3_vision/classes.cpp Implemented GPU preprocessing logic with device detection, model creation, and conditional preprocessing paths

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// global_processed: (1, C, H, W)
// hd_sliced: (num_slices, C, H, W)
// Output: (1 + num_slices, C, H, W)
return std::make_shared<v0::Concat>(NodeVector{global_processed, hd_sliced}, 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that the lambda isn't needed. Execute that make_shared directly at place where create_concatenate_batch is called

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

auto create_pad_to_max_crops = [&](std::shared_ptr<Node> input_nchw, std::shared_ptr<Node> max_crops_param) {
auto create_constant_i64 = [](const std::vector<int64_t>& val) {
return v0::Constant::create(element::i64, Shape{val.size()}, val);
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert to standalone function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

bool is_gpu_device = (device == "GPU" || device.find("GPU") == 0);

if (!is_gpu_device) {
// Always use CPU preprocessing for non-GPU devices
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the new model block is used only for GPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified it following qwen2vl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified to also support CPU.

bool use_ov_image_preprocess = false; // Default to false, will be set based on device and environment

// GPU preprocessing model creation function
std::shared_ptr<ov::Model> create_combined_preprocessing_vision_model(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does create_combined_preprocessing_vision_model() need to be a member function? Convert to standalone function in anonymous namespace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Share perf results

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1st token latency on 12th Gen Intel(R) Core(TM) i9-12900K

GPU
(1) Optimized CPU image processing + GPU
(2) OV image processing (GPU) + GPU

iter 2025.4rc2 (1) (2)
warm-up 383.75 345.73 355.34
1 256.18 219.62 207.13
2 152.82 108.02 98.35
3 144.49 109.55 103.41
4 156.77 106.60 100.19
avg. 177.57 135.95 127.27
avg. w/o #1 151.36 108.06 100.65

CPU
(1) Optimized CPU image processing + CPU
(2) OV image processing (CPU) + CPU

iter 2025.4rc2 (1) (2)
warm-up 5851.68 5840.50 5772.77
1 1191.25 1155.78 1144.94
2 1194.99 1152.73 1145.11
3 1186.30 1151.35 1139.59
4 1184.42 1153.32 1141.33
avg. 1189.24 1153.30 1142.74

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be implemented in optimum-intel instead? In that case optimum-intel will also get faster and optimum-intel will export the model with the new block, so GenAI won't need to have it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To work on optimum-intel, manager approval is required.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's lack of transparency. I created CVS-177192 in attempt to mitigate that. The problem you are solving here is broader and affects all VLMs. And there seem to be a better and simpler solution involving optimum-intel. It's simpler because Python tends to be simpler then C++.
The reason qwen2vl is implemented that way is that it was specifically optimized for GPU and we didn't think that deep about this problem back then.

Since you already done that partially for Phi-3.5, I suggest finalizing that on optimum-intel side. Let me know if my action is required for manager approval you mentioned.

You posed perf number for CPU and GPU and both improved, so we are good here. For WWB you still need to compare this 0.817479 with the baseline and run the same test for GPU. This can be done using GenAI instead of optimum-intel since you already have the implementation.

Before proceeding, we need @rkazants approval as optimum-intel maintainer. @rkazants, are you OK with the proposal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wovchena This 0.817479 is from GPU test, while using CPU image preprocessing on GPU gives 0.837313. The reason is the mean scaling, and when optimizing CPU image preprocessing, this function was also excluded from optimization. Therefore, I have some questions:

  1. Should this similarity score never decrease? Is a trade-off not acceptable?
  2. What does “baseline” mean? Is there a fixed version of OV and GenAI for baseline?
  3. Is the WWB similarity value for CPU also required?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Minor decrease is possible. I hope it becomes irrelevant after resize is aligned and the model absorbs the difference introduced by mean scaling.
  2. Baseline is the similarity score for the same device on GenAI version without this PR
  3. Yes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. CPU with your patch so you should remove this limitation that it's applied to GPU only like you did for performance

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jade-cho, @Wovchena, our team will take this AR to extend optimum-intel.
We need more context about it, Please describe what exactly is needed in optimum-intel. I also sync with @Wovchena today to ask.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rkazants am I understanding correctly that you will take over this PR to merge into optimum-intel?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jade-cho, we agreed with @Wovchena do not do it in optimum-intel. We agreed to do such optimization in OpenVINO GenAI side only. Please proceed in this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Share WWB results

Copy link
Contributor Author

@jade-cho jade-cho Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
INFO:whowhatbench.model_loaders:Using OpenVINO GenAI API
INFO:whowhatbench.model_loaders:Using OpenVINO GenAI VLMPipeline API
The repository /home/jade/WW44_llm-optimum_2025.4.0-20329/phi-3.5-vision-instruct/pytorch/ov/OV_FP16-4BIT_DEFAULT/ contains custom code which must be executed to correctly load the model. You can inspect the repository content at /home/jade/WW44_llm-optimum_2025.4.0-20329/phi-3.5-vision-instruct/pytorch/ov/OV_FP16-4BIT_DEFAULT .
 You can inspect the repository content at https://hf.co//home/jade/WW44_llm-optimum_2025.4.0-20329/phi-3.5-vision-instruct/pytorch/ov/OV_FP16-4BIT_DEFAULT/.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y
Evaluate pipeline: 24it [00:14,  1.62it/s]
...
Similarity evaluation: 24it [00:00, 29.36it/s]                                                                                                                                                          | 0/1 [00:00<?, ?it/s]
INFO:whowhatbench.wwb:Metrics for model: /home/jade/WW44_llm-optimum_2025.4.0-20329/phi-3.5-vision-instruct/pytorch/ov/OV_FP16-4BIT_DEFAULT/
INFO:whowhatbench.wwb:   similarity
0    0.817479

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WWB similarity

Device 2025.4rc2 PR
CPU 0.821846 0.821519
GPU 0.837313 0.817479

@jade-cho jade-cho force-pushed the phi3_gpu_preprocessing branch from 95fcbcf to af2e7c4 Compare November 25, 2025 05:41
Copilot AI review requested due to automatic review settings November 27, 2025 00:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings November 27, 2025 00:45
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +889 to +893
ov::Tensor hd_image = HD_transform(image, config.phi3_v.num_crops);
image_size = ImageSize{hd_image.get_shape().at(2), hd_image.get_shape().at(1)};
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image size extraction uses magic indices (2 and 1) which refer to height and width. Consider adding named constants or comments to clarify these dimension indices match the expected NHWC format.

Copilot uses AI. Check for mistakes.
ov::Tensor hd_image = HD_transform(image, config.phi3_v.num_crops);
image_size = ImageSize{hd_image.get_shape().at(2), hd_image.get_shape().at(1)};

uint64_t global_size[2] = {INPUT_IMAGE_SIZE, INPUT_IMAGE_SIZE};
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using a C-style array for global_size is inconsistent with modern C++ practices. Consider using std::array<uint64_t, 2> for better type safety and clarity.

Copilot uses AI. Check for mistakes.
Comment on lines +895 to +899
int64_t max_crops_value = static_cast<int64_t>(config.phi3_v.num_crops);
ov::Tensor max_crops_tensor(ov::element::i64, ov::Shape{}, &max_crops_value);
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing a pointer to a local variable max_crops_value to the tensor constructor is unsafe if the tensor doesn't copy the data. The variable goes out of scope after this function returns, which could lead to undefined behavior. Verify that ov::Tensor copies the data or ensure the lifetime of max_crops_value extends beyond tensor usage.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings November 27, 2025 00:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ov::Tensor hd_image = HD_transform(image, config.phi3_v.num_crops);
image_size = ImageSize{hd_image.get_shape().at(2), hd_image.get_shape().at(1)};

int64_t global_size[2] = {INPUT_IMAGE_SIZE, INPUT_IMAGE_SIZE};
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The array size 2 is a magic number. Consider defining a named constant (e.g., NUM_DIMENSIONS=2) to clarify that this represents height and width dimensions.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings November 27, 2025 00:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

src/cpp/src/visual_language/phi3_vision/classes.cpp:1

  • [nitpick] The comment states 'batch=1' but the actual shape uses '{1, -1, -1, 3}' where batch size is fixed at 1. The comment is accurate for this specific usage, but consider clarifying that the batch dimension is fixed to 1 for this preprocessing pipeline.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings November 27, 2025 00:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

attrs.mode = v11::Interpolate::InterpolateMode::CUBIC;
attrs.shape_calculation_mode = v11::Interpolate::ShapeCalcMode::SIZES;
attrs.coordinate_transformation_mode = v11::Interpolate::CoordinateTransformMode::ASYMMETRIC;
attrs.cube_coeff = -0.5f; // Standard coefficient for bicubic interpolation (Catmull-Rom)
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment states 'catmull-rom' but the coefficient -0.5 is actually for Catmull-Rom splines, which is a specific case of cubic interpolation. Consider clarifying that this is the standard Catmull-Rom coefficient or simply stating 'bicubic interpolation with Catmull-Rom splines' for precision.

Suggested change
attrs.cube_coeff = -0.5f; // Standard coefficient for bicubic interpolation (Catmull-Rom)
attrs.cube_coeff = -0.5f; // Catmull-Rom spline coefficient (bicubic interpolation with Catmull-Rom splines)

Copilot uses AI. Check for mistakes.
Comment on lines +936 to +941
m_ireq_queue_vision_encoder = std::make_unique<CircularBufferQueue<ov::InferRequest>>(
compiled_model.get_property(ov::optimal_number_of_infer_requests),
[&compiled_model]() -> ov::InferRequest {
return compiled_model.create_infer_request();
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lambda captures compiled_model by reference, but compiled_model is a local variable that will go out of scope after the constructor completes. This will result in undefined behavior when the lambda is invoked. Capture compiled_model by value instead: [compiled_model].

Copilot uses AI. Check for mistakes.
Comment on lines +971 to +976
m_ireq_queue_vision_encoder = std::make_unique<CircularBufferQueue<ov::InferRequest>>(
compiled_model.get_property(ov::optimal_number_of_infer_requests),
[&compiled_model]() -> ov::InferRequest {
return compiled_model.create_infer_request();
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lambda captures compiled_model by reference, but compiled_model is a local variable that will go out of scope after the constructor completes. This will result in undefined behavior when the lambda is invoked. Capture compiled_model by value instead: [compiled_model].

Copilot uses AI. Check for mistakes.
@jade-cho jade-cho force-pushed the phi3_gpu_preprocessing branch from 87f9bc8 to 5804ddb Compare November 27, 2025 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: visual language Visual language pipeline

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants