[GPU] Optimize Image preprocessing by GPU #3063

jade-cho · 2025-11-24T14:03:44Z

Description

Optimize the image preprocessing of the Phi-3.5 Vision model on GPU.

Create ov::Model for image preprocessing.
Combine it to vision encoding model as input.

Copilot

Pull request overview

This PR optimizes image preprocessing for the Phi-3.5 Vision model on GPU by creating an OpenVINO preprocessing model that can be combined with the vision encoding model, moving preprocessing operations from CPU to GPU.

Key changes:

Added GPU preprocessing support controlled by device type and environment variable
Created create_combined_preprocessing_vision_model() to build preprocessing operations as an OpenVINO model
Modified constructor to conditionally compile combined preprocessing+vision model for GPU devices

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
src/cpp/src/visual_language/phi3_vision/classes.hpp	Added private members for GPU preprocessing flag and model creation function
src/cpp/src/visual_language/phi3_vision/classes.cpp	Implemented GPU preprocessing logic with device detection, model creation, and conditional preprocessing paths

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/visual_language/phi3_vision/classes.cpp

Wovchena · 2025-11-24T14:27:33Z

src/cpp/src/visual_language/phi3_vision/classes.cpp

+        // global_processed: (1, C, H, W)
+        // hd_sliced: (num_slices, C, H, W)
+        // Output: (1 + num_slices, C, H, W)
+        return std::make_shared<v0::Concat>(NodeVector{global_processed, hd_sliced}, 0);


Seems that the lambda isn't needed. Execute that make_shared directly at place where create_concatenate_batch is called

Wovchena · 2025-11-24T14:28:16Z

src/cpp/src/visual_language/phi3_vision/classes.cpp

+    auto create_pad_to_max_crops = [&](std::shared_ptr<Node> input_nchw, std::shared_ptr<Node> max_crops_param) {
+        auto create_constant_i64 = [](const std::vector<int64_t>& val) {
+            return v0::Constant::create(element::i64, Shape{val.size()}, val);
+        };


Convert to standalone function

Wovchena · 2025-11-24T14:31:18Z

src/cpp/src/visual_language/phi3_vision/classes.cpp

+    bool is_gpu_device = (device == "GPU" || device.find("GPU") == 0);
+
+    if (!is_gpu_device) {
+        // Always use CPU preprocessing for non-GPU devices


Why the new model block is used only for GPU?

Modified it following qwen2vl.

Modified to also support CPU.

Wovchena · 2025-11-24T14:32:55Z

src/cpp/src/visual_language/phi3_vision/classes.hpp

+    bool use_ov_image_preprocess = false; // Default to false, will be set based on device and environment
+
+    // GPU preprocessing model creation function
+    std::shared_ptr<ov::Model> create_combined_preprocessing_vision_model(


Does create_combined_preprocessing_vision_model() need to be a member function? Convert to standalone function in anonymous namespace

Wovchena · 2025-11-24T14:33:18Z

src/cpp/src/visual_language/phi3_vision/classes.cpp

Share perf results

1st token latency on 12th Gen Intel(R) Core(TM) i9-12900K

GPU
(1) Optimized CPU image processing + GPU
(2) OV image processing (GPU) + GPU

iter 2025.4rc2 (1) (2)

warm-up 383.75 345.73 355.34

1 256.18 219.62 207.13

2 152.82 108.02 98.35

3 144.49 109.55 103.41

4 156.77 106.60 100.19

avg. 177.57 135.95 127.27

avg. w/o #1 151.36 108.06 100.65

CPU
(1) Optimized CPU image processing + CPU
(2) OV image processing (CPU) + CPU

iter 2025.4rc2 (1) (2)

warm-up 5851.68 5840.50 5772.77

1 1191.25 1155.78 1144.94

2 1194.99 1152.73 1145.11

3 1186.30 1151.35 1139.59

4 1184.42 1153.32 1141.33

avg. 1189.24 1153.30 1142.74

Wovchena · 2025-11-24T14:34:52Z

src/cpp/src/visual_language/phi3_vision/classes.cpp

Can this be implemented in optimum-intel instead? In that case optimum-intel will also get faster and optimum-intel will export the model with the new block, so GenAI won't need to have it

To work on optimum-intel, manager approval is required.

I think there's lack of transparency. I created CVS-177192 in attempt to mitigate that. The problem you are solving here is broader and affects all VLMs. And there seem to be a better and simpler solution involving optimum-intel. It's simpler because Python tends to be simpler then C++.
The reason qwen2vl is implemented that way is that it was specifically optimized for GPU and we didn't think that deep about this problem back then.

Since you already done that partially for Phi-3.5, I suggest finalizing that on optimum-intel side. Let me know if my action is required for manager approval you mentioned.

You posed perf number for CPU and GPU and both improved, so we are good here. For WWB you still need to compare this 0.817479 with the baseline and run the same test for GPU. This can be done using GenAI instead of optimum-intel since you already have the implementation.

Before proceeding, we need @rkazants approval as optimum-intel maintainer. @rkazants, are you OK with the proposal?

@Wovchena This 0.817479 is from GPU test, while using CPU image preprocessing on GPU gives 0.837313. The reason is the mean scaling, and when optimizing CPU image preprocessing, this function was also excluded from optimization. Therefore, I have some questions:

Should this similarity score never decrease? Is a trade-off not acceptable?

What does “baseline” mean? Is there a fixed version of OV and GenAI for baseline?

Is the WWB similarity value for CPU also required?

Minor decrease is possible. I hope it becomes irrelevant after resize is aligned and the model absorbs the difference introduced by mean scaling.

Baseline is the similarity score for the same device on GenAI version without this PR

Yes

CPU with your patch so you should remove this limitation that it's applied to GPU only like you did for performance

@jade-cho, @Wovchena, our team will take this AR to extend optimum-intel.
We need more context about it, Please describe what exactly is needed in optimum-intel. I also sync with @Wovchena today to ask.

@rkazants am I understanding correctly that you will take over this PR to merge into optimum-intel?

@jade-cho, we agreed with @Wovchena do not do it in optimum-intel. We agreed to do such optimization in OpenVINO GenAI side only. Please proceed in this PR.

Wovchena · 2025-11-24T14:35:09Z

src/cpp/src/visual_language/phi3_vision/classes.cpp

Share WWB results

INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2 INFO:whowhatbench.model_loaders:Using OpenVINO GenAI API INFO:whowhatbench.model_loaders:Using OpenVINO GenAI VLMPipeline API The repository /home/jade/WW44_llm-optimum_2025.4.0-20329/phi-3.5-vision-instruct/pytorch/ov/OV_FP16-4BIT_DEFAULT/ contains custom code which must be executed to correctly load the model. You can inspect the repository content at /home/jade/WW44_llm-optimum_2025.4.0-20329/phi-3.5-vision-instruct/pytorch/ov/OV_FP16-4BIT_DEFAULT . You can inspect the repository content at https://hf.co//home/jade/WW44_llm-optimum_2025.4.0-20329/phi-3.5-vision-instruct/pytorch/ov/OV_FP16-4BIT_DEFAULT/. You can avoid this prompt in future by passing the argument `trust_remote_code=True`. Do you wish to run the custom code? [y/N] y Evaluate pipeline: 24it [00:14, 1.62it/s] ... Similarity evaluation: 24it [00:00, 29.36it/s] | 0/1 [00:00<?, ?it/s] INFO:whowhatbench.wwb:Metrics for model: /home/jade/WW44_llm-optimum_2025.4.0-20329/phi-3.5-vision-instruct/pytorch/ov/OV_FP16-4BIT_DEFAULT/ INFO:whowhatbench.wwb: similarity 0 0.817479

WWB similarity

Device 2025.4rc2 PR

CPU 0.821846 0.821519

GPU 0.837313 0.817479

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/visual_language/phi3_vision/classes.hpp

src/cpp/src/visual_language/phi3_vision/classes.cpp

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-27T00:46:48Z

src/cpp/src/visual_language/phi3_vision/classes.cpp

+        ov::Tensor hd_image = HD_transform(image, config.phi3_v.num_crops);
+        image_size = ImageSize{hd_image.get_shape().at(2), hd_image.get_shape().at(1)};


The image size extraction uses magic indices (2 and 1) which refer to height and width. Consider adding named constants or comments to clarify these dimension indices match the expected NHWC format.

Copilot · 2025-11-27T00:46:48Z

src/cpp/src/visual_language/phi3_vision/classes.cpp

+        ov::Tensor hd_image = HD_transform(image, config.phi3_v.num_crops);
+        image_size = ImageSize{hd_image.get_shape().at(2), hd_image.get_shape().at(1)};
+
+        uint64_t global_size[2] = {INPUT_IMAGE_SIZE, INPUT_IMAGE_SIZE};


[nitpick] Using a C-style array for global_size is inconsistent with modern C++ practices. Consider using std::array<uint64_t, 2> for better type safety and clarity.

Copilot · 2025-11-27T00:46:48Z

src/cpp/src/visual_language/phi3_vision/classes.cpp

+        int64_t max_crops_value = static_cast<int64_t>(config.phi3_v.num_crops);
+        ov::Tensor max_crops_tensor(ov::element::i64, ov::Shape{}, &max_crops_value);


Passing a pointer to a local variable max_crops_value to the tensor constructor is unsafe if the tensor doesn't copy the data. The variable goes out of scope after this function returns, which could lead to undefined behavior. Verify that ov::Tensor copies the data or ensure the lifetime of max_crops_value extends beyond tensor usage.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/visual_language/phi3_vision/classes.cpp

Copilot · 2025-11-27T00:47:47Z

src/cpp/src/visual_language/phi3_vision/classes.cpp

+        ov::Tensor hd_image = HD_transform(image, config.phi3_v.num_crops);
+        image_size = ImageSize{hd_image.get_shape().at(2), hd_image.get_shape().at(1)};
+
+        int64_t global_size[2] = {INPUT_IMAGE_SIZE, INPUT_IMAGE_SIZE};


[nitpick] The array size 2 is a magic number. Consider defining a named constant (e.g., NUM_DIMENSIONS=2) to clarify that this represents height and width dimensions.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

src/cpp/src/visual_language/phi3_vision/classes.cpp:1

[nitpick] The comment states 'batch=1' but the actual shape uses '{1, -1, -1, 3}' where batch size is fixed at 1. The comment is accurate for this specific usage, but consider clarifying that the batch dimension is fixed to 1 for this preprocessing pipeline.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/visual_language/phi3_vision/classes.hpp

src/cpp/src/visual_language/phi3_vision/classes.cpp

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-27T00:54:07Z

src/cpp/src/visual_language/phi3_vision/classes.cpp

+    attrs.mode = v11::Interpolate::InterpolateMode::CUBIC;
+    attrs.shape_calculation_mode = v11::Interpolate::ShapeCalcMode::SIZES;
+    attrs.coordinate_transformation_mode = v11::Interpolate::CoordinateTransformMode::ASYMMETRIC;
+    attrs.cube_coeff = -0.5f;  // Standard coefficient for bicubic interpolation (Catmull-Rom)


[nitpick] The comment states 'catmull-rom' but the coefficient -0.5 is actually for Catmull-Rom splines, which is a specific case of cubic interpolation. Consider clarifying that this is the standard Catmull-Rom coefficient or simply stating 'bicubic interpolation with Catmull-Rom splines' for precision.

Suggested change

attrs.cube_coeff = -0.5f; // Standard coefficient for bicubic interpolation (Catmull-Rom)

attrs.cube_coeff = -0.5f; // Catmull-Rom spline coefficient (bicubic interpolation with Catmull-Rom splines)

Copilot · 2025-11-27T00:54:08Z

src/cpp/src/visual_language/phi3_vision/classes.cpp

+        m_ireq_queue_vision_encoder = std::make_unique<CircularBufferQueue<ov::InferRequest>>(
+            compiled_model.get_property(ov::optimal_number_of_infer_requests),
+            [&compiled_model]() -> ov::InferRequest {
+                return compiled_model.create_infer_request();


The lambda captures compiled_model by reference, but compiled_model is a local variable that will go out of scope after the constructor completes. This will result in undefined behavior when the lambda is invoked. Capture compiled_model by value instead: [compiled_model].

Copilot · 2025-11-27T00:54:08Z

src/cpp/src/visual_language/phi3_vision/classes.cpp

+        m_ireq_queue_vision_encoder = std::make_unique<CircularBufferQueue<ov::InferRequest>>(
+            compiled_model.get_property(ov::optimal_number_of_infer_requests),
+            [&compiled_model]() -> ov::InferRequest {
+                return compiled_model.create_infer_request();


The lambda captures compiled_model by reference, but compiled_model is a local variable that will go out of scope after the constructor completes. This will result in undefined behavior when the lambda is invoked. Capture compiled_model by value instead: [compiled_model].

Copilot AI review requested due to automatic review settings November 24, 2025 14:03

github-actions bot added the category: visual language Visual language pipeline label Nov 24, 2025

Copilot AI reviewed Nov 24, 2025

View reviewed changes

src/cpp/src/visual_language/phi3_vision/classes.cpp Outdated Show resolved Hide resolved

src/cpp/src/visual_language/phi3_vision/classes.cpp Outdated Show resolved Hide resolved

Wovchena requested changes Nov 24, 2025

View reviewed changes

jade-cho force-pushed the phi3_gpu_preprocessing branch from 95fcbcf to af2e7c4 Compare November 25, 2025 05:41

Copilot AI review requested due to automatic review settings November 27, 2025 00:41

Copilot AI reviewed Nov 27, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings November 27, 2025 00:45

Copilot AI reviewed Nov 27, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings November 27, 2025 00:47

Copilot AI reviewed Nov 27, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings November 27, 2025 00:49

Copilot AI reviewed Nov 27, 2025

View reviewed changes

src/cpp/src/visual_language/phi3_vision/classes.hpp Outdated Show resolved Hide resolved

src/cpp/src/visual_language/phi3_vision/classes.cpp Outdated Show resolved Hide resolved

src/cpp/src/visual_language/phi3_vision/classes.cpp Outdated Show resolved Hide resolved

Copilot AI review requested due to automatic review settings November 27, 2025 00:53

Copilot AI reviewed Nov 27, 2025

View reviewed changes

Optimize Image preprocessing by GPU with ov::Model

5804ddb

jade-cho force-pushed the phi3_gpu_preprocessing branch from 87f9bc8 to 5804ddb Compare November 27, 2025 08:11

iter	2025.4rc2	(1)	(2)
warm-up	383.75	345.73	355.34
1	256.18	219.62	207.13
2	152.82	108.02	98.35
3	144.49	109.55	103.41
4	156.77	106.60	100.19
avg.	177.57	135.95	127.27
avg. w/o #1	151.36	108.06	100.65

iter	2025.4rc2	(1)	(2)
warm-up	5851.68	5840.50	5772.77
1	1191.25	1155.78	1144.94
2	1194.99	1152.73	1145.11
3	1186.30	1151.35	1139.59
4	1184.42	1153.32	1141.33
avg.	1189.24	1153.30	1142.74

		ov::Tensor hd_image = HD_transform(image, config.phi3_v.num_crops);
		image_size = ImageSize{hd_image.get_shape().at(2), hd_image.get_shape().at(1)};

		int64_t max_crops_value = static_cast<int64_t>(config.phi3_v.num_crops);
		ov::Tensor max_crops_tensor(ov::element::i64, ov::Shape{}, &max_crops_value);

	attrs.cube_coeff = -0.5f; // Standard coefficient for bicubic interpolation (Catmull-Rom)
	attrs.cube_coeff = -0.5f; // Catmull-Rom spline coefficient (bicubic interpolation with Catmull-Rom splines)

[GPU] Optimize Image preprocessing by GPU #3063

Are you sure you want to change the base?

[GPU] Optimize Image preprocessing by GPU #3063

Conversation

jade-cho commented Nov 24, 2025

Description

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jade-cho Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

jade-cho Nov 25, 2025 •

edited

Loading