-
Notifications
You must be signed in to change notification settings - Fork 302
[GPU] Optimize Image preprocessing by GPU #3063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes image preprocessing for the Phi-3.5 Vision model on GPU by creating an OpenVINO preprocessing model that can be combined with the vision encoding model, moving preprocessing operations from CPU to GPU.
Key changes:
- Added GPU preprocessing support controlled by device type and environment variable
- Created
create_combined_preprocessing_vision_model()to build preprocessing operations as an OpenVINO model - Modified constructor to conditionally compile combined preprocessing+vision model for GPU devices
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/cpp/src/visual_language/phi3_vision/classes.hpp | Added private members for GPU preprocessing flag and model creation function |
| src/cpp/src/visual_language/phi3_vision/classes.cpp | Implemented GPU preprocessing logic with device detection, model creation, and conditional preprocessing paths |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // global_processed: (1, C, H, W) | ||
| // hd_sliced: (num_slices, C, H, W) | ||
| // Output: (1 + num_slices, C, H, W) | ||
| return std::make_shared<v0::Concat>(NodeVector{global_processed, hd_sliced}, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that the lambda isn't needed. Execute that make_shared directly at place where create_concatenate_batch is called
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| auto create_pad_to_max_crops = [&](std::shared_ptr<Node> input_nchw, std::shared_ptr<Node> max_crops_param) { | ||
| auto create_constant_i64 = [](const std::vector<int64_t>& val) { | ||
| return v0::Constant::create(element::i64, Shape{val.size()}, val); | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convert to standalone function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| bool is_gpu_device = (device == "GPU" || device.find("GPU") == 0); | ||
|
|
||
| if (!is_gpu_device) { | ||
| // Always use CPU preprocessing for non-GPU devices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the new model block is used only for GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified it following qwen2vl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified to also support CPU.
| bool use_ov_image_preprocess = false; // Default to false, will be set based on device and environment | ||
|
|
||
| // GPU preprocessing model creation function | ||
| std::shared_ptr<ov::Model> create_combined_preprocessing_vision_model( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does create_combined_preprocessing_vision_model() need to be a member function? Convert to standalone function in anonymous namespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Share perf results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1st token latency on 12th Gen Intel(R) Core(TM) i9-12900K
GPU
(1) Optimized CPU image processing + GPU
(2) OV image processing (GPU) + GPU
| iter | 2025.4rc2 | (1) | (2) |
|---|---|---|---|
| warm-up | 383.75 | 345.73 | 355.34 |
| 1 | 256.18 | 219.62 | 207.13 |
| 2 | 152.82 | 108.02 | 98.35 |
| 3 | 144.49 | 109.55 | 103.41 |
| 4 | 156.77 | 106.60 | 100.19 |
| avg. | 177.57 | 135.95 | 127.27 |
| avg. w/o #1 | 151.36 | 108.06 | 100.65 |
CPU
(1) Optimized CPU image processing + CPU
(2) OV image processing (CPU) + CPU
| iter | 2025.4rc2 | (1) | (2) |
|---|---|---|---|
| warm-up | 5851.68 | 5840.50 | 5772.77 |
| 1 | 1191.25 | 1155.78 | 1144.94 |
| 2 | 1194.99 | 1152.73 | 1145.11 |
| 3 | 1186.30 | 1151.35 | 1139.59 |
| 4 | 1184.42 | 1153.32 | 1141.33 |
| avg. | 1189.24 | 1153.30 | 1142.74 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be implemented in optimum-intel instead? In that case optimum-intel will also get faster and optimum-intel will export the model with the new block, so GenAI won't need to have it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To work on optimum-intel, manager approval is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's lack of transparency. I created CVS-177192 in attempt to mitigate that. The problem you are solving here is broader and affects all VLMs. And there seem to be a better and simpler solution involving optimum-intel. It's simpler because Python tends to be simpler then C++.
The reason qwen2vl is implemented that way is that it was specifically optimized for GPU and we didn't think that deep about this problem back then.
Since you already done that partially for Phi-3.5, I suggest finalizing that on optimum-intel side. Let me know if my action is required for manager approval you mentioned.
You posed perf number for CPU and GPU and both improved, so we are good here. For WWB you still need to compare this 0.817479 with the baseline and run the same test for GPU. This can be done using GenAI instead of optimum-intel since you already have the implementation.
Before proceeding, we need @rkazants approval as optimum-intel maintainer. @rkazants, are you OK with the proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wovchena This 0.817479 is from GPU test, while using CPU image preprocessing on GPU gives 0.837313. The reason is the mean scaling, and when optimizing CPU image preprocessing, this function was also excluded from optimization. Therefore, I have some questions:
- Should this similarity score never decrease? Is a trade-off not acceptable?
- What does “baseline” mean? Is there a fixed version of OV and GenAI for baseline?
- Is the WWB similarity value for CPU also required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Minor decrease is possible. I hope it becomes irrelevant after resize is aligned and the model absorbs the difference introduced by mean scaling.
- Baseline is the similarity score for the same device on GenAI version without this PR
- Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- CPU with your patch so you should remove this limitation that it's applied to GPU only like you did for performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rkazants am I understanding correctly that you will take over this PR to merge into optimum-intel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Share WWB results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
INFO:whowhatbench.model_loaders:Using OpenVINO GenAI API
INFO:whowhatbench.model_loaders:Using OpenVINO GenAI VLMPipeline API
The repository /home/jade/WW44_llm-optimum_2025.4.0-20329/phi-3.5-vision-instruct/pytorch/ov/OV_FP16-4BIT_DEFAULT/ contains custom code which must be executed to correctly load the model. You can inspect the repository content at /home/jade/WW44_llm-optimum_2025.4.0-20329/phi-3.5-vision-instruct/pytorch/ov/OV_FP16-4BIT_DEFAULT .
You can inspect the repository content at https://hf.co//home/jade/WW44_llm-optimum_2025.4.0-20329/phi-3.5-vision-instruct/pytorch/ov/OV_FP16-4BIT_DEFAULT/.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.
Do you wish to run the custom code? [y/N] y
Evaluate pipeline: 24it [00:14, 1.62it/s]
...
Similarity evaluation: 24it [00:00, 29.36it/s] | 0/1 [00:00<?, ?it/s]
INFO:whowhatbench.wwb:Metrics for model: /home/jade/WW44_llm-optimum_2025.4.0-20329/phi-3.5-vision-instruct/pytorch/ov/OV_FP16-4BIT_DEFAULT/
INFO:whowhatbench.wwb: similarity
0 0.817479There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WWB similarity
| Device | 2025.4rc2 | PR |
|---|---|---|
| CPU | 0.821846 | 0.821519 |
| GPU | 0.837313 | 0.817479 |
95fcbcf to
af2e7c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ov::Tensor hd_image = HD_transform(image, config.phi3_v.num_crops); | ||
| image_size = ImageSize{hd_image.get_shape().at(2), hd_image.get_shape().at(1)}; |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The image size extraction uses magic indices (2 and 1) which refer to height and width. Consider adding named constants or comments to clarify these dimension indices match the expected NHWC format.
| ov::Tensor hd_image = HD_transform(image, config.phi3_v.num_crops); | ||
| image_size = ImageSize{hd_image.get_shape().at(2), hd_image.get_shape().at(1)}; | ||
|
|
||
| uint64_t global_size[2] = {INPUT_IMAGE_SIZE, INPUT_IMAGE_SIZE}; |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Using a C-style array for global_size is inconsistent with modern C++ practices. Consider using std::array<uint64_t, 2> for better type safety and clarity.
| int64_t max_crops_value = static_cast<int64_t>(config.phi3_v.num_crops); | ||
| ov::Tensor max_crops_tensor(ov::element::i64, ov::Shape{}, &max_crops_value); |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing a pointer to a local variable max_crops_value to the tensor constructor is unsafe if the tensor doesn't copy the data. The variable goes out of scope after this function returns, which could lead to undefined behavior. Verify that ov::Tensor copies the data or ensure the lifetime of max_crops_value extends beyond tensor usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ov::Tensor hd_image = HD_transform(image, config.phi3_v.num_crops); | ||
| image_size = ImageSize{hd_image.get_shape().at(2), hd_image.get_shape().at(1)}; | ||
|
|
||
| int64_t global_size[2] = {INPUT_IMAGE_SIZE, INPUT_IMAGE_SIZE}; |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The array size 2 is a magic number. Consider defining a named constant (e.g., NUM_DIMENSIONS=2) to clarify that this represents height and width dimensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
src/cpp/src/visual_language/phi3_vision/classes.cpp:1
- [nitpick] The comment states 'batch=1' but the actual shape uses '{1, -1, -1, 3}' where batch size is fixed at 1. The comment is accurate for this specific usage, but consider clarifying that the batch dimension is fixed to 1 for this preprocessing pipeline.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| attrs.mode = v11::Interpolate::InterpolateMode::CUBIC; | ||
| attrs.shape_calculation_mode = v11::Interpolate::ShapeCalcMode::SIZES; | ||
| attrs.coordinate_transformation_mode = v11::Interpolate::CoordinateTransformMode::ASYMMETRIC; | ||
| attrs.cube_coeff = -0.5f; // Standard coefficient for bicubic interpolation (Catmull-Rom) |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment states 'catmull-rom' but the coefficient -0.5 is actually for Catmull-Rom splines, which is a specific case of cubic interpolation. Consider clarifying that this is the standard Catmull-Rom coefficient or simply stating 'bicubic interpolation with Catmull-Rom splines' for precision.
| attrs.cube_coeff = -0.5f; // Standard coefficient for bicubic interpolation (Catmull-Rom) | |
| attrs.cube_coeff = -0.5f; // Catmull-Rom spline coefficient (bicubic interpolation with Catmull-Rom splines) |
| m_ireq_queue_vision_encoder = std::make_unique<CircularBufferQueue<ov::InferRequest>>( | ||
| compiled_model.get_property(ov::optimal_number_of_infer_requests), | ||
| [&compiled_model]() -> ov::InferRequest { | ||
| return compiled_model.create_infer_request(); |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lambda captures compiled_model by reference, but compiled_model is a local variable that will go out of scope after the constructor completes. This will result in undefined behavior when the lambda is invoked. Capture compiled_model by value instead: [compiled_model].
| m_ireq_queue_vision_encoder = std::make_unique<CircularBufferQueue<ov::InferRequest>>( | ||
| compiled_model.get_property(ov::optimal_number_of_infer_requests), | ||
| [&compiled_model]() -> ov::InferRequest { | ||
| return compiled_model.create_infer_request(); |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lambda captures compiled_model by reference, but compiled_model is a local variable that will go out of scope after the constructor completes. This will result in undefined behavior when the lambda is invoked. Capture compiled_model by value instead: [compiled_model].
87f9bc8 to
5804ddb
Compare
Description
Optimize the image preprocessing of the Phi-3.5 Vision model on GPU.
CVS-176393