Skip to content

Commit 6b705eb

Browse files
evkotovmryzhov
andauthored
Fix memory consumption issue with quantized Gemini Nano2 models on CPU (#32149)
### Details: Problem: Quantized models (i8/fp16 weights) were consuming excessive memory (up to 90GB) due to ConstantFolding transformation converting compressed weights to fp32. Root cause: 1. EinsumDecomposition was not called in CPU pipeline before MarkDequantization 2. MarkDequantization couldn't recognize decompression patterns with Einsum operations 3. DisableDecompressionConvertConstantFolding was disabled, allowing unwanted conversions Solution: 1. Add EinsumDecomposition to decompression_handling_manager before MarkDequantization This allows proper pattern recognition for Einsum operations 2. Keep DisableDecompressionConvertConstantFolding enabled (comment out the disable line) This preserves the protection against unwanted constant folding Transformation pipeline flow: Before fix: MarkDequantization -> [Einsum blocks pattern] -> ConstantFolding converts to fp32 After fix: EinsumDecomposition -> MarkDequantization -> [Pattern recognized] -> Constants preserved Test results on einsum_model_with_fp16_i8: - Before: constants converted to fp32 (4x memory increase for i8) - After: constants remain in i8 format (1057 MB memory usage) Both changes are required - applying only one results in incorrect behavior. ### Tickets: - 165827 --------- Co-authored-by: Mikhail Ryzhov <[email protected]>
1 parent bbf3f96 commit 6b705eb

File tree

1 file changed

+5
-2
lines changed

1 file changed

+5
-2
lines changed

src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,7 @@
113113
#include "transformations/op_conversions/convert_topk11_downgrade.hpp"
114114
#include "transformations/op_conversions/detection_output_downgrade.hpp"
115115
#include "transformations/op_conversions/detection_output_upgrade.hpp"
116+
#include "transformations/op_conversions/einsum_decomposition.hpp"
116117
#include "transformations/op_conversions/eye_decomposition.hpp"
117118
#include "transformations/op_conversions/fake_convert_decomposition.hpp"
118119
#include "transformations/op_conversions/fq_decomposition.hpp"
@@ -462,6 +463,10 @@ void Transformations::PreLpt(const std::vector<ov::element::Type>& defaultPrecis
462463
const bool useLpt = !defaultPrecisions.empty();
463464
CPU_REGISTER_PASS_COMMON(decompression_handling_manager, ov::pass::CompressedGatherTransformation);
464465
CPU_REGISTER_PASS_COMMON(decompression_handling_manager, ov::pass::MarkShapeOfSubgraphs);
466+
467+
// Decompose Einsum before marking dequantization to ensure the pattern can be recognized
468+
CPU_REGISTER_PASS_COMMON(decompression_handling_manager, ov::pass::EinsumDecomposition);
469+
465470
// We need to fuse Transpose to MatMul to have a simpler callback for the next transformation
466471
CPU_REGISTER_PASS_X64(decompression_handling_manager, ov::pass::TransposeMatMul);
467472
CPU_REGISTER_PASS_ARM(decompression_handling_manager, ov::pass::TransposeMatMul);
@@ -830,8 +835,6 @@ void Transformations::PreLpt(const std::vector<ov::element::Type>& defaultPrecis
830835

831836
// List of enabled/disabled transformations
832837

833-
// Allow FP16 Converts to be folded and FP16 constants to be upgraded to FP32 data type
834-
CPU_DISABLE_PASS_COMMON(manager, ov::pass::DisableDecompressionConvertConstantFolding);
835838
CPU_DISABLE_PASS_COMMON(manager, ov::pass::ConvertCompressedOnlyToLegacy);
836839
CPU_DISABLE_PASS_COMMON(manager, ov::pass::EyeDecomposition);
837840
CPU_DISABLE_PASS_COMMON(manager, ov::pass::ConvertGELU);

0 commit comments

Comments
 (0)