Fix memory consumption issue with quantized Gemini Nano2 models on CPU (#32149)

evkotov · mryzhov · web-flow · commit 6b705ebc0e89 · 2025-10-17T15:28:03.000Z
### Details:
Problem:
Quantized models (i8/fp16 weights) were consuming excessive memory (up
to 90GB) due to
  ConstantFolding transformation converting compressed weights to fp32.

  Root cause:
1. EinsumDecomposition was not called in CPU pipeline before
MarkDequantization
2. MarkDequantization couldn't recognize decompression patterns with
Einsum operations
3. DisableDecompressionConvertConstantFolding was disabled, allowing
unwanted conversions

  Solution:
1. Add EinsumDecomposition to decompression_handling_manager before
MarkDequantization
     This allows proper pattern recognition for Einsum operations
2. Keep DisableDecompressionConvertConstantFolding enabled (comment out
the disable line)
     This preserves the protection against unwanted constant folding

  Transformation pipeline flow:
  Before fix:
MarkDequantization -&gt; [Einsum blocks pattern] -&gt; ConstantFolding
converts to fp32

  After fix:
EinsumDecomposition -&gt; MarkDequantization -&gt; [Pattern recognized] -&gt;
Constants preserved

  Test results on einsum_model_with_fp16_i8:
  - Before: constants converted to fp32 (4x memory increase for i8)
  - After: constants remain in i8 format (1057 MB memory usage)

Both changes are required - applying only one results in incorrect
behavior.

### Tickets:
 - 165827

---------

Co-authored-by: Mikhail Ryzhov &lt;mikhail.ryzhov@intel.com&gt;
diff --git a/src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp b/src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp
@@ -113,6 +113,7 @@
 #include "transformations/op_conversions/convert_topk11_downgrade.hpp"
 #include "transformations/op_conversions/detection_output_downgrade.hpp"
 #include "transformations/op_conversions/detection_output_upgrade.hpp"
+#include "transformations/op_conversions/einsum_decomposition.hpp"
 #include "transformations/op_conversions/eye_decomposition.hpp"
 #include "transformations/op_conversions/fake_convert_decomposition.hpp"
 #include "transformations/op_conversions/fq_decomposition.hpp"
@@ -462,6 +463,10 @@ void Transformations::PreLpt(const std::vector<ov::element::Type>& defaultPrecis
     const bool useLpt = !defaultPrecisions.empty();
     CPU_REGISTER_PASS_COMMON(decompression_handling_manager, ov::pass::CompressedGatherTransformation);
     CPU_REGISTER_PASS_COMMON(decompression_handling_manager, ov::pass::MarkShapeOfSubgraphs);
+
+    // Decompose Einsum before marking dequantization to ensure the pattern can be recognized
+    CPU_REGISTER_PASS_COMMON(decompression_handling_manager, ov::pass::EinsumDecomposition);
+
     // We need to fuse Transpose to MatMul to have a simpler callback for the next transformation
     CPU_REGISTER_PASS_X64(decompression_handling_manager, ov::pass::TransposeMatMul);
     CPU_REGISTER_PASS_ARM(decompression_handling_manager, ov::pass::TransposeMatMul);
@@ -830,8 +835,6 @@ void Transformations::PreLpt(const std::vector<ov::element::Type>& defaultPrecis
 
     // List of enabled/disabled transformations
 
-    // Allow FP16 Converts to be folded and FP16 constants to be upgraded to FP32 data type
-    CPU_DISABLE_PASS_COMMON(manager, ov::pass::DisableDecompressionConvertConstantFolding);
     CPU_DISABLE_PASS_COMMON(manager, ov::pass::ConvertCompressedOnlyToLegacy);
     CPU_DISABLE_PASS_COMMON(manager, ov::pass::EyeDecomposition);
     CPU_DISABLE_PASS_COMMON(manager, ov::pass::ConvertGELU);