Deepseek Mtp model uses the lm_head and embedding from the main model (#2790)

zzhx1 · wangxiyuan · web-flow · commit 866347a6210d · 2025-12-08T10:33:29.000+08:00
### What this PR does / why we need it? In the Deepseek technical report, it is mentioned that the embedding and lmhead layers of the MTP layer are shared with the main model, but the current implementation independently loads the complete embedding and lmhead. In the Deepseek-R1 model, their weight sizes are 129280*7168 in fp16 format, which is 1.72G. This PR fixes the MTP layer to use the lmhead and embedding of the main model, saving 3.45G of GPU memory in the pure DP scenario. The current process will first create temporary spaces for the embedding and lmhead in the mtp layer, then I will call torch.equal to determine if the two matrices are the same. If they are the same, they will be reused, and the previous tensor will be released. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
diff --git a/vllm_ascend/spec_decode/mtp_proposer.py b/vllm_ascend/spec_decode/mtp_proposer.py
@@ -200,6 +200,16 @@ def load_model(self, model) -> None:
         process_weights_after_loading(self.model, draft_model_config,
                                       target_device)
 
+        # check if mtp model use main model's embedding and LMhead
+        main_model = model
+        if torch.equal(self.model.model.embed_tokens.weight,
+                       main_model.model.embed_tokens.weight):
+            self.model.model.embed_tokens = main_model.model.embed_tokens
+        for _, layer_module in self.model.model.layers.items():
+            if torch.equal(layer_module.shared_head.head.weight,
+                           main_model.lm_head.weight):
+                layer_module.shared_head.head = main_model.lm_head
+
         if self.vllm_config.compilation_config.cudagraph_mode.has_full_cudagraphs(
         ):
             self.update_stream: torch.npu.Stream = torch.npu.Stream()