Skip to content

Conversation

@lhp-deep
Copy link

@lhp-deep lhp-deep commented Dec 2, 2025

What this PR does / why we need it?

In reinforcement learning scenarios, the current inference applies a transpose operation to the weights. For a cleaner architecture, the weight transpose module was moved to wakeup.

Does this PR introduce any user-facing change?

How was this patch tested?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the weight transposition logic for MoE models, moving it from the weight loading process into the wake_up method. This is intended to support Reinforcement Learning scenarios where weights are updated dynamically. The changes involve modifying how MoE weights are handled in fused_moe.py and worker_v1.py, and updating example and test files accordingly. My review identifies a critical bug in the weight identification logic within the wake_up method and suggests an improvement for correctness and code quality.

Comment on lines +195 to +198
for name, param in model.named_parameters():
if 'w2_weight' in name and param.shape[2] == hidden_size:
parts = name.split('.')
param_name = parts[-1]
parent_module = model.get_submodule(".".join(parts[:-1]))

w2_data = param.transpose(1, 2)
w2_data = torch.nn.Parameter(w2_data, requires_grad=False)
setattr(parent_module, param_name, w2_data)
elif 'w13_weight' in name and param.shape[1] == hidden_size:
parts = name.split('.')
param_name = parts[-1]
parent_module = model.get_submodule(".".join(parts[:-1]))

w13_data = param.transpose(1, 2)
w13_data = torch.nn.Parameter(w13_data, requires_grad=False)
setattr(parent_module, param_name, w13_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There are a couple of issues in this block of code:

  1. [Critical] The condition to identify the w2_weight parameter is incorrect. The shape of w2_weight at this point is (num_experts, hidden_size, intermediate_size). The condition param.shape[2] == hidden_size compares intermediate_size with hidden_size, which is not always true and will cause this logic to fail for many models. It should be param.shape[1] == hidden_size to correctly identify the parameter by its hidden dimension size.

  2. [High] After transposing a tensor, it's good practice to call .contiguous() to ensure the memory layout is continuous. This can prevent potential errors and performance issues in subsequent operations that expect a contiguous tensor. The load_weights method which is called after this might rely on it.

  3. [Medium] The code for transposing w2_weight and w13_weight is very similar. This duplication can be avoided by refactoring it into a helper function to improve readability and maintainability.

I've provided a suggestion that fixes the critical bug, adds .contiguous(), and refactors the duplicated logic.

Suggested change
for name, param in model.named_parameters():
if 'w2_weight' in name and param.shape[2] == hidden_size:
parts = name.split('.')
param_name = parts[-1]
parent_module = model.get_submodule(".".join(parts[:-1]))
w2_data = param.transpose(1, 2)
w2_data = torch.nn.Parameter(w2_data, requires_grad=False)
setattr(parent_module, param_name, w2_data)
elif 'w13_weight' in name and param.shape[1] == hidden_size:
parts = name.split('.')
param_name = parts[-1]
parent_module = model.get_submodule(".".join(parts[:-1]))
w13_data = param.transpose(1, 2)
w13_data = torch.nn.Parameter(w13_data, requires_grad=False)
setattr(parent_module, param_name, w13_data)
for name, param in model.named_parameters():
# The shape of w2_weight is (num_experts, hidden_size, intermediate_size)
# The shape of w13_weight is (num_experts, hidden_size, 2 * intermediate_size)
if ('w2_weight' in name or 'w13_weight' in name) and len(param.shape) == 3 and param.shape[1] == hidden_size:
parts = name.split('.')
param_name = parts[-1]
parent_module = model.get_submodule(".".join(parts[:-1]))
# Transpose back to training format and ensure contiguity
new_data = param.transpose(1, 2).contiguous()
new_param = torch.nn.Parameter(new_data, requires_grad=False)
setattr(parent_module, param_name, new_param)

@github-actions
Copy link

github-actions bot commented Dec 2, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ops module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants