Skip to content

Conversation

@wind-all
Copy link

@wind-all wind-all commented Nov 28, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 28, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new tutorial for running Qwen3 dense models on multiple NPUs. The tutorial is comprehensive, covering optimizations, environment setup, deployment, and evaluation. My review focuses on ensuring the correctness and clarity of the provided instructions. I've identified a critical issue in a shell script that would cause it to fail, a misleading description of an optimization, and an inconsistent parameter in an example script that could lead to confusion. I have provided suggestions to address these points.

Comment on lines 165 to 180
# Performance optimization of memory management
# if os is Ubuntu
apt update
apt install libjemalloc2
#if os is openEuler, add `sslverify=0` to each warehouse paragraph in openEuler.repo
cp /etc/yum.repos.d/openEuler.repo /etc/yum.repos.d/openEuler.repo.bak
sed -i '/^name.*$/a sslverify=0' /etc/yum.repos.d/openEuler.repo
yum install -y jemalloc
# Add the LD_PRELOAD environment variable
if [ -f /usr/lib/aarch64-linux-gnu/libjemalloc.so.2 ]; then
# On Ubuntu, first install with `apt install libjemalloc2`
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
elif [ -f /usr/lib64/libjemalloc.so.2 ]; then
# On openEuler, first install with `yum install jemalloc`
export LD_PRELOAD=/usr/lib64/libjemalloc.so.2:$LD_PRELOAD
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The shell script for setting up environment variables has a logical error. It unconditionally executes commands for both Ubuntu (apt) and openEuler (yum). This will cause errors when run inside a container that only has one of these package managers. The script should be structured with conditional logic (e.g., an if/else block checking for the OS type) to execute the appropriate commands for the environment.

Suggested change
# Performance optimization of memory management
# if os is Ubuntu
apt update
apt install libjemalloc2
#if os is openEuler, add `sslverify=0` to each warehouse paragraph in openEuler.repo
cp /etc/yum.repos.d/openEuler.repo /etc/yum.repos.d/openEuler.repo.bak
sed -i '/^name.*$/a sslverify=0' /etc/yum.repos.d/openEuler.repo
yum install -y jemalloc
# Add the LD_PRELOAD environment variable
if [ -f /usr/lib/aarch64-linux-gnu/libjemalloc.so.2 ]; then
# On Ubuntu, first install with `apt install libjemalloc2`
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
elif [ -f /usr/lib64/libjemalloc.so.2 ]; then
# On openEuler, first install with `yum install jemalloc`
export LD_PRELOAD=/usr/lib64/libjemalloc.so.2:$LD_PRELOAD
fi
# Performance optimization of memory management
if command -v apt-get >/dev/null; then
# On Ubuntu, first install with `apt install libjemalloc2`
apt update
apt install -y libjemalloc2
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
elif command -v yum >/dev/null; then
# On openEuler, first install with `yum install jemalloc`
# add `sslverify=0` to each warehouse paragraph in openEuler.repo
cp /etc/yum.repos.d/openEuler.repo /etc/yum.repos.d/openEuler.repo.bak
sed -i '/^name.*$/a sslverify=0' /etc/yum.repos.d/openEuler.repo
yum install -y jemalloc
export LD_PRELOAD=/usr/lib64/libjemalloc.so.2:$LD_PRELOAD
fi

This optimization is enabled by default and does not require any additional environment variables to be set.

### 2. AddRMSNormQuant Fusion
AddRMSNormQuant fusion merges the Address-wise Multi-Scale Normalization and Quantization operations, allowing for more efficient memory access and computation, thereby enhancing throughput.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The description for AddRMSNormQuant fusion appears to be incorrect. It states it merges "Address-wise Multi-Scale Normalization and Quantization operations". RMSNorm stands for Root Mean Square Normalization. This description is misleading and should be corrected to accurately reflect the function of this optimization.

Suggested change
AddRMSNormQuant fusion merges the Address-wise Multi-Scale Normalization and Quantization operations, allowing for more efficient memory access and computation, thereby enhancing throughput.
AddRMSNormQuant fusion merges the Root Mean Square Normalization (RMSNorm) and Quantization operations, allowing for more efficient memory access and computation, thereby enhancing throughput.

trust_remote_code=True,
distributed_executor_backend="mp",
max_model_len=5500,
max_num_batched_tokens=5500,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In the offline inference example, max_num_batched_tokens is set to 5500, which is the same value as max_model_len. This is inconsistent with the online inference example where max-num-batched-tokens is 40960. Using the same value as max_model_len for max_num_batched_tokens is likely a typo and could be confusing or lead to suboptimal performance or OOM errors for users adapting this script. Please use a value that is consistent with the online serving example or provide a justification for this specific value.

Suggested change
max_num_batched_tokens=5500,
max_num_batched_tokens=40960,

multi_npu_qwen3_next
multi_npu
multi_npu_moge
multi_npu_qwen3_dense
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name needs to be uniformly changed to Qwen3-Dense

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified

@@ -0,0 +1,353 @@
# Multi-NPU (Qwen3-32B-W8A8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify the title to Qwen3-Dense, and you could take Qwen3-32B as an example

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified


`docker pull quay.io/ascend/vllm-ascend:v0.11.0rc2-a3`

Start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please provide docker run command directly

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified

Comment on lines 98 to 100
`docker pull quay.io/ascend/vllm-ascend:v0.11.0rc2`

`docker pull quay.io/ascend/vllm-ascend:v0.11.0rc2-a3`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not hardcode vllm-ascend image version,it's better to use |vllm_ascend_version| bellow

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified

@1092626063
Copy link
Contributor

/lgtm

- `QWEN3-4B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800I A2 (64G × 8) node. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-4B)
- `QWEN3-8B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800I A2 (64G × 8) node. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-8B)
- `QWEN3-14B`(BF16 version): require 2 Atlas 800 A3 (64G × 16) node or 2 Atlas 800I A2 (64G × 8) node. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-14B)
- `QWEN3-32B`(BF16 version): require 4 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800I A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-32B)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QWEN3->Qwen3, please modify it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified

### Installation

You can using our official docker image for supporting Qwen3 Dense models.
Currently, we provide the all-in-one images `quay.io/ascend/vllm-ascend:v0.11.0rc2``quay.io/ascend/vllm-ascend:v0.11.0rc2-a3` and so on.[Download images](https://quay.io/repository/ascend/vllm-ascend?tab=tags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Currently, we provide the all-in-one images `quay.io/ascend/vllm-ascend:v0.11.0rc2``quay.io/ascend/vllm-ascend:v0.11.0rc2-a3` and so on.[Download images](https://quay.io/repository/ascend/vllm-ascend?tab=tags)
Currently, we provide the all-in-one images.[Download images](https://quay.io/repository/ascend/vllm-ascend?tab=tags)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified


```{code-block} bash
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0rc2-a3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should modify it also.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified


```bash
# Set vLLM to Engine V1
export VLLM_USE_V1=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to export VLLM_USE_V1=1 now. Delete it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified


# Performance optimization of memory management
# if os is Ubuntu
apt update
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too much installation guide for jemalloc2, refer to https://github.com/vllm-project/vllm-ascend/pull/4399/files

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified

@menogrey
Copy link
Contributor

menogrey commented Dec 4, 2025

LGTM, thanks for your contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants