add multi_npu_qwen3_dense tutorials #4543

wind-all · 2025-11-28T09:15:33Z

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

Signed-off-by: wind-all <[email protected]>

github-actions · 2025-11-28T09:15:41Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request adds a new tutorial for running Qwen3 dense models on multiple NPUs. The tutorial is comprehensive, covering optimizations, environment setup, deployment, and evaluation. My review focuses on ensuring the correctness and clarity of the provided instructions. I've identified a critical issue in a shell script that would cause it to fail, a misleading description of an optimization, and an inconsistent parameter in an example script that could lead to confusion. I have provided suggestions to address these points.

gemini-code-assist · 2025-11-28T09:17:17Z

docs/source/tutorials/Qwen3-Dense.md

+# Performance optimization of memory management
+# if os is Ubuntu
+apt update
+apt install libjemalloc2
+#if os is openEuler, add `sslverify=0` to each warehouse paragraph in openEuler.repo
+cp /etc/yum.repos.d/openEuler.repo /etc/yum.repos.d/openEuler.repo.bak
+sed -i '/^name.*$/a sslverify=0' /etc/yum.repos.d/openEuler.repo
+yum install -y jemalloc
+# Add the LD_PRELOAD environment variable
+if [ -f /usr/lib/aarch64-linux-gnu/libjemalloc.so.2 ]; then
+    # On Ubuntu, first install with `apt install libjemalloc2`
+    export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
+elif [ -f /usr/lib64/libjemalloc.so.2 ]; then
+    # On openEuler, first install with `yum install jemalloc`
+    export LD_PRELOAD=/usr/lib64/libjemalloc.so.2:$LD_PRELOAD
+fi


The shell script for setting up environment variables has a logical error. It unconditionally executes commands for both Ubuntu (apt) and openEuler (yum). This will cause errors when run inside a container that only has one of these package managers. The script should be structured with conditional logic (e.g., an if/else block checking for the OS type) to execute the appropriate commands for the environment.

Suggested change

# Performance optimization of memory management

# if os is Ubuntu

apt update

apt install libjemalloc2

#if os is openEuler, add `sslverify=0` to each warehouse paragraph in openEuler.repo

cp /etc/yum.repos.d/openEuler.repo /etc/yum.repos.d/openEuler.repo.bak

sed -i '/^name.*$/a sslverify=0' /etc/yum.repos.d/openEuler.repo

yum install -y jemalloc

# Add the LD_PRELOAD environment variable

if [ -f /usr/lib/aarch64-linux-gnu/libjemalloc.so.2 ]; then

# On Ubuntu, first install with `apt install libjemalloc2`

export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD

elif [ -f /usr/lib64/libjemalloc.so.2 ]; then

# On openEuler, first install with `yum install jemalloc`

export LD_PRELOAD=/usr/lib64/libjemalloc.so.2:$LD_PRELOAD

fi

# Performance optimization of memory management

if command -v apt-get >/dev/null; then

# On Ubuntu, first install with `apt install libjemalloc2`

apt update

apt install -y libjemalloc2

export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD

elif command -v yum >/dev/null; then

# On openEuler, first install with `yum install jemalloc`

# add `sslverify=0` to each warehouse paragraph in openEuler.repo

cp /etc/yum.repos.d/openEuler.repo /etc/yum.repos.d/openEuler.repo.bak

sed -i '/^name.*$/a sslverify=0' /etc/yum.repos.d/openEuler.repo

yum install -y jemalloc

export LD_PRELOAD=/usr/lib64/libjemalloc.so.2:$LD_PRELOAD

fi

gemini-code-assist · 2025-11-28T09:17:18Z

docs/source/tutorials/Qwen3-Dense.md

+This optimization is enabled by default and does not require any additional environment variables to be set.
+
+### 2. AddRMSNormQuant Fusion
+AddRMSNormQuant fusion merges the Address-wise Multi-Scale Normalization and Quantization operations, allowing for more efficient memory access and computation, thereby enhancing throughput.


The description for AddRMSNormQuant fusion appears to be incorrect. It states it merges "Address-wise Multi-Scale Normalization and Quantization operations". RMSNorm stands for Root Mean Square Normalization. This description is misleading and should be corrected to accurately reflect the function of this optimization.

Suggested change

AddRMSNormQuant fusion merges the Address-wise Multi-Scale Normalization and Quantization operations, allowing for more efficient memory access and computation, thereby enhancing throughput.

AddRMSNormQuant fusion merges the Root Mean Square Normalization (RMSNorm) and Quantization operations, allowing for more efficient memory access and computation, thereby enhancing throughput.

gemini-code-assist · 2025-11-28T09:17:18Z

docs/source/tutorials/Qwen3-Dense.md

+          trust_remote_code=True,
+          distributed_executor_backend="mp",
+          max_model_len=5500,
+          max_num_batched_tokens=5500,


In the offline inference example, max_num_batched_tokens is set to 5500, which is the same value as max_model_len. This is inconsistent with the online inference example where max-num-batched-tokens is 40960. Using the same value as max_model_len for max_num_batched_tokens is likely a typo and could be confusing or lead to suboptimal performance or OOM errors for users adapting this script. Please use a value that is consistent with the online serving example or provide a justification for this specific value.

Suggested change

max_num_batched_tokens=5500,

max_num_batched_tokens=40960,

1092626063 · 2025-11-29T04:03:28Z

docs/source/tutorials/index.md

 multi_npu_qwen3_next
 multi_npu
 multi_npu_moge
+multi_npu_qwen3_dense


The name needs to be uniformly changed to Qwen3-Dense

1092626063 · 2025-11-29T04:03:56Z

docs/source/tutorials/multi_npu_qwen3_dense.md

@@ -0,0 +1,353 @@
+# Multi-NPU (Qwen3-32B-W8A8)


Modify the title to Qwen3-Dense, and you could take Qwen3-32B as an example

1092626063 · 2025-11-29T06:32:25Z

docs/source/tutorials/multi_npu_qwen3_dense.md

+
+`docker pull quay.io/ascend/vllm-ascend:v0.11.0rc2-a3`
+
+Start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker).


please provide docker run command directly

Signed-off-by: wind-all <[email protected]>

1092626063 · 2025-12-03T12:47:37Z

docs/source/tutorials/Qwen3-Dense.md

+`docker pull quay.io/ascend/vllm-ascend:v0.11.0rc2`
+
+`docker pull quay.io/ascend/vllm-ascend:v0.11.0rc2-a3`


please do not hardcode vllm-ascend image version，it's better to use |vllm_ascend_version| bellow

1092626063 · 2025-12-03T12:49:22Z

/lgtm

Signed-off-by: wind-all <[email protected]>

menogrey · 2025-12-04T02:50:18Z

docs/source/tutorials/Qwen3-Dense.md

+- `QWEN3-4B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800I A2 (64G × 8) node. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-4B)
+- `QWEN3-8B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800I A2 (64G × 8) node. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-8B)
+- `QWEN3-14B`(BF16 version): require 2 Atlas 800 A3 (64G × 16) node or 2 Atlas 800I A2 (64G × 8) node. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-14B)
+- `QWEN3-32B`(BF16 version): require 4 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800I A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-32B)


QWEN3->Qwen3, please modify it.

menogrey · 2025-12-04T02:52:11Z

docs/source/tutorials/Qwen3-Dense.md

+### Installation
+
+You can using our official docker image for supporting Qwen3 Dense models.
+Currently, we provide the all-in-one images `quay.io/ascend/vllm-ascend:v0.11.0rc2`、`quay.io/ascend/vllm-ascend:v0.11.0rc2-a3` and so on.[Download images](https://quay.io/repository/ascend/vllm-ascend?tab=tags)


Suggested change

Currently, we provide the all-in-one images `quay.io/ascend/vllm-ascend:v0.11.0rc2`、`quay.io/ascend/vllm-ascend:v0.11.0rc2-a3` and so on.[Download images](https://quay.io/repository/ascend/vllm-ascend?tab=tags)

Currently, we provide the all-in-one images.[Download images](https://quay.io/repository/ascend/vllm-ascend?tab=tags)

menogrey · 2025-12-04T02:52:59Z

docs/source/tutorials/Qwen3-Dense.md

+
+```{code-block} bash
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0rc2-a3


You should modify it also.

menogrey · 2025-12-04T02:53:34Z

docs/source/tutorials/Qwen3-Dense.md

+
+```bash
+# Set vLLM to Engine V1
+export VLLM_USE_V1=1


No need to export VLLM_USE_V1=1 now. Delete it.

menogrey · 2025-12-04T03:00:45Z

docs/source/tutorials/Qwen3-Dense.md

+
+# Performance optimization of memory management
+# if os is Ubuntu
+apt update


Too much installation guide for jemalloc2, refer to https://github.com/vllm-project/vllm-ascend/pull/4399/files

Signed-off-by: wind-all <[email protected]>

menogrey · 2025-12-04T07:30:34Z

LGTM, thanks for your contribution.

Signed-off-by: wind-all <[email protected]>

wind-all added 10 commits November 28, 2025 15:15

add multi_npu_qwen3_dense tutorials

622e612

Signed-off-by: wind-all <[email protected]>

Merge branch 'vllm-project:main' into ayt-main

6616feb

add multi_npu_qwen3_dense tutorials

d9f5f09

Signed-off-by: wind-all <[email protected]>

Merge branch 'ayt-main' of github.com:wind-all/vllm-ascend into ayt-main

56bde9a

add multi_npu_qwen3_dense tutorials

9bcd719

Signed-off-by: wind-all <[email protected]>

add multi_npu_qwen3_dense tutorials

fcbb587

Signed-off-by: wind-all <[email protected]>

add multi_npu_qwen3_dense tutorials

502a739

Signed-off-by: wind-all <[email protected]>

change multi_npu_qwen3_dense tutorials

f9c8001

Signed-off-by: wind-all <[email protected]>

change multi_npu_qwen3_dense tutorials

96d9cf9

Signed-off-by: wind-all <[email protected]>

change multi_npu_qwen3_dense tutorials

332a8d4

Signed-off-by: wind-all <[email protected]>

github-actions bot added the documentation Improvements or additions to documentation label Nov 28, 2025

gemini-code-assist bot reviewed Nov 28, 2025

View reviewed changes

1092626063 reviewed Nov 29, 2025

View reviewed changes

wind-all added 3 commits November 29, 2025 14:45

change multi_npu_qwen3_dense tutorials

0f533d0

Signed-off-by: wind-all <[email protected]>

change multi_npu_qwen3_dense tutorials

04cc126

Signed-off-by: wind-all <[email protected]>

change multi_npu_qwen3_dense tutorials

f786c24

Signed-off-by: wind-all <[email protected]>

1092626063 reviewed Dec 3, 2025

View reviewed changes

change multi_npu_qwen3_dense tutorials

a0c371d

Signed-off-by: wind-all <[email protected]>

menogrey reviewed Dec 4, 2025

View reviewed changes

change multi_npu_qwen3_dense tutorials

595408d

Signed-off-by: wind-all <[email protected]>

wind-all added 2 commits December 5, 2025 16:09

change multi_npu_qwen3_dense tutorials

e2e4404

Signed-off-by: wind-all <[email protected]>

change multi_npu_qwen3_dense tutorials

a423a96

Signed-off-by: wind-all <[email protected]>

	AddRMSNormQuant fusion merges the Address-wise Multi-Scale Normalization and Quantization operations, allowing for more efficient memory access and computation, thereby enhancing throughput.
	AddRMSNormQuant fusion merges the Root Mean Square Normalization (RMSNorm) and Quantization operations, allowing for more efficient memory access and computation, thereby enhancing throughput.


		`docker pull quay.io/ascend/vllm-ascend:v0.11.0rc2-a3`

		Start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker).

	Currently, we provide the all-in-one images `quay.io/ascend/vllm-ascend:v0.11.0rc2`、`quay.io/ascend/vllm-ascend:v0.11.0rc2-a3` and so on.[Download images](https://quay.io/repository/ascend/vllm-ascend?tab=tags)
	Currently, we provide the all-in-one images.[Download images](https://quay.io/repository/ascend/vllm-ascend?tab=tags)

add multi_npu_qwen3_dense tutorials #4543

Are you sure you want to change the base?

add multi_npu_qwen3_dense tutorials #4543

Conversation

wind-all commented Nov 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

1092626063 commented Dec 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

menogrey commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wind-all commented Nov 28, 2025 •

edited by github-actions bot

Loading