-
Notifications
You must be signed in to change notification settings - Fork 629
add multi_npu_qwen3_dense tutorials #4543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: wind-all <[email protected]>
Signed-off-by: wind-all <[email protected]>
Signed-off-by: wind-all <[email protected]>
Signed-off-by: wind-all <[email protected]>
Signed-off-by: wind-all <[email protected]>
Signed-off-by: wind-all <[email protected]>
Signed-off-by: wind-all <[email protected]>
Signed-off-by: wind-all <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a new tutorial for running Qwen3 dense models on multiple NPUs. The tutorial is comprehensive, covering optimizations, environment setup, deployment, and evaluation. My review focuses on ensuring the correctness and clarity of the provided instructions. I've identified a critical issue in a shell script that would cause it to fail, a misleading description of an optimization, and an inconsistent parameter in an example script that could lead to confusion. I have provided suggestions to address these points.
docs/source/tutorials/Qwen3-Dense.md
Outdated
| # Performance optimization of memory management | ||
| # if os is Ubuntu | ||
| apt update | ||
| apt install libjemalloc2 | ||
| #if os is openEuler, add `sslverify=0` to each warehouse paragraph in openEuler.repo | ||
| cp /etc/yum.repos.d/openEuler.repo /etc/yum.repos.d/openEuler.repo.bak | ||
| sed -i '/^name.*$/a sslverify=0' /etc/yum.repos.d/openEuler.repo | ||
| yum install -y jemalloc | ||
| # Add the LD_PRELOAD environment variable | ||
| if [ -f /usr/lib/aarch64-linux-gnu/libjemalloc.so.2 ]; then | ||
| # On Ubuntu, first install with `apt install libjemalloc2` | ||
| export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD | ||
| elif [ -f /usr/lib64/libjemalloc.so.2 ]; then | ||
| # On openEuler, first install with `yum install jemalloc` | ||
| export LD_PRELOAD=/usr/lib64/libjemalloc.so.2:$LD_PRELOAD | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shell script for setting up environment variables has a logical error. It unconditionally executes commands for both Ubuntu (apt) and openEuler (yum). This will cause errors when run inside a container that only has one of these package managers. The script should be structured with conditional logic (e.g., an if/else block checking for the OS type) to execute the appropriate commands for the environment.
| # Performance optimization of memory management | |
| # if os is Ubuntu | |
| apt update | |
| apt install libjemalloc2 | |
| #if os is openEuler, add `sslverify=0` to each warehouse paragraph in openEuler.repo | |
| cp /etc/yum.repos.d/openEuler.repo /etc/yum.repos.d/openEuler.repo.bak | |
| sed -i '/^name.*$/a sslverify=0' /etc/yum.repos.d/openEuler.repo | |
| yum install -y jemalloc | |
| # Add the LD_PRELOAD environment variable | |
| if [ -f /usr/lib/aarch64-linux-gnu/libjemalloc.so.2 ]; then | |
| # On Ubuntu, first install with `apt install libjemalloc2` | |
| export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD | |
| elif [ -f /usr/lib64/libjemalloc.so.2 ]; then | |
| # On openEuler, first install with `yum install jemalloc` | |
| export LD_PRELOAD=/usr/lib64/libjemalloc.so.2:$LD_PRELOAD | |
| fi | |
| # Performance optimization of memory management | |
| if command -v apt-get >/dev/null; then | |
| # On Ubuntu, first install with `apt install libjemalloc2` | |
| apt update | |
| apt install -y libjemalloc2 | |
| export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD | |
| elif command -v yum >/dev/null; then | |
| # On openEuler, first install with `yum install jemalloc` | |
| # add `sslverify=0` to each warehouse paragraph in openEuler.repo | |
| cp /etc/yum.repos.d/openEuler.repo /etc/yum.repos.d/openEuler.repo.bak | |
| sed -i '/^name.*$/a sslverify=0' /etc/yum.repos.d/openEuler.repo | |
| yum install -y jemalloc | |
| export LD_PRELOAD=/usr/lib64/libjemalloc.so.2:$LD_PRELOAD | |
| fi |
docs/source/tutorials/Qwen3-Dense.md
Outdated
| This optimization is enabled by default and does not require any additional environment variables to be set. | ||
|
|
||
| ### 2. AddRMSNormQuant Fusion | ||
| AddRMSNormQuant fusion merges the Address-wise Multi-Scale Normalization and Quantization operations, allowing for more efficient memory access and computation, thereby enhancing throughput. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description for AddRMSNormQuant fusion appears to be incorrect. It states it merges "Address-wise Multi-Scale Normalization and Quantization operations". RMSNorm stands for Root Mean Square Normalization. This description is misleading and should be corrected to accurately reflect the function of this optimization.
| AddRMSNormQuant fusion merges the Address-wise Multi-Scale Normalization and Quantization operations, allowing for more efficient memory access and computation, thereby enhancing throughput. | |
| AddRMSNormQuant fusion merges the Root Mean Square Normalization (RMSNorm) and Quantization operations, allowing for more efficient memory access and computation, thereby enhancing throughput. |
| trust_remote_code=True, | ||
| distributed_executor_backend="mp", | ||
| max_model_len=5500, | ||
| max_num_batched_tokens=5500, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the offline inference example, max_num_batched_tokens is set to 5500, which is the same value as max_model_len. This is inconsistent with the online inference example where max-num-batched-tokens is 40960. Using the same value as max_model_len for max_num_batched_tokens is likely a typo and could be confusing or lead to suboptimal performance or OOM errors for users adapting this script. Please use a value that is consistent with the online serving example or provide a justification for this specific value.
| max_num_batched_tokens=5500, | |
| max_num_batched_tokens=40960, |
docs/source/tutorials/index.md
Outdated
| multi_npu_qwen3_next | ||
| multi_npu | ||
| multi_npu_moge | ||
| multi_npu_qwen3_dense |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name needs to be uniformly changed to Qwen3-Dense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified
| @@ -0,0 +1,353 @@ | |||
| # Multi-NPU (Qwen3-32B-W8A8) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modify the title to Qwen3-Dense, and you could take Qwen3-32B as an example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified
|
|
||
| `docker pull quay.io/ascend/vllm-ascend:v0.11.0rc2-a3` | ||
|
|
||
| Start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please provide docker run command directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified
Signed-off-by: wind-all <[email protected]>
Signed-off-by: wind-all <[email protected]>
Signed-off-by: wind-all <[email protected]>
docs/source/tutorials/Qwen3-Dense.md
Outdated
| `docker pull quay.io/ascend/vllm-ascend:v0.11.0rc2` | ||
|
|
||
| `docker pull quay.io/ascend/vllm-ascend:v0.11.0rc2-a3` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do not hardcode vllm-ascend image version,it's better to use |vllm_ascend_version| bellow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified
|
/lgtm |
Signed-off-by: wind-all <[email protected]>
docs/source/tutorials/Qwen3-Dense.md
Outdated
| - `QWEN3-4B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800I A2 (64G × 8) node. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-4B) | ||
| - `QWEN3-8B`(BF16 version): require 1 Atlas 800 A3 (64G × 16) node or 1 Atlas 800I A2 (64G × 8) node. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-8B) | ||
| - `QWEN3-14B`(BF16 version): require 2 Atlas 800 A3 (64G × 16) node or 2 Atlas 800I A2 (64G × 8) node. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-14B) | ||
| - `QWEN3-32B`(BF16 version): require 4 Atlas 800 A3 (64G × 16) nodes or 4 Atlas 800I A2 (64G × 8) nodes. [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-32B) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QWEN3->Qwen3, please modify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified
docs/source/tutorials/Qwen3-Dense.md
Outdated
| ### Installation | ||
|
|
||
| You can using our official docker image for supporting Qwen3 Dense models. | ||
| Currently, we provide the all-in-one images `quay.io/ascend/vllm-ascend:v0.11.0rc2`、`quay.io/ascend/vllm-ascend:v0.11.0rc2-a3` and so on.[Download images](https://quay.io/repository/ascend/vllm-ascend?tab=tags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Currently, we provide the all-in-one images `quay.io/ascend/vllm-ascend:v0.11.0rc2`、`quay.io/ascend/vllm-ascend:v0.11.0rc2-a3` and so on.[Download images](https://quay.io/repository/ascend/vllm-ascend?tab=tags) | |
| Currently, we provide the all-in-one images.[Download images](https://quay.io/repository/ascend/vllm-ascend?tab=tags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified
docs/source/tutorials/Qwen3-Dense.md
Outdated
|
|
||
| ```{code-block} bash | ||
| # Update the vllm-ascend image | ||
| export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0rc2-a3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should modify it also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified
docs/source/tutorials/Qwen3-Dense.md
Outdated
|
|
||
| ```bash | ||
| # Set vLLM to Engine V1 | ||
| export VLLM_USE_V1=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to export VLLM_USE_V1=1 now. Delete it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified
docs/source/tutorials/Qwen3-Dense.md
Outdated
|
|
||
| # Performance optimization of memory management | ||
| # if os is Ubuntu | ||
| apt update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too much installation guide for jemalloc2, refer to https://github.com/vllm-project/vllm-ascend/pull/4399/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified
Signed-off-by: wind-all <[email protected]>
|
LGTM, thanks for your contribution. |
Signed-off-by: wind-all <[email protected]>
Signed-off-by: wind-all <[email protected]>
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?