From 2befd9f98466bf34f4baa494f18419a4220f9283 Mon Sep 17 00:00:00 2001 From: wangxiyuan Date: Tue, 2 Dec 2025 17:35:47 +0800 Subject: [PATCH 1/8] clean up model module (#4611) Model module is useless now. Let't remove it totally. - vLLM version: v0.11.2 Signed-off-by: yangshihao6 --- docs/source/tutorials/Qwen2.5-7b.md | 358 ++++++++++++++++++++++++++++ setup.py | 1 - vllm_ascend/__init__.py | 5 - vllm_ascend/models/__init__.py | 10 - 4 files changed, 358 insertions(+), 16 deletions(-) create mode 100644 docs/source/tutorials/Qwen2.5-7b.md delete mode 100644 vllm_ascend/models/__init__.py diff --git a/docs/source/tutorials/Qwen2.5-7b.md b/docs/source/tutorials/Qwen2.5-7b.md new file mode 100644 index 00000000000..6741b426a9f --- /dev/null +++ b/docs/source/tutorials/Qwen2.5-7b.md @@ -0,0 +1,358 @@ +# Qwen2.5-7B-Instruct Deployment and Verification Guide + +## Introduction + +Qwen2.5-7B-Instruct is a 7-billion-parameter large language model pre-trained on 18 trillion tokens. It supports a maximum context window of 128K, enables generation of up to 8K tokens, and delivers enhanced capabilities in multilingual processing, instruction following, programming, mathematical computation, and structured data handling. + +This document details the complete deployment and verification workflow for the model, including supported features, environment preparation, single-node deployment, functional verification, accuracy and performance evaluation, and troubleshooting of common issues. It is designed to help users quickly complete model deployment and validation. + +## Supported Features + +Qwen2.5-7B-Instruct offers the following core capabilities: +- **Multilingual Support**: Compatible with over 29 languages (Chinese, English, French, Spanish, Russian, Japanese, etc.). +- **Instruction Following**: Optimized through instruction tuning to accurately understand and execute user commands. +- **Programming & Mathematical Proficiency**: Delivers excellent performance on benchmarks such as HumanEval (programming) and MATH (mathematics). +- **Structured Data Handling**: Enhanced ability to process and generate structured data (e.g., tables, JSON formats). +- **Long Context Processing**: Supports a maximum context length of 128K for efficient handling of ultra-long text sequences. + +## Environment Preparation + +### Model Weight + +Qwen2.5-7B-Instruct model weights can be downloaded from the official ModelScope repository (Note: Corrected from VL version to the correct language model link): +- [Qwen2.5-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct) + +It is recommended to download the model weights to a local directory (e.g., `./Qwen2.5-7B-Instruct/`) for quick access during deployment. + +### Hardware and System Requirements + +| Component | Specification | +|-----------|---------------| +| Hardware Platform | 910B4 (8 cards × 32GB) | +| Operating System | Ubuntu 22.04 (Corrected from non-official 22.03 version) | +| Driver Version | 25.0.rc1.1 | +| Python Version | 3.12 | + +### Software Dependencies + +| Component | Version Requirement | Notes | +|-----------|---------------------|-------| +| CANN | 8.2.RC1 | Ascend Computing Architecture Dependency | +| PyTorch | 2.5.1.post0 | Base Deep Learning Framework | +| torch-npu | 2.7.1rc1 | Ascend-adapted version | +| vLLM | 0.9.1 | Must match vLLM-Ascend version | +| vLLM-Ascend | 0.9.1-dev | Ascend-optimized version | + +### Environment Check and Verification + +Verify hardware status and network connectivity before installation: +```bash +# Check NPU device status +npu-smi info + +# Verify network interface and connectivity +for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done +for i in {0..15}; do hccn_tool -i $i -link -g; done +for i in {0..15}; do hccn_tool -i $i -net_health -g; done + +# Check IP configuration +for i in {0..15}; do hccn_tool -i $i -ip -g; done +``` + +### Container Environment Setup + +Create a privileged container to isolate the deployment environment: +```bash +docker run -it --privileged --name=test_vllm_Qwen_2.5_7B --net=host --shm-size=500g \ +--device=/dev/davinci{0..15} \ +-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ +-v /home/:/home \ +-w /home/ \ +mindie:dev-2.2.RC1.B070-800I-A2-py312-ubuntu22.03-x86_64 \ +/bin/bash +``` +Replace `` with your actual username. + +### Installation + +Install the required software dependencies in the container following these steps: + +#### Step 1: Install CANN Toolkit +```bash +# Execute the CANN installation package (adjust path to match local file) +./Ascend-cann-toolkit_8.2.RC1_linux-x86_64.run --install --install-path=/home//cmc/cann_8.2.rc1 + +# Configure CANN environment variables (New: Ensure dependencies take effect) +echo "source /home//cmc/cann_8.2.rc1/set_env.sh" >> ~/.bashrc +source ~/.bashrc +``` + +#### Step 2: Configure PyTorch Environment +```bash +# Set up pip mirror sources for faster installation +pip config set global.extra-index-url "https://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypi" + +# Install PyTorch and torch-npu (Fixed version compatibility) +pip install torch==2.5.1.post0 torchvision==0.18.0 torch-npu==2.7.1rc1 +``` + +#### Step 3: Install vLLM and vLLM-Ascend +```bash +# Install dependency packages (New: Avoid compilation failures) +pip install cmake ninja sentencepiece transformers + +# Install vLLM (v0.9.1) +git clone https://github.com/vllm-project/vllm.git +cd vllm && git checkout releases/v0.9.1 +VLLM_TARGET_DEVICE=empty pip install -v -e . +cd .. + +# Install vLLM-Ascend (v0.9.1-dev, Ascend-optimized version) +git clone https://github.com/vllm-project/vllm-ascend.git +cd vllm-ascend && git checkout v0.9.1-dev +pip install -v -e . +cd .. +``` + +#### Step 4: Install Accuracy Evaluation Tool (AISBench) +The AISBench tool is used for model accuracy and performance evaluation. Follow these installation steps: + +:::{note} +The server may be in a restricted network zone (Yellow Zone) and require a Green Zone proxy tool for internet access. Download the proxy tool from the internal repository, run `PortMapping.exe` to obtain the proxy IP, and update `ip_addr` in `portproxy_remote.sh` before executing the script. +::: + +```bash +# Clone the AISBench repository +git clone https://gitee.com/aisbench/benchmark.git +cd benchmark/ + +# Install core dependencies +pip3 install -e ./ --use-pep517 + +# Install dependencies for service-oriented model evaluation (vLLM/Triton) +pip3 install -r requirements/api.txt +pip3 install -r requirements/extra.txt + +# Install BFCL evaluation dependencies +pip3 install -r requirements/bfcl_dependencies.txt --no-deps + +# Disable proxy after installation +unset https_proxy +unset http_proxy +``` + +For detailed installation instructions, refer to the [AISBench Official Documentation](https://ais-bench-benchmark.readthedocs.io/zh-cn/latest/get_started/install.html). + +## Deployment + +### Single-node Deployment + +Qwen2.5-7B-Instruct supports single-node single-card deployment on the 910B4 platform. Follow these steps to start the inference service: + +1. Prepare model weights: Ensure the downloaded model weights are stored in the `./Qwen2.5-7B-Instruct/` directory. +2. Download the gsm8k dataset (for evaluation): [gsm8k.zip](https://vision-file-storage/api/file/download/attachment-v2/WIKI202511118986704/32978033/20251111T144846Z_9658c67a0fb349f9be081ab9ab9fd2bc.zip?attachment_id=32978033) +3. Create and execute the deployment script (save as `deploy.sh`): + +```shell +#!/bin/sh +# Set environment variables for Ascend optimization +export VLLM_USE_V1=1 +export TASK_QUEUE_ENABLE=1 +export HCCL_OP_EXPANSION_MODE="AIV" +export PAGED_ATTENTION_MASK_LEN=max_seq_len +export VLLM_ASCEND_ENABLE_FLASHCOMM=1 +export VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE=1 + +# Start vLLM inference service +vllm serve ./Qwen2.5-7B-Instruct/ \ + --host \ # Replace with server IP (e.g., 0.0.0.0 for all interfaces) + --port \ # Replace with available port (e.g., 8080) + --served-model-name qwen-2.5-7b-instruct \ # Standardized model name for consistency + --trust-remote-code \ + --dtype bfloat16 \ + --max-model-len 32768 \ # Maximum context length (adjust based on requirements) + --tensor-parallel-size 1 \ # Single-card deployment + --disable-log-requests \ + --enforce-eager + +# Execution command: chmod +x deploy.sh && ./deploy.sh +``` + +### Multi-node Deployment + +This document currently focuses on single-node deployment. For multi-node deployment, refer to the [vLLM-Ascend Multi-node Guide](https://github.com/vllm-project/vllm-ascend) and ensure consistent environment configuration across all nodes. + +### Prefill-Decode Disaggregation + +This feature is not supported at this time. + +## Functional Verification + +After starting the service, verify functionality using a `curl` request: + +```bash +curl http://:/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "qwen-2.5-7b-instruct", # Must match --served-model-name from deployment + "prompt": "Beijing is a", + "max_tokens": 5, + "temperature": 0 + }' +``` + +A valid response (e.g., `"Beijing is a vibrant and historic capital city"`) indicates successful deployment. + +### Supplementary Verification Method (New) +If `curl` verification fails, use this Python script: +```python +import requests + +url = "http://:/v1/completions" +headers = {"Content-Type": "application/json"} +data = { + "model": "qwen-2.5-7b-instruct", + "prompt": "Explain quantum computing in simple terms", + "max_tokens": 100, + "temperature": 0.7 +} + +response = requests.post(url, headers=headers, json=data) +print(response.json()) +``` + +## Accuracy Evaluation + +Two accuracy evaluation methods are provided: AISBench (recommended) and manual testing with standard datasets. + +### Using AISBench + +#### Prerequisites +1. Extract the gsm8k dataset to `benchmark/datasets/gsm8k/` (download from the link above). +2. Configure model evaluation parameters. + +#### Configuration Steps +1. Locate the AISBench configuration file: +```bash +cd benchmark/ +ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --search +``` +2. Modify the configuration file (e.g., `vllm_api_general_chat.py`) to match the deployed service: +```Python +from ais_bench.benchmark.models import VLLMCustomAPIChat + +models = [ + dict( + attr="service", + type=VLLMCustomAPIChat, + abbr='vllm-api-general-chat', + path="", + model="qwen-2.5-7b-instruct", # Must match --served-model-name from deployment + request_rate=0, + retry=2, + host_ip="", # Deployment server IP + host_port=, # Deployment server port + max_out_len=512, + batch_size=1, + generation_kwargs=dict( + temperature=0.5, + top_k=10, + top_p=0.95, + seed=None, + repetition_penalty=1.03, + ) + ) +] +``` + +#### Execution Command +```bash +# Specify visible NPU cards (adjust based on available hardware) +export ASCEND_RT_VISIBLE_DEVICES=0 + +# Run evaluation (debug logs recommended for first execution) +ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --debug + +# Generate summary report +ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer example +``` + +#### Evaluation Results +Results and logs are saved to `benchmark/outputs/default/`. A sample accuracy report is shown below: +![Accuracy Evaluation Result](https://wiki.huawei.com/vision-file-storage/api/file/download/upload-v2/WIKI202511118986704/32976454/30bf146f86ab472697430f8efae66c1a.png) + +### Pure Model Accuracy Evaluation +For local model evaluation (without service deployment), modify `attr="local"` in the AISBench configuration file: +```Python +dict( + attr="local", # Change from "service" to "local" + type=VLLMCustomAPIChat, + abbr='vllm-api-general-chat', + path="./Qwen2.5-7B-Instruct/", # Path to local model weights + model="qwen-2.5-7b-instruct", + # ... (other parameters remain unchanged) +) +``` + +## Performance Evaluation + +### Using AISBench +Add `--mode perf` to the accuracy evaluation command to run performance testing: +```bash +ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer example --mode perf +``` + +#### Performance Metrics +Key metrics include throughput (tokens/sec), latency (ms), and NPU utilization. A sample result is shown below: +![Performance Evaluation Result](https://wiki.huawei.com/vision-file-storage/api/file/download/upload-v2/WIKI202511118986704/32976455/2b68624624a2436db5959e51aebaa106.png) + +For detailed metric explanations, refer to the [AISBench Performance Documentation](https://ais-bench-benchmark.readthedocs.io/zh-cn/latest/base_tutorials/results_intro/performance_metric.html#). + +### Using vLLM Benchmark +vLLM includes a built-in benchmark tool for evaluating throughput and latency. Example command for online serving performance testing: +```bash +export VLLM_USE_MODELSCOPE=true +vllm bench serve \ + --model ./Qwen2.5-7B-Instruct/ \ + --dataset-name random \ + --random-input 200 \ + --num-prompt 200 \ + --request-rate 1 \ + --save-result \ + --result-dir ./perf_results/ +``` + +For more details, refer to the [vLLM Benchmark Documentation](https://docs.vllm.ai/en/latest/contributing/benchmarks.html). + +## Common Issues and Solutions + +### How to Check Service Status and Metrics? +- **Enable Monitoring**: Add the `--metrics` parameter to the deployment command. Access metrics via `http://:/metrics` (Prometheus-compatible) to view NPU utilization, queue length, and inference latency. +- **Debug Logs**: Add the `--log-level debug` parameter to the deployment command to output detailed logs for troubleshooting. + +### Deployment Verification Failure +- **Issue**: `curl` request returns an error or no response. +- **Solutions**: + 1. Verify the server IP and port are correct (use `netstat -tuln | grep ` to check port occupancy). + 2. Ensure the model weight path is correct and the model is fully loaded (look for "Model loaded successfully" in logs). + 3. Confirm firewall rules allow traffic on the deployment port (use `ufw status` to check firewall status). + 4. Verify dependency version compatibility (especially vLLM and vLLM-Ascend must match). + +### Multi-Card Load Imbalance +- **Symptom**: Uneven memory usage across NPU cards. +- **Solutions**: + 1. Ensure `--tensor-parallel-size` matches the number of cards (e.g., set to 8 for 8-card deployment). + 2. For large models, adjust the `--gpu-memory-utilization` parameter (e.g., 0.9) to optimize memory allocation. + 3. Enable Ascend-specific optimizations (e.g., `VLLM_ASCEND_ENABLE_FLASHCOMM=1`). + +### Network Restriction Issues +- **Issue**: Failed dependency downloads (restricted network environment). +- **Solution**: Configure the proxy using the Green Zone proxy tool as described in the [Installation](#step-4-install-accuracy-evaluation-tool-aisbench) section, then disable the proxy after installation. + +### Compilation Failure When Installing vLLM +- **Issue**: Compilation errors occur when executing `pip install -v -e .`. +- **Solutions**: + 1. Ensure dependency packages are installed: `pip install cmake ninja sentencepiece`. + 2. Verify Python version is 3.12 (lower versions are not supported). + 3. Clean cache and reinstall: `rm -rf build/ dist/ *.egg-info && pip install -v -e .`. + diff --git a/setup.py b/setup.py index 3e88affaf79..890b5228e57 100644 --- a/setup.py +++ b/setup.py @@ -534,7 +534,6 @@ def _read_requirements(filename: str) -> List[str]: entry_points={ "vllm.platform_plugins": ["ascend = vllm_ascend:register"], "vllm.general_plugins": [ - "ascend_enhanced_model = vllm_ascend:register_model", "ascend_kv_connector = vllm_ascend:register_connector", "ascend_model_loader = vllm_ascend:register_model_loader", "ascend_service_profiling = vllm_ascend:register_service_profiling" diff --git a/vllm_ascend/__init__.py b/vllm_ascend/__init__.py index 8c8bd514770..117859dff7b 100644 --- a/vllm_ascend/__init__.py +++ b/vllm_ascend/__init__.py @@ -22,11 +22,6 @@ def register(): return "vllm_ascend.platform.NPUPlatform" -def register_model(): - from .models import register_model - register_model() - - def register_connector(): from vllm_ascend.distributed import register_connector register_connector() diff --git a/vllm_ascend/models/__init__.py b/vllm_ascend/models/__init__.py deleted file mode 100644 index b1957fe8f04..00000000000 --- a/vllm_ascend/models/__init__.py +++ /dev/null @@ -1,10 +0,0 @@ -from vllm import ModelRegistry - - -def register_model(): - # There is no PanguProMoEForCausalLM in vLLM, so we should register it before vLLM config initialization - # to make sure the model can be loaded correctly. This register step can be removed once vLLM support PanguProMoEForCausalLM. - ModelRegistry.register_model( - "PanguProMoEForCausalLM", - "vllm_ascend.torchair.models.torchair_pangu_moe:PanguProMoEForCausalLM" - ) From 17152994156ec5ac3b2d0ba52c33994db1a1a74b Mon Sep 17 00:00:00 2001 From: yangshihao6 Date: Tue, 2 Dec 2025 19:18:23 +0800 Subject: [PATCH 2/8] clean up model module (#4611) Model module is useless now. Let't remove it totally. - vLLM version: v0.11.2 Signed-off-by: yangshihao6 --- docs/source/tutorials/index.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/tutorials/index.md b/docs/source/tutorials/index.md index 7897fdf33dd..5ff1c03d505 100644 --- a/docs/source/tutorials/index.md +++ b/docs/source/tutorials/index.md @@ -10,6 +10,7 @@ single_npu_qwen3_embedding single_npu_qwen3_quantization single_npu_qwen3_w4a4 single_node_pd_disaggregation_llmdatadist +Qwen2.5-7b multi_npu_qwen3_next multi_npu multi_npu_moge From 0fbb025ade91760455f1134956303846a3f4a431 Mon Sep 17 00:00:00 2001 From: yangshihao6 Date: Wed, 3 Dec 2025 17:25:41 +0800 Subject: [PATCH 3/8] clean up model module (#4611) Model module is useless now. Let't remove it totally. - vLLM version: v0.11.2 Signed-off-by: yangshihao6 --- .../tutorials/{Qwen2.5-7b.md => Qwen2.5.md} | 82 ++++--------------- docs/source/tutorials/index.md | 2 +- 2 files changed, 17 insertions(+), 67 deletions(-) rename docs/source/tutorials/{Qwen2.5-7b.md => Qwen2.5.md} (77%) diff --git a/docs/source/tutorials/Qwen2.5-7b.md b/docs/source/tutorials/Qwen2.5.md similarity index 77% rename from docs/source/tutorials/Qwen2.5-7b.md rename to docs/source/tutorials/Qwen2.5.md index 6741b426a9f..5b1e7e730ed 100644 --- a/docs/source/tutorials/Qwen2.5-7b.md +++ b/docs/source/tutorials/Qwen2.5.md @@ -1,77 +1,27 @@ -# Qwen2.5-7B-Instruct Deployment and Verification Guide +# Qwen2.5-Instruct Deployment and Verification Guide ## Introduction -Qwen2.5-7B-Instruct is a 7-billion-parameter large language model pre-trained on 18 trillion tokens. It supports a maximum context window of 128K, enables generation of up to 8K tokens, and delivers enhanced capabilities in multilingual processing, instruction following, programming, mathematical computation, and structured data handling. +Qwen2.5-Instruct is the flagship instruction-tuned variant of Alibaba Cloud’s Qwen 2.5 LLM series. It supports a maximum context window of 128K, enables generation of up to 8K tokens, and delivers enhanced capabilities in multilingual processing, instruction following, programming, mathematical computation, and structured data handling. This document details the complete deployment and verification workflow for the model, including supported features, environment preparation, single-node deployment, functional verification, accuracy and performance evaluation, and troubleshooting of common issues. It is designed to help users quickly complete model deployment and validation. ## Supported Features -Qwen2.5-7B-Instruct offers the following core capabilities: -- **Multilingual Support**: Compatible with over 29 languages (Chinese, English, French, Spanish, Russian, Japanese, etc.). -- **Instruction Following**: Optimized through instruction tuning to accurately understand and execute user commands. -- **Programming & Mathematical Proficiency**: Delivers excellent performance on benchmarks such as HumanEval (programming) and MATH (mathematics). -- **Structured Data Handling**: Enhanced ability to process and generate structured data (e.g., tables, JSON formats). -- **Long Context Processing**: Supports a maximum context length of 128K for efficient handling of ultra-long text sequences. +Refer to [supported features](../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix. + +Refer to [feature guide](../user_guide/feature_guide/index.md) to get the feature's configuration. ## Environment Preparation ### Model Weight +- `Qwen2.5-Instruct`(BF16 version): require 2 910B4 (32G × 2) nodes. [Qwen2.5-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-Instruct) -Qwen2.5-7B-Instruct model weights can be downloaded from the official ModelScope repository (Note: Corrected from VL version to the correct language model link): -- [Qwen2.5-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct) - -It is recommended to download the model weights to a local directory (e.g., `./Qwen2.5-7B-Instruct/`) for quick access during deployment. - -### Hardware and System Requirements - -| Component | Specification | -|-----------|---------------| -| Hardware Platform | 910B4 (8 cards × 32GB) | -| Operating System | Ubuntu 22.04 (Corrected from non-official 22.03 version) | -| Driver Version | 25.0.rc1.1 | -| Python Version | 3.12 | - -### Software Dependencies - -| Component | Version Requirement | Notes | -|-----------|---------------------|-------| -| CANN | 8.2.RC1 | Ascend Computing Architecture Dependency | -| PyTorch | 2.5.1.post0 | Base Deep Learning Framework | -| torch-npu | 2.7.1rc1 | Ascend-adapted version | -| vLLM | 0.9.1 | Must match vLLM-Ascend version | -| vLLM-Ascend | 0.9.1-dev | Ascend-optimized version | - -### Environment Check and Verification +It is recommended to download the model weights to a local directory (e.g., `./Qwen2.5-Instruct/`) for quick access during deployment. -Verify hardware status and network connectivity before installation: -```bash -# Check NPU device status -npu-smi info - -# Verify network interface and connectivity -for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done -for i in {0..15}; do hccn_tool -i $i -link -g; done -for i in {0..15}; do hccn_tool -i $i -net_health -g; done - -# Check IP configuration -for i in {0..15}; do hccn_tool -i $i -ip -g; done -``` +### Verify Multi-node Communication(Optional) -### Container Environment Setup - -Create a privileged container to isolate the deployment environment: -```bash -docker run -it --privileged --name=test_vllm_Qwen_2.5_7B --net=host --shm-size=500g \ ---device=/dev/davinci{0..15} \ --v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ --v /home/:/home \ --w /home/ \ -mindie:dev-2.2.RC1.B070-800I-A2-py312-ubuntu22.03-x86_64 \ -/bin/bash -``` -Replace `` with your actual username. +If you want to deploy multi-node environment, you need to verify multi-node communication according to [verify multi-node communication environment](../installation.md#verify-multi-node-communication). ### Installation @@ -147,9 +97,9 @@ For detailed installation instructions, refer to the [AISBench Official Document ### Single-node Deployment -Qwen2.5-7B-Instruct supports single-node single-card deployment on the 910B4 platform. Follow these steps to start the inference service: +Qwen2.5-Instruct supports single-node single-card deployment on the 910B4 platform. Follow these steps to start the inference service: -1. Prepare model weights: Ensure the downloaded model weights are stored in the `./Qwen2.5-7B-Instruct/` directory. +1. Prepare model weights: Ensure the downloaded model weights are stored in the `./Qwen2.5-Instruct/` directory. 2. Download the gsm8k dataset (for evaluation): [gsm8k.zip](https://vision-file-storage/api/file/download/attachment-v2/WIKI202511118986704/32978033/20251111T144846Z_9658c67a0fb349f9be081ab9ab9fd2bc.zip?attachment_id=32978033) 3. Create and execute the deployment script (save as `deploy.sh`): @@ -164,7 +114,7 @@ export VLLM_ASCEND_ENABLE_FLASHCOMM=1 export VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE=1 # Start vLLM inference service -vllm serve ./Qwen2.5-7B-Instruct/ \ +vllm serve ./Qwen2.5-Instruct/ \ --host \ # Replace with server IP (e.g., 0.0.0.0 for all interfaces) --port \ # Replace with available port (e.g., 8080) --served-model-name qwen-2.5-7b-instruct \ # Standardized model name for consistency @@ -184,7 +134,7 @@ This document currently focuses on single-node deployment. For multi-node deploy ### Prefill-Decode Disaggregation -This feature is not supported at this time. +Not supported yet. ## Functional Verification @@ -203,7 +153,7 @@ curl http://:/v1/completions \ A valid response (e.g., `"Beijing is a vibrant and historic capital city"`) indicates successful deployment. -### Supplementary Verification Method (New) +### Supplementary Verification Method If `curl` verification fails, use this Python script: ```python import requests @@ -288,7 +238,7 @@ dict( attr="local", # Change from "service" to "local" type=VLLMCustomAPIChat, abbr='vllm-api-general-chat', - path="./Qwen2.5-7B-Instruct/", # Path to local model weights + path="./Qwen2.5-Instruct/", # Path to local model weights model="qwen-2.5-7b-instruct", # ... (other parameters remain unchanged) ) @@ -313,7 +263,7 @@ vLLM includes a built-in benchmark tool for evaluating throughput and latency. E ```bash export VLLM_USE_MODELSCOPE=true vllm bench serve \ - --model ./Qwen2.5-7B-Instruct/ \ + --model ./Qwen2.5-Instruct/ \ --dataset-name random \ --random-input 200 \ --num-prompt 200 \ diff --git a/docs/source/tutorials/index.md b/docs/source/tutorials/index.md index 5ff1c03d505..315c26be1d2 100644 --- a/docs/source/tutorials/index.md +++ b/docs/source/tutorials/index.md @@ -10,7 +10,7 @@ single_npu_qwen3_embedding single_npu_qwen3_quantization single_npu_qwen3_w4a4 single_node_pd_disaggregation_llmdatadist -Qwen2.5-7b +Qwen2.5 multi_npu_qwen3_next multi_npu multi_npu_moge From 646154cdbfdf9d9228d00794a6b19cb826ce0597 Mon Sep 17 00:00:00 2001 From: yangshihao6 Date: Wed, 3 Dec 2025 17:25:41 +0800 Subject: [PATCH 4/8] clean up model module (#4611) Model module is useless now. Let't remove it totally. - vLLM version: v0.11.2 Signed-off-by: yangshihao6 --- docs/source/tutorials/Qwen2.5-7b.md | 358 ---------------------------- docs/source/tutorials/Qwen2.5.md | 269 +++++++++++++++++++++ docs/source/tutorials/index.md | 2 +- 3 files changed, 270 insertions(+), 359 deletions(-) delete mode 100644 docs/source/tutorials/Qwen2.5-7b.md create mode 100644 docs/source/tutorials/Qwen2.5.md diff --git a/docs/source/tutorials/Qwen2.5-7b.md b/docs/source/tutorials/Qwen2.5-7b.md deleted file mode 100644 index 6741b426a9f..00000000000 --- a/docs/source/tutorials/Qwen2.5-7b.md +++ /dev/null @@ -1,358 +0,0 @@ -# Qwen2.5-7B-Instruct Deployment and Verification Guide - -## Introduction - -Qwen2.5-7B-Instruct is a 7-billion-parameter large language model pre-trained on 18 trillion tokens. It supports a maximum context window of 128K, enables generation of up to 8K tokens, and delivers enhanced capabilities in multilingual processing, instruction following, programming, mathematical computation, and structured data handling. - -This document details the complete deployment and verification workflow for the model, including supported features, environment preparation, single-node deployment, functional verification, accuracy and performance evaluation, and troubleshooting of common issues. It is designed to help users quickly complete model deployment and validation. - -## Supported Features - -Qwen2.5-7B-Instruct offers the following core capabilities: -- **Multilingual Support**: Compatible with over 29 languages (Chinese, English, French, Spanish, Russian, Japanese, etc.). -- **Instruction Following**: Optimized through instruction tuning to accurately understand and execute user commands. -- **Programming & Mathematical Proficiency**: Delivers excellent performance on benchmarks such as HumanEval (programming) and MATH (mathematics). -- **Structured Data Handling**: Enhanced ability to process and generate structured data (e.g., tables, JSON formats). -- **Long Context Processing**: Supports a maximum context length of 128K for efficient handling of ultra-long text sequences. - -## Environment Preparation - -### Model Weight - -Qwen2.5-7B-Instruct model weights can be downloaded from the official ModelScope repository (Note: Corrected from VL version to the correct language model link): -- [Qwen2.5-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct) - -It is recommended to download the model weights to a local directory (e.g., `./Qwen2.5-7B-Instruct/`) for quick access during deployment. - -### Hardware and System Requirements - -| Component | Specification | -|-----------|---------------| -| Hardware Platform | 910B4 (8 cards × 32GB) | -| Operating System | Ubuntu 22.04 (Corrected from non-official 22.03 version) | -| Driver Version | 25.0.rc1.1 | -| Python Version | 3.12 | - -### Software Dependencies - -| Component | Version Requirement | Notes | -|-----------|---------------------|-------| -| CANN | 8.2.RC1 | Ascend Computing Architecture Dependency | -| PyTorch | 2.5.1.post0 | Base Deep Learning Framework | -| torch-npu | 2.7.1rc1 | Ascend-adapted version | -| vLLM | 0.9.1 | Must match vLLM-Ascend version | -| vLLM-Ascend | 0.9.1-dev | Ascend-optimized version | - -### Environment Check and Verification - -Verify hardware status and network connectivity before installation: -```bash -# Check NPU device status -npu-smi info - -# Verify network interface and connectivity -for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done -for i in {0..15}; do hccn_tool -i $i -link -g; done -for i in {0..15}; do hccn_tool -i $i -net_health -g; done - -# Check IP configuration -for i in {0..15}; do hccn_tool -i $i -ip -g; done -``` - -### Container Environment Setup - -Create a privileged container to isolate the deployment environment: -```bash -docker run -it --privileged --name=test_vllm_Qwen_2.5_7B --net=host --shm-size=500g \ ---device=/dev/davinci{0..15} \ --v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ --v /home/:/home \ --w /home/ \ -mindie:dev-2.2.RC1.B070-800I-A2-py312-ubuntu22.03-x86_64 \ -/bin/bash -``` -Replace `` with your actual username. - -### Installation - -Install the required software dependencies in the container following these steps: - -#### Step 1: Install CANN Toolkit -```bash -# Execute the CANN installation package (adjust path to match local file) -./Ascend-cann-toolkit_8.2.RC1_linux-x86_64.run --install --install-path=/home//cmc/cann_8.2.rc1 - -# Configure CANN environment variables (New: Ensure dependencies take effect) -echo "source /home//cmc/cann_8.2.rc1/set_env.sh" >> ~/.bashrc -source ~/.bashrc -``` - -#### Step 2: Configure PyTorch Environment -```bash -# Set up pip mirror sources for faster installation -pip config set global.extra-index-url "https://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypi" - -# Install PyTorch and torch-npu (Fixed version compatibility) -pip install torch==2.5.1.post0 torchvision==0.18.0 torch-npu==2.7.1rc1 -``` - -#### Step 3: Install vLLM and vLLM-Ascend -```bash -# Install dependency packages (New: Avoid compilation failures) -pip install cmake ninja sentencepiece transformers - -# Install vLLM (v0.9.1) -git clone https://github.com/vllm-project/vllm.git -cd vllm && git checkout releases/v0.9.1 -VLLM_TARGET_DEVICE=empty pip install -v -e . -cd .. - -# Install vLLM-Ascend (v0.9.1-dev, Ascend-optimized version) -git clone https://github.com/vllm-project/vllm-ascend.git -cd vllm-ascend && git checkout v0.9.1-dev -pip install -v -e . -cd .. -``` - -#### Step 4: Install Accuracy Evaluation Tool (AISBench) -The AISBench tool is used for model accuracy and performance evaluation. Follow these installation steps: - -:::{note} -The server may be in a restricted network zone (Yellow Zone) and require a Green Zone proxy tool for internet access. Download the proxy tool from the internal repository, run `PortMapping.exe` to obtain the proxy IP, and update `ip_addr` in `portproxy_remote.sh` before executing the script. -::: - -```bash -# Clone the AISBench repository -git clone https://gitee.com/aisbench/benchmark.git -cd benchmark/ - -# Install core dependencies -pip3 install -e ./ --use-pep517 - -# Install dependencies for service-oriented model evaluation (vLLM/Triton) -pip3 install -r requirements/api.txt -pip3 install -r requirements/extra.txt - -# Install BFCL evaluation dependencies -pip3 install -r requirements/bfcl_dependencies.txt --no-deps - -# Disable proxy after installation -unset https_proxy -unset http_proxy -``` - -For detailed installation instructions, refer to the [AISBench Official Documentation](https://ais-bench-benchmark.readthedocs.io/zh-cn/latest/get_started/install.html). - -## Deployment - -### Single-node Deployment - -Qwen2.5-7B-Instruct supports single-node single-card deployment on the 910B4 platform. Follow these steps to start the inference service: - -1. Prepare model weights: Ensure the downloaded model weights are stored in the `./Qwen2.5-7B-Instruct/` directory. -2. Download the gsm8k dataset (for evaluation): [gsm8k.zip](https://vision-file-storage/api/file/download/attachment-v2/WIKI202511118986704/32978033/20251111T144846Z_9658c67a0fb349f9be081ab9ab9fd2bc.zip?attachment_id=32978033) -3. Create and execute the deployment script (save as `deploy.sh`): - -```shell -#!/bin/sh -# Set environment variables for Ascend optimization -export VLLM_USE_V1=1 -export TASK_QUEUE_ENABLE=1 -export HCCL_OP_EXPANSION_MODE="AIV" -export PAGED_ATTENTION_MASK_LEN=max_seq_len -export VLLM_ASCEND_ENABLE_FLASHCOMM=1 -export VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE=1 - -# Start vLLM inference service -vllm serve ./Qwen2.5-7B-Instruct/ \ - --host \ # Replace with server IP (e.g., 0.0.0.0 for all interfaces) - --port \ # Replace with available port (e.g., 8080) - --served-model-name qwen-2.5-7b-instruct \ # Standardized model name for consistency - --trust-remote-code \ - --dtype bfloat16 \ - --max-model-len 32768 \ # Maximum context length (adjust based on requirements) - --tensor-parallel-size 1 \ # Single-card deployment - --disable-log-requests \ - --enforce-eager - -# Execution command: chmod +x deploy.sh && ./deploy.sh -``` - -### Multi-node Deployment - -This document currently focuses on single-node deployment. For multi-node deployment, refer to the [vLLM-Ascend Multi-node Guide](https://github.com/vllm-project/vllm-ascend) and ensure consistent environment configuration across all nodes. - -### Prefill-Decode Disaggregation - -This feature is not supported at this time. - -## Functional Verification - -After starting the service, verify functionality using a `curl` request: - -```bash -curl http://:/v1/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "qwen-2.5-7b-instruct", # Must match --served-model-name from deployment - "prompt": "Beijing is a", - "max_tokens": 5, - "temperature": 0 - }' -``` - -A valid response (e.g., `"Beijing is a vibrant and historic capital city"`) indicates successful deployment. - -### Supplementary Verification Method (New) -If `curl` verification fails, use this Python script: -```python -import requests - -url = "http://:/v1/completions" -headers = {"Content-Type": "application/json"} -data = { - "model": "qwen-2.5-7b-instruct", - "prompt": "Explain quantum computing in simple terms", - "max_tokens": 100, - "temperature": 0.7 -} - -response = requests.post(url, headers=headers, json=data) -print(response.json()) -``` - -## Accuracy Evaluation - -Two accuracy evaluation methods are provided: AISBench (recommended) and manual testing with standard datasets. - -### Using AISBench - -#### Prerequisites -1. Extract the gsm8k dataset to `benchmark/datasets/gsm8k/` (download from the link above). -2. Configure model evaluation parameters. - -#### Configuration Steps -1. Locate the AISBench configuration file: -```bash -cd benchmark/ -ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --search -``` -2. Modify the configuration file (e.g., `vllm_api_general_chat.py`) to match the deployed service: -```Python -from ais_bench.benchmark.models import VLLMCustomAPIChat - -models = [ - dict( - attr="service", - type=VLLMCustomAPIChat, - abbr='vllm-api-general-chat', - path="", - model="qwen-2.5-7b-instruct", # Must match --served-model-name from deployment - request_rate=0, - retry=2, - host_ip="", # Deployment server IP - host_port=, # Deployment server port - max_out_len=512, - batch_size=1, - generation_kwargs=dict( - temperature=0.5, - top_k=10, - top_p=0.95, - seed=None, - repetition_penalty=1.03, - ) - ) -] -``` - -#### Execution Command -```bash -# Specify visible NPU cards (adjust based on available hardware) -export ASCEND_RT_VISIBLE_DEVICES=0 - -# Run evaluation (debug logs recommended for first execution) -ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --debug - -# Generate summary report -ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer example -``` - -#### Evaluation Results -Results and logs are saved to `benchmark/outputs/default/`. A sample accuracy report is shown below: -![Accuracy Evaluation Result](https://wiki.huawei.com/vision-file-storage/api/file/download/upload-v2/WIKI202511118986704/32976454/30bf146f86ab472697430f8efae66c1a.png) - -### Pure Model Accuracy Evaluation -For local model evaluation (without service deployment), modify `attr="local"` in the AISBench configuration file: -```Python -dict( - attr="local", # Change from "service" to "local" - type=VLLMCustomAPIChat, - abbr='vllm-api-general-chat', - path="./Qwen2.5-7B-Instruct/", # Path to local model weights - model="qwen-2.5-7b-instruct", - # ... (other parameters remain unchanged) -) -``` - -## Performance Evaluation - -### Using AISBench -Add `--mode perf` to the accuracy evaluation command to run performance testing: -```bash -ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer example --mode perf -``` - -#### Performance Metrics -Key metrics include throughput (tokens/sec), latency (ms), and NPU utilization. A sample result is shown below: -![Performance Evaluation Result](https://wiki.huawei.com/vision-file-storage/api/file/download/upload-v2/WIKI202511118986704/32976455/2b68624624a2436db5959e51aebaa106.png) - -For detailed metric explanations, refer to the [AISBench Performance Documentation](https://ais-bench-benchmark.readthedocs.io/zh-cn/latest/base_tutorials/results_intro/performance_metric.html#). - -### Using vLLM Benchmark -vLLM includes a built-in benchmark tool for evaluating throughput and latency. Example command for online serving performance testing: -```bash -export VLLM_USE_MODELSCOPE=true -vllm bench serve \ - --model ./Qwen2.5-7B-Instruct/ \ - --dataset-name random \ - --random-input 200 \ - --num-prompt 200 \ - --request-rate 1 \ - --save-result \ - --result-dir ./perf_results/ -``` - -For more details, refer to the [vLLM Benchmark Documentation](https://docs.vllm.ai/en/latest/contributing/benchmarks.html). - -## Common Issues and Solutions - -### How to Check Service Status and Metrics? -- **Enable Monitoring**: Add the `--metrics` parameter to the deployment command. Access metrics via `http://:/metrics` (Prometheus-compatible) to view NPU utilization, queue length, and inference latency. -- **Debug Logs**: Add the `--log-level debug` parameter to the deployment command to output detailed logs for troubleshooting. - -### Deployment Verification Failure -- **Issue**: `curl` request returns an error or no response. -- **Solutions**: - 1. Verify the server IP and port are correct (use `netstat -tuln | grep ` to check port occupancy). - 2. Ensure the model weight path is correct and the model is fully loaded (look for "Model loaded successfully" in logs). - 3. Confirm firewall rules allow traffic on the deployment port (use `ufw status` to check firewall status). - 4. Verify dependency version compatibility (especially vLLM and vLLM-Ascend must match). - -### Multi-Card Load Imbalance -- **Symptom**: Uneven memory usage across NPU cards. -- **Solutions**: - 1. Ensure `--tensor-parallel-size` matches the number of cards (e.g., set to 8 for 8-card deployment). - 2. For large models, adjust the `--gpu-memory-utilization` parameter (e.g., 0.9) to optimize memory allocation. - 3. Enable Ascend-specific optimizations (e.g., `VLLM_ASCEND_ENABLE_FLASHCOMM=1`). - -### Network Restriction Issues -- **Issue**: Failed dependency downloads (restricted network environment). -- **Solution**: Configure the proxy using the Green Zone proxy tool as described in the [Installation](#step-4-install-accuracy-evaluation-tool-aisbench) section, then disable the proxy after installation. - -### Compilation Failure When Installing vLLM -- **Issue**: Compilation errors occur when executing `pip install -v -e .`. -- **Solutions**: - 1. Ensure dependency packages are installed: `pip install cmake ninja sentencepiece`. - 2. Verify Python version is 3.12 (lower versions are not supported). - 3. Clean cache and reinstall: `rm -rf build/ dist/ *.egg-info && pip install -v -e .`. - diff --git a/docs/source/tutorials/Qwen2.5.md b/docs/source/tutorials/Qwen2.5.md new file mode 100644 index 00000000000..a7be8521e5e --- /dev/null +++ b/docs/source/tutorials/Qwen2.5.md @@ -0,0 +1,269 @@ +# Qwen2.5-Instruct Deployment and Verification Guide + +## Introduction + +Qwen2.5-Instruct is the flagship instruction-tuned variant of Alibaba Cloud’s Qwen 2.5 LLM series. It supports a maximum context window of 128K, enables generation of up to 8K tokens, and delivers enhanced capabilities in multilingual processing, instruction following, programming, mathematical computation, and structured data handling. + +This document details the complete deployment and verification workflow for the model, including supported features, environment preparation, single-node deployment, functional verification, accuracy and performance evaluation, and troubleshooting of common issues. It is designed to help users quickly complete model deployment and validation. + +## Supported Features + +Refer to [supported features](../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix. + +Refer to [feature guide](../user_guide/feature_guide/index.md) to get the feature's configuration. + +## Environment Preparation + +### Model Weight +- `Qwen2.5-Instruct`(BF16 version): require 2 910B4 (32G × 2) nodes. [Qwen2.5-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-Instruct) +- `Qwen2.5-7B-quantized.w8a8`(Quantized version): require 1 910B4 (32G × 1) node. [Qwen2.5-7B-quantized.w8a8](https://modelscope.cn/models/neuralmagic/Qwen2.5-7B-quantized.w8a8) + +It is recommended to download the model weights to a local directory (e.g., `./Qwen2.5-Instruct/`) for quick access during deployment. + +### Verify Multi-node Communication(Optional) + +If you want to deploy multi-node environment, you need to verify multi-node communication according to [verify multi-node communication environment](../installation.md#verify-multi-node-communication). + +### Installation + +You can using our official docker image and install extra operator for supporting `Qwen2.5-Instruct`. + +:::{note} +Only AArch64 architecture are supported currently due to extra operator's installation limitations. +::: + +:::::{tab-set} +:sync-group: install + +::::{tab-item} A3 series +:sync: A3 + +1. Start the docker image on your each node. + +```{code-block} bash + :substitutions: + +export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3 +docker run --rm \ + --name vllm-ascend \ + --shm-size=1g \ + --net=host \ + --device /dev/davinci0 \ + --device /dev/davinci1 \ + --device /dev/davinci2 \ + --device /dev/davinci3 \ + --device /dev/davinci4 \ + --device /dev/davinci5 \ + --device /dev/davinci6 \ + --device /dev/davinci7 \ + --device /dev/davinci8 \ + --device /dev/davinci9 \ + --device /dev/davinci10 \ + --device /dev/davinci11 \ + --device /dev/davinci12 \ + --device /dev/davinci13 \ + --device /dev/davinci14 \ + --device /dev/davinci15 \ + --device /dev/davinci_manager \ + --device /dev/devmm_svm \ + --device /dev/hisi_hdc \ + -v /usr/local/dcmi:/usr/local/dcmi \ + -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \ + -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ + -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ + -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ + -v /etc/ascend_install.info:/etc/ascend_install.info \ + -v /root/.cache:/root/.cache \ + -it $IMAGE bash +``` + +2. Install the package `custom-ops` to make the kernels available. + +```shell +wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/a3/CANN-custom_ops-sfa-linux.aarch64.run +chmod +x ./CANN-custom_ops-sfa-linux.aarch64.run +./CANN-custom_ops-sfa-linux.aarch64.run --quiet +export ASCEND_CUSTOM_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize:${ASCEND_CUSTOM_OPP_PATH} +export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/op_api/lib/:${LD_LIBRARY_PATH} +wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/a3/custom_ops-1.0-cp311-cp311-linux_aarch64.whl +pip install custom_ops-1.0-cp311-cp311-linux_aarch64.whl +``` + +:::: +::::{tab-item} A2 series +:sync: A2 + +1. Start the docker image on your each node. + +```{code-block} bash + :substitutions: + +export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version| +docker run --rm \ + --name vllm-ascend \ + --shm-size=1g \ + --net=host \ + --device /dev/davinci0 \ + --device /dev/davinci1 \ + --device /dev/davinci2 \ + --device /dev/davinci3 \ + --device /dev/davinci4 \ + --device /dev/davinci5 \ + --device /dev/davinci6 \ + --device /dev/davinci7 \ + --device /dev/davinci_manager \ + --device /dev/devmm_svm \ + --device /dev/hisi_hdc \ + -v /usr/local/dcmi:/usr/local/dcmi \ + -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \ + -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ + -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ + -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ + -v /etc/ascend_install.info:/etc/ascend_install.info \ + -v /root/.cache:/root/.cache \ + -it $IMAGE bash +``` + +2. Install the package `custom-ops` to make the kernels available. + +```shell +wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/a2/CANN-custom_ops-sfa-linux.aarch64.run +chmod +x ./CANN-custom_ops-sfa-linux.aarch64.run +./CANN-custom_ops-sfa-linux.aarch64.run --quiet +export ASCEND_CUSTOM_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize:${ASCEND_CUSTOM_OPP_PATH} +export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/op_api/lib/:${LD_LIBRARY_PATH} +wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/a2/custom_ops-1.0-cp311-cp311-linux_aarch64.whl +pip install custom_ops-1.0-cp311-cp311-linux_aarch64.whl +``` + +:::: +::::: + +In addition, if you don't want to use the docker image as above, you can also build all from source: + +- Install `vllm-ascend` from source, refer to [installation](../installation.md). + +- Install extra operator for supporting `DeepSeek-V3.2-Exp`, refer to the above tab. + +If you want to deploy multi-node environment, you need to set up environment on each node. + +## Deployment + +### Single-node Deployment + +Qwen2.5-Instruct supports single-node single-card deployment on the 910B4 platform. Follow these steps to start the inference service: + +1. Prepare model weights: Ensure the downloaded model weights are stored in the `./Qwen2.5-Instruct/` directory. +2. Download the gsm8k dataset (for evaluation): [gsm8k.zip](https://vision-file-storage/api/file/download/attachment-v2/WIKI202511118986704/32978033/20251111T144846Z_9658c67a0fb349f9be081ab9ab9fd2bc.zip?attachment_id=32978033) +3. Create and execute the deployment script (save as `deploy.sh`): + +```shell +#!/bin/sh +# Set environment variables for Ascend optimization +export VLLM_USE_V1=1 +export TASK_QUEUE_ENABLE=1 +export HCCL_OP_EXPANSION_MODE="AIV" +export PAGED_ATTENTION_MASK_LEN=max_seq_len +export VLLM_ASCEND_ENABLE_FLASHCOMM=1 +export VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE=1 + +# Start vLLM inference service +vllm serve ./Qwen2.5-Instruct/ \ + --host \ # Replace with server IP (e.g., 0.0.0.0 for all interfaces) + --port \ # Replace with available port (e.g., 8080) + --served-model-name qwen-2.5-7b-instruct \ # Standardized model name for consistency + --trust-remote-code \ + --dtype bfloat16 \ + --max-model-len 32768 \ # Maximum context length (adjust based on requirements) + --tensor-parallel-size 1 \ # Single-card deployment + --disable-log-requests \ + --enforce-eager + +# Execution command: chmod +x deploy.sh && ./deploy.sh +``` + +### Multi-node Deployment + +This document currently focuses on single-node deployment. For multi-node deployment, refer to the [vLLM-Ascend Multi-node Guide](https://github.com/vllm-project/vllm-ascend) and ensure consistent environment configuration across all nodes. + +### Prefill-Decode Disaggregation + +Not supported yet. + +## Functional Verification + +After starting the service, verify functionality using a `curl` request: + +```bash +curl http://:/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "qwen-2.5-7b-instruct", # Must match --served-model-name from deployment + "prompt": "Beijing is a", + "max_tokens": 5, + "temperature": 0 + }' +``` + +A valid response (e.g., `"Beijing is a vibrant and historic capital city"`) indicates successful deployment. + +## Accuracy Evaluation + +Two accuracy evaluation methods are provided: AISBench (recommended) and manual testing with standard datasets. + +### Using AISBench + +1. Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details. +2. After execution, you can get the result, here is the result of `Qwen2.5-Instruct` in `vllm-ascend:0.11.0rc0` for reference only. + +| dataset | version | metric | mode | vllm-api-general-chat | +|----- | ----- | ----- | ----- |--------------| +| gsm8k | - | accuracy | gen | 75.00 | + + +#### Execution Command +```bash +# Specify visible NPU cards (adjust based on available hardware) +export ASCEND_RT_VISIBLE_DEVICES=0 + +# Run evaluation (debug logs recommended for first execution) +ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --debug + +# Generate summary report +ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer example +``` + +#### Evaluation Results +Results and logs are saved to `benchmark/outputs/default/`. A sample accuracy report is shown below: + +## Performance + +### Using AISBench + +Refer to [Using AISBench for performance evaluation](../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details. + +### Using vLLM Benchmark +Run performance evaluation of `Qwen2.5-Instruct` as an example. + +Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. + +There are three `vllm bench` subcommand: +- `latency`: Benchmark the latency of a single batch of requests. +- `serve`: Benchmark the online serving throughput. +- `throughput`: Benchmark offline inference throughput. + +Take the `serve` as an example. Run the code as follows. + +```bash +export VLLM_USE_MODELSCOPE=true +vllm bench serve \ + --model ./Qwen2.5-Instruct/ \ + --dataset-name random \ + --random-input 200 \ + --num-prompt 200 \ + --request-rate 1 \ + --save-result \ + --result-dir ./perf_results/ +``` + +After about several minutes, you can get the performance evaluation result. diff --git a/docs/source/tutorials/index.md b/docs/source/tutorials/index.md index 5ff1c03d505..315c26be1d2 100644 --- a/docs/source/tutorials/index.md +++ b/docs/source/tutorials/index.md @@ -10,7 +10,7 @@ single_npu_qwen3_embedding single_npu_qwen3_quantization single_npu_qwen3_w4a4 single_node_pd_disaggregation_llmdatadist -Qwen2.5-7b +Qwen2.5 multi_npu_qwen3_next multi_npu multi_npu_moge From a01ee7af0493e215eef3cc9e052cb4b275ed3196 Mon Sep 17 00:00:00 2001 From: yangshihao6 Date: Thu, 4 Dec 2025 11:50:47 +0800 Subject: [PATCH 5/8] clean up model module (#4611) Model module is useless now. Let't remove it totally. - vLLM version: v0.11.2 Signed-off-by: yangshihao6 --- docs/source/tutorials/Qwen2.5.md | 27 +++++++++++++++++---------- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/docs/source/tutorials/Qwen2.5.md b/docs/source/tutorials/Qwen2.5.md index a7be8521e5e..8c066855882 100644 --- a/docs/source/tutorials/Qwen2.5.md +++ b/docs/source/tutorials/Qwen2.5.md @@ -220,21 +220,28 @@ Two accuracy evaluation methods are provided: AISBench (recommended) and manual |----- | ----- | ----- | ----- |--------------| | gsm8k | - | accuracy | gen | 75.00 | +### Using Language Model Evaluation Harness -#### Execution Command -```bash -# Specify visible NPU cards (adjust based on available hardware) -export ASCEND_RT_VISIBLE_DEVICES=0 +As an example, take the `gsm8k` dataset as a test dataset, and run accuracy evaluation of `DeepSeek-V3.2-Exp-W8A8` in online mode. + +1. Refer to [Using lm_eval](../developer_guide/evaluation/using_lm_eval.md) for `lm_eval` installation. -# Run evaluation (debug logs recommended for first execution) -ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --debug +2. Run `lm_eval` to execute the accuracy evaluation. -# Generate summary report -ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer example +```shell +lm_eval \ + --model local-completions \ + --model_args model=/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-Exp-W8A8,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \ + --tasks gsm8k \ + --output_path ./ ``` -#### Evaluation Results -Results and logs are saved to `benchmark/outputs/default/`. A sample accuracy report is shown below: +3. After execution, you can get the result, here is the result of `DeepSeek-V3.2-Exp-W8A8` in `vllm-ascend:0.11.0rc0` for reference only. + +|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| +|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| +|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9591|± |0.0055| +|gsm8k| 3|strict-match | 5|exact_match|↑ |0.9583|± |0.0055| ## Performance From 9db6a7578ca9985b223d672078d5f0854f4ea322 Mon Sep 17 00:00:00 2001 From: yangshihao6 Date: Thu, 4 Dec 2025 19:17:22 +0800 Subject: [PATCH 6/8] clean up model module (#4611) Model module is useless now. Let't remove it totally. - vLLM version: v0.11.2 Signed-off-by: yangshihao6 --- docs/source/tutorials/Qwen2.5.md | 75 ++++++++++++++++++++++---------- 1 file changed, 51 insertions(+), 24 deletions(-) diff --git a/docs/source/tutorials/Qwen2.5.md b/docs/source/tutorials/Qwen2.5.md index 8c066855882..3578e4c67c5 100644 --- a/docs/source/tutorials/Qwen2.5.md +++ b/docs/source/tutorials/Qwen2.5.md @@ -213,35 +213,57 @@ Two accuracy evaluation methods are provided: AISBench (recommended) and manual ### Using AISBench -1. Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details. -2. After execution, you can get the result, here is the result of `Qwen2.5-Instruct` in `vllm-ascend:0.11.0rc0` for reference only. +Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details. -| dataset | version | metric | mode | vllm-api-general-chat | -|----- | ----- | ----- | ----- |--------------| -| gsm8k | - | accuracy | gen | 75.00 | - -### Using Language Model Evaluation Harness - -As an example, take the `gsm8k` dataset as a test dataset, and run accuracy evaluation of `DeepSeek-V3.2-Exp-W8A8` in online mode. - -1. Refer to [Using lm_eval](../developer_guide/evaluation/using_lm_eval.md) for `lm_eval` installation. +#### Configuration Steps +1. Locate the AISBench configuration file: +```bash +cd benchmark/ +ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --search +``` +2. Modify the configuration file (e.g., `vllm_api_general_chat.py`) to match the deployed service: +```Python +from ais_bench.benchmark.models import VLLMCustomAPIChat + +models = [ + dict( + attr="service", + type=VLLMCustomAPIChat, + abbr='vllm-api-general-chat', + path="", + model="qwen-2.5-7b-instruct", # Must match --served-model-name from deployment + request_rate=0, + retry=2, + host_ip="", # Deployment server IP + host_port=, # Deployment server port + max_out_len=512, + batch_size=1, + generation_kwargs=dict( + temperature=0.5, + top_k=10, + top_p=0.95, + seed=None, + repetition_penalty=1.03, + ) + ) +] +``` +#### Execution Command +```bash +# Specify visible NPU cards (adjust based on available hardware) +export ASCEND_RT_VISIBLE_DEVICES=0 -2. Run `lm_eval` to execute the accuracy evaluation. +# Run evaluation (debug logs recommended for first execution) +ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --debug -```shell -lm_eval \ - --model local-completions \ - --model_args model=/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-Exp-W8A8,base_url=http://127.0.0.1:8000/v1/completions,tokenized_requests=False,trust_remote_code=True \ - --tasks gsm8k \ - --output_path ./ +# Generate summary report +ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer example ``` +Results and logs are saved to `benchmark/outputs/default/`. A sample accuracy report is shown below: -3. After execution, you can get the result, here is the result of `DeepSeek-V3.2-Exp-W8A8` in `vllm-ascend:0.11.0rc0` for reference only. - -|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| -|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| -|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9591|± |0.0055| -|gsm8k| 3|strict-match | 5|exact_match|↑ |0.9583|± |0.0055| +| dataset | version | metric | mode | vllm-api-general-chat | +|----- | ----- | ----- | ----- |--------------| +| gsm8k | - | accuracy | gen | 75.00 | ## Performance @@ -249,6 +271,11 @@ lm_eval \ Refer to [Using AISBench for performance evaluation](../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details. +Add `--mode perf` to the accuracy evaluation command to run performance testing: +```bash +ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer example --mode perf +``` + ### Using vLLM Benchmark Run performance evaluation of `Qwen2.5-Instruct` as an example. From db4de4481affe2fe3c5f0eeaaea5b80b488f388e Mon Sep 17 00:00:00 2001 From: yangshihao6 Date: Thu, 4 Dec 2025 19:51:14 +0800 Subject: [PATCH 7/8] clean up model module (#4611) Model module is useless now. Let't remove it totally. - vLLM version: v0.11.2 Signed-off-by: yangshihao6 --- docs/source/tutorials/Qwen2.5.md | 33 -------------------------------- 1 file changed, 33 deletions(-) diff --git a/docs/source/tutorials/Qwen2.5.md b/docs/source/tutorials/Qwen2.5.md index 3578e4c67c5..c3c742d149e 100644 --- a/docs/source/tutorials/Qwen2.5.md +++ b/docs/source/tutorials/Qwen2.5.md @@ -215,39 +215,6 @@ Two accuracy evaluation methods are provided: AISBench (recommended) and manual Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details. -#### Configuration Steps -1. Locate the AISBench configuration file: -```bash -cd benchmark/ -ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --search -``` -2. Modify the configuration file (e.g., `vllm_api_general_chat.py`) to match the deployed service: -```Python -from ais_bench.benchmark.models import VLLMCustomAPIChat - -models = [ - dict( - attr="service", - type=VLLMCustomAPIChat, - abbr='vllm-api-general-chat', - path="", - model="qwen-2.5-7b-instruct", # Must match --served-model-name from deployment - request_rate=0, - retry=2, - host_ip="", # Deployment server IP - host_port=, # Deployment server port - max_out_len=512, - batch_size=1, - generation_kwargs=dict( - temperature=0.5, - top_k=10, - top_p=0.95, - seed=None, - repetition_penalty=1.03, - ) - ) -] -``` #### Execution Command ```bash # Specify visible NPU cards (adjust based on available hardware) From c37b2ec2aaebebf016d0352011825fc667bbec2e Mon Sep 17 00:00:00 2001 From: yangshihao6 Date: Thu, 4 Dec 2025 23:56:10 +0800 Subject: [PATCH 8/8] clean up model module (#4611) Model module is useless now. Let't remove it totally. - vLLM version: v0.11.2 Signed-off-by: yangshihao6 --- docs/source/tutorials/Qwen2.5.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/docs/source/tutorials/Qwen2.5.md b/docs/source/tutorials/Qwen2.5.md index c3c742d149e..b6025a45223 100644 --- a/docs/source/tutorials/Qwen2.5.md +++ b/docs/source/tutorials/Qwen2.5.md @@ -217,9 +217,6 @@ Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for #### Execution Command ```bash -# Specify visible NPU cards (adjust based on available hardware) -export ASCEND_RT_VISIBLE_DEVICES=0 - # Run evaluation (debug logs recommended for first execution) ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --debug