[Model] Support pooling models #3122

lianyiibo · 2025-09-23T07:16:30Z

What this PR does / why we need it?

Support pooling models (like bge-reranker-v2-m3) in vllm-ascend, this pr covered the three model types of embed (cls_token, mean_token, lasttoken).

After this commit, vllm has provided support for adapting pooling models on the v1 engine. This PR includes corresponding adaptations on the vllm-ascend side.

Fixes #1960

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Taking BAAI/bge-reranker-v2-m3 as an example.

from argparse import Namespace
import os

from vllm import LLM, EngineArgs
from vllm.utils import FlexibleArgumentParser

def parse_args():
    parser = FlexibleArgumentParser()
    parser = EngineArgs.add_cli_args(parser)
    # Set example specific arguments
    parser.set_defaults(
        model="BAAI/bge-reranker-v2-m3", task="score", enforce_eager=True
    )
    return parser.parse_args()

def main(args: Namespace):
    # Sample prompts.
    text_1 = "What is the capital of France?"
    texts_2 = [
        "The capital of Brazil is Brasilia.",
        "The capital of France is Paris.",
    ]

    # Create an LLM.
    # You should pass task="score" for cross-encoder models
    llm = LLM(**vars(args))

    # Generate scores. The output is a list of ScoringRequestOutputs.
    outputs = llm.score(text_1, texts_2)

    # Print the outputs.
    print("\nGenerated Outputs:\n" + "-" * 60)
    for text_2, output in zip(texts_2, outputs):
        score = output.outputs.score
        print(f"Pair: {[text_1, text_2]!r} \nScore: {score}")
        print("-" * 60)

if __name__ == "__main__":
    args = parse_args()
    main(args)

Perf test

Scripts

ASCEND_RT_VISIBLE_DEVICES=1 VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server --model BAAI/bge-m3
vllm bench serve --model BAAI/bge-m3 --backend vllm-rerank --endpoint /v1/rerank --dataset-name random-rerank --tokenizer BAAI/bge-m3 --random-input-len 512

Results

Before this pr

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Benchmark duration (s):                  6.08      
Total input tokens:                      0         
Request throughput (req/s):              164.43    
Total Token throughput (tok/s):          0.00      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          4029.80   
Median E2EL (ms):                        4024.80   
P99 E2EL (ms):                           5717.67   
==================================================

After this pr

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Benchmark duration (s):                  6.37      
Total input tokens:                      0         
Request throughput (req/s):              157.04    
Total Token throughput (tok/s):          0.00      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          3990.05   
Median E2EL (ms):                        3894.46   
P99 E2EL (ms):                           5869.24   
==================================================

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

github-actions · 2025-09-23T07:16:40Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request adds support for pooling models in vllm-ascend by adapting changes from the upstream vllm project. The changes include modifying attention mask generation for non-causal masks, adding a new attention path for encoder-only models, and updating the model runner to handle pooling models and encoder-only attention specifications. The implementation is largely correct, but I've identified a critical issue in vllm_ascend/worker/model_runner_v1.py where a method call is missing a required argument, which would cause a runtime error.

vllm_ascend/worker/model_runner_v1.py

github-actions · 2025-09-23T07:25:48Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Potabk · 2025-09-23T13:48:20Z

can we add some e2e and example for the model

lianyiibo · 2025-09-24T02:06:48Z

can we add some e2e and example for the model

OK, but since I have no relevant experience before, I would like to ask which test scenarios this feature needs to cover. At present, I think it is necessary to write test cases in the e2e/singlecard directory. Is this correct?

Potabk · 2025-09-24T02:10:15Z

can refer to https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_embedding.py

github-actions · 2025-09-28T09:54:52Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-09-29T19:28:29Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-12-03T09:35:43Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-12-03T15:45:25Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-12-04T14:34:44Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-12-06T01:35:41Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: lianyibo <[email protected]>

* move pooling model in the first if branch in forward * refactor the pooling-params and pooling-attnmetadata to align with vllm Signed-off-by: MengqingCao <[email protected]>

Signed-off-by: MengqingCao <[email protected]>

Signed-off-by: lianyibo <[email protected]>

Signed-off-by: MengqingCao <[email protected]>

github-actions · 2025-12-06T09:18:07Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: lianyibo <[email protected]>

github-actions bot added the module:core label Sep 23, 2025

lianyiibo changed the title ~~[Model] Support pooling models.~~ [Model] Support pooling models Sep 23, 2025

gemini-code-assist bot reviewed Sep 23, 2025

View reviewed changes

vllm_ascend/worker/model_runner_v1.py Outdated Show resolved Hide resolved

lianyiibo force-pushed the pooling_support branch from 722558f to af4926f Compare September 23, 2025 07:25

github-actions bot added the merge-conflicts label Sep 23, 2025

lianyiibo force-pushed the pooling_support branch from af4926f to 344d14d Compare September 23, 2025 08:02

github-actions bot removed the merge-conflicts label Sep 23, 2025

lianyiibo marked this pull request as draft September 23, 2025 08:15

lianyiibo force-pushed the pooling_support branch from 344d14d to 4100a6c Compare September 23, 2025 08:32

lianyiibo marked this pull request as ready for review September 23, 2025 08:36

lianyiibo force-pushed the pooling_support branch 2 times, most recently from 95f54d7 to 27d9a92 Compare September 23, 2025 08:58

lianyiibo marked this pull request as draft September 25, 2025 01:44

lianyiibo force-pushed the pooling_support branch from 27d9a92 to f0829a4 Compare September 28, 2025 09:54

github-actions bot added the merge-conflicts label Sep 28, 2025

lianyiibo force-pushed the pooling_support branch from f0829a4 to b7eddd1 Compare September 28, 2025 09:58

github-actions bot removed the merge-conflicts label Sep 28, 2025

lianyiibo force-pushed the pooling_support branch 3 times, most recently from 74349c6 to dd092f7 Compare September 29, 2025 03:47

github-actions bot added the merge-conflicts label Sep 29, 2025

lianyiibo force-pushed the pooling_support branch from dd092f7 to 3ebfb2e Compare October 13, 2025 02:27

github-actions bot added the module:tests label Oct 13, 2025

wangxiyuan approved these changes Dec 3, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Dec 3, 2025

MengqingCao force-pushed the pooling_support branch 2 times, most recently from 5b65b0b to 14d4196 Compare December 3, 2025 12:01

github-actions bot removed the merge-conflicts label Dec 3, 2025

github-actions bot added merge-conflicts and removed merge-conflicts labels Dec 3, 2025

github-actions bot added merge-conflicts and removed merge-conflicts labels Dec 5, 2025

lianyiibo and others added 9 commits December 6, 2025 07:29

Support pooling models with aclgraph

d98ab8e

Signed-off-by: lianyibo <[email protected]>

Add pooling attention branch

a0fb3c8

Signed-off-by: lianyibo <[email protected]>

Some fixes

692c965

* move pooling model in the first if branch in forward * refactor the pooling-params and pooling-attnmetadata to align with vllm Signed-off-by: MengqingCao <[email protected]>

fix pooling mask

cf7174b

Signed-off-by: MengqingCao <[email protected]>

remove issue link

856e092

Signed-off-by: lianyibo <[email protected]>

reduce cudagraph_capture_sizes to speed up CI

f468324

Signed-off-by: MengqingCao <[email protected]>

add a lightly ci

5e60ba9

Signed-off-by: MengqingCao <[email protected]>

remove redundant code

409c83e

Signed-off-by: MengqingCao <[email protected]>

fix issue included by rebase

433ec3a

Signed-off-by: MengqingCao <[email protected]>

MengqingCao force-pushed the pooling_support branch from ba28f9e to 433ec3a Compare December 6, 2025 08:11

github-actions bot removed the merge-conflicts label Dec 6, 2025

remove redundant code

296cb94

Signed-off-by: MengqingCao <[email protected]>

github-actions bot added the merge-conflicts label Dec 6, 2025

Merge branch 'main' into pooling_support

309cb72

Signed-off-by: lianyibo <[email protected]>

github-actions bot removed the merge-conflicts label Dec 8, 2025

[Model] Support pooling models #3122

Are you sure you want to change the base?

[Model] Support pooling models #3122

Conversation

lianyiibo commented Sep 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Perf test

Scripts

Results

Uh oh!

github-actions bot commented Sep 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot commented Sep 23, 2025

Uh oh!

Potabk commented Sep 23, 2025

Uh oh!

lianyiibo commented Sep 24, 2025

Uh oh!

Potabk commented Sep 24, 2025

Uh oh!

github-actions bot commented Sep 28, 2025

Uh oh!

github-actions bot commented Sep 29, 2025

Uh oh!

github-actions bot commented Dec 3, 2025

Uh oh!

github-actions bot commented Dec 3, 2025

Uh oh!

github-actions bot commented Dec 4, 2025

Uh oh!

github-actions bot commented Dec 6, 2025

Uh oh!

github-actions bot commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lianyiibo commented Sep 23, 2025 •

edited by github-actions bot

Loading