Skip to content

Conversation

@lianyiibo
Copy link
Contributor

@lianyiibo lianyiibo commented Sep 23, 2025

What this PR does / why we need it?

Support pooling models (like bge-reranker-v2-m3) in vllm-ascend, this pr covered the three model types of embed (cls_token, mean_token, lasttoken).

After this commit, vllm has provided support for adapting pooling models on the v1 engine. This PR includes corresponding adaptations on the vllm-ascend side.

Fixes #1960

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Taking BAAI/bge-reranker-v2-m3 as an example.

from argparse import Namespace
import os

from vllm import LLM, EngineArgs
from vllm.utils import FlexibleArgumentParser

def parse_args():
    parser = FlexibleArgumentParser()
    parser = EngineArgs.add_cli_args(parser)
    # Set example specific arguments
    parser.set_defaults(
        model="BAAI/bge-reranker-v2-m3", task="score", enforce_eager=True
    )
    return parser.parse_args()

def main(args: Namespace):
    # Sample prompts.
    text_1 = "What is the capital of France?"
    texts_2 = [
        "The capital of Brazil is Brasilia.",
        "The capital of France is Paris.",
    ]

    # Create an LLM.
    # You should pass task="score" for cross-encoder models
    llm = LLM(**vars(args))

    # Generate scores. The output is a list of ScoringRequestOutputs.
    outputs = llm.score(text_1, texts_2)

    # Print the outputs.
    print("\nGenerated Outputs:\n" + "-" * 60)
    for text_2, output in zip(texts_2, outputs):
        score = output.outputs.score
        print(f"Pair: {[text_1, text_2]!r} \nScore: {score}")
        print("-" * 60)

if __name__ == "__main__":
    args = parse_args()
    main(args)

Perf test

Scripts

ASCEND_RT_VISIBLE_DEVICES=1 VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server --model BAAI/bge-m3
vllm bench serve --model BAAI/bge-m3 --backend vllm-rerank --endpoint /v1/rerank --dataset-name random-rerank --tokenizer BAAI/bge-m3 --random-input-len 512

Results

Before this pr

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Benchmark duration (s):                  6.08      
Total input tokens:                      0         
Request throughput (req/s):              164.43    
Total Token throughput (tok/s):          0.00      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          4029.80   
Median E2EL (ms):                        4024.80   
P99 E2EL (ms):                           5717.67   
==================================================

After this pr

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Benchmark duration (s):                  6.37      
Total input tokens:                      0         
Request throughput (req/s):              157.04    
Total Token throughput (tok/s):          0.00      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          3990.05   
Median E2EL (ms):                        3894.46   
P99 E2EL (ms):                           5869.24   
==================================================

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@lianyiibo lianyiibo changed the title [Model] Support pooling models. [Model] Support pooling models Sep 23, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for pooling models in vllm-ascend by adapting changes from the upstream vllm project. The changes include modifying attention mask generation for non-causal masks, adding a new attention path for encoder-only models, and updating the model runner to handle pooling models and encoder-only attention specifications. The implementation is largely correct, but I've identified a critical issue in vllm_ascend/worker/model_runner_v1.py where a method call is missing a required argument, which would cause a runtime error.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@lianyiibo lianyiibo marked this pull request as draft September 23, 2025 08:15
@lianyiibo lianyiibo marked this pull request as ready for review September 23, 2025 08:36
@lianyiibo lianyiibo force-pushed the pooling_support branch 2 times, most recently from 95f54d7 to 27d9a92 Compare September 23, 2025 08:58
@Potabk
Copy link
Collaborator

Potabk commented Sep 23, 2025

can we add some e2e and example for the model

@lianyiibo
Copy link
Contributor Author

can we add some e2e and example for the model

OK, but since I have no relevant experience before, I would like to ask which test scenarios this feature needs to cover. At present, I think it is necessary to write test cases in the e2e/singlecard directory. Is this correct?

@Potabk
Copy link
Collaborator

Potabk commented Sep 24, 2025

can refer to https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/singlecard/test_embedding.py

@lianyiibo lianyiibo marked this pull request as draft September 25, 2025 01:44
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link

github-actions bot commented Dec 3, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@MengqingCao MengqingCao force-pushed the pooling_support branch 2 times, most recently from 5b65b0b to 14d4196 Compare December 3, 2025 12:01
@github-actions
Copy link

github-actions bot commented Dec 3, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link

github-actions bot commented Dec 4, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link

github-actions bot commented Dec 6, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

lianyiibo and others added 9 commits December 6, 2025 07:29
  * move pooling model in the first if branch in forward
  * refactor the pooling-params and pooling-attnmetadata to align with
vllm

Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: lianyibo <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
@github-actions
Copy link

github-actions bot commented Dec 6, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: BAAI/bge-reranker-v2-m3 failed to start in graph and eager mode due to Text-only XLMRobertaForSequenceClassification not be supported

5 participants