-
Notifications
You must be signed in to change notification settings - Fork 624
[Bugfix] fix dp parallel + tp > 1 offline inference port conflict #4539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a port conflict issue that occurs during offline inference in distributed environments using both data parallelism and tensor parallelism (tp > 1). The change removes a monkey-patch for ParallelConfig.get_next_dp_init_port that was causing this conflict. The removed code attempted to manage data parallel ports by incrementing a counter on the ParallelConfig instance. This approach is not safe across multiple processes, as each process would have its own instance and counter, leading to race conditions and port collisions. By deleting this faulty patch, the code now relies on the default port allocation mechanism from the underlying vLLM framework, which should correctly handle multi-process scenarios. This is a clean and effective fix for the described problem.
977e836 to
ff29d66
Compare
Signed-off-by: leo-pony <[email protected]>
Signed-off-by: leo-pony <[email protected]>
ff29d66 to
9260b49
Compare
Signed-off-by: leo-pony <[email protected]>
…lm-project#4539) ### What this PR does / why we need it? fix dp parallel + tp > 1 offline inference port conflict issue import PR:vllm-project#429 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: leo-pony <[email protected]>
…lm-project#4539) ### What this PR does / why we need it? fix dp parallel + tp > 1 offline inference port conflict issue import PR:vllm-project#429 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: leo-pony <[email protected]> Signed-off-by: Che Ruan <[email protected]>
…lm-project#4539) ### What this PR does / why we need it? fix dp parallel + tp > 1 offline inference port conflict issue import PR:vllm-project#429 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: leo-pony <[email protected]> Signed-off-by: Che Ruan <[email protected]>
What this PR does / why we need it?
fix dp parallel + tp > 1 offline inference port conflict
issue import PR:#429
Does this PR introduce any user-facing change?
How was this patch tested?