Skip to content

Conversation

@Tsundoku958
Copy link
Contributor

@Tsundoku958 Tsundoku958 commented Nov 14, 2025

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

I noticed that during _schedule_decoding, when preemption of running requests is required, the default order is to evict the earliest arriving request first (because during _schedule_prefill, requests are appended to the running queue in the order of their arrival time). According to the First-Come-First-Served (FCFS) principle, wouldn’t it be better to evict the latest arriving request first?

Reproduction:

GPU: 4090

Command:

lmdeploy serve api_server ../vllm_build/qwen/Qwen-7B-Chat/  --backend pytorch --max-batch-size 128
python benchmark/profile_restful_api.py --backend lmdeploy  --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 500 --request-rate 5

Under conditions of high request load and GPU memory pressure.

before commit
image

after:
image

Modification

first evict latest arrival time seq

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.
    @lvhan028 @grimoire

"""Schedule decoding."""

running = self.running
def _reorder_running():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just sort running in reversed order, so we don't need nested loops.

Copy link
Contributor Author

@Tsundoku958 Tsundoku958 Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that in the original logic, when traversing the sequence, the order of eviction priority is the same as the order of block allocation requests. Consider the following two scenarios:

  1. When the sequence is sorted in descending order of timestamps, if there are still a certain number of free blocks during scheduling, the latest requests will be allocated space first. This may cause the earliest arriving requests to fail to obtain GPU blocks.
  2. When the sequence is sorted in ascending order of timestamps, if there are almost no free blocks left during scheduling, the earliest arriving requests will be evicted first. This violates the First-Come-First-Served (FCFS) principle.

Therefore, I think adding a nested loop could address both of these situations.

@grimoire
Copy link
Collaborator

FCFS might be acceptable, but it is not a necessity. The order of service does not significantly impact throughput. Moreover, if a preempted request has an earlier arrival time, it will be rescheduled for computation sooner.
I think increasing the complexity of the scheduler just for the sake of FCFS is not worth it. The focus of optimization should perhaps be on how to avoid eviction in the first place.

@Tsundoku958
Copy link
Contributor Author

FCFS might be acceptable, but it is not a necessity. The order of service does not significantly impact throughput. Moreover, if a preempted request has an earlier arrival time, it will be rescheduled for computation sooner. I think increasing the complexity of the scheduler just for the sake of FCFS is not worth it. The focus of optimization should perhaps be on how to avoid eviction in the first place.

  1. Although FCFS has little impact on throughput, it can reduce TTFT, which I think is also a relatively important metric.
  2. If we want to increase throughput, I think we could introduce a new strategy that sorts requests by maximum generation length(seq.sampling_param.max_new_tokens). However, this would still require first adding nested loops as implemented in the current commit, so I believe the current commit is useful.
  3. Avoiding eviction is certainly important, but a key prerequisite for minimizing eviction is having sufficient GPU memory space. However, abundant GPU memory may lead to insufficient GPU parallelism. Under different hardware environments, it seems challenging to perfectly balance both parallelism and GPU memory availability.(I'm not entirely sure about this part, but I think sorting the running queue seems like a straightforward and feasible strategy.)

@grimoire
Copy link
Collaborator

TTFT is primarily influenced by prefill and is theoretically less related to schedule_decoding. The differences in the benchmark results seem more like margin of error. While FCFS is certainly good, these changes are more of a trade-off than an optimization. Increasing code complexity will raise maintenance costs and potential risks. Please give me some time to evaluate the value of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants