fix evict policy #4127

Tsundoku958 · 2025-11-14T09:18:48Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

I noticed that during _schedule_decoding, when preemption of running requests is required, the default order is to evict the earliest arriving request first (because during _schedule_prefill, requests are appended to the running queue in the order of their arrival time). According to the First-Come-First-Served (FCFS) principle, wouldn’t it be better to evict the latest arriving request first?

Reproduction:

GPU: 4090

Command:

lmdeploy serve api_server ../vllm_build/qwen/Qwen-7B-Chat/  --backend pytorch --max-batch-size 128

python benchmark/profile_restful_api.py --backend lmdeploy  --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 500 --request-rate 5

Under conditions of high request load and GPU memory pressure.

before commit

after：

Modification

first evict latest arrival time seq

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.
@lvhan028 @grimoire

grimoire · 2025-11-16T05:15:09Z

lmdeploy/pytorch/paging/scheduler.py

        """Schedule decoding."""

-        running = self.running
+        def _reorder_running():


I think we can just sort running in reversed order, so we don't need nested loops.

I noticed that in the original logic, when traversing the sequence, the order of eviction priority is the same as the order of block allocation requests. Consider the following two scenarios:

When the sequence is sorted in descending order of timestamps, if there are still a certain number of free blocks during scheduling, the latest requests will be allocated space first. This may cause the earliest arriving requests to fail to obtain GPU blocks.

When the sequence is sorted in ascending order of timestamps, if there are almost no free blocks left during scheduling, the earliest arriving requests will be evicted first. This violates the First-Come-First-Served (FCFS) principle.

Therefore, I think adding a nested loop could address both of these situations.

grimoire · 2025-11-17T04:20:38Z

FCFS might be acceptable, but it is not a necessity. The order of service does not significantly impact throughput. Moreover, if a preempted request has an earlier arrival time, it will be rescheduled for computation sooner.
I think increasing the complexity of the scheduler just for the sake of FCFS is not worth it. The focus of optimization should perhaps be on how to avoid eviction in the first place.

Tsundoku958 · 2025-11-17T06:54:44Z

FCFS might be acceptable, but it is not a necessity. The order of service does not significantly impact throughput. Moreover, if a preempted request has an earlier arrival time, it will be rescheduled for computation sooner. I think increasing the complexity of the scheduler just for the sake of FCFS is not worth it. The focus of optimization should perhaps be on how to avoid eviction in the first place.

Although FCFS has little impact on throughput, it can reduce TTFT, which I think is also a relatively important metric.
If we want to increase throughput, I think we could introduce a new strategy that sorts requests by maximum generation length(seq.sampling_param.max_new_tokens). However, this would still require first adding nested loops as implemented in the current commit, so I believe the current commit is useful.
Avoiding eviction is certainly important, but a key prerequisite for minimizing eviction is having sufficient GPU memory space. However, abundant GPU memory may lead to insufficient GPU parallelism. Under different hardware environments, it seems challenging to perfectly balance both parallelism and GPU memory availability.（I'm not entirely sure about this part, but I think sorting the running queue seems like a straightforward and feasible strategy.）

grimoire · 2025-11-17T10:09:52Z

TTFT is primarily influenced by prefill and is theoretically less related to schedule_decoding. The differences in the benchmark results seem more like margin of error. While FCFS is certainly good, these changes are more of a trade-off than an optimization. Increasing code complexity will raise maintenance costs and potential risks. Please give me some time to evaluate the value of this.

fix evict policy

2fc65a6

grimoire reviewed Nov 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix evict policy #4127

fix evict policy #4127

Tsundoku958 commented Nov 14, 2025 •

edited

Loading

Uh oh!

grimoire Nov 16, 2025

Uh oh!

Tsundoku958 Nov 17, 2025 •

edited

Loading

Uh oh!

grimoire commented Nov 17, 2025

Uh oh!

Tsundoku958 commented Nov 17, 2025

Uh oh!

grimoire commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix evict policy #4127

Are you sure you want to change the base?

fix evict policy #4127

Conversation

Tsundoku958 commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Reproduction:

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

Uh oh!

grimoire Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

Tsundoku958 Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grimoire commented Nov 17, 2025

Uh oh!

Tsundoku958 commented Nov 17, 2025

Uh oh!

grimoire commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Tsundoku958 commented Nov 14, 2025 •

edited

Loading

Tsundoku958 Nov 17, 2025 •

edited

Loading