-
Notifications
You must be signed in to change notification settings - Fork 625
fix evict policy #4127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix evict policy #4127
Conversation
| """Schedule decoding.""" | ||
|
|
||
| running = self.running | ||
| def _reorder_running(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just sort running in reversed order, so we don't need nested loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that in the original logic, when traversing the sequence, the order of eviction priority is the same as the order of block allocation requests. Consider the following two scenarios:
- When the sequence is sorted in descending order of timestamps, if there are still a certain number of free blocks during scheduling, the latest requests will be allocated space first. This may cause the earliest arriving requests to fail to obtain GPU blocks.
- When the sequence is sorted in ascending order of timestamps, if there are almost no free blocks left during scheduling, the earliest arriving requests will be evicted first. This violates the First-Come-First-Served (FCFS) principle.
Therefore, I think adding a nested loop could address both of these situations.
|
FCFS might be acceptable, but it is not a necessity. The order of service does not significantly impact throughput. Moreover, if a preempted request has an earlier arrival time, it will be rescheduled for computation sooner. |
|
|
TTFT is primarily influenced by prefill and is theoretically less related to schedule_decoding. The differences in the benchmark results seem more like margin of error. While FCFS is certainly good, these changes are more of a trade-off than an optimization. Increasing code complexity will raise maintenance costs and potential risks. Please give me some time to evaluate the value of this. |
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
I noticed that during _schedule_decoding, when preemption of running requests is required, the default order is to evict the earliest arriving request first (because during _schedule_prefill, requests are appended to the running queue in the order of their arrival time). According to the First-Come-First-Served (FCFS) principle, wouldn’t it be better to evict the latest arriving request first?
Reproduction:
GPU: 4090
Command:
Under conditions of high request load and GPU memory pressure.
before commit

after:

Modification
first evict latest arrival time seq
BC-breaking (Optional)
Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.
Checklist
@lvhan028 @grimoire