-
Notifications
You must be signed in to change notification settings - Fork 450
Fix preemption logic to only check borrowing constraints for resources requiring preemption #7392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: zhifei92 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @gabesaba
I don't know enough to say if this change is good or bad.
One thing I would suggest is unit/integration tests confirming this fixes the bug you reported.
|
or maybe @PBundyra ? |
|
cc @pajakd |
|
@zhifei92 please rebase |
Done, The current Additionally, I included a test case—"preemption scenario: CPU needs borrowing(been borrowed from the cohort) but not preemption, memory needs preemption"—to verify the changes made in this update. |
…ources requiring preemption
5ca401b to
5a7a743
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test which covers (a simplified version of?) the scenario described in #7392 (comment), either as a unit or integration test. It would be good to make sure the fix is working E2E
edit:
I see that @kannon92 suggested this too.
One thing I would suggest is unit/integration tests confirming this fixes the bug you reported.
To clarify: this should test a broader part of the scheduling cycle, not just the workloadFits function (though I'm happy leaving those new tests in as well)
|
@zhifei92 please also answer "Does this PR introduce a user-facing change?". this section will go to release notes for the next patch releases |
Done |
| util.ExpectWorkloadsToBePreempted(ctx, k8sClient, betaWl1) | ||
| util.FinishEvictionForWorkloads(ctx, k8sClient, betaWl1) | ||
|
|
||
| // Verify the high priority workload is admitted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, Turn the comments into ginkgo.By steps. It is still one line, but helps to correlate sometimes logs with steps when a failure happens. Here and below. The above would be long as a single step, maybe split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
lint fix
d613f03 to
447fdab
Compare
|
/retest |
|
@zhifei92: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/hold |
|
Flaky test, with the recently preempted workload scheduling infront of the preempting workload. The test flakiness can be addressed by setting the priority of the preempting workload as higher than the workload being preempted, so it deterministically get sorted first. I am concerned that this change could introduce a preemption loop (or a situation like we encountered in this test, which is not a loop, but will result in multiple preemptions before hitting a stable state). I will think on this a bit and comment again. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
The
workloadFitsfunction incorrectly fails preemption when any resource requires borrowing, even if the resources that actually need preemption (specified infrsNeedPreemption) are available after preemption.Actual issue encountered:
The following queue fails to reclaim
vcuda-memoryandvcuda-ratioresources because CPU and memory have borrowed resources from other queues. This causes the reclamation ofvcuda-memoryandvcuda-ratioto fail, which is undesirable—since reclaiming onlyvcuda-memoryandvcuda-ratioshould be sufficient for our job to be handled by the admitRoot Cause
Solution
Only check borrowing constraints for resources that actually require preemption:
Which issue(s) this PR fixes:
Fixes #7393
Special notes for your reviewer:
Does this PR introduce a user-facing change?