Fix preemption logic to only check borrowing constraints for resources requiring preemption #7392

zhifei92 · 2025-10-27T08:41:50Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

The workloadFits function incorrectly fails preemption when any resource requires borrowing, even if the resources that actually need preemption (specified in frsNeedPreemption) are available after preemption.

Actual issue encountered:

The following queue fails to reclaim vcuda-memory and vcuda-ratio resources because CPU and memory have borrowed resources from other queues. This causes the reclamation of vcuda-memory and vcuda-ratio to fail, which is undesirable—since reclaiming only vcuda-memory and vcuda-ratio should be sufficient for our job to be handled by the admit

apiVersion: kueue.x-k8s.io/v1beta1
  kind: ClusterQueue
  metadata:
    finalizers:
    - kueue.x-k8s.io/resource-in-use
    generation: 5
    name: queue-3e
  spec:
    cohort: group-60
    flavorFungibility:
      whenCanBorrow: Borrow
      whenCanPreempt: Preempt
    preemption:
      borrowWithinCohort:
        policy: Never
      reclaimWithinCohort: Any
      withinClusterQueue: Never
    queueingStrategy: BestEffortFIFO
    resourceGroups:
    - coveredResources:
      - cpu
      - memory
      - pods
      - storage
      - rdma_devices
      flavors:
      - name: group-60-normal
        resources:
        - borrowingLimit: "40"
          name: cpu
          nominalQuota: "660"
        - borrowingLimit: 250Gi
          name: memory
          nominalQuota: 3450Gi
        - borrowingLimit: "2147483647"
          name: pods
          nominalQuota: "2147483647"
        - borrowingLimit: "35183298347008"
          name: storage
          nominalQuota: "35183298347008"
        - borrowingLimit: "35183298347008"
          name: rdma_devices
          nominalQuota: "35183298347008

    - coveredResources:
      - vcuda-core
      - vcuda-ratio
      - vcuda-memory
      flavors:
      - name: group-60-b
        resources:
        - borrowingLimit: "0"
          name: vcuda-core
          nominalQuota: "560"
        - borrowingLimit: "0"
          name: vcuda-ratio
          nominalQuota: "3500"
        - borrowingLimit: "0"
          name: vcuda-memory
          nominalQuota: "6265"
    stopPolicy: None

flavorsUsage:
    - name: group-60-normal
      resources:
      - borrowed: 8900m
        name: cpu
        total: 668900m
      - borrowed: "0"
        name: rdma_devices
        total: "0"
      - borrowed: 79700Mi
        name: memory
        total: 3612500Mi
      - borrowed: "0"
        name: pods
        total: "34"
      - borrowed: "0"
        name: storage
        total: "0"

    - name: group-60-b
      resources:
      - borrowed: "0"
        name: vcuda-core
        total: "34"
      - borrowed: "0"
        name: vcuda-memory
        total: "5186"
      - borrowed: "0"
        name: vcuda-ratio
        total: "2900"

Root Cause

// Before fix - checks ALL resources
for fr, v := range preemptionCtx.workloadUsage.Quota {
    if !allowBorrowing && preemptionCtx.preemptorCQ.BorrowingWith(fr, v) {
        return false  // ❌ Fails if ANY resource needs borrowing
    }
}

Solution
Only check borrowing constraints for resources that actually require preemption:

// After fix - only checks resources needing preemption
for fr, v := range preemptionCtx.workloadUsage.Quota {
    _, needsPreemption := preemptionCtx.frsNeedPreemption[fr]
    if needsPreemption && !allowBorrowing && preemptionCtx.preemptorCQ.BorrowingWith(fr, v) {
        return false  // ✅ Only fails if preemption-needed resources need borrowing
    }
}

Which issue(s) this PR fixes:

Fixes #7393

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fixed an issue where reclaiming resources from one resource group would incorrectly fail due to borrowed usage in a different resource group. Reclamation now validates only the resources being reclaimed, enabling jobs to be admitted as soon as sufficient quota is freed. No user action required.

k8s-ci-robot · 2025-10-27T08:41:56Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: zhifei92
Once this PR has been reviewed and has the lgtm label, please assign gabesaba for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2025-10-27T08:41:57Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`447fdab`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6901da05eeb0330008c452fa

kannon92

/cc @gabesaba

I don't know enough to say if this change is good or bad.

One thing I would suggest is unit/integration tests confirming this fixes the bug you reported.

kannon92 · 2025-10-27T19:23:11Z

or maybe @PBundyra ?

mimowo · 2025-10-28T08:45:14Z

cc @pajakd

mimowo · 2025-10-28T08:51:23Z

@zhifei92 please rebase

zhifei92 · 2025-10-28T08:51:39Z

One thing I would suggest is unit/integration tests confirming this fixes the bug you reported.

Done, The current workloadFits doesn't have any unit tests yet, so I've added a comprehensive one.

Additionally, I included a test case—"preemption scenario: CPU needs borrowing(been borrowed from the cohort) but not preemption, memory needs preemption"—to verify the changes made in this update.

…ources requiring preemption

gabesaba

Please add a test which covers (a simplified version of?) the scenario described in #7392 (comment), either as a unit or integration test. It would be good to make sure the fix is working E2E

edit:
I see that @kannon92 suggested this too.

One thing I would suggest is unit/integration tests confirming this fixes the bug you reported.

To clarify: this should test a broader part of the scheduling cycle, not just the workloadFits function (though I'm happy leaving those new tests in as well)

mimowo · 2025-10-28T11:35:59Z

@zhifei92 please also answer "Does this PR introduce a user-facing change?". this section will go to release notes for the next patch releases

zhifei92 · 2025-10-29T06:27:47Z

please also answer "Does this PR introduce a user-facing change?". this section will go to release notes for the next patch releases

Done

test/integration/singlecluster/scheduler/preemption_test.go

mimowo · 2025-10-29T08:36:23Z

test/integration/singlecluster/scheduler/preemption_test.go

+			util.ExpectWorkloadsToBePreempted(ctx, k8sClient, betaWl1)
+			util.FinishEvictionForWorkloads(ctx, k8sClient, betaWl1)
+
+			// Verify the high priority workload is admitted


nit, Turn the comments into ginkgo.By steps. It is still one line, but helps to correlate sometimes logs with steps when a failure happens. Here and below. The above would be long as a single step, maybe split.

lint fix

zhifei92 · 2025-10-29T10:14:52Z

/retest

k8s-ci-robot · 2025-10-29T10:32:07Z

@zhifei92: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kueue-test-integration-baseline-main	`447fdab`	link	true	`/test pull-kueue-test-integration-baseline-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

mimowo · 2025-10-29T15:25:17Z

/hold
To investigate the failure https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/7392/pull-kueue-test-integration-baseline-main/1983477640618577920 as it seems related

gabesaba · 2025-10-31T08:44:16Z

Flaky test, with the recently preempted workload scheduling infront of the preempting workload. The test flakiness can be addressed by setting the priority of the preempting workload as higher than the workload being preempted, so it deterministically get sorted first.

I am concerned that this change could introduce a preemption loop (or a situation like we encountered in this test, which is not a loop, but will result in multiple preemptions before hitting a stable state). I will think on this a bit and comment again.

INTEGRATION_NPROCS=1 INTEGRATION_TARGET=./test/integration/singlecluster/scheduler  GINKGO_ARGS="--focus='When Queue A already has workloads borrowing resources from Queue B'" make test-integration

  STEP: Verifying the workload requiring preemption has been admitted. @ 10/31/25 08:28:48.967
  2025-10-31T08:28:48.968042706Z        LEVEL(-2)       workload-reconciler     core/workload_controller.go:813 Workload update event   {"workload": {"name":"beta-wl1","namespace":"preemption-z5ncl"}, "queue": "beta-lq", "status": "pending", "prevStatus": "admitted", "prevClusterQueue": "beta-cq"}
  2025-10-31T08:28:48.968354706Z        LEVEL(-2)       workload-reconciler     queue/manager.go:619    Attempting to move workloads    {"workload": {"name":"beta-wl1","namespace":"preemption-z5ncl"}, "queue": "beta-lq", "status": "pending", "cohort": "gpu-cohort", "root": "gpu-cohort"}
  2025-10-31T08:28:48.968724046Z        LEVEL(-2)       core/workload_controller.go:162 Reconcile Workload      {"controller": "workload_controller", "namespace": "preemption-z5ncl", "name": "beta-wl1", "reconcileID": "cf4ae6e2-2aa6-4191-bc6c-06e5b7085802"}
  2025-10-31T08:28:49.035782843Z        LEVEL(-3)       scheduler       queue/manager.go:661    Obtained ClusterQueue heads     {"schedulingCycle": 13, "count": 2}
  2025-10-31T08:28:49.036762344Z        LEVEL(-2)       scheduler       scheduler/scheduler.go:252      Attempting to schedule workload {"schedulingCycle": 13, "workload": {"name":"beta-wl1","namespace":"preemption-z5ncl"}, "clusterQueue": {"name":"beta-cq"}, "parentCohort": {"name":"gpu-cohort"}, "rootCohort": {"name":"gpu-cohort"}}
  2025-10-31T08:28:49.037076534Z        LEVEL(-2)       scheduler       scheduler/scheduler.go:630      Workload assumed in the cache   {"schedulingCycle": 13, "workload": {"name":"beta-wl1","namespace":"preemption-z5ncl"}, "clusterQueue": {"name":"beta-cq"}, "parentCohort": {"name":"gpu-cohort"}, "rootCohort": {"name":"gpu-cohort"}}
  2025-10-31T08:28:49.037345064Z        LEVEL(-2)       scheduler       scheduler/scheduler.go:252      Attempting to schedule workload {"schedulingCycle": 13, "workload": {"name":"alpha-wl-preempting","namespace":"preemption-z5ncl"}, "clusterQueue": {"name":"alpha-cq"}, "parentCohort": {"name":"gpu-cohort"}, "rootCohort": {"name":"gpu-cohort"}}
  2025-10-31T08:28:49.037501294Z        LEVEL(-3)       scheduler       scheduler/logging.go:42 Workload evaluated for admission        {"schedulingCycle": 13, "workload": {"name":"beta-wl1","namespace":"preemption-z5ncl"}, "clusterQueue": {"name":"beta-cq"}, "status": "assumed", "reason": ""}
  2025-10-31T08:28:49.037653614Z        LEVEL(-3)       scheduler       scheduler/logging.go:42 Workload evaluated for admission        {"schedulingCycle": 13, "workload": {"name":"alpha-wl-preempting","namespace":"preemption-z5ncl"}, "clusterQueue": {"name":"alpha-cq"}, "status": "skipped", "reason": "Workload no longer fits after processing another workload"}
  2025-10-31T08:28:49.037906394Z        LEVEL(-2)       scheduler       scheduler/scheduler.go:778      Workload re-queued      {"schedulingCycle": 13, "workload": {"name":"alpha-wl-preempting","namespace":"preemption-z5ncl"}, "clusterQueue": {"name":"alpha-cq"}, "queue": {"name":"alpha-lq","namespace":"preemption-z5ncl"}, "requeueReason": "FailedAfterNomination", "added": true, "status": "skipped"}
  2025-10-31T08:28:49.062841739Z        DEBUG   events  recorder/recorder.go:104        Workload no longer fits after processing another workload       {"type": "Warning", "object": {"kind":"Workload","namespace":"preemption-z5ncl","name":"alpha-wl-preempting","uid":"b87cf2e8-b314-4c01-921b-a03e2f97f92c","apiVersion":"kueue.x-k8s.io/v1beta2","resourceVersion":"265"}, "reason": "Pending"}

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. labels Oct 27, 2025

k8s-ci-robot requested a review from mbobrovskyi October 27, 2025 08:41

k8s-ci-robot requested a review from PBundyra October 27, 2025 08:41

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 27, 2025

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Oct 27, 2025

kannon92 reviewed Oct 27, 2025

View reviewed changes

k8s-ci-robot requested a review from gabesaba October 27, 2025 19:12

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 28, 2025

zhifei92 added 2 commits October 28, 2025 16:53

fix: Fix preemption logic to only check borrowing constraints for res…

a544512

…ources requiring preemption

test: Add unit tests for workloadFits

5a7a743

zhifei92 force-pushed the fix-preemption branch from 5ca401b to 5a7a743 Compare October 28, 2025 09:13

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 28, 2025

gabesaba reviewed Oct 28, 2025

View reviewed changes

k8s-ci-robot added release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Oct 29, 2025

test: Add an integration test.

eca9247

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 29, 2025

mimowo reviewed Oct 29, 2025

View reviewed changes

test/integration/singlecluster/scheduler/preemption_test.go Show resolved Hide resolved

mimowo reviewed Oct 29, 2025

View reviewed changes

Turn the comments into ginkgo.By

447fdab

lint fix

zhifei92 force-pushed the fix-preemption branch from d613f03 to 447fdab Compare October 29, 2025 09:10

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 29, 2025

Fix preemption logic to only check borrowing constraints for resources requiring preemption #7392

Are you sure you want to change the base?

Fix preemption logic to only check borrowing constraints for resources requiring preemption #7392

Conversation

zhifei92 commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

k8s-ci-robot commented Oct 27, 2025

Uh oh!

netlify bot commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Uh oh!

kannon92 left a comment

Choose a reason for hiding this comment

Uh oh!

kannon92 commented Oct 27, 2025

Uh oh!

mimowo commented Oct 28, 2025

Uh oh!

mimowo commented Oct 28, 2025

Uh oh!

zhifei92 commented Oct 28, 2025

Uh oh!

gabesaba left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mimowo commented Oct 28, 2025

Uh oh!

zhifei92 commented Oct 29, 2025

Uh oh!

Uh oh!

mimowo Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhifei92 Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

zhifei92 commented Oct 29, 2025

Uh oh!

k8s-ci-robot commented Oct 29, 2025

Uh oh!

mimowo commented Oct 29, 2025

Uh oh!

gabesaba commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhifei92 commented Oct 27, 2025 •

edited

Loading

netlify bot commented Oct 27, 2025 •

edited

Loading

gabesaba left a comment •

edited

Loading

mimowo Oct 29, 2025 •

edited

Loading