Skip to content

Conversation

@zhifei92
Copy link

@zhifei92 zhifei92 commented Oct 27, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

The workloadFits function incorrectly fails preemption when any resource requires borrowing, even if the resources that actually need preemption (specified in frsNeedPreemption) are available after preemption.

Actual issue encountered:

The following queue fails to reclaim vcuda-memory and vcuda-ratio resources because CPU and memory have borrowed resources from other queues. This causes the reclamation of vcuda-memory and vcuda-ratio to fail, which is undesirable—since reclaiming only vcuda-memory and vcuda-ratio should be sufficient for our job to be handled by the admit

apiVersion: kueue.x-k8s.io/v1beta1
  kind: ClusterQueue
  metadata:
    finalizers:
    - kueue.x-k8s.io/resource-in-use
    generation: 5
    name: queue-3e
  spec:
    cohort: group-60
    flavorFungibility:
      whenCanBorrow: Borrow
      whenCanPreempt: Preempt
    preemption:
      borrowWithinCohort:
        policy: Never
      reclaimWithinCohort: Any
      withinClusterQueue: Never
    queueingStrategy: BestEffortFIFO
    resourceGroups:
    - coveredResources:
      - cpu
      - memory
      - pods
      - storage
      - rdma_devices
      flavors:
      - name: group-60-normal
        resources:
        - borrowingLimit: "40"
          name: cpu
          nominalQuota: "660"
        - borrowingLimit: 250Gi
          name: memory
          nominalQuota: 3450Gi
        - borrowingLimit: "2147483647"
          name: pods
          nominalQuota: "2147483647"
        - borrowingLimit: "35183298347008"
          name: storage
          nominalQuota: "35183298347008"
        - borrowingLimit: "35183298347008"
          name: rdma_devices
          nominalQuota: "35183298347008

    - coveredResources:
      - vcuda-core
      - vcuda-ratio
      - vcuda-memory
      flavors:
      - name: group-60-b
        resources:
        - borrowingLimit: "0"
          name: vcuda-core
          nominalQuota: "560"
        - borrowingLimit: "0"
          name: vcuda-ratio
          nominalQuota: "3500"
        - borrowingLimit: "0"
          name: vcuda-memory
          nominalQuota: "6265"
    stopPolicy: None

flavorsUsage:
    - name: group-60-normal
      resources:
      - borrowed: 8900m
        name: cpu
        total: 668900m
      - borrowed: "0"
        name: rdma_devices
        total: "0"
      - borrowed: 79700Mi
        name: memory
        total: 3612500Mi
      - borrowed: "0"
        name: pods
        total: "34"
      - borrowed: "0"
        name: storage
        total: "0"

    - name: group-60-b
      resources:
      - borrowed: "0"
        name: vcuda-core
        total: "34"
      - borrowed: "0"
        name: vcuda-memory
        total: "5186"
      - borrowed: "0"
        name: vcuda-ratio
        total: "2900"

Root Cause

// Before fix - checks ALL resources
for fr, v := range preemptionCtx.workloadUsage.Quota {
    if !allowBorrowing && preemptionCtx.preemptorCQ.BorrowingWith(fr, v) {
        return false  // ❌ Fails if ANY resource needs borrowing
    }
}

Solution
Only check borrowing constraints for resources that actually require preemption:

// After fix - only checks resources needing preemption
for fr, v := range preemptionCtx.workloadUsage.Quota {
    _, needsPreemption := preemptionCtx.frsNeedPreemption[fr]
    if needsPreemption && !allowBorrowing && preemptionCtx.preemptorCQ.BorrowingWith(fr, v) {
        return false  // ✅ Only fails if preemption-needed resources need borrowing
    }
}

Which issue(s) this PR fixes:

Fixes #7393

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fixed an issue where reclaiming resources from one resource group would incorrectly fail due to borrowed usage in a different resource group. Reclamation now validates only the resources being reclaimed, enabling jobs to be admitted as soon as sufficient quota is freed. No user action required.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. labels Oct 27, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: zhifei92
Once this PR has been reviewed and has the lgtm label, please assign gabesaba for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from PBundyra October 27, 2025 08:41
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 27, 2025
@netlify
Copy link

netlify bot commented Oct 27, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 447fdab
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6901da05eeb0330008c452fa

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Oct 27, 2025
Copy link
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @gabesaba

I don't know enough to say if this change is good or bad.

One thing I would suggest is unit/integration tests confirming this fixes the bug you reported.

@k8s-ci-robot k8s-ci-robot requested a review from gabesaba October 27, 2025 19:12
@kannon92
Copy link
Contributor

or maybe @PBundyra ?

@mimowo
Copy link
Contributor

mimowo commented Oct 28, 2025

cc @pajakd

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 28, 2025
@mimowo
Copy link
Contributor

mimowo commented Oct 28, 2025

@zhifei92 please rebase

@zhifei92
Copy link
Author

One thing I would suggest is unit/integration tests confirming this fixes the bug you reported.

Done, The current workloadFits doesn't have any unit tests yet, so I've added a comprehensive one.

Additionally, I included a test case—"preemption scenario: CPU needs borrowing(been borrowed from the cohort) but not preemption, memory needs preemption"—to verify the changes made in this update.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 28, 2025
Copy link
Contributor

@gabesaba gabesaba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test which covers (a simplified version of?) the scenario described in #7392 (comment), either as a unit or integration test. It would be good to make sure the fix is working E2E

edit:
I see that @kannon92 suggested this too.

One thing I would suggest is unit/integration tests confirming this fixes the bug you reported.

To clarify: this should test a broader part of the scheduling cycle, not just the workloadFits function (though I'm happy leaving those new tests in as well)

@mimowo
Copy link
Contributor

mimowo commented Oct 28, 2025

@zhifei92 please also answer "Does this PR introduce a user-facing change?". this section will go to release notes for the next patch releases

@k8s-ci-robot k8s-ci-robot added release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Oct 29, 2025
@zhifei92
Copy link
Author

please also answer "Does this PR introduce a user-facing change?". this section will go to release notes for the next patch releases

Done

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 29, 2025
util.ExpectWorkloadsToBePreempted(ctx, k8sClient, betaWl1)
util.FinishEvictionForWorkloads(ctx, k8sClient, betaWl1)

// Verify the high priority workload is admitted
Copy link
Contributor

@mimowo mimowo Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, Turn the comments into ginkgo.By steps. It is still one line, but helps to correlate sometimes logs with steps when a failure happens. Here and below. The above would be long as a single step, maybe split.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@zhifei92
Copy link
Author

/retest

@k8s-ci-robot
Copy link
Contributor

@zhifei92: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kueue-test-integration-baseline-main 447fdab link true /test pull-kueue-test-integration-baseline-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@mimowo
Copy link
Contributor

mimowo commented Oct 29, 2025

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 29, 2025
@gabesaba
Copy link
Contributor

Flaky test, with the recently preempted workload scheduling infront of the preempting workload. The test flakiness can be addressed by setting the priority of the preempting workload as higher than the workload being preempted, so it deterministically get sorted first.

I am concerned that this change could introduce a preemption loop (or a situation like we encountered in this test, which is not a loop, but will result in multiple preemptions before hitting a stable state). I will think on this a bit and comment again.

INTEGRATION_NPROCS=1 INTEGRATION_TARGET=./test/integration/singlecluster/scheduler  GINKGO_ARGS="--focus='When Queue A already has workloads borrowing resources from Queue B'" make test-integration

  STEP: Verifying the workload requiring preemption has been admitted. @ 10/31/25 08:28:48.967
  2025-10-31T08:28:48.968042706Z        LEVEL(-2)       workload-reconciler     core/workload_controller.go:813 Workload update event   {"workload": {"name":"beta-wl1","namespace":"preemption-z5ncl"}, "queue": "beta-lq", "status": "pending", "prevStatus": "admitted", "prevClusterQueue": "beta-cq"}
  2025-10-31T08:28:48.968354706Z        LEVEL(-2)       workload-reconciler     queue/manager.go:619    Attempting to move workloads    {"workload": {"name":"beta-wl1","namespace":"preemption-z5ncl"}, "queue": "beta-lq", "status": "pending", "cohort": "gpu-cohort", "root": "gpu-cohort"}
  2025-10-31T08:28:48.968724046Z        LEVEL(-2)       core/workload_controller.go:162 Reconcile Workload      {"controller": "workload_controller", "namespace": "preemption-z5ncl", "name": "beta-wl1", "reconcileID": "cf4ae6e2-2aa6-4191-bc6c-06e5b7085802"}
  2025-10-31T08:28:49.035782843Z        LEVEL(-3)       scheduler       queue/manager.go:661    Obtained ClusterQueue heads     {"schedulingCycle": 13, "count": 2}
  2025-10-31T08:28:49.036762344Z        LEVEL(-2)       scheduler       scheduler/scheduler.go:252      Attempting to schedule workload {"schedulingCycle": 13, "workload": {"name":"beta-wl1","namespace":"preemption-z5ncl"}, "clusterQueue": {"name":"beta-cq"}, "parentCohort": {"name":"gpu-cohort"}, "rootCohort": {"name":"gpu-cohort"}}
  2025-10-31T08:28:49.037076534Z        LEVEL(-2)       scheduler       scheduler/scheduler.go:630      Workload assumed in the cache   {"schedulingCycle": 13, "workload": {"name":"beta-wl1","namespace":"preemption-z5ncl"}, "clusterQueue": {"name":"beta-cq"}, "parentCohort": {"name":"gpu-cohort"}, "rootCohort": {"name":"gpu-cohort"}}
  2025-10-31T08:28:49.037345064Z        LEVEL(-2)       scheduler       scheduler/scheduler.go:252      Attempting to schedule workload {"schedulingCycle": 13, "workload": {"name":"alpha-wl-preempting","namespace":"preemption-z5ncl"}, "clusterQueue": {"name":"alpha-cq"}, "parentCohort": {"name":"gpu-cohort"}, "rootCohort": {"name":"gpu-cohort"}}
  2025-10-31T08:28:49.037501294Z        LEVEL(-3)       scheduler       scheduler/logging.go:42 Workload evaluated for admission        {"schedulingCycle": 13, "workload": {"name":"beta-wl1","namespace":"preemption-z5ncl"}, "clusterQueue": {"name":"beta-cq"}, "status": "assumed", "reason": ""}
  2025-10-31T08:28:49.037653614Z        LEVEL(-3)       scheduler       scheduler/logging.go:42 Workload evaluated for admission        {"schedulingCycle": 13, "workload": {"name":"alpha-wl-preempting","namespace":"preemption-z5ncl"}, "clusterQueue": {"name":"alpha-cq"}, "status": "skipped", "reason": "Workload no longer fits after processing another workload"}
  2025-10-31T08:28:49.037906394Z        LEVEL(-2)       scheduler       scheduler/scheduler.go:778      Workload re-queued      {"schedulingCycle": 13, "workload": {"name":"alpha-wl-preempting","namespace":"preemption-z5ncl"}, "clusterQueue": {"name":"alpha-cq"}, "queue": {"name":"alpha-lq","namespace":"preemption-z5ncl"}, "requeueReason": "FailedAfterNomination", "added": true, "status": "skipped"}
  2025-10-31T08:28:49.062841739Z        DEBUG   events  recorder/recorder.go:104        Workload no longer fits after processing another workload       {"type": "Warning", "object": {"kind":"Workload","namespace":"preemption-z5ncl","name":"alpha-wl-preempting","uid":"b87cf2e8-b314-4c01-921b-a03e2f97f92c","apiVersion":"kueue.x-k8s.io/v1beta2","resourceVersion":"265"}, "reason": "Pending"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

A bug in the preemption logic

5 participants