Skip to content

A bug in the preemption logic #7393

@zhifei92

Description

@zhifei92

What happened:
The following queue fails to reclaim vcuda-memory and vcuda-ratio resources because CPU and memory have borrowed resources from other queues. This causes the reclamation of vcuda-memory and vcuda-ratio to fail, which is undesirable—since reclaiming only vcuda-memory and vcuda-ratio should be sufficient for our job to be handled by the admit

apiVersion: kueue.x-k8s.io/v1beta1
  kind: ClusterQueue
  metadata:
    finalizers:
    - kueue.x-k8s.io/resource-in-use
    generation: 5
    name: queue-3e
  spec:
    cohort: group-60
    flavorFungibility:
      whenCanBorrow: Borrow
      whenCanPreempt: Preempt
    preemption:
      borrowWithinCohort:
        policy: Never
      reclaimWithinCohort: Any
      withinClusterQueue: Never
    queueingStrategy: BestEffortFIFO
    resourceGroups:
    - coveredResources:
      - cpu
      - memory
      - pods
      - storage
      - rdma_devices
      flavors:
      - name: group-60-normal
        resources:
        - borrowingLimit: "40"
          name: cpu
          nominalQuota: "660"
        - borrowingLimit: 250Gi
          name: memory
          nominalQuota: 3450Gi
        - borrowingLimit: "2147483647"
          name: pods
          nominalQuota: "2147483647"
        - borrowingLimit: "35183298347008"
          name: storage
          nominalQuota: "35183298347008"
        - borrowingLimit: "35183298347008"
          name: rdma_devices
          nominalQuota: "35183298347008

    - coveredResources:
      - vcuda-core
      - vcuda-ratio
      - vcuda-memory
      flavors:
      - name: group-60-b
        resources:
        - borrowingLimit: "0"
          name: vcuda-core
          nominalQuota: "560"
        - borrowingLimit: "0"
          name: vcuda-ratio
          nominalQuota: "3500"
        - borrowingLimit: "0"
          name: vcuda-memory
          nominalQuota: "6265"
    stopPolicy: None

flavorsUsage:
    - name: group-60-normal
      resources:
      - borrowed: 8900m
        name: cpu
        total: 668900m
      - borrowed: "0"
        name: rdma_devices
        total: "0"
      - borrowed: 79700Mi
        name: memory
        total: 3612500Mi
      - borrowed: "0"
        name: pods
        total: "34"
      - borrowed: "0"
        name: storage
        total: "0"

    - name: group-60-b
      resources:
      - borrowed: "0"
        name: vcuda-core
        total: "34"
      - borrowed: "0"
        name: vcuda-memory
        total: "5186"
      - borrowed: "0"
        name: vcuda-ratio
        total: "2900"

What you expected to happen:
If CPU and memory resources do not need to be reclaimed, and only vcuda-memory and vcuda-ratio are being reclaimed, then CPU and memory should not be validated. Once sufficient vcuda-memory and vcuda-ratio have been reclaimed, the job should be allowed to be scheduled successfully.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Kueue version (use git describe --tags --dirty --always):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions