Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions charts/kueue/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,9 @@ The following table lists the configurable parameters of the kueue chart and the

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| autoKueue.tasLevels | list | `[{name: cloud.provider.com/topology-block}]` | Defines the TAS levels |
| autoKueue.nodeLabel | object | `{cloud.provider.com/node-group: "tas-group"}` | Sets the Resource flavor node label |
| autoKueue.clusterQueueName | string | `cq` | The name of the cluster queue that will be created |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| autoKueue.clusterQueueName | string | `cq` | The name of the cluster queue that will be created |
| autoKueue.clusterQueueName | string | `default` | The name of the cluster queue that will be created |

wdyt?

| controllerManager.featureGates | list | `[]` | ControllerManager's feature gates |
| controllerManager.imagePullSecrets | list | `[]` | ControllerManager's imagePullSecrets |
| controllerManager.livenessProbe.failureThreshold | int | `3` | ControllerManager's livenessProbe failureThreshold |
Expand All @@ -119,6 +122,7 @@ The following table lists the configurable parameters of the kueue chart and the
| controllerManager.replicas | int | `1` | ControllerManager's replicas count |
| controllerManager.tolerations | list | `[]` | ControllerManager's tolerations |
| controllerManager.topologySpreadConstraints | list | `[]` | ControllerManager's topologySpreadConstraints |
| enableAutoKueue | bool | `false` | Enable AutoKueue for automated TAS deployment |
| enableCertManager | bool | `false` | Enable x509 automated certificate management using cert-manager (cert-manager.io) |
| enableKueueViz | bool | `false` | Enable KueueViz dashboard |
| enablePrometheus | bool | `false` | Enable Prometheus |
Expand Down
131 changes: 131 additions & 0 deletions charts/kueue/templates/hooks/autokueue.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# 1. ConfigMap containing Kueue resource definitions
# This resource will be created in the 'kueue-system' namespace.
{{- if .Values.enableAutoKueue }}
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ .Release.Name }}-kueue-resources
namespace: kueue-system
annotations:
"helm.sh/hook": post-install,post-upgrade
"helm.sh/hook-weight": "-5"
"helm.sh/hook-delete-policy": before-hook-creation
data:
resources.yaml: |-
apiVersion: kueue.x-k8s.io/v1beta1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
apiVersion: kueue.x-k8s.io/v1beta1
apiVersion: kueue.x-k8s.io/v1beta2

nit I think it should already work and seems better to use new versions going forward.

kind: Topology
metadata:
name: "default"
spec:
levels:
{{- range .Values.autoKueue.tasLevels }}
- nodeLabel: {{ .name | quote }}
{{- end }}
---
kind: ResourceFlavor
apiVersion: kueue.x-k8s.io/v1beta1
metadata:
name: "tas-flavor"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: "tas-flavor"
name: "tas-gpu-default"

since we may need other flavors for other accelerators.
wdyt?

spec:
{{- with .Values.autoKueue.nodeLabel }}
nodeLabels:
{{- toYaml . | nindent 8 }}
{{- end }}
topologyName: "default"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: NoSchedule
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: {{ .Values.autoKueue.clusterQueueName | quote }}
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["nvidia.com/gpu"]
flavors:
- name: "tas-flavor"
resources:
- name: "nvidia.com/gpu"
nominalQuota: 100000000
---
# 2. Service Account for the Job
# This resource will be created in the 'kueue-system' namespace.
apiVersion: v1
kind: ServiceAccount
metadata:
name: {{ .Release.Name }}-kueue-hook-sa
namespace: kueue-system
annotations:
"helm.sh/hook": post-install,post-upgrade
"helm.sh/hook-delete-policy": before-hook-creation

---
# 3. ClusterRole with required permissions
# This defines permissions to get resources and create Kueue resources.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: {{ .Release.Name }}-kueue-hook-clusterrole
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: {{ .Release.Name }}-kueue-hook-clusterrole
name: {{ .Release.Name }}-autokueue-hook-clusterrole

wdyt?

annotations:
"helm.sh/hook": post-install,post-upgrade
"helm.sh/hook-delete-policy": before-hook-creation
rules:
- apiGroups: ["kueue.x-k8s.io"]
resources: ["topologies", "resourceflavors", "clusterqueues", "localqueues"]
verbs: ["create", "get", "list", "patch", "update"]
- apiGroups: [""] # Core API group
resources: ["configmaps", "endpoints"]
verbs: ["get"]

---
# 4. A ClusterRoleBinding to grant the permissions cluster-wide
# This is required for managing cluster-scoped resources like Topologies.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: {{ .Release.Name }}-kueue-hook-crb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: {{ .Release.Name }}-kueue-hook-crb
name: {{ .Release.Name }}-autokueue-hook-crb

And the name for the role I would say -autokueue-hook-clusterrole

annotations:
"helm.sh/hook": post-install,post-upgrade
"helm.sh/hook-delete-policy": before-hook-creation
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: {{ .Release.Name }}-kueue-hook-clusterrole
subjects:
- kind: ServiceAccount
name: {{ .Release.Name }}-kueue-hook-sa
namespace: kueue-system # The namespace where the ServiceAccount lives
---
# 5. The Job that waits and applies the resources
# This Job will be created and run in the 'kueue-system' namespace.
apiVersion: batch/v1
kind: Job
metadata:
name: "{{ .Release.Name }}-create-kueue-resources-job"
namespace: kueue-system
annotations:
"helm.sh/hook": post-install,post-upgrade
"helm.sh/hook-weight": "5"
"helm.sh/hook-delete-policy": hook-succeeded,before-hook-creation
spec:
template:
spec:
serviceAccountName: {{ .Release.Name }}-kueue-hook-sa
containers:
- name: kubectl-apply
image: bitnami/kubectl:latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use some image that is hosted by k8s registry?

Please explore https://explore.ggcr.dev/?repo=registry.k8s.io

there is registry.k8s.io/kubectl hosted.

command:
- /bin/sh
- -c
- |
set -ex
kubectl get configmap {{ .Release.Name }}-kueue-resources -n kueue-system -o=jsonpath='{.data.resources\.yaml}' | kubectl apply -f -
echo "🚀 All Kueue resources applied successfully."
restartPolicy: Never
backoffLimit: 5
{{- end }}
8 changes: 8 additions & 0 deletions charts/kueue/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ enableCertManager: false
enableVisibilityAPF: false
# -- Enable KueueViz dashboard
enableKueueViz: false
# -- Enable autoKueue for automated TAS deployment
enaleAutoKueue: false
# -- Kubernetes cluster's domain
kubernetesClusterDomain: cluster.local
controllerManager:
Expand Down Expand Up @@ -286,3 +288,9 @@ metrics:
# -- ServiceMonitor's tlsConfig
tlsConfig:
insecureSkipVerify: true
autoKueue:
clusterQueueName: cq
tasLevels:
- name: cloud.provider.com/topology-block
resourceFlavorNodeSelector:
cloud.provider.com/node-group: "tas-group"