Skip to content

jobframework JobReconciler don't update PodsReady condition timely when updata status failed #7363

@olderTaoist

Description

@olderTaoist

What happened:
when enable waitForPodsReady feature, submit batch job, the jobframework JobReconciler don't update PodsReady condition timely when update status failed, the error message is as follows

{"level":"error","ts":"2025-10-23T09:54:18.123509681Z","caller":"jobframework/reconciler.go:523","msg":"Updating workload status","controller":"job","controllerGroup":"batch","controllerKind":"Job","Job":{"name":"k8s-job-20m-mxy","namespa
ce":"default"},"namespace":"default","name":"k8s-job-20m-mxy","reconcileID":"23c12b4b-786f-42f5-bd58-a10213f56e19","job":"default/k8s-job-20m-mxy","gvk":"batch/v1, Kind=Job","error":"Internal error occurred: failed calling webhook \"vwork
load.kb.io\": Post \"https://kueue-webhook-service.kueue-system.svc:443/validate-kueue-x-k8s-io-v1beta1-workload?timeout=10s\": dial tcp 169.169.109.130:443: connect: connection refused","stacktrace":"sigs.k8s.io/kueue/pkg/controller/jobf
ramework.(*JobReconciler).ReconcileGenericJob\n\t/Users/didi/ml/kueue/pkg/controller/jobframework/reconciler.go:523\nsigs.k8s.io/kueue/pkg/controller/jobframework.(*genericReconciler).Reconcile\n\t/Users/didi/ml/kueue/pkg/controller/jobfr
amework/reconciler.go:1522\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/Users/didi/ml/kueue/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controlle
r-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/Users/didi/ml/kueue/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:340\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Contro
ller[...]).processNextWorkItem\n\t/Users/didi/ml/kueue/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:300\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1\n\t/Users/didi
/ml/kueue/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:202"}

What you expected to happen:
set PodsReady condition to True timely when updata status failed

How to reproduce it (as minimally and precisely as possible):

for example, in the above situation, access to the webhook sometimes works and sometimes doesn't.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Kueue version: master
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions