-
Notifications
You must be signed in to change notification settings - Fork 6
feat(operator): add pod deletion policy when node is downfeat: add node-down pod deletion policy #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Reformat imports to follow standard order - Make cleanup_stuck_terminating_pods_on_down_nodes call single-line - Reformat assert! statements in tests to multi-line for better readability
|
Hey @shahab96 , Thanks a lot. I submitted this bug(https://github.com/rustfs/rustfs/issues/1193) report to Gemini, and here are his suggested fixes. |
Totally feasible and useful changes here! Just some minor comments I had. The one for the if statement can be totally ignored honestly. I think all we'd need here is to emit a log when the node is detected to be down. That's all actually. Thank you for helping us out with the operator! |
No problem. Thank you for helping to make the RustFS operator better! |
|
Hey @shahab96 , Update submitted. |
Link
rustfs/rustfs#1193
Summary
This PR adds a configurable “Pod Deletion Policy When Node Is Down” option to the RustFS operator, inspired by Longhorn’s setting. It automates cleanup of terminating Pods stuck on an unreachable node, enabling controllers (especially StatefulSets) to recreate replacement Pods without manual
kubectl delete --force.Motivation / Problem
When a node becomes
NotReady/Unknown(or disappears), Pods may get stuck inTerminatingand never complete graceful shutdown. In StatefulSet-based workloads, this can prevent timely recovery because the replacement Pod may not be created until the old terminating Pod object is removed.Operators/admins currently must manually force delete the stuck Pod to unblock recovery.
What this PR does
New Tenant configuration
Adds a new Tenant spec field:
spec.podDeletionPolicyWhenNodeIsDownSupported values:
DoNothing(default)Delete(normal delete)ForceDelete(delete withgracePeriodSeconds=0)DeleteStatefulSetPodDeleteDeploymentPod(Deployment pods are typically owned byReplicaSet)DeleteBothStatefulSetAndDeploymentPodReconcile behavior
During reconciliation, if the policy is enabled (not
DoNothing), the operator:rustfs.tenant=<tenant-name>Terminating(metadata.deletionTimestamp != None)pod.spec.nodeName) and treats it as “down” when:Ready != True(False/Unknown), orRBAC changes
The operator ClusterRole is extended with read access to Nodes:
nodes:get,list,watchThis is required to check Node readiness.
Example YAML
Safety / Caveats
DoNothing.Terminatingto keep behavior conservative and avoid deleting healthy running Pods.Tests
Unit tests added/updated to validate:
Readycondition (TruevsFalse/Unknown)StatefulSetvsReplicaSetownership)All tests pass:
cargo testImplementation notes (files changed)
Checklist
cargo testpasses