Skip to content

Cluster update should abort after the first failed load balancer node deregistration #17649

@jauru

Description

@jauru

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

Client version: 1.33.0 (git-v1.33.0)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Server Version: v1.33.4

3. What cloud provider are you using?
OpenStack

4. What commands did you run? What is the simplest way to reproduce this issue?
kops rolling-update cluster --validate-count 2 --bastion-interval 2m --instance-group bastions,master-az1,master-az2,master-az3,nodes-az1,nodes-az2,nodes-az3 --validation-timeout 20m --yes

5. What happened after the commands executed?
Cluster update ran while load balancer API was unstable. Node load balancer deregistration failed for each node, but the update continued.

6. What did you expect to happen?
The cluster update should abort after the first unsuccessful node load balancer deregistration.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

Kops command output:

SDK 2025/08/20 05:15:33 WARN Response has no supported checksum. Not validating response payload.
I0820 05:15:33.126370     227 create_kubecfg.go:151] unable to get user: user: Current requires cgo or $USER set in environment
I0820 05:15:36.234944     227 instancegroups.go:508] Validating the cluster.
NAME		STATUS		NEEDUPDATE	READY	MIN	TARGET	MAX	NODES
bastions	Ready		0		1	1	1	1	0
master-az1	Ready		0		1	1	1	1	1
master-az2	Ready		0		1	1	1	1	1
master-az3	Ready		0		1	1	1	1	1
nodes-az1	NeedsUpdate	1		0	1	1	1	1
nodes-az2	NeedsUpdate	1		0	1	1	1	1
nodes-az3	NeedsUpdate	1		0	1	1	1	1
I0820 05:15:38.235764     227 instancegroups.go:544] Cluster validated.
I0820 05:15:38.235793     227 instancegroups.go:342] Tainting 1 node in "nodes-az1" instancegroup.
I0820 05:15:38.274219     227 instancegroups.go:431] Draining the node: "nodes-az1-qfemgf".
I0820 05:15:40.390737     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:15:41.106014     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:15:42.291045     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:15:43.776610     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:15:46.092009     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:15:50.547771     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:15:59.236905     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:16:15.888793     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:16:49.254575     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:17:56.091020     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:20:13.179493     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:24:54.455283     227 loadbalancer.go:384] got error 409 retrying...
E0820 05:24:54.455360     227 rollingupdate.go:219] failed to roll InstanceGroup "nodes-az1": failed to drain node "nodes-az1-qfemgf": error deregistering instance "152db94f-fafa-471d-a219-d5d5f30fce25", node "nodes-az1-qfemgf": failed to deregister instance from load balancers: timed out waiting for the condition
I0820 05:24:54.455372     227 instancegroups.go:508] Validating the cluster.
I0820 05:24:56.627653     227 instancegroups.go:544] Cluster validated.
I0820 05:24:56.627685     227 instancegroups.go:342] Tainting 1 node in "nodes-az2" instancegroup.
I0820 05:24:56.668566     227 instancegroups.go:431] Draining the node: "nodes-az2-4xpav3".
I0820 05:24:57.730326     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:24:58.327227     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:24:58.807064     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:24:58.966839     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:24:59.681057     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:25:01.280384     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:25:02.136278     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:25:05.687178     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:25:06.581115     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:25:14.251399     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:25:15.305006     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:25:30.650995     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:25:32.402023     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:26:04.099553     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:26:07.273007     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:27:13.447574     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:27:16.637422     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:29:26.550514     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:29:31.983180     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:34:00.567880     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:34:06.194263     227 loadbalancer.go:384] got error 409 retrying...
E0820 05:34:06.194332     227 rollingupdate.go:219] failed to roll InstanceGroup "nodes-az2": failed to drain node "nodes-az2-4xpav3": error deregistering instance "e0a666cf-296f-4ca0-acb5-2371eab1b8a5", node "nodes-az2-4xpav3": failed to deregister instance from load balancers: timed out waiting for the condition
I0820 05:34:06.194347     227 instancegroups.go:508] Validating the cluster.
I0820 05:34:08.426196     227 instancegroups.go:544] Cluster validated.
I0820 05:34:08.426244     227 instancegroups.go:342] Tainting 1 node in "nodes-az3" instancegroup.
I0820 05:34:08.467404     227 instancegroups.go:431] Draining the node: "nodes-az3-zlclye".
I0820 05:34:09.576486     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:34:09.802331     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:34:10.417448     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:34:10.834617     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:34:11.273166     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:34:12.303061     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:34:13.178655     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:34:13.665537     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:34:17.468349     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:34:18.277489     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:34:26.212449     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:34:27.186908     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:34:43.705441     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:34:43.822280     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:35:16.946923     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:35:18.115243     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:36:22.425695     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:36:25.513390     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:38:36.309690     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:38:36.976682     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:43:04.507581     227 loadbalancer.go:384] got error 409 retrying...
I0820 05:43:10.285685     227 loadbalancer.go:384] got error 409 retrying...
E0820 05:43:10.285760     227 rollingupdate.go:219] failed to roll InstanceGroup "nodes-az3": failed to drain node "nodes-az3-zlclye": error deregistering instance "ccec663f-7e11-42a5-9969-1c6bae93eb34", node "nodes-az3-zlclye": failed to deregister instance from load balancers: timed out waiting for the condition
I0820 05:43:10.285776     227 rollingupdate.go:236] Completed rolling update for cluster "******.k8s.local" instance groups [bastions master-az1 master-az2 master-az3 nodes-az1 nodes-az2 nodes-az3]

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions