-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Calico falied to create ip route entry for few pods which has an wep entry. This happens randomly on different cluster/node when rebooted.
Expected Behavior
Calico should recreate all ip routes for pods post reboot.
Current Behavior
Post reboot on clusters randomly few pods are stuck in CrashLoopBackOff state due to network rechability issue. This is due to absence of ip route for the pod ip and cali* interface.
Pod has IP assigned
core@cop-node-88-65:~$ lspodwide | grep pvos-switch-state-processor-1
gravity pvos-switch-state-processor-1 0/2 CrashLoopBackOff 1073 (2m24s ago) 13d 172.16.111.99 cop-node-88-65.arubacorp.net <none> <none>Calicoctl WEP list shows the IP and Interface
core@cop-node-88-65:~$ calicoctl get wep -n gravity | grep pvos-switch-state-processor-1
gravity pvos-switch-state-processor-1 cop-node-88-65.arubacorp.net 172.16.111.99/32 cali8536608cf3fIfconfig output
core@cop-node-88-65:~$ ifconfig cali8536608cf3f
cali8536608cf3f: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
ether ee:ee:ee:ee:ee:ee txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0ping output
core@cop-node-88-65:~$ ping 172.16.111.99
ping: connect: Invalid argument
core@cop-node-88-65:~$ ping 172.16.111.99
ping: connect: Invalid argumentip route output
core@cop-node-88-65:~$ ip route | grep 172.16.111.99
core@cop-node-88-65:~$
core@cop-node-88-65:~$ ip route | grep cali8536608cf3f
core@cop-node-88-65:~$No iptables refresh interval is modified, everything is default.
2025-10-09 15:24:32.953 [INFO][81] felix/int_dataplane.go 2180: Will refresh IP sets on timer interval=1m30s
2025-10-09 15:24:32.953 [INFO][81] felix/int_dataplane.go 2180: Will refresh routes on timer interval=1m30s
2025-10-09 15:24:32.953 [INFO][81] felix/int_dataplane.go 2180: Will refresh XDP state on timer interval=1m30sPossible Solution
Felix should be able to reconcile with pod interfaces in WEP and create missing route. However, it is not recreating even after 12 hours of bootup.
Steps to Reproduce (for bugs)
- Create kubernetes cluster with calico - 3/5/7 nodes
- Bring up enough pods - 300-400 pods/node
- Reboot cluster
- After few minutes find few pods in CrashLoopBackOff with route missing
One observation is that if calico-node pod is restarted, this route is reconciled with WEP and entry is created in ip route table, thus recovering the pod state to Running.
Context
Cluster is unstable after reboot as pods are never recovered when stuck in this state. Interested to know if any specific parameter can be tuned to fix this issue.
Your Environment
- Calico version: v3.30.0
- Calico dataplane (bpf, nftables, iptables, windows etc.): iptables
- Orchestrator version (e.g. kubernetes, openshift, etc.): Kubernetes v1.32.5
- Operating System and version: Ubuntu 22.04.5 LTS