Skip to content

calico fails to create ip route while wep is present for the pod #11161

@ajayudayagiri-hpe

Description

@ajayudayagiri-hpe

Calico falied to create ip route entry for few pods which has an wep entry. This happens randomly on different cluster/node when rebooted.

Expected Behavior

Calico should recreate all ip routes for pods post reboot.

Current Behavior

Post reboot on clusters randomly few pods are stuck in CrashLoopBackOff state due to network rechability issue. This is due to absence of ip route for the pod ip and cali* interface.

Pod has IP assigned

core@cop-node-88-65:~$ lspodwide | grep pvos-switch-state-processor-1
gravity            pvos-switch-state-processor-1                                                                                    0/2     CrashLoopBackOff        1073 (2m24s ago)   13d     172.16.111.99    cop-node-88-65.arubacorp.net   <none>           <none>

Calicoctl WEP list shows the IP and Interface

core@cop-node-88-65:~$ calicoctl get wep -n gravity | grep pvos-switch-state-processor-1
gravity     pvos-switch-state-processor-1                                                           cop-node-88-65.arubacorp.net   172.16.111.99/32    cali8536608cf3f

Ifconfig output

core@cop-node-88-65:~$ ifconfig cali8536608cf3f
cali8536608cf3f: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        ether ee:ee:ee:ee:ee:ee  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ping output

core@cop-node-88-65:~$ ping 172.16.111.99
ping: connect: Invalid argument
core@cop-node-88-65:~$ ping 172.16.111.99
ping: connect: Invalid argument

ip route output

core@cop-node-88-65:~$ ip route | grep 172.16.111.99
core@cop-node-88-65:~$
core@cop-node-88-65:~$ ip route | grep cali8536608cf3f
core@cop-node-88-65:~$

No iptables refresh interval is modified, everything is default.

2025-10-09 15:24:32.953 [INFO][81] felix/int_dataplane.go 2180: Will refresh IP sets on timer interval=1m30s
2025-10-09 15:24:32.953 [INFO][81] felix/int_dataplane.go 2180: Will refresh routes on timer interval=1m30s
2025-10-09 15:24:32.953 [INFO][81] felix/int_dataplane.go 2180: Will refresh XDP state on timer interval=1m30s

Possible Solution

Felix should be able to reconcile with pod interfaces in WEP and create missing route. However, it is not recreating even after 12 hours of bootup.

Steps to Reproduce (for bugs)

  1. Create kubernetes cluster with calico - 3/5/7 nodes
  2. Bring up enough pods - 300-400 pods/node
  3. Reboot cluster
  4. After few minutes find few pods in CrashLoopBackOff with route missing

One observation is that if calico-node pod is restarted, this route is reconciled with WEP and entry is created in ip route table, thus recovering the pod state to Running.

Context

Cluster is unstable after reboot as pods are never recovered when stuck in this state. Interested to know if any specific parameter can be tuned to fix this issue.

Your Environment

  • Calico version: v3.30.0
  • Calico dataplane (bpf, nftables, iptables, windows etc.): iptables
  • Orchestrator version (e.g. kubernetes, openshift, etc.): Kubernetes v1.32.5
  • Operating System and version: Ubuntu 22.04.5 LTS

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions