We are running into network connectivity issues in our EKS cluster, and would appreciate your suggestions or tuning recommendations.
Cluster Overview:
EKS cluster size: 1000 worker nodes
Node networking: Worker nodes are launched in a secondary, non-routable private CIDR
Calico deployment: Installed via the Tigera operator, with provider: eks and VXLAN mode always enabled
Workload Pattern:
We have a specific deployment/service with:
Replicas: 575 pods (one pod per node)
Pod churn: Every 15 minutes, ~30% of the pods (about 172) are terminated and replaced with new pods, completing a full rotation of all 575 pods within the 15-minute window.
Issue Observed:
We exposed nginx ingress controller via NodePort and Istio ingress via LoadBalancer and we see frequent network connection issues:
Most EC2 nodes are marked unhealthy by the load balancer; only a few remain healthy at any given time.
This results in significant connectivity and availability problems for our service.
Questions/Request for Suggestions:
- Are there Calico or Kubernetes configurations we can tune to improve network stability and performance under these conditions?
- Are there known limitations or best practices for Calico when running at this scale and with such high pod churn?
- Any VXLAN-specific tuning or EKS-specific considerations that might help?
Additional Context:
EKS version: 1.31
Calico version: v3.29.1
Tigera operator version: v3.29.1
AMI: EKS Optimized Amazon Linux 2023