Skip to content

Network Connectivity Issues with NodePort/LoadBalancer Services in Large EKS Cluster Using Calico (Tigera Operator, VXLAN Mode) #11243

@ranga-sabbasani

Description

@ranga-sabbasani

We are running into network connectivity issues in our EKS cluster, and would appreciate your suggestions or tuning recommendations.

Cluster Overview:

EKS cluster size: 1000 worker nodes
Node networking: Worker nodes are launched in a secondary, non-routable private CIDR
Calico deployment: Installed via the Tigera operator, with provider: eks and VXLAN mode always enabled

Workload Pattern:

We have a specific deployment/service with:
Replicas: 575 pods (one pod per node)
Pod churn: Every 15 minutes, ~30% of the pods (about 172) are terminated and replaced with new pods, completing a full rotation of all 575 pods within the 15-minute window.

Issue Observed:

We exposed nginx ingress controller via NodePort and Istio ingress via LoadBalancer and we see frequent network connection issues:
Most EC2 nodes are marked unhealthy by the load balancer; only a few remain healthy at any given time.
This results in significant connectivity and availability problems for our service.

Questions/Request for Suggestions:

  • Are there Calico or Kubernetes configurations we can tune to improve network stability and performance under these conditions?
  • Are there known limitations or best practices for Calico when running at this scale and with such high pod churn?
  • Any VXLAN-specific tuning or EKS-specific considerations that might help?

Additional Context:

EKS version: 1.31
Calico version: v3.29.1
Tigera operator version: v3.29.1
AMI: EKS Optimized Amazon Linux 2023

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions