Performance Tuning

Performance Tuning ExaBGP

Optimize ExaBGP for scale, performance, and reliability

⚡ ExaBGP is lightweight by design - most deployments run with minimal tuning

Overview

Performance Characteristics

ExaBGP is designed for efficiency:

Typical resource usage (single BGP session, 100 routes):
- CPU: 2-5% on modern hardware
- Memory: 50-100 MB
- Network: < 1 Mbps
- Startup time: < 1 second

Scaling characteristics:

- BGP sessions: 100+ per instance
- Routes announced: 10,000+ per session
- Route updates: 1,000+ per second
- API processes: 10+ concurrent processes

When to Optimize

Optimize when:

Managing 50+ BGP sessions
Announcing 1,000+ routes per session
Receiving full Internet tables (800k+ routes)
High route churn (100+ updates/second)
Resource-constrained environments (embedded systems)

Don't optimize prematurely:

Most deployments need no tuning
Default settings work for 95% of use cases
Measure first, optimize second

Performance Baseline

Establish Your Baseline

Before optimizing, measure current performance:

#!/bin/bash
# ExaBGP performance baseline script

echo "=== ExaBGP Performance Baseline ==="
echo

# Process info
echo "Process:"
ps aux | grep exabgp | grep -v grep | awk '{print "PID: "$2" CPU: "$3"% MEM: "$4"% RSS: "$6" KB"}'
echo

# BGP sessions
echo "BGP Sessions:"
grep "neighbor.*up" /var/log/exabgp.log | tail -5
echo

# Route counts
echo "Routes Announced:"
grep "announce route" /var/log/exabgp.log | wc -l
echo

# Memory usage
echo "Memory:"
pmap $(pgrep -f exabgp) | tail -1
echo

# Network connections
echo "Network:"
ss -tan | grep :179

Run baseline test:

./baseline.sh > baseline-$(date +%Y%m%d).txt

BGP Session Tuning

Hold Time and Keepalive

Default values:

# ExaBGP defaults (seconds)
hold-time 180
keepalive 60

For stable networks (optimize for efficiency):

neighbor 192.168.1.1 {
    router-id 192.168.1.2;
    local-as 65001;
    peer-as 65000;

    # Longer timers reduce overhead
    hold-time 240;      # 4 minutes (default: 180)
    # keepalive = hold-time / 3 = 80 seconds
}

For unstable networks (optimize for fast detection):

neighbor 192.168.1.1 {
    # Shorter timers detect failures faster
    hold-time 90;       # 1.5 minutes
    # keepalive = 30 seconds
}

Trade-offs:

Longer timers: Less CPU/network overhead, slower failure detection
Shorter timers: Faster detection, more overhead, risk of false positives

TCP Configuration

System-level TCP tuning:

# /etc/sysctl.d/99-exabgp.conf

# Increase TCP buffer sizes
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

# Enable TCP window scaling
net.ipv4.tcp_window_scaling = 1

# Reduce TIME_WAIT sockets
net.ipv4.tcp_tw_reuse = 1

# Increase connection backlog
net.core.somaxconn = 1024

# Apply changes
sysctl -p /etc/sysctl.d/99-exabgp.conf

Connection Retry

Configure connection retry behavior:

# Environment variables
export exabgp_tcp_once=false         # Keep retrying (default: true)
export exabgp_tcp_delay=5            # Retry delay in seconds (default: 5)

For flaky networks:

# Be patient with retries
export exabgp_tcp_delay=10

Route Scale Optimization

Managing Large Route Tables

Configuration for large route announcements:

neighbor 192.168.1.1 {
    router-id 192.168.1.2;
    local-as 65001;
    peer-as 65000;

    family {
        ipv4 unicast;
    }

    # Optimize for bulk route announcements
    capability {
        graceful-restart;          # RFC 4724 - smooth restarts
        add-path send/receive;     # RFC 7911 - multiple paths
    }
}

Batch route announcements:

✅ BEST: Use bulk announcements (ExaBGP 4.0+):

#!/usr/bin/env python3
"""
Optimal: Use bulk announcements for same attributes
"""
import sys

# Generate 10,000 routes
routes = [f"100.{i//256}.{i%256}.0/24" for i in range(10000)]

# Option 1: Announce ALL in one command (fastest)
# Good for routes with same attributes
sys.stdout.write(f"announce attributes next-hop self nlri {' '.join(routes)}\n")
sys.stdout.flush()

# Option 2: Announce in batches (if command line too long)
# Use batches of 1000 routes
batch_size = 1000
for i in range(0, len(routes), batch_size):
    batch = routes[i:i+batch_size]
    sys.stdout.write(f"announce attributes next-hop self nlri {' '.join(batch)}\n")
    sys.stdout.flush()

Performance: 10-100x faster than individual announce route commands.

Legacy method (ExaBGP 3.x or if attributes differ per route):

#!/usr/bin/env python3
"""
Legacy: Individual route announcements
"""
import sys

# Generate 10,000 routes
routes = [f"100.{i//256}.{i%256}.0/24" for i in range(10000)]

# Announce in batches
batch_size = 100
for i in range(0, len(routes), batch_size):
    batch = routes[i:i+batch_size]

    # Send batch
    for route in batch:
        sys.stdout.write(f"announce route {route} next-hop self\n")

    # Flush after batch
    sys.stdout.flush()

Route Filtering

Filter unnecessary routes at ExaBGP:

neighbor 192.168.1.1 {
    # ... neighbor config ...

    api {
        processes [announcer];
        receive {
            parsed;           # Receive parsed routes
            update;           # Receive updates
        }
    }
}

process filter {
    run python3 /etc/exabgp/filter.py;
    encoder text;
}

Filter script:

#!/usr/bin/env python3
"""
Filter incoming routes to reduce memory
"""
import sys

for line in sys.stdin:
    # Only process specific prefixes
    if 'route' in line:
        if any(prefix in line for prefix in ['10.0.0.0/8', '192.168.0.0/16']):
            # Log accepted routes
            sys.stderr.write(f"ACCEPT: {line}")
        else:
            # Drop others (saves memory)
            sys.stderr.write(f"DROP: {line}")
            continue

    # Forward accepted lines
    sys.stdout.write(line)
    sys.stdout.flush()

Memory Management

Memory Usage Patterns

ExaBGP memory breakdown:

Component               Memory Usage
-----------------------------------------
Base process            ~30 MB
Per BGP session         ~1-2 MB
Per announced route     ~100 bytes
Per received route      ~200 bytes
API process overhead    ~5-10 MB each

Example calculation:

5 BGP sessions           = 5-10 MB
1,000 announced routes   = 0.1 MB
10,000 received routes   = 2 MB
2 API processes          = 10-20 MB
-----------------------------------------
Total estimate           = ~50 MB

Reduce Memory Footprint

1. Disable unnecessary features:

neighbor 192.168.1.1 {
    # ... neighbor config ...

    # Don't receive routes if not needed
    api {
        processes [announcer];
        receive {
            # parsed;        # Disable if not processing received routes
            # update;        # Disable if not needed
        }
    }
}

2. Use JSON compact format (if using JSON API):

process announcer {
    run python3 /etc/exabgp/announce.py;
    encoder json;           # More compact than text for large routes
}

3. Limit route storage:

# Don't store routes in memory if not needed
import sys

# Process routes immediately without storing
for line in sys.stdin:
    # Process immediately
    process_route(line)
    # Don't store in list/dict

Memory Monitoring

Monitor memory usage:

#!/bin/bash
# Monitor ExaBGP memory over time

while true; do
    pid=$(pgrep -f exabgp | head -1)
    if [ -n "$pid" ]; then
        mem=$(ps -p $pid -o rss= | awk '{printf "%.1f", $1/1024}')
        echo "$(date +%H:%M:%S) ExaBGP Memory: ${mem} MB"
    fi
    sleep 60
done

CPU Optimization

CPU Usage Patterns

What consumes CPU:

Activity                    CPU Impact
-----------------------------------------
BGP keepalives              Very low
Route announcements         Low-moderate
Route withdrawals           Low-moderate
Route processing (API)      Moderate-high
JSON parsing                High
Large route updates         High

Optimization Techniques

1. Reduce API process CPU:

#!/usr/bin/env python3
"""
Efficient health check with minimal CPU
"""
import sys
import time
import subprocess

# Cache results to avoid repeated checks
last_check_time = 0
last_check_result = False
check_interval = 5  # Only check every 5 seconds

def check_service():
    """Cached health check"""
    global last_check_time, last_check_result

    now = time.time()
    if now - last_check_time < check_interval:
        # Return cached result
        return last_check_result

    # Perform check
    try:
        result = subprocess.run(
            ['curl', '-sf', '-m', '2', 'http://localhost/health'],
            capture_output=True,
            timeout=3
        )
        last_check_result = result.returncode == 0
    except:
        last_check_result = False

    last_check_time = now
    return last_check_result

while True:
    if check_service():
        sys.stdout.write("announce route 100.10.0.100/32 next-hop self\n")
    else:
        sys.stdout.write("withdraw route 100.10.0.100/32\n")

    sys.stdout.flush()

    # Sleep to reduce CPU (critical!)
    time.sleep(10)

2. Use efficient data structures:

# ❌ SLOW: List lookup
routes = ['10.0.0.0/8', '192.168.0.0/16', ...]
if route in routes:  # O(n) lookup
    process()

# ✅ FAST: Set lookup
routes = {'10.0.0.0/8', '192.168.0.0/16', ...}
if route in routes:  # O(1) lookup
    process()

3. Avoid tight loops:

# ❌ BAD: No sleep = 100% CPU
while True:
    check_service()

# ✅ GOOD: Sleep between checks
while True:
    check_service()
    time.sleep(5)

CPU Profiling

Profile Python API processes:

# Install cProfile
pip3 install cProfile

# Run with profiling
python3 -m cProfile -o profile.stats /etc/exabgp/healthcheck.py

# Analyze results
python3 -m pstats profile.stats
> sort cumulative
> stats 10

Network Performance

Optimize Network I/O

1. Interface MTU:

# Check current MTU
ip link show eth0

# Increase MTU if possible (reduces packet overhead)
ip link set eth0 mtu 9000  # Jumbo frames (if supported)

2. Disable unnecessary services on BGP interface:

# Dedicated BGP interface
# Disable unnecessary protocols
ethtool -K eth0 tso off    # TCP segmentation offload
ethtool -K eth0 gso off    # Generic segmentation offload

3. Use direct routing:

# Ensure BGP peers are directly connected or via L2
# Avoid routing BGP over complex topologies
ip route get 192.168.1.1

Monitor Network Performance

Track BGP network metrics:

#!/bin/bash
# Monitor BGP network traffic

interface="eth0"
while true; do
    # BGP traffic on port 179
    rx=$(iptables -L -v -n -x | grep "dpt:179" | awk '{print $2}')
    tx=$(iptables -L -v -n -x | grep "spt:179" | awk '{print $2}')

    echo "$(date +%H:%M:%S) BGP RX: $rx TX: $tx"
    sleep 60
done

API Process Optimization

Efficient API Processes

Best practices for API scripts:

1. Minimize I/O operations:

# ✅ GOOD: Batch output
messages = []
for route in routes:
    messages.append(f"announce route {route} next-hop self")

sys.stdout.write('\n'.join(messages) + '\n')
sys.stdout.flush()

# ❌ BAD: Flush after every route
for route in routes:
    sys.stdout.write(f"announce route {route} next-hop self\n")
    sys.stdout.flush()  # Too many flushes!

2. Use bulk announcements (ExaBGP 4.0+):

# ✅ BEST: Use bulk announcements for same attributes
# Announce 1000 routes with same attributes in ONE command
prefixes = [f"10.{i//256}.{i%256}.0/24" for i in range(1000)]
sys.stdout.write(f"announce attributes next-hop self nlri {' '.join(prefixes)}\n")
sys.stdout.flush()

# ❌ SLOW: Individual announcements
# 1000 separate commands = 10-100x slower
for i in range(1000):
    sys.stdout.write(f"announce route 10.{i//256}.{i%256}.0/24 next-hop self\n")
sys.stdout.flush()

Performance gain: 10-100x faster for bulk operations. Single command reduces parsing overhead significantly.

See also: Bulk Announcements Documentation

3. Use buffered I/O:

import sys
import io

# Use buffered stdout
output = io.TextIOWrapper(sys.stdout.buffer, line_buffering=False)

# Write messages
output.write("announce route 10.0.0.0/8 next-hop self\n")
output.flush()  # Flush when ready

4. Avoid external commands:

# ❌ SLOW: External commands
import subprocess
result = subprocess.run(['curl', 'http://localhost'])

# ✅ FAST: Python libraries
import urllib.request
result = urllib.request.urlopen('http://localhost')

Process Management

Limit concurrent API processes:

# Don't run too many API processes
neighbor 192.168.1.1 {
    api {
        # Limit to essential processes only
        processes [healthcheck];  # Not 10 different processes
    }
}

Benchmarking

Benchmark Suite

Create comprehensive benchmark:

#!/bin/bash
# ExaBGP benchmark suite

echo "=== ExaBGP Performance Benchmark ==="
echo "Started: $(date)"
echo

# Test 1: Startup time
echo "Test 1: Startup time"
start=$(date +%s.%N)
exabgp --test /etc/exabgp/exabgp.conf > /dev/null 2>&1
end=$(date +%s.%N)
echo "Config parse time: $(echo "$end - $start" | bc) seconds"
echo

# Test 2: Route announcement rate (individual commands)
echo "Test 2: Route announcement rate - Individual commands (1000 routes)"
cat > /tmp/bench-announce-individual.py <<'EOF'
#!/usr/bin/env python3
import sys
for i in range(1000):
    sys.stdout.write(f"announce route 100.{i//256}.{i%256}.0/24 next-hop self\n")
sys.stdout.flush()
EOF

chmod +x /tmp/bench-announce-individual.py
start=$(date +%s.%N)
/tmp/bench-announce-individual.py
end=$(date +%s.%N)
duration=$(echo "$end - $start" | bc)
rate=$(echo "1000 / $duration" | bc)
echo "Duration: $duration seconds"
echo "Rate: $rate routes/second"
echo

# Test 3: Route announcement rate (bulk announcements)
echo "Test 3: Route announcement rate - Bulk announcements (1000 routes)"
cat > /tmp/bench-announce-bulk.py <<'EOF'
#!/usr/bin/env python3
import sys
prefixes = [f"100.{i//256}.{i%256}.0/24" for i in range(1000)]
sys.stdout.write(f"announce attributes next-hop self nlri {' '.join(prefixes)}\n")
sys.stdout.flush()
EOF

chmod +x /tmp/bench-announce-bulk.py
start=$(date +%s.%N)
/tmp/bench-announce-bulk.py
end=$(date +%s.%N)
duration=$(echo "$end - $start" | bc)
rate=$(echo "1000 / $duration" | bc)
echo "Duration: $duration seconds"
echo "Rate: $rate routes/second"
echo "Note: Bulk announcements (ExaBGP 4.0+) significantly faster"
echo

# Test 4: Memory usage under load
echo "Test 4: Memory usage"
pid=$(pgrep -f exabgp | head -1)
if [ -n "$pid" ]; then
    echo "Current RSS: $(ps -p $pid -o rss= | awk '{print $1/1024}') MB"
    echo "Current VSZ: $(ps -p $pid -o vsz= | awk '{print $1/1024}') MB"
fi
echo

# Test 4: CPU usage
echo "Test 4: CPU usage (10 second average)"
if [ -n "$pid" ]; then
    cpu=$(ps -p $pid -o %cpu= | head -1)
    echo "CPU: $cpu%"
fi
echo

echo "Completed: $(date)"

Load Testing

Simulate high route churn:

#!/usr/bin/env python3
"""
Load test: Rapid route announcements/withdrawals
"""
import sys
import time

routes = [f"100.{i//256}.{i%256}.0/24" for i in range(1000)]

# Measure announcement speed
start = time.time()

for i in range(10):  # 10 iterations
    # Announce all
    for route in routes:
        sys.stdout.write(f"announce route {route} next-hop self\n")
    sys.stdout.flush()

    time.sleep(1)

    # Withdraw all
    for route in routes:
        sys.stdout.write(f"withdraw route {route}\n")
    sys.stdout.flush()

    time.sleep(1)

end = time.time()
duration = end - start
total_updates = 1000 * 10 * 2  # routes × iterations × (announce+withdraw)
rate = total_updates / duration

sys.stderr.write(f"Updates: {total_updates}\n")
sys.stderr.write(f"Duration: {duration:.2f}s\n")
sys.stderr.write(f"Rate: {rate:.0f} updates/sec\n")

Compare Configurations

Benchmark different configurations:

#!/bin/bash
# Compare text vs JSON API

echo "=== Text API ==="
time exabgp /etc/exabgp/config-text.conf &
PID1=$!
sleep 30
kill $PID1

echo "=== JSON API ==="
time exabgp /etc/exabgp/config-json.conf &
PID2=$!
sleep 30
kill $PID2

# Compare memory/CPU

Capacity Planning

Sizing Guidelines

Resource requirements by deployment size:

Small deployment (1-5 BGP sessions, <100 routes):

CPU: 1 core @ 1 GHz
Memory: 256 MB
Network: 1 Mbps
Disk: 100 MB

Medium deployment (5-20 BGP sessions, <1,000 routes):

CPU: 2 cores @ 2 GHz
Memory: 512 MB
Network: 10 Mbps
Disk: 500 MB

Large deployment (20-50 BGP sessions, <10,000 routes):

CPU: 4 cores @ 2.5 GHz
Memory: 2 GB
Network: 100 Mbps
Disk: 1 GB

Extra-large deployment (50+ BGP sessions, receiving full tables):

CPU: 8 cores @ 3 GHz
Memory: 8 GB
Network: 1 Gbps
Disk: 5 GB

Growth Planning

Plan for growth:

#!/usr/bin/env python3
"""
Capacity planning calculator
"""

# Current metrics
current_sessions = 10
current_routes_announced = 100
current_routes_received = 1000

# Growth projections (next 12 months)
growth_factor = 2.0  # 100% growth

# Projected metrics
projected_sessions = current_sessions * growth_factor
projected_routes_announced = current_routes_announced * growth_factor
projected_routes_received = current_routes_received * growth_factor

# Memory calculation
base_memory = 30  # MB
session_memory = projected_sessions * 2  # MB per session
route_ann_memory = (projected_routes_announced * 100) / 1024 / 1024  # MB
route_rcv_memory = (projected_routes_received * 200) / 1024 / 1024  # MB

total_memory = base_memory + session_memory + route_ann_memory + route_rcv_memory

print(f"Projected BGP sessions: {projected_sessions:.0f}")
print(f"Projected routes announced: {projected_routes_announced:.0f}")
print(f"Projected routes received: {projected_routes_received:.0f}")
print(f"Estimated memory requirement: {total_memory:.0f} MB")
print(f"Recommended server: {total_memory * 2:.0f} MB RAM")  # 2x headroom

Best Practices

Performance Best Practices Checklist

Configuration:

Use appropriate BGP timers for your network
Enable only necessary address families
Disable route reception if not processing routes
Use graceful restart for smooth failovers

API Processes:

Always include time.sleep() in loops
Use bulk announcements (announce attributes ... nlri) for same-attribute routes (ExaBGP 4.0+)
Batch route announcements when possible
Flush stdout after batch, not after each route
Use efficient data structures (sets, dicts)
Avoid external commands (use Python libraries)
Cache results to avoid redundant checks

System:

Tune TCP parameters for your workload
Use appropriate MTU (jumbo frames if supported)
Monitor resource usage over time
Implement log rotation
Use dedicated interface for BGP if possible

Monitoring:

Anti-Patterns to Avoid

Don't:

❌ Run API processes without sleep ❌ Flush stdout after every single route ❌ Use external commands when Python libraries exist ❌ Store all routes in memory unnecessarily ❌ Receive routes if not processing them ❌ Use overly aggressive BGP timers ❌ Run dozens of API processes simultaneously ❌ Parse logs with grep | awk | sed chains (use Python) ❌ Ignore resource monitoring ❌ Over-optimize before measuring

Performance Troubleshooting

High CPU Usage

Diagnosis:

# Check which process is consuming CPU
top -p $(pgrep -f exabgp)

# Check API processes
ps aux | grep -E 'python|exabgp' | grep -v grep

Common causes:

API process tight loop (missing sleep)
High route churn
JSON parsing overhead
Inefficient health checks

Fix:

# Add sleep to API process
while True:
    do_work()
    time.sleep(5)  # ← Critical!

High Memory Usage

Diagnosis:

# Check memory details
pmap $(pgrep -f exabgp)

# Check for memory leaks
watch -n 5 'ps aux | grep exabgp | grep -v grep'

Common causes:

Receiving large route tables
Storing routes in API process
Memory leak in API script
Too many API processes

Fix:

# Disable route reception if not needed
api {
    receive {
        # parsed;   # Disable if not processing
    }
}

Slow Route Convergence

Diagnosis:

# Measure convergence time
time_start=$(date +%s.%N)
# Trigger route change
time_end=$(date +%s.%N)
echo "Convergence: $(echo "$time_end - $time_start" | bc)s"

Common causes:

Long BGP timers
Router-side processing delays
Network latency
API process delays

Fix:

# Reduce hold-time (carefully!)
hold-time 90;  # Down from 180

Next Steps

Learn More

Monitoring - Monitor ExaBGP performance
Debugging - Troubleshooting guide
Service HA - High availability patterns

Tools

htop - Interactive process viewer
iotop - I/O monitoring
nethogs - Network bandwidth per process

Need help optimizing? Join our Slack community →

👻 Ghost written by Claude (Anthropic AI)

🏠 Home

🚀 Getting Started

🔧 API

🛡️ Use Cases

🌐 Address Families

FlowSpec
- Match Conditions
- Actions Reference

⚙️ Configuration

🔍 Operations

📚 Reference

🔄 Migration

🌍 Community

🔗 External

GitHub Repo ↗
Slack ↗
Issues ↗

👻 Ghost written by Claude (Anthropic AI)

Performance Tuning

Performance Tuning ExaBGP

Table of Contents

Overview

Performance Characteristics

When to Optimize

Performance Baseline

Establish Your Baseline

BGP Session Tuning

Hold Time and Keepalive

TCP Configuration

Connection Retry

Route Scale Optimization

Managing Large Route Tables

Route Filtering

Memory Management

Memory Usage Patterns

Reduce Memory Footprint

Memory Monitoring

CPU Optimization

CPU Usage Patterns

Optimization Techniques

CPU Profiling

Network Performance

Optimize Network I/O

Monitor Network Performance

API Process Optimization

Efficient API Processes

Process Management

Benchmarking

Benchmark Suite

Load Testing

Compare Configurations

Capacity Planning

Sizing Guidelines

Growth Planning

Best Practices

Performance Best Practices Checklist

Anti-Patterns to Avoid

Performance Troubleshooting

High CPU Usage

High Memory Usage

Slow Route Convergence

Next Steps

Learn More

Tools

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!