Skip to content

Production Best Practices

Thomas Mangin edited this page Nov 13, 2025 · 4 revisions

Production Best Practices

Comprehensive guide to deploying ExaBGP in production environments


Table of Contents


Introduction

Deploying ExaBGP in production requires careful attention to security, reliability, and operational excellence. This guide provides battle-tested patterns from real-world deployments.

Production Readiness Checklist

Before going to production:

  • βœ… Security hardening applied
  • βœ… Monitoring and alerting configured
  • βœ… HA architecture tested
  • βœ… Disaster recovery plan documented
  • βœ… Logging centralized
  • βœ… Runbooks created
  • βœ… Load testing completed
  • βœ… Rollback procedures tested

Critical reminder:

πŸ”΄ ExaBGP does NOT manipulate RIB/FIB - ExaBGP is a pure BGP protocol speaker. When your API programs announce/withdraw routes, ExaBGP sends BGP messages. The router installs/removes routes in its RIB/FIB. ExaBGP never touches routing tables directly.


Security Hardening

Process Isolation

Run ExaBGP as dedicated user:

# Create exabgp user
sudo useradd -r -s /bin/false -d /var/lib/exabgp exabgp

# Set ownership
sudo chown -R exabgp:exabgp /etc/exabgp
sudo chown -R exabgp:exabgp /var/log/exabgp
sudo chown -R exabgp:exabgp /var/lib/exabgp

# Restrict permissions
sudo chmod 750 /etc/exabgp
sudo chmod 640 /etc/exabgp/*.conf

Systemd Hardening

Systemd service with security restrictions:

[Unit]
Description=ExaBGP BGP Speaker
After=network.target
Documentation=https://github.com/Exa-Networks/exabgp/wiki

[Service]
Type=simple
User=exabgp
Group=exabgp
ExecStart=/usr/local/bin/exabgp /etc/exabgp/exabgp.conf
Restart=on-failure
RestartSec=5s

# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/log/exabgp /var/lib/exabgp
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
RestrictRealtime=true
RestrictNamespaces=true
LockPersonality=true
MemoryDenyWriteExecute=true
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
SystemCallFilter=@system-service
SystemCallFilter=~@privileged @resources
SystemCallErrorNumber=EPERM

# Resource limits
LimitNOFILE=65536
LimitNPROC=100
MemoryLimit=512M
CPUQuota=100%

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable exabgp
sudo systemctl start exabgp

BGP Authentication

MD5 authentication (recommended for production):

neighbor 192.168.1.1 {
    router-id 192.168.1.2;
    local-address 192.168.1.2;
    local-as 65001;
    peer-as 65000;

    # MD5 authentication
    md5-password "your-strong-password-here";

    # TTL security (GTSM)
    ttl-security 255;  # Only accept packets with TTL=255

    family {
        ipv4 unicast;
        ipv4 flowspec;
    }
}

Generate strong passwords:

# Generate random 32-character password
openssl rand -base64 24

Store passwords securely:

# Use environment variables
export BGP_MD5_PASSWORD="$(cat /etc/exabgp/secrets/bgp_password)"

# Reference in config (ExaBGP 4.x+)
neighbor 192.168.1.1 {
    md5-password env['BGP_MD5_PASSWORD'];
}

API Security

Restrict API programs:

process healthcheck {
    # Use absolute paths
    run /usr/local/bin/exabgp-healthcheck.py;

    # Set working directory
    working-directory /var/lib/exabgp;

    # Limit environment
    env {
        SERVICE_IP = '100.10.0.100';
        CHECK_INTERVAL = '5';
    }

    encoder text;
}

Validate API program permissions:

# API programs should be owned by root, not writable by exabgp
sudo chown root:root /usr/local/bin/exabgp-healthcheck.py
sudo chmod 755 /usr/local/bin/exabgp-healthcheck.py

# Prevent tampering
sudo chattr +i /usr/local/bin/exabgp-healthcheck.py  # Make immutable

Network Security

Firewall rules (iptables):

# Allow BGP from specific peers only
sudo iptables -A INPUT -p tcp --dport 179 -s 192.168.1.1 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 179 -j DROP

# Allow established connections
sudo iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

# Save rules
sudo iptables-save > /etc/iptables/rules.v4

Firewall rules (nftables):

# /etc/nftables.conf
table inet filter {
    chain input {
        type filter hook input priority 0; policy drop;

        # Allow established connections
        ct state established,related accept

        # Allow BGP from specific peer
        ip saddr 192.168.1.1 tcp dport 179 accept

        # Drop other BGP
        tcp dport 179 drop
    }
}

Monitoring and Observability

Prometheus Metrics Exporter

Complete metrics exporter:

#!/usr/bin/env python3
"""
exabgp_exporter.py - Prometheus metrics for ExaBGP
"""
import sys
import json
import time
from prometheus_client import start_http_server, Counter, Gauge, Histogram

# Metrics
bgp_session_up = Gauge('exabgp_bgp_session_up',
                       'BGP session state (1=up, 0=down)',
                       ['peer', 'local_as', 'peer_as'])

bgp_routes_announced = Counter('exabgp_routes_announced_total',
                                'Total routes announced',
                                ['peer', 'afi', 'safi'])

bgp_routes_withdrawn = Counter('exabgp_routes_withdrawn_total',
                                'Total routes withdrawn',
                                ['peer', 'afi', 'safi'])

bgp_active_routes = Gauge('exabgp_active_routes',
                          'Number of active routes',
                          ['peer', 'afi', 'safi'])

bgp_notifications = Counter('exabgp_notifications_total',
                            'BGP NOTIFICATION messages',
                            ['peer', 'code', 'subcode'])

bgp_update_processing_time = Histogram('exabgp_update_processing_seconds',
                                       'Time to process UPDATE messages',
                                       ['peer'])

# State tracking
route_counts = {}

def log(message):
    """Log to STDERR"""
    sys.stderr.write(f"[{time.strftime('%Y-%m-%d %H:%M:%S')}] {message}\n")
    sys.stderr.flush()

def handle_state(msg):
    """Process STATE messages"""
    peer = msg['neighbor']['address']['peer']
    local_as = msg['neighbor']['asn']['local']
    peer_as = msg['neighbor']['asn']['peer']
    state = msg['neighbor']['state']

    # Update metric
    value = 1 if state == 'up' else 0
    bgp_session_up.labels(peer=peer, local_as=local_as, peer_as=peer_as).set(value)

    log(f"Session {peer}: {state}")

def handle_update(msg):
    """Process UPDATE messages"""
    start_time = time.time()

    peer = msg['neighbor']['address']['peer']
    update = msg['neighbor']['message']['update']

    # Process announcements
    if 'announce' in update:
        for family, routes in update['announce'].items():
            afi_safi = family  # e.g., "ipv4 unicast"
            count = len(routes)

            bgp_routes_announced.labels(peer=peer, afi=family.split()[0],
                                       safi=family.split()[1]).inc(count)

            # Update active count
            key = (peer, afi_safi)
            route_counts[key] = route_counts.get(key, 0) + count
            bgp_active_routes.labels(peer=peer, afi=family.split()[0],
                                    safi=family.split()[1]).set(route_counts[key])

    # Process withdrawals
    if 'withdraw' in update:
        for family, routes in update['withdraw'].items():
            count = len(routes)

            bgp_routes_withdrawn.labels(peer=peer, afi=family.split()[0],
                                       safi=family.split()[1]).inc(count)

            # Update active count
            key = (peer, family)
            route_counts[key] = max(0, route_counts.get(key, 0) - count)
            bgp_active_routes.labels(peer=peer, afi=family.split()[0],
                                    safi=family.split()[1]).set(route_counts[key])

    # Record processing time
    duration = time.time() - start_time
    bgp_update_processing_time.labels(peer=peer).observe(duration)

def handle_notification(msg):
    """Process NOTIFICATION messages"""
    peer = msg['neighbor']['address']['peer']
    notification = msg['neighbor']['message']['notification']

    code = notification.get('code', 0)
    subcode = notification.get('subcode', 0)

    bgp_notifications.labels(peer=peer, code=code, subcode=subcode).inc()

    log(f"NOTIFICATION from {peer}: code={code} subcode={subcode}")

def main():
    """Main metrics exporter"""
    # Start Prometheus HTTP server
    port = 9101
    start_http_server(port)
    log(f"Prometheus metrics server started on port {port}")

    # Process BGP messages
    while True:
        line = sys.stdin.readline()
        if not line:
            break

        try:
            msg = json.loads(line.strip())
            msg_type = msg.get('type')

            if msg_type == 'state':
                handle_state(msg)
            elif msg_type == 'update':
                handle_update(msg)
            elif msg_type == 'notification':
                handle_notification(msg)

        except Exception as e:
            log(f"Error processing message: {e}")

if __name__ == '__main__':
    main()

ExaBGP configuration:

process prometheus_exporter {
    run /usr/local/bin/exabgp_exporter.py;
    encoder json;
    receive {
        parsed;
        updates;
        neighbor-changes;
    }
}

neighbor 192.168.1.1 {
    router-id 192.168.1.2;
    local-address 192.168.1.2;
    local-as 65001;
    peer-as 65000;

    api {
        processes [ prometheus_exporter ];
    }
}

Prometheus Scrape Config

# /etc/prometheus/prometheus.yml
scrape_configs:
  - job_name: 'exabgp'
    static_configs:
      - targets: ['localhost:9101']
        labels:
          instance: 'bgp-server-1'
          datacenter: 'dc1'

Grafana Dashboard

Key metrics to visualize:

  1. BGP Session Status

    • Query: exabgp_bgp_session_up
    • Type: Stat panel
    • Alert: Session down
  2. Route Count

    • Query: exabgp_active_routes
    • Type: Graph
    • Alert: Sudden drops
  3. Announcement Rate

    • Query: rate(exabgp_routes_announced_total[5m])
    • Type: Graph
    • Alert: Unusual spikes
  4. NOTIFICATION Messages

    • Query: rate(exabgp_notifications_total[5m])
    • Type: Graph
    • Alert: Any notifications

Health Check Monitoring

Monitor ExaBGP process:

#!/bin/bash
# /usr/local/bin/check_exabgp.sh

# Check if ExaBGP is running
if ! systemctl is-active --quiet exabgp; then
    echo "CRITICAL: ExaBGP is not running"
    exit 2
fi

# Check if BGP session is established
if ! ss -tn | grep -q ':179.*ESTAB'; then
    echo "WARNING: No established BGP sessions"
    exit 1
fi

echo "OK: ExaBGP running with established sessions"
exit 0

Nagios/Icinga check:

# /etc/nagios/nrpe.d/exabgp.cfg
command[check_exabgp]=/usr/local/bin/check_exabgp.sh

Centralized Logging

Syslog configuration:

# ExaBGP environment variables
export exabgp.log.destination=syslog
export exabgp.log.level=INFO
export exabgp.log.rib=true
export exabgp.log.packets=false  # Disable in production (noisy)

Rsyslog configuration:

# /etc/rsyslog.d/exabgp.conf
if $programname == 'exabgp' then /var/log/exabgp/exabgp.log
& stop

Logrotate:

# /etc/logrotate.d/exabgp
/var/log/exabgp/*.log {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    create 0640 exabgp exabgp
    sharedscripts
    postrotate
        systemctl reload rsyslog
    endscript
}

High Availability Architecture

Active-Active HA

Multiple ExaBGP instances announcing same routes:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Router (ECMP enabled)                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↑                ↑                ↑
         β”‚ BGP            β”‚ BGP            β”‚ BGP
         β”‚                β”‚                β”‚
    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
    β”‚ ExaBGP  β”‚      β”‚ ExaBGP  β”‚      β”‚ ExaBGP  β”‚
    β”‚ Server1 β”‚      β”‚ Server2 β”‚      β”‚ Server3 β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓                ↓                ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Service β”‚      β”‚ Service β”‚      β”‚ Service β”‚
    β”‚ Healthy β”‚      β”‚ Healthy β”‚      β”‚ Healthy β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Benefits:

  • βœ… No single point of failure
  • βœ… Automatic load distribution (ECMP)
  • βœ… Fast failover (BGP convergence)
  • βœ… Horizontal scaling

Configuration on each server:

neighbor 192.168.1.1 {
    router-id 192.168.1.2;  # Unique per server
    local-address 192.168.1.2;  # Unique per server
    local-as 65001;
    peer-as 65000;

    family {
        ipv4 unicast;
    }

    api {
        processes [ healthcheck ];
    }
}

process healthcheck {
    run /usr/local/bin/healthcheck.py;
    encoder text;
}

Active-Passive HA with MED

Primary/backup failover:

#!/usr/bin/env python3
# Primary server (low MED)
SERVICE_IP = "100.10.0.100"
MED = 100  # Lower is preferred

if is_healthy():
    print(f"announce route {SERVICE_IP}/32 next-hop self med {MED}")
#!/usr/bin/env python3
# Backup server (high MED)
SERVICE_IP = "100.10.0.100"
MED = 200  # Higher MED = backup

if is_healthy():
    print(f"announce route {SERVICE_IP}/32 next-hop self med {MED}")

Geographic HA

Multi-datacenter deployment:

       Internet
           β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
    β”‚   Router    β”‚
    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
           β”‚ BGP sessions
    β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                     β”‚
β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”             β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”
β”‚  DC1  β”‚             β”‚  DC2  β”‚
β”‚       β”‚             β”‚       β”‚
β”‚ExaBGP β”‚             β”‚ExaBGP β”‚
β”‚Serviceβ”‚             β”‚Serviceβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”˜             β””β”€β”€β”€β”€β”€β”€β”€β”˜
  MED=100               MED=200

Benefits:

  • βœ… Geographic redundancy
  • βœ… Disaster recovery
  • βœ… Reduced latency (users routed to nearest DC)

Keepalived Integration

Use keepalived for local HA:

# /etc/keepalived/keepalived.conf
vrrp_script check_exabgp {
    script "/usr/local/bin/check_exabgp.sh"
    interval 2
    weight -20
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100

    virtual_ipaddress {
        192.168.1.100/24
    }

    track_script {
        check_exabgp
    }

    # Run ExaBGP when becoming MASTER
    notify_master "/usr/local/bin/exabgp_start.sh"
    notify_backup "/usr/local/bin/exabgp_stop.sh"
}

Performance Tuning

System Tuning

Sysctl parameters:

# /etc/sysctl.d/99-exabgp.conf

# Increase connection tracking
net.netfilter.nf_conntrack_max = 262144

# TCP tuning
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 3

# Socket buffers
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864

# Backlog
net.core.netdev_max_backlog = 5000
net.ipv4.tcp_max_syn_backlog = 8192

# Apply
sudo sysctl -p /etc/sysctl.d/99-exabgp.conf

ExaBGP Tuning

Environment variables:

# Performance tuning
export exabgp.daemon.daemonize=true
export exabgp.log.packets=false  # Disable packet logging (high overhead)
export exabgp.log.parser=false   # Disable parser logging
export exabgp.tcp.bind=''        # Empty = listen on all interfaces
export exabgp.tcp.once=false     # Keep listening after session established

# Cache tuning
export exabgp.cache.attributes=true
export exabgp.cache.nexthops=true

# Performance mode
export exabgp.profile.enable=false  # Disable profiling

API Process Optimization

Efficient health checking:

#!/usr/bin/env python3
"""
Optimized health check - minimize overhead
"""
import sys
import time
import socket

# Configuration
SERVICE_IP = "100.10.0.100"
SERVICE_PORT = 80
CHECK_INTERVAL = 5
SOCKET_TIMEOUT = 2

# Reuse socket for efficiency
def create_socket():
    """Create configured socket"""
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.settimeout(SOCKET_TIMEOUT)
    # Set TCP_NODELAY for faster checks
    sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
    return sock

def check_health_fast():
    """Fast health check"""
    sock = create_socket()
    try:
        result = sock.connect_ex(('127.0.0.1', SERVICE_PORT))
        return result == 0
    except:
        return False
    finally:
        try:
            sock.close()
        except:
            pass

# State tracking (avoid redundant announcements)
announced = False

# Main loop
time.sleep(2)

while True:
    healthy = check_health_fast()

    if healthy and not announced:
        sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
        sys.stdout.flush()
        announced = True
    elif not healthy and announced:
        sys.stdout.write(f"withdraw route {SERVICE_IP}/32\n")
        sys.stdout.flush()
        announced = False

    time.sleep(CHECK_INTERVAL)

Resource Limits

Systemd limits:

[Service]
# Limit memory
MemoryLimit=512M
MemoryHigh=400M

# Limit CPU
CPUQuota=50%

# Limit file descriptors
LimitNOFILE=10000

# Limit processes
LimitNPROC=100

Logging and Alerting

Structured Logging

JSON logging for easy parsing:

import json
import time

def log_json(level, event, **kwargs):
    """Structured JSON logging"""
    entry = {
        'timestamp': time.time(),
        'level': level,
        'event': event,
        'hostname': socket.gethostname(),
        **kwargs
    }
    print(json.dumps(entry), file=sys.stderr)

# Use it
log_json('INFO', 'route_announced', prefix='100.10.0.0/24', nexthop='self')
log_json('ERROR', 'health_check_failed', service='web', port=80, error='timeout')
log_json('CRITICAL', 'bgp_session_down', peer='192.168.1.1', reason='hold_timer')

Alertmanager Integration

Alert on BGP events:

# /etc/prometheus/alerts/exabgp.yml
groups:
  - name: exabgp
    interval: 30s
    rules:
      # BGP session down
      - alert: BGPSessionDown
        expr: exabgp_bgp_session_up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "BGP session down: {{ $labels.peer }}"
          description: "BGP session to {{ $labels.peer }} has been down for 2 minutes"

      # Route count dropped
      - alert: RouteCountDropped
        expr: |
          (exabgp_active_routes - exabgp_active_routes offset 5m)
          / exabgp_active_routes offset 5m < -0.5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Route count dropped >50%"

      # High notification rate
      - alert: HighNotificationRate
        expr: rate(exabgp_notifications_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High BGP NOTIFICATION rate"
          description: "Receiving {{ $value }} notifications/sec from {{ $labels.peer }}"

      # No route announcements
      - alert: NoRouteAnnouncements
        expr: rate(exabgp_routes_announced_total[10m]) == 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "No routes announced in 15 minutes"

PagerDuty Integration

Alert on critical events:

import requests

def alert_pagerduty(severity, summary, details):
    """Send alert to PagerDuty"""
    PAGERDUTY_API_KEY = os.getenv('PAGERDUTY_API_KEY')

    event = {
        'routing_key': PAGERDUTY_API_KEY,
        'event_action': 'trigger',
        'payload': {
            'summary': summary,
            'severity': severity,
            'source': socket.gethostname(),
            'custom_details': details
        }
    }

    try:
        response = requests.post(
            'https://events.pagerduty.com/v2/enqueue',
            json=event,
            timeout=5
        )
        response.raise_for_status()
        log_json('INFO', 'pagerduty_alert_sent', summary=summary)
    except Exception as e:
        log_json('ERROR', 'pagerduty_alert_failed', error=str(e))

# Use it
if consecutive_failures > 10:
    alert_pagerduty(
        severity='critical',
        summary='ExaBGP health check failing',
        details={'failures': consecutive_failures, 'service': SERVICE_IP}
    )

Disaster Recovery

Backup Procedures

Backup ExaBGP configuration:

#!/bin/bash
# /usr/local/bin/backup_exabgp.sh

BACKUP_DIR="/var/backups/exabgp"
DATE=$(date +%Y%m%d_%H%M%S)

# Create backup directory
mkdir -p "$BACKUP_DIR"

# Backup config
tar czf "$BACKUP_DIR/exabgp-config-$DATE.tar.gz" \
    /etc/exabgp/*.conf \
    /etc/exabgp/api/ \
    /etc/systemd/system/exabgp.service

# Backup state (if using stateful mode)
cp -a /var/lib/exabgp "$BACKUP_DIR/exabgp-state-$DATE"

# Keep only last 30 backups
find "$BACKUP_DIR" -name "exabgp-*" -mtime +30 -delete

echo "Backup completed: $BACKUP_DIR/exabgp-config-$DATE.tar.gz"

Automate with cron:

# /etc/cron.d/exabgp-backup
0 2 * * * root /usr/local/bin/backup_exabgp.sh

Recovery Procedures

Restore from backup:

#!/bin/bash
# /usr/local/bin/restore_exabgp.sh

if [ $# -ne 1 ]; then
    echo "Usage: $0 <backup-file>"
    exit 1
fi

BACKUP_FILE="$1"

# Stop ExaBGP
systemctl stop exabgp

# Restore config
tar xzf "$BACKUP_FILE" -C /

# Verify config
exabgp --test /etc/exabgp/exabgp.conf

if [ $? -eq 0 ]; then
    echo "Config valid, starting ExaBGP"
    systemctl start exabgp
else
    echo "ERROR: Invalid config, ExaBGP not started"
    exit 1
fi

Disaster Recovery Plan

Document DR procedures:

# ExaBGP Disaster Recovery Plan

## Scenario 1: ExaBGP Process Crash
**Detection:** Systemd alert, Prometheus metric `up{job="exabgp"}==0`
**Impact:** Routes withdrawn, traffic stops
**Recovery:**
1. Check systemd status: `systemctl status exabgp`
2. Check logs: `journalctl -u exabgp -n 100`
3. Restart: `systemctl restart exabgp`
4. Verify: `exabgpcli show neighbor`

## Scenario 2: BGP Session Down
**Detection:** `exabgp_bgp_session_up==0`
**Impact:** Routes not advertised
**Recovery:**
1. Check peer reachability: `ping <peer-ip>`
2. Check firewall: `iptables -L -n | grep 179`
3. Verify config: `grep <peer-ip> /etc/exabgp/exabgp.conf`
4. Check peer logs
5. Restart session: `systemctl restart exabgp`

## Scenario 3: Server Failure
**Detection:** Host unreachable
**Impact:** Routes withdrawn, traffic fails over to backup
**Recovery:**
1. Verify failover occurred
2. Monitor backup server load
3. Replace/repair failed server
4. Test before returning to service
5. Gradual traffic shift back

## Scenario 4: Configuration Error
**Detection:** ExaBGP fails to start
**Impact:** No BGP announcements
**Recovery:**
1. Validate config: `exabgp --test /etc/exabgp/exabgp.conf`
2. Check syntax errors in logs
3. Restore from backup: `/usr/local/bin/restore_exabgp.sh`
4. Test restored config
5. Start ExaBGP

## Scenario 5: API Program Crash Loop
**Detection:** Repeated restarts in logs
**Impact:** Inconsistent route announcements
**Recovery:**
1. Check API program logs
2. Test API program standalone
3. Disable API program temporarily
4. Fix bug
5. Deploy fixed version
6. Re-enable

Configuration Management

Version Control

Store configs in Git:

# Initialize repo
cd /etc/exabgp
git init
git add *.conf api/
git commit -m "Initial ExaBGP configuration"

# Add remote
git remote add origin [email protected]:yourorg/exabgp-configs.git
git push -u origin master

Automated deployment:

#!/bin/bash
# /usr/local/bin/deploy_exabgp_config.sh

# Pull latest config
cd /etc/exabgp
git pull origin master

# Validate config
exabgp --test /etc/exabgp/exabgp.conf

if [ $? -eq 0 ]; then
    # Reload ExaBGP
    systemctl reload exabgp
    echo "Configuration deployed successfully"
else
    # Rollback
    git reset --hard HEAD^
    echo "ERROR: Invalid configuration, rolled back"
    exit 1
fi

Ansible Automation

Ansible playbook:

---
# playbooks/exabgp.yml
- name: Deploy ExaBGP
  hosts: bgp_servers
  become: yes

  vars:
    exabgp_version: "4.2.25"
    service_ip: "100.10.0.100"
    peer_ip: "192.168.1.1"
    local_as: 65001
    peer_as: 65000

  tasks:
    - name: Install ExaBGP
      pip:
        name: "exabgp=={{ exabgp_version }}"
        state: present

    - name: Create exabgp user
      user:
        name: exabgp
        system: yes
        shell: /bin/false
        home: /var/lib/exabgp

    - name: Create directories
      file:
        path: "{{ item }}"
        state: directory
        owner: exabgp
        group: exabgp
        mode: 0750
      loop:
        - /etc/exabgp
        - /etc/exabgp/api
        - /var/log/exabgp
        - /var/lib/exabgp

    - name: Deploy configuration
      template:
        src: templates/exabgp.conf.j2
        dest: /etc/exabgp/exabgp.conf
        owner: exabgp
        group: exabgp
        mode: 0640
      notify: restart exabgp

    - name: Deploy API programs
      copy:
        src: "{{ item }}"
        dest: /etc/exabgp/api/
        owner: root
        group: root
        mode: 0755
      loop:
        - files/healthcheck.py
        - files/exporter.py
      notify: restart exabgp

    - name: Deploy systemd service
      template:
        src: templates/exabgp.service.j2
        dest: /etc/systemd/system/exabgp.service
        mode: 0644
      notify:
        - reload systemd
        - restart exabgp

    - name: Enable ExaBGP service
      systemd:
        name: exabgp
        enabled: yes
        state: started

  handlers:
    - name: reload systemd
      systemd:
        daemon_reload: yes

    - name: restart exabgp
      systemd:
        name: exabgp
        state: restarted

Real-World Deployment Patterns

Pattern 1: Anycast DNS

Use case: Global DNS service with anycast IPs

#!/usr/bin/env python3
"""
DNS anycast health check
"""
import sys
import time
import dns.resolver

SERVICE_IP = "1.1.1.1"  # Anycast IP
DNS_PORT = 53
TEST_QUERY = "example.com"

def check_dns():
    """Check if DNS resolver is working"""
    try:
        resolver = dns.resolver.Resolver()
        resolver.nameservers = ['127.0.0.1']
        resolver.timeout = 2
        resolver.lifetime = 2

        answer = resolver.resolve(TEST_QUERY, 'A')
        return len(answer) > 0
    except:
        return False

announced = False
time.sleep(2)

while True:
    healthy = check_dns()

    if healthy and not announced:
        sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
        sys.stdout.flush()
        announced = True
    elif not healthy and announced:
        sys.stdout.write(f"withdraw route {SERVICE_IP}/32\n")
        sys.stdout.flush()
        announced = False

    time.sleep(5)

Pattern 2: DDoS Scrubbing Center

Use case: Redirect attack traffic to scrubbing center via FlowSpec

#!/usr/bin/env python3
"""
DDoS mitigation with FlowSpec
"""
import sys
import time

SCRUBBING_VRF = "65001:999"

def announce_flowspec_block(src_prefix, dst_port, protocol):
    """Announce FlowSpec rule to block traffic"""
    rule = (
        f"announce flow route {{ "
        f"match {{ source {src_prefix}; destination-port ={dst_port}; protocol ={protocol}; }} "
        f"then {{ redirect {SCRUBBING_VRF}; }} "
        f"}}"
    )
    sys.stdout.write(rule + "\n")
    sys.stdout.flush()

def detect_attack():
    """Detect DDoS attack (integrate with your IDS)"""
    # Example: Read from IDS output
    # Return (source, dest_port, protocol) if attack detected
    return None

time.sleep(2)

while True:
    attack = detect_attack()

    if attack:
        src, port, proto = attack
        announce_flowspec_block(src, port, proto)
        log(f"Blocked {src} to port {port}")

    time.sleep(1)

Pattern 3: Multi-Tier Load Balancing

Facebook/Meta Katran pattern:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Border Router (ECMP)                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β–Ό          β–Ό          β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ ExaBGP  β”‚ β”‚ ExaBGP  β”‚ β”‚ ExaBGP  β”‚
  β”‚ + L4LB  β”‚ β”‚ + L4LB  β”‚ β”‚ + L4LB  β”‚
  β”‚ (XDP)   β”‚ β”‚ (XDP)   β”‚ β”‚ (XDP)   β”‚
  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
       β”‚           β”‚           β”‚
  β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
  β”‚      Backend Servers (ECMP)     β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Testing Strategies

Integration Testing

Test BGP session establishment:

#!/bin/bash
# test_bgp_session.sh

# Start ExaBGP in test mode
timeout 30 exabgp /etc/exabgp/exabgp.conf --test

if [ $? -eq 0 ]; then
    echo "βœ“ Configuration valid"
else
    echo "βœ— Configuration invalid"
    exit 1
fi

# Start ExaBGP
systemctl start exabgp
sleep 5

# Check BGP session
if ss -tn | grep -q ':179.*ESTAB'; then
    echo "βœ“ BGP session established"
else
    echo "βœ— BGP session not established"
    systemctl stop exabgp
    exit 1
fi

# Verify route announcement
if exabgpcli show adj-rib out | grep -q '100.10.0.0/24'; then
    echo "βœ“ Route announced"
else
    echo "βœ— Route not announced"
    systemctl stop exabgp
    exit 1
fi

echo "All tests passed"

Load Testing

Simulate high route count:

#!/usr/bin/env python3
"""
Load test - announce many routes
"""
import sys
import time

# Announce 10,000 routes
time.sleep(2)

for i in range(10000):
    prefix = f"100.{i // 256}.{i % 256}.0/24"
    sys.stdout.write(f"announce route {prefix} next-hop self\n")

    if i % 100 == 0:
        sys.stdout.flush()
        time.sleep(0.1)  # Rate limit

sys.stdout.flush()

# Keep running
while True:
    time.sleep(60)

Capacity Planning

Route Capacity

Estimate memory usage:

Memory per route (IPv4 unicast):
- Route: ~100 bytes
- Attributes: ~200 bytes
Total: ~300 bytes per route

Example:
- 100,000 routes = 30 MB
- 1,000,000 routes = 300 MB

Add 100 MB for ExaBGP overhead
Add 50 MB per API process

Total for 1M routes with 2 API processes:
300 + 100 + 100 = 500 MB

System requirements:

Routes RAM CPU
1,000 256 MB 1 core
10,000 512 MB 1 core
100,000 1 GB 2 cores
1,000,000 2 GB 4 cores

BGP Session Limits

Sessions per server:

ExaBGP can handle:
- 100+ BGP sessions per server (tested)
- Limited by CPU and network bandwidth
- Use separate ExaBGP instances for isolation

Operational Procedures

Deployment Checklist

Before deploying to production:

  • Configuration validated with --test
  • MD5 authentication configured
  • Firewall rules applied
  • Monitoring configured (Prometheus)
  • Alerts configured (Alertmanager/PagerDuty)
  • Logs centralized (syslog)
  • Backups automated (cron)
  • DR procedures documented
  • Runbooks created
  • Team trained
  • Load testing completed
  • Failover tested
  • Rollback plan tested

Runbook: BGP Session Troubleshooting

# Runbook: BGP Session Not Establishing

## Symptoms
- `exabgp_bgp_session_up==0`
- No routes announced
- Log shows: "connection refused" or "timeout"

## Diagnosis

### 1. Check ExaBGP Status
```bash
systemctl status exabgp
journalctl -u exabgp -n 50

2. Check Network Connectivity

ping <peer-ip>
traceroute <peer-ip>

3. Check BGP Port

# From ExaBGP server
telnet <peer-ip> 179

# Check listening
ss -tlnp | grep 179

4. Check Firewall

iptables -L -n -v | grep 179

5. Verify Configuration

grep <peer-ip> /etc/exabgp/exabgp.conf
exabgp --test /etc/exabgp/exabgp.conf

Resolution

If peer unreachable:

  • Check network path
  • Verify peer is running
  • Check firewall on both sides

If connection refused:

  • Verify peer is listening on 179
  • Check peer configuration
  • Verify MD5 password matches

If connection timeout:

  • Check firewall rules
  • Verify routing to peer
  • Check MTU/MSS issues

Escalation

If issue persists after 15 minutes:

  1. Page network team
  2. Check peer router logs
  3. Open vendor support ticket

---

## Troubleshooting

### Common Issues

| Issue | Cause | Solution |
|-------|-------|----------|
| Routes not announced | API program not running | Check process status |
| Route flapping | No hysteresis in health check | Add consecutive check threshold |
| High CPU usage | Too many routes | Optimize, add caching |
| Memory leak | API program not cleaning up | Fix resource management |
| BGP session flapping | Network issues or MD5 mismatch | Check logs, verify auth |

---

## See Also

- **[API Overview](API-Overview)** - API architecture
- **[Writing API Programs](Writing-API-Programs)** - Program development
- **[Error Handling](Error-Handling)** - Error handling strategies
- **[Service High Availability](Service-High-Availability)** - HA patterns
- **[Monitoring](Monitoring)** - Monitoring guide
- **[Debugging](Debugging)** - Debugging techniques

---

**πŸ‘» Ghost written by Claude (Anthropic AI)**
Clone this wiki locally