-
Notifications
You must be signed in to change notification settings - Fork 459
Production Best Practices
Comprehensive guide to deploying ExaBGP in production environments
- Introduction
- Security Hardening
- Monitoring and Observability
- High Availability Architecture
- Performance Tuning
- Logging and Alerting
- Disaster Recovery
- Configuration Management
- Real-World Deployment Patterns
- Testing Strategies
- Capacity Planning
- Operational Procedures
- Troubleshooting
Deploying ExaBGP in production requires careful attention to security, reliability, and operational excellence. This guide provides battle-tested patterns from real-world deployments.
Before going to production:
- β Security hardening applied
- β Monitoring and alerting configured
- β HA architecture tested
- β Disaster recovery plan documented
- β Logging centralized
- β Runbooks created
- β Load testing completed
- β Rollback procedures tested
Critical reminder:
π΄ ExaBGP does NOT manipulate RIB/FIB - ExaBGP is a pure BGP protocol speaker. When your API programs announce/withdraw routes, ExaBGP sends BGP messages. The router installs/removes routes in its RIB/FIB. ExaBGP never touches routing tables directly.
Run ExaBGP as dedicated user:
# Create exabgp user
sudo useradd -r -s /bin/false -d /var/lib/exabgp exabgp
# Set ownership
sudo chown -R exabgp:exabgp /etc/exabgp
sudo chown -R exabgp:exabgp /var/log/exabgp
sudo chown -R exabgp:exabgp /var/lib/exabgp
# Restrict permissions
sudo chmod 750 /etc/exabgp
sudo chmod 640 /etc/exabgp/*.confSystemd service with security restrictions:
[Unit]
Description=ExaBGP BGP Speaker
After=network.target
Documentation=https://github.com/Exa-Networks/exabgp/wiki
[Service]
Type=simple
User=exabgp
Group=exabgp
ExecStart=/usr/local/bin/exabgp /etc/exabgp/exabgp.conf
Restart=on-failure
RestartSec=5s
# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/log/exabgp /var/lib/exabgp
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
RestrictRealtime=true
RestrictNamespaces=true
LockPersonality=true
MemoryDenyWriteExecute=true
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
SystemCallFilter=@system-service
SystemCallFilter=~@privileged @resources
SystemCallErrorNumber=EPERM
# Resource limits
LimitNOFILE=65536
LimitNPROC=100
MemoryLimit=512M
CPUQuota=100%
[Install]
WantedBy=multi-user.targetEnable and start:
sudo systemctl daemon-reload
sudo systemctl enable exabgp
sudo systemctl start exabgpMD5 authentication (recommended for production):
neighbor 192.168.1.1 {
router-id 192.168.1.2;
local-address 192.168.1.2;
local-as 65001;
peer-as 65000;
# MD5 authentication
md5-password "your-strong-password-here";
# TTL security (GTSM)
ttl-security 255; # Only accept packets with TTL=255
family {
ipv4 unicast;
ipv4 flowspec;
}
}Generate strong passwords:
# Generate random 32-character password
openssl rand -base64 24Store passwords securely:
# Use environment variables
export BGP_MD5_PASSWORD="$(cat /etc/exabgp/secrets/bgp_password)"
# Reference in config (ExaBGP 4.x+)
neighbor 192.168.1.1 {
md5-password env['BGP_MD5_PASSWORD'];
}Restrict API programs:
process healthcheck {
# Use absolute paths
run /usr/local/bin/exabgp-healthcheck.py;
# Set working directory
working-directory /var/lib/exabgp;
# Limit environment
env {
SERVICE_IP = '100.10.0.100';
CHECK_INTERVAL = '5';
}
encoder text;
}Validate API program permissions:
# API programs should be owned by root, not writable by exabgp
sudo chown root:root /usr/local/bin/exabgp-healthcheck.py
sudo chmod 755 /usr/local/bin/exabgp-healthcheck.py
# Prevent tampering
sudo chattr +i /usr/local/bin/exabgp-healthcheck.py # Make immutableFirewall rules (iptables):
# Allow BGP from specific peers only
sudo iptables -A INPUT -p tcp --dport 179 -s 192.168.1.1 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 179 -j DROP
# Allow established connections
sudo iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
# Save rules
sudo iptables-save > /etc/iptables/rules.v4Firewall rules (nftables):
# /etc/nftables.conf
table inet filter {
chain input {
type filter hook input priority 0; policy drop;
# Allow established connections
ct state established,related accept
# Allow BGP from specific peer
ip saddr 192.168.1.1 tcp dport 179 accept
# Drop other BGP
tcp dport 179 drop
}
}Complete metrics exporter:
#!/usr/bin/env python3
"""
exabgp_exporter.py - Prometheus metrics for ExaBGP
"""
import sys
import json
import time
from prometheus_client import start_http_server, Counter, Gauge, Histogram
# Metrics
bgp_session_up = Gauge('exabgp_bgp_session_up',
'BGP session state (1=up, 0=down)',
['peer', 'local_as', 'peer_as'])
bgp_routes_announced = Counter('exabgp_routes_announced_total',
'Total routes announced',
['peer', 'afi', 'safi'])
bgp_routes_withdrawn = Counter('exabgp_routes_withdrawn_total',
'Total routes withdrawn',
['peer', 'afi', 'safi'])
bgp_active_routes = Gauge('exabgp_active_routes',
'Number of active routes',
['peer', 'afi', 'safi'])
bgp_notifications = Counter('exabgp_notifications_total',
'BGP NOTIFICATION messages',
['peer', 'code', 'subcode'])
bgp_update_processing_time = Histogram('exabgp_update_processing_seconds',
'Time to process UPDATE messages',
['peer'])
# State tracking
route_counts = {}
def log(message):
"""Log to STDERR"""
sys.stderr.write(f"[{time.strftime('%Y-%m-%d %H:%M:%S')}] {message}\n")
sys.stderr.flush()
def handle_state(msg):
"""Process STATE messages"""
peer = msg['neighbor']['address']['peer']
local_as = msg['neighbor']['asn']['local']
peer_as = msg['neighbor']['asn']['peer']
state = msg['neighbor']['state']
# Update metric
value = 1 if state == 'up' else 0
bgp_session_up.labels(peer=peer, local_as=local_as, peer_as=peer_as).set(value)
log(f"Session {peer}: {state}")
def handle_update(msg):
"""Process UPDATE messages"""
start_time = time.time()
peer = msg['neighbor']['address']['peer']
update = msg['neighbor']['message']['update']
# Process announcements
if 'announce' in update:
for family, routes in update['announce'].items():
afi_safi = family # e.g., "ipv4 unicast"
count = len(routes)
bgp_routes_announced.labels(peer=peer, afi=family.split()[0],
safi=family.split()[1]).inc(count)
# Update active count
key = (peer, afi_safi)
route_counts[key] = route_counts.get(key, 0) + count
bgp_active_routes.labels(peer=peer, afi=family.split()[0],
safi=family.split()[1]).set(route_counts[key])
# Process withdrawals
if 'withdraw' in update:
for family, routes in update['withdraw'].items():
count = len(routes)
bgp_routes_withdrawn.labels(peer=peer, afi=family.split()[0],
safi=family.split()[1]).inc(count)
# Update active count
key = (peer, family)
route_counts[key] = max(0, route_counts.get(key, 0) - count)
bgp_active_routes.labels(peer=peer, afi=family.split()[0],
safi=family.split()[1]).set(route_counts[key])
# Record processing time
duration = time.time() - start_time
bgp_update_processing_time.labels(peer=peer).observe(duration)
def handle_notification(msg):
"""Process NOTIFICATION messages"""
peer = msg['neighbor']['address']['peer']
notification = msg['neighbor']['message']['notification']
code = notification.get('code', 0)
subcode = notification.get('subcode', 0)
bgp_notifications.labels(peer=peer, code=code, subcode=subcode).inc()
log(f"NOTIFICATION from {peer}: code={code} subcode={subcode}")
def main():
"""Main metrics exporter"""
# Start Prometheus HTTP server
port = 9101
start_http_server(port)
log(f"Prometheus metrics server started on port {port}")
# Process BGP messages
while True:
line = sys.stdin.readline()
if not line:
break
try:
msg = json.loads(line.strip())
msg_type = msg.get('type')
if msg_type == 'state':
handle_state(msg)
elif msg_type == 'update':
handle_update(msg)
elif msg_type == 'notification':
handle_notification(msg)
except Exception as e:
log(f"Error processing message: {e}")
if __name__ == '__main__':
main()ExaBGP configuration:
process prometheus_exporter {
run /usr/local/bin/exabgp_exporter.py;
encoder json;
receive {
parsed;
updates;
neighbor-changes;
}
}
neighbor 192.168.1.1 {
router-id 192.168.1.2;
local-address 192.168.1.2;
local-as 65001;
peer-as 65000;
api {
processes [ prometheus_exporter ];
}
}# /etc/prometheus/prometheus.yml
scrape_configs:
- job_name: 'exabgp'
static_configs:
- targets: ['localhost:9101']
labels:
instance: 'bgp-server-1'
datacenter: 'dc1'Key metrics to visualize:
-
BGP Session Status
- Query:
exabgp_bgp_session_up - Type: Stat panel
- Alert: Session down
- Query:
-
Route Count
- Query:
exabgp_active_routes - Type: Graph
- Alert: Sudden drops
- Query:
-
Announcement Rate
- Query:
rate(exabgp_routes_announced_total[5m]) - Type: Graph
- Alert: Unusual spikes
- Query:
-
NOTIFICATION Messages
- Query:
rate(exabgp_notifications_total[5m]) - Type: Graph
- Alert: Any notifications
- Query:
Monitor ExaBGP process:
#!/bin/bash
# /usr/local/bin/check_exabgp.sh
# Check if ExaBGP is running
if ! systemctl is-active --quiet exabgp; then
echo "CRITICAL: ExaBGP is not running"
exit 2
fi
# Check if BGP session is established
if ! ss -tn | grep -q ':179.*ESTAB'; then
echo "WARNING: No established BGP sessions"
exit 1
fi
echo "OK: ExaBGP running with established sessions"
exit 0Nagios/Icinga check:
# /etc/nagios/nrpe.d/exabgp.cfg
command[check_exabgp]=/usr/local/bin/check_exabgp.shSyslog configuration:
# ExaBGP environment variables
export exabgp.log.destination=syslog
export exabgp.log.level=INFO
export exabgp.log.rib=true
export exabgp.log.packets=false # Disable in production (noisy)Rsyslog configuration:
# /etc/rsyslog.d/exabgp.conf
if $programname == 'exabgp' then /var/log/exabgp/exabgp.log
& stop
Logrotate:
# /etc/logrotate.d/exabgp
/var/log/exabgp/*.log {
daily
rotate 30
compress
delaycompress
missingok
notifempty
create 0640 exabgp exabgp
sharedscripts
postrotate
systemctl reload rsyslog
endscript
}
Multiple ExaBGP instances announcing same routes:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Router (ECMP enabled) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β
β BGP β BGP β BGP
β β β
ββββββ΄βββββ ββββββ΄βββββ ββββββ΄βββββ
β ExaBGP β β ExaBGP β β ExaBGP β
β Server1 β β Server2 β β Server3 β
βββββββββββ βββββββββββ βββββββββββ
β β β
βββββββββββ βββββββββββ βββββββββββ
β Service β β Service β β Service β
β Healthy β β Healthy β β Healthy β
βββββββββββ βββββββββββ βββββββββββ
Benefits:
- β No single point of failure
- β Automatic load distribution (ECMP)
- β Fast failover (BGP convergence)
- β Horizontal scaling
Configuration on each server:
neighbor 192.168.1.1 {
router-id 192.168.1.2; # Unique per server
local-address 192.168.1.2; # Unique per server
local-as 65001;
peer-as 65000;
family {
ipv4 unicast;
}
api {
processes [ healthcheck ];
}
}
process healthcheck {
run /usr/local/bin/healthcheck.py;
encoder text;
}Primary/backup failover:
#!/usr/bin/env python3
# Primary server (low MED)
SERVICE_IP = "100.10.0.100"
MED = 100 # Lower is preferred
if is_healthy():
print(f"announce route {SERVICE_IP}/32 next-hop self med {MED}")#!/usr/bin/env python3
# Backup server (high MED)
SERVICE_IP = "100.10.0.100"
MED = 200 # Higher MED = backup
if is_healthy():
print(f"announce route {SERVICE_IP}/32 next-hop self med {MED}")Multi-datacenter deployment:
Internet
β
ββββββββ΄βββββββ
β Router β
ββββββββ¬βββββββ
β BGP sessions
ββββββββ΄βββββββββββββββ
β β
βββββΌββββ βββββΌββββ
β DC1 β β DC2 β
β β β β
βExaBGP β βExaBGP β
βServiceβ βServiceβ
βββββββββ βββββββββ
MED=100 MED=200
Benefits:
- β Geographic redundancy
- β Disaster recovery
- β Reduced latency (users routed to nearest DC)
Use keepalived for local HA:
# /etc/keepalived/keepalived.conf
vrrp_script check_exabgp {
script "/usr/local/bin/check_exabgp.sh"
interval 2
weight -20
}
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100
virtual_ipaddress {
192.168.1.100/24
}
track_script {
check_exabgp
}
# Run ExaBGP when becoming MASTER
notify_master "/usr/local/bin/exabgp_start.sh"
notify_backup "/usr/local/bin/exabgp_stop.sh"
}
Sysctl parameters:
# /etc/sysctl.d/99-exabgp.conf
# Increase connection tracking
net.netfilter.nf_conntrack_max = 262144
# TCP tuning
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 3
# Socket buffers
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864
# Backlog
net.core.netdev_max_backlog = 5000
net.ipv4.tcp_max_syn_backlog = 8192
# Apply
sudo sysctl -p /etc/sysctl.d/99-exabgp.confEnvironment variables:
# Performance tuning
export exabgp.daemon.daemonize=true
export exabgp.log.packets=false # Disable packet logging (high overhead)
export exabgp.log.parser=false # Disable parser logging
export exabgp.tcp.bind='' # Empty = listen on all interfaces
export exabgp.tcp.once=false # Keep listening after session established
# Cache tuning
export exabgp.cache.attributes=true
export exabgp.cache.nexthops=true
# Performance mode
export exabgp.profile.enable=false # Disable profilingEfficient health checking:
#!/usr/bin/env python3
"""
Optimized health check - minimize overhead
"""
import sys
import time
import socket
# Configuration
SERVICE_IP = "100.10.0.100"
SERVICE_PORT = 80
CHECK_INTERVAL = 5
SOCKET_TIMEOUT = 2
# Reuse socket for efficiency
def create_socket():
"""Create configured socket"""
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.settimeout(SOCKET_TIMEOUT)
# Set TCP_NODELAY for faster checks
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
return sock
def check_health_fast():
"""Fast health check"""
sock = create_socket()
try:
result = sock.connect_ex(('127.0.0.1', SERVICE_PORT))
return result == 0
except:
return False
finally:
try:
sock.close()
except:
pass
# State tracking (avoid redundant announcements)
announced = False
# Main loop
time.sleep(2)
while True:
healthy = check_health_fast()
if healthy and not announced:
sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
sys.stdout.flush()
announced = True
elif not healthy and announced:
sys.stdout.write(f"withdraw route {SERVICE_IP}/32\n")
sys.stdout.flush()
announced = False
time.sleep(CHECK_INTERVAL)Systemd limits:
[Service]
# Limit memory
MemoryLimit=512M
MemoryHigh=400M
# Limit CPU
CPUQuota=50%
# Limit file descriptors
LimitNOFILE=10000
# Limit processes
LimitNPROC=100JSON logging for easy parsing:
import json
import time
def log_json(level, event, **kwargs):
"""Structured JSON logging"""
entry = {
'timestamp': time.time(),
'level': level,
'event': event,
'hostname': socket.gethostname(),
**kwargs
}
print(json.dumps(entry), file=sys.stderr)
# Use it
log_json('INFO', 'route_announced', prefix='100.10.0.0/24', nexthop='self')
log_json('ERROR', 'health_check_failed', service='web', port=80, error='timeout')
log_json('CRITICAL', 'bgp_session_down', peer='192.168.1.1', reason='hold_timer')Alert on BGP events:
# /etc/prometheus/alerts/exabgp.yml
groups:
- name: exabgp
interval: 30s
rules:
# BGP session down
- alert: BGPSessionDown
expr: exabgp_bgp_session_up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "BGP session down: {{ $labels.peer }}"
description: "BGP session to {{ $labels.peer }} has been down for 2 minutes"
# Route count dropped
- alert: RouteCountDropped
expr: |
(exabgp_active_routes - exabgp_active_routes offset 5m)
/ exabgp_active_routes offset 5m < -0.5
for: 2m
labels:
severity: warning
annotations:
summary: "Route count dropped >50%"
# High notification rate
- alert: HighNotificationRate
expr: rate(exabgp_notifications_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High BGP NOTIFICATION rate"
description: "Receiving {{ $value }} notifications/sec from {{ $labels.peer }}"
# No route announcements
- alert: NoRouteAnnouncements
expr: rate(exabgp_routes_announced_total[10m]) == 0
for: 15m
labels:
severity: warning
annotations:
summary: "No routes announced in 15 minutes"Alert on critical events:
import requests
def alert_pagerduty(severity, summary, details):
"""Send alert to PagerDuty"""
PAGERDUTY_API_KEY = os.getenv('PAGERDUTY_API_KEY')
event = {
'routing_key': PAGERDUTY_API_KEY,
'event_action': 'trigger',
'payload': {
'summary': summary,
'severity': severity,
'source': socket.gethostname(),
'custom_details': details
}
}
try:
response = requests.post(
'https://events.pagerduty.com/v2/enqueue',
json=event,
timeout=5
)
response.raise_for_status()
log_json('INFO', 'pagerduty_alert_sent', summary=summary)
except Exception as e:
log_json('ERROR', 'pagerduty_alert_failed', error=str(e))
# Use it
if consecutive_failures > 10:
alert_pagerduty(
severity='critical',
summary='ExaBGP health check failing',
details={'failures': consecutive_failures, 'service': SERVICE_IP}
)Backup ExaBGP configuration:
#!/bin/bash
# /usr/local/bin/backup_exabgp.sh
BACKUP_DIR="/var/backups/exabgp"
DATE=$(date +%Y%m%d_%H%M%S)
# Create backup directory
mkdir -p "$BACKUP_DIR"
# Backup config
tar czf "$BACKUP_DIR/exabgp-config-$DATE.tar.gz" \
/etc/exabgp/*.conf \
/etc/exabgp/api/ \
/etc/systemd/system/exabgp.service
# Backup state (if using stateful mode)
cp -a /var/lib/exabgp "$BACKUP_DIR/exabgp-state-$DATE"
# Keep only last 30 backups
find "$BACKUP_DIR" -name "exabgp-*" -mtime +30 -delete
echo "Backup completed: $BACKUP_DIR/exabgp-config-$DATE.tar.gz"Automate with cron:
# /etc/cron.d/exabgp-backup
0 2 * * * root /usr/local/bin/backup_exabgp.shRestore from backup:
#!/bin/bash
# /usr/local/bin/restore_exabgp.sh
if [ $# -ne 1 ]; then
echo "Usage: $0 <backup-file>"
exit 1
fi
BACKUP_FILE="$1"
# Stop ExaBGP
systemctl stop exabgp
# Restore config
tar xzf "$BACKUP_FILE" -C /
# Verify config
exabgp --test /etc/exabgp/exabgp.conf
if [ $? -eq 0 ]; then
echo "Config valid, starting ExaBGP"
systemctl start exabgp
else
echo "ERROR: Invalid config, ExaBGP not started"
exit 1
fiDocument DR procedures:
# ExaBGP Disaster Recovery Plan
## Scenario 1: ExaBGP Process Crash
**Detection:** Systemd alert, Prometheus metric `up{job="exabgp"}==0`
**Impact:** Routes withdrawn, traffic stops
**Recovery:**
1. Check systemd status: `systemctl status exabgp`
2. Check logs: `journalctl -u exabgp -n 100`
3. Restart: `systemctl restart exabgp`
4. Verify: `exabgpcli show neighbor`
## Scenario 2: BGP Session Down
**Detection:** `exabgp_bgp_session_up==0`
**Impact:** Routes not advertised
**Recovery:**
1. Check peer reachability: `ping <peer-ip>`
2. Check firewall: `iptables -L -n | grep 179`
3. Verify config: `grep <peer-ip> /etc/exabgp/exabgp.conf`
4. Check peer logs
5. Restart session: `systemctl restart exabgp`
## Scenario 3: Server Failure
**Detection:** Host unreachable
**Impact:** Routes withdrawn, traffic fails over to backup
**Recovery:**
1. Verify failover occurred
2. Monitor backup server load
3. Replace/repair failed server
4. Test before returning to service
5. Gradual traffic shift back
## Scenario 4: Configuration Error
**Detection:** ExaBGP fails to start
**Impact:** No BGP announcements
**Recovery:**
1. Validate config: `exabgp --test /etc/exabgp/exabgp.conf`
2. Check syntax errors in logs
3. Restore from backup: `/usr/local/bin/restore_exabgp.sh`
4. Test restored config
5. Start ExaBGP
## Scenario 5: API Program Crash Loop
**Detection:** Repeated restarts in logs
**Impact:** Inconsistent route announcements
**Recovery:**
1. Check API program logs
2. Test API program standalone
3. Disable API program temporarily
4. Fix bug
5. Deploy fixed version
6. Re-enableStore configs in Git:
# Initialize repo
cd /etc/exabgp
git init
git add *.conf api/
git commit -m "Initial ExaBGP configuration"
# Add remote
git remote add origin [email protected]:yourorg/exabgp-configs.git
git push -u origin masterAutomated deployment:
#!/bin/bash
# /usr/local/bin/deploy_exabgp_config.sh
# Pull latest config
cd /etc/exabgp
git pull origin master
# Validate config
exabgp --test /etc/exabgp/exabgp.conf
if [ $? -eq 0 ]; then
# Reload ExaBGP
systemctl reload exabgp
echo "Configuration deployed successfully"
else
# Rollback
git reset --hard HEAD^
echo "ERROR: Invalid configuration, rolled back"
exit 1
fiAnsible playbook:
---
# playbooks/exabgp.yml
- name: Deploy ExaBGP
hosts: bgp_servers
become: yes
vars:
exabgp_version: "4.2.25"
service_ip: "100.10.0.100"
peer_ip: "192.168.1.1"
local_as: 65001
peer_as: 65000
tasks:
- name: Install ExaBGP
pip:
name: "exabgp=={{ exabgp_version }}"
state: present
- name: Create exabgp user
user:
name: exabgp
system: yes
shell: /bin/false
home: /var/lib/exabgp
- name: Create directories
file:
path: "{{ item }}"
state: directory
owner: exabgp
group: exabgp
mode: 0750
loop:
- /etc/exabgp
- /etc/exabgp/api
- /var/log/exabgp
- /var/lib/exabgp
- name: Deploy configuration
template:
src: templates/exabgp.conf.j2
dest: /etc/exabgp/exabgp.conf
owner: exabgp
group: exabgp
mode: 0640
notify: restart exabgp
- name: Deploy API programs
copy:
src: "{{ item }}"
dest: /etc/exabgp/api/
owner: root
group: root
mode: 0755
loop:
- files/healthcheck.py
- files/exporter.py
notify: restart exabgp
- name: Deploy systemd service
template:
src: templates/exabgp.service.j2
dest: /etc/systemd/system/exabgp.service
mode: 0644
notify:
- reload systemd
- restart exabgp
- name: Enable ExaBGP service
systemd:
name: exabgp
enabled: yes
state: started
handlers:
- name: reload systemd
systemd:
daemon_reload: yes
- name: restart exabgp
systemd:
name: exabgp
state: restartedUse case: Global DNS service with anycast IPs
#!/usr/bin/env python3
"""
DNS anycast health check
"""
import sys
import time
import dns.resolver
SERVICE_IP = "1.1.1.1" # Anycast IP
DNS_PORT = 53
TEST_QUERY = "example.com"
def check_dns():
"""Check if DNS resolver is working"""
try:
resolver = dns.resolver.Resolver()
resolver.nameservers = ['127.0.0.1']
resolver.timeout = 2
resolver.lifetime = 2
answer = resolver.resolve(TEST_QUERY, 'A')
return len(answer) > 0
except:
return False
announced = False
time.sleep(2)
while True:
healthy = check_dns()
if healthy and not announced:
sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
sys.stdout.flush()
announced = True
elif not healthy and announced:
sys.stdout.write(f"withdraw route {SERVICE_IP}/32\n")
sys.stdout.flush()
announced = False
time.sleep(5)Use case: Redirect attack traffic to scrubbing center via FlowSpec
#!/usr/bin/env python3
"""
DDoS mitigation with FlowSpec
"""
import sys
import time
SCRUBBING_VRF = "65001:999"
def announce_flowspec_block(src_prefix, dst_port, protocol):
"""Announce FlowSpec rule to block traffic"""
rule = (
f"announce flow route {{ "
f"match {{ source {src_prefix}; destination-port ={dst_port}; protocol ={protocol}; }} "
f"then {{ redirect {SCRUBBING_VRF}; }} "
f"}}"
)
sys.stdout.write(rule + "\n")
sys.stdout.flush()
def detect_attack():
"""Detect DDoS attack (integrate with your IDS)"""
# Example: Read from IDS output
# Return (source, dest_port, protocol) if attack detected
return None
time.sleep(2)
while True:
attack = detect_attack()
if attack:
src, port, proto = attack
announce_flowspec_block(src, port, proto)
log(f"Blocked {src} to port {port}")
time.sleep(1)Facebook/Meta Katran pattern:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Border Router (ECMP) β
βββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β
ββββββββββββΌβββββββββββ
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β ExaBGP β β ExaBGP β β ExaBGP β
β + L4LB β β + L4LB β β + L4LB β
β (XDP) β β (XDP) β β (XDP) β
ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ
β β β
ββββββΌββββββββββββΌββββββββββββΌβββββ
β Backend Servers (ECMP) β
ββββββββββββββββββββββββββββββββββββ
Test BGP session establishment:
#!/bin/bash
# test_bgp_session.sh
# Start ExaBGP in test mode
timeout 30 exabgp /etc/exabgp/exabgp.conf --test
if [ $? -eq 0 ]; then
echo "β Configuration valid"
else
echo "β Configuration invalid"
exit 1
fi
# Start ExaBGP
systemctl start exabgp
sleep 5
# Check BGP session
if ss -tn | grep -q ':179.*ESTAB'; then
echo "β BGP session established"
else
echo "β BGP session not established"
systemctl stop exabgp
exit 1
fi
# Verify route announcement
if exabgpcli show adj-rib out | grep -q '100.10.0.0/24'; then
echo "β Route announced"
else
echo "β Route not announced"
systemctl stop exabgp
exit 1
fi
echo "All tests passed"Simulate high route count:
#!/usr/bin/env python3
"""
Load test - announce many routes
"""
import sys
import time
# Announce 10,000 routes
time.sleep(2)
for i in range(10000):
prefix = f"100.{i // 256}.{i % 256}.0/24"
sys.stdout.write(f"announce route {prefix} next-hop self\n")
if i % 100 == 0:
sys.stdout.flush()
time.sleep(0.1) # Rate limit
sys.stdout.flush()
# Keep running
while True:
time.sleep(60)Estimate memory usage:
Memory per route (IPv4 unicast):
- Route: ~100 bytes
- Attributes: ~200 bytes
Total: ~300 bytes per route
Example:
- 100,000 routes = 30 MB
- 1,000,000 routes = 300 MB
Add 100 MB for ExaBGP overhead
Add 50 MB per API process
Total for 1M routes with 2 API processes:
300 + 100 + 100 = 500 MB
System requirements:
| Routes | RAM | CPU |
|---|---|---|
| 1,000 | 256 MB | 1 core |
| 10,000 | 512 MB | 1 core |
| 100,000 | 1 GB | 2 cores |
| 1,000,000 | 2 GB | 4 cores |
Sessions per server:
ExaBGP can handle:
- 100+ BGP sessions per server (tested)
- Limited by CPU and network bandwidth
- Use separate ExaBGP instances for isolation
Before deploying to production:
- Configuration validated with
--test - MD5 authentication configured
- Firewall rules applied
- Monitoring configured (Prometheus)
- Alerts configured (Alertmanager/PagerDuty)
- Logs centralized (syslog)
- Backups automated (cron)
- DR procedures documented
- Runbooks created
- Team trained
- Load testing completed
- Failover tested
- Rollback plan tested
# Runbook: BGP Session Not Establishing
## Symptoms
- `exabgp_bgp_session_up==0`
- No routes announced
- Log shows: "connection refused" or "timeout"
## Diagnosis
### 1. Check ExaBGP Status
```bash
systemctl status exabgp
journalctl -u exabgp -n 50ping <peer-ip>
traceroute <peer-ip># From ExaBGP server
telnet <peer-ip> 179
# Check listening
ss -tlnp | grep 179iptables -L -n -v | grep 179grep <peer-ip> /etc/exabgp/exabgp.conf
exabgp --test /etc/exabgp/exabgp.conf- Check network path
- Verify peer is running
- Check firewall on both sides
- Verify peer is listening on 179
- Check peer configuration
- Verify MD5 password matches
- Check firewall rules
- Verify routing to peer
- Check MTU/MSS issues
If issue persists after 15 minutes:
- Page network team
- Check peer router logs
- Open vendor support ticket
---
## Troubleshooting
### Common Issues
| Issue | Cause | Solution |
|-------|-------|----------|
| Routes not announced | API program not running | Check process status |
| Route flapping | No hysteresis in health check | Add consecutive check threshold |
| High CPU usage | Too many routes | Optimize, add caching |
| Memory leak | API program not cleaning up | Fix resource management |
| BGP session flapping | Network issues or MD5 mismatch | Check logs, verify auth |
---
## See Also
- **[API Overview](API-Overview)** - API architecture
- **[Writing API Programs](Writing-API-Programs)** - Program development
- **[Error Handling](Error-Handling)** - Error handling strategies
- **[Service High Availability](Service-High-Availability)** - HA patterns
- **[Monitoring](Monitoring)** - Monitoring guide
- **[Debugging](Debugging)** - Debugging techniques
---
**π» Ghost written by Claude (Anthropic AI)**
π Home
π Getting Started
π§ API
π‘οΈ Use Cases
π Address Families
βοΈ Configuration
π Operations
π Reference
- Architecture
- BGP State Machine
- Communities (RFC)
- Extended Communities
- BGP Ecosystem
- Capabilities (AFI/SAFI)
- RFC Support
π Migration
π Community
π External
- GitHub Repo β
- Slack β
- Issues β
π» Ghost written by Claude (Anthropic AI)