Skip to content

Conversation

@krishna-samy
Copy link
Contributor

Problem statement:

 - Under large route churn with identical ECMP sets, Zebra spends excessive CPU
   in nexthop_active_update during rib_process.
 - This is because of repeated identical NH active checks, nhe hash lookups and
   hash entry transitions during resolution process
 - As per current code, each incoming route entry during a burst will go
   through the above processing individually.
 - This could be optimized to have efficient processing during route churn

Fix:

 - Introduce new fields as below to cache the resolved_nhe_id for each incoming NHE received from protocol.
        struct route_entry {
        ...
        struct nhg_hash_entry *nhe_received;
        ...
        }

 - On the received/unresolved NHE:
        struct nhg_hash_entry {
        ...
        uint32_t resolved_nhe_id; // Cached resolved NHE ID (0 = not cached)
        uint32_t cache_gen_num; // Validation stamp for cache
        };

 - 're->nhe_received' to store the received NHs set from protocol
 - 'nhe_received->resolved_nhe_id' to store the resolved NHE in slow-path
   and the same will be used for lookup during fast-path.
 - 'global_nh_epoch' to track any system wide events so that the cached NHEs would be invalidated.
 - On fast-path, if cache_gen_num matches global_nh_epoch, adopt the resolved NHG directly (skip heavy resolution).
 - Increment global_nh_epoch on route-map changes, interface up/down/address events
   and MPLS label updates to invalidate the cache
 - So the validity of any cached 'resolved_nhe_id' is determined by equality check 'cache_gen_num == global_nh_epoch'.
 - Some of the special cases like labels, route-map, self-pointing NHs are considered to skip the caching.

Result:

overall ~30% improvement in zebra during routes churn with identical ECMP
Before fix:
0 14684.091 1317 11149 25199 11185 25841 0 0 0 TE work_queue_run
After fix:
0 7307.182 1166 6266 18044 6292 18051 0 0 0 TE work_queue_run

@frrbot frrbot bot added tests Topotests, make check, etc zebra labels Nov 13, 2025
@krishna-samy
Copy link
Contributor Author

Some of the test results:

r1# sh ip route 33.1.1.1/32 nexthop-group
Routing entry for 33.1.1.1/32
  Known via "bgp", distance 20, metric 0, best
  Last update 00:00:16 ago
  Flags: Selected
  Status: Installed
  Nexthop Group ID: 30 									>>> candidate nhe/installed_nhe
  Received Nexthop Group ID: 25 						>>> nhe_received from protocol
  * 10.0.1.2, via r1-eth0, weight 1
  * 10.0.2.2, via r1-eth1, weight 1
  * 10.0.3.2, via r1-eth2, weight 1
  * 10.0.4.2, via r1-eth3, weight 1

r1#
r1# sh nexthop-group rib 25
ID: 25 (zebra)
     RefCnt: 1
     Cache: cached nhe 30, gen_id 14, global_count 14 		>>> cache
     Uptime: 00:00:25
     VRF: default(No AFI)
     Nexthop Count: 4
     Flags: 0x400
     Depends: (26) (27) (28) (29)
        via 10.0.1.2 (vrf default) inactive, weight 1
        via 10.0.2.2 (vrf default) inactive, weight 1
        via 10.0.3.2 (vrf default) inactive, weight 1
        via 10.0.4.2 (vrf default) inactive, weight 1
r1#
r1# sh nexthop-group rib 30
ID: 30 (zebra)
     RefCnt: 1
     Uptime: 00:00:31
     VRF: default(No AFI)
     Nexthop Count: 4
     Flags: 0x3
     Valid, Installed
     Depends: (31) (32) (33) (34)
        via 10.0.1.2, r1-eth0 (vrf default), weight 1
        via 10.0.2.2, r1-eth1 (vrf default), weight 1
        via 10.0.3.2, r1-eth2 (vrf default), weight 1
        via 10.0.4.2, r1-eth3 (vrf default), weight 1

r1# sh ip route 34.1.1.1/32 nexthop-group
Routing entry for 34.1.1.1/32
  Known via "bgp", distance 20, metric 0, best
  Last update 00:00:43 ago
  Flags: Recursion Selected
  Status: Installed
  Nexthop Group ID: 36 									>>> candidate nhe
  Installed Nexthop Group ID: 14 						>>> installed nhe
  Received Nexthop Group ID: 35 						>>> received nhe from protocol
    3.3.3.3 (recursive), weight 1
  *   10.0.5.2, via r1-eth4, weight 1

r1#
r1# sh nexthop-group rib 35
ID: 35 (zebra)
     RefCnt: 1 
     Cache: cached nhe 36, gen_id 14, global_count 14  	>>> cache
     Uptime: 00:00:52
     VRF: default(IPv4)
     Nexthop Count: 1
     Flags: 0x400
        via 3.3.3.3 (vrf default) inactive, weight 1
r1#
r1# sh nexthop-group rib 36
ID: 36 (zebra)
     RefCnt: 1
     Uptime: 00:00:57
     VRF: default(IPv4)
     Nexthop Count: 1
     Flags: 0x9
     Valid, Recursive
     Depends: (14)
        via 3.3.3.3 (vrf default) (recursive), weight 1
           via 10.0.5.2, r1-eth4 (vrf default), weight 1
r1#
r1# sh nexthop-group rib 14
ID: 14 (zebra)
     RefCnt: 3
     Uptime: 00:01:01
     VRF: default(IPv4)
     Nexthop Count: 1
     Flags: 0x3
     Valid, Installed
     Interface Index: 9
        via 10.0.5.2, r1-eth4 (vrf default), weight 1
     Dependents: (36)
r1#


scale test results - 10k routes with 512 ECMP:

without fix:

Event statistics for zebra:

Showing statistics for pthread default
--------------------------------------
                               CPU (user+system): Real (wall-clock):
Active   Runtime(ms)   Invoked Avg uSec Max uSecs Avg uSec Max uSecs  CPU_Warn Wall_Warn Starv_Warn   Type  Event
    1          0.362        18       20       121       20       122         0         0          0  R      msg_conn_read
    1          2.673       399        6        29        8        47         0         0          0    T    wheel_timer_thread
    0        527.147      2968      177     64272      197     64274         0         0          0     E   rib_process_dplane_results
    1          1.328        16       83       409       84       410         0         0          0  R      vtysh_accept
    1          0.969         9      107       151      113       173         0         0          0  R      zserv_accept
    1         19.332       828       23       230       23       230         0         0          0  R      kernel_read
    0          7.281       521       13       106       14       106         0         0          0    T    timer_walk_continue
    0          0.018         4        4        10        5        10         0         0          0   W     vtysh_write
    0          2.306       520        4        63        4        66         0         0          0    T    if_zebra_speed_update
    0      14684.091      1317    11149     25199    11185     25841         0         0          0    TE   work_queue_run
    0          0.003         1        3         3        4         4         0         0          0    T    rib_sweep_route
    0         42.496        12     3541     27457     6794     44307         0         0          0     E   msg_conn_proc_msgs
    0          6.896        51      135      1885      143      1886         0         0          0  R      vtysh_read
    0          0.167         2       83       101       84       101         0         0          0    T    timer_walk_start
    0          0.010         1       10        10       11        11         0         0          0    T    zebra_evpn_mh_startup_delay_exp_cb
    0          0.062         1       62        62       64        64         0         0          0     E   msg_client_connect_timer
    0       8436.697      2356     3580     11176     3619     11204         0         0          0     E   zserv_process_messages
    0          0.013         1       13        13       13        13         0         0          0     E   frr_config_read_in
    0          1.017       274        3        60        8        61         0         0          0   W     msg_conn_write

Total Event statistics
-------------------------
                               CPU (user+system): Real (wall-clock):
Active   Runtime(ms)   Invoked Avg uSec Max uSecs Avg uSec Max uSecs  CPU_Warn Wall_Warn Starv_Warn   Type  Event
   15      24470.230     36800      664     64272      676     64274         0         0          0  RWTEX  TOTAL

with fix:

Showing statistics for pthread default
--------------------------------------
                               CPU (user+system): Real (wall-clock):
Active   Runtime(ms)   Invoked Avg uSec Max uSecs Avg uSec Max uSecs  CPU_Warn Wall_Warn Starv_Warn   Type  Event
    0          1.627       274        5       106        6       106         0         0          0   W     msg_conn_write
    0          0.002         1        2         2        4         4         0         0          0    T    zebra_evpn_mh_startup_delay_exp_cb
    0          0.004         1        4         4        4         4         0         0          0    T    rib_sweep_route
    0          0.020         1       20        20       21        21         0         0          0     E   frr_config_read_in
    1         18.800       856       21       202       23       394         0         0          0  R      kernel_read
    0          7.152       521       13       138       14       281         0         0          0    T    timer_walk_continue
    0         36.848        12     3070     23226     6560     54172         0         0          0     E   msg_conn_proc_msgs
    1          1.843       255        7        33        8        34         0         0          0    T    wheel_timer_thread
    0       7307.182      1166     6266     18044     6292     18051         0         0          0    TE   work_queue_run 									>>> reduced by 50%
    0          0.184         2       92       125       92       125         0         0          0    T    timer_walk_start
    1          0.129        18        7        24        7        24         0         0          0  R      msg_conn_read
    0          1.432        42       34       224       34       226         0         0          0  R      vtysh_read
    0        467.711      2714      172     38401      189     48303         0         0          0     E   rib_process_dplane_results
    0          0.092         1       92        92       95        95         0         0          0     E   msg_client_connect_timer
    0       8477.756      2197     3858     14297     3890     14296         0         0          0     E   zserv_process_messages
    0          1.902       520        3        63        4        65         0         0          0    T    if_zebra_speed_update
    1          0.821         9       91       124       97       124         0         0          0  R      zserv_accept
    1          0.466        13       35       124       36       125         0         0          0  R      vtysh_accept

Total Event statistics
-------------------------
                               CPU (user+system): Real (wall-clock):
Active   Runtime(ms)   Invoked Avg uSec Max uSecs Avg uSec Max uSecs  CPU_Warn Wall_Warn Starv_Warn   Type  Event
   15      16940.404     35841      472     38401      482     54172         0         0          0  RWTEX  TOTAL 											>>> overall reduction by ~30%

@krishna-samy krishna-samy force-pushed the krishna/nh_active_optimization branch 6 times, most recently from 03d8d7c to 447be1a Compare November 17, 2025 14:37
The original implementation of nhe_received was done to store
the received NHs from protocols as is during the early route
processing. But the route_entry_update_nhe() was incorrectly
overwriting re->nhe_received during NH resolution updates.

This caused the received NHE to be replaced with the resolved NHE,
defeating the purpose of tracking the original nexthops sent by the
protocol. Fixing the same.

Signed-off-by: Krishnasamy <[email protected]>
Problem statement:
 - Under large route churn with identical ECMP sets, Zebra spends excessive CPU
   in nexthop_active_update during rib_process.
 - This is because of repeated identical NH active checks, nhe hash lookups and
   hash entry transitions during resolution process
 - As per current code, each incoming route entry during a burst will go
   through the above processing individually.
 - This could be optimized to have efficient processing during route churn

Fix:
 - Introduce new fields as below to cache the resolved_nhe_id for each incoming NHE received from protocol.
	struct route_entry {
	...
	struct nhg_hash_entry *nhe_received;
	...
	}

 - On the received/unresolved NHE:
	struct nhg_hash_entry {
	...
	uint32_t resolved_nhe_id; // Cached resolved NHE ID (0 = not cached)
	uint32_t cache_gen_num; // Validation stamp for cache
	};

 - 're->nhe_received' to store the received NHs set from protocol
 - 'nhe_received->resolved_nhe_id' to store the resolved NHE in slow-path
   and the same will be used for lookup during fast-path.
 - 'global_nh_epoch' to track any system wide events so that the cached NHEs would be invalidated.
 - On fast-path, if cache_gen_num matches global_nh_epoch, adopt the resolved NHG directly (skip heavy resolution).
 - Increment global_nh_epoch on route-map changes, interface up/down/address events
   and label updates to invalidate the cache
 - So the validity of any cached 'resolved_nhe_id' is determined by equality check 'cache_gen_num == global_nh_epoch'.
 - Some of the special cases like labels, route-map, self-pointing NHs are considered to skip the caching.

Signed-off-by: Krishnasamy <[email protected]>
@krishna-samy krishna-samy force-pushed the krishna/nh_active_optimization branch from 447be1a to c4c7153 Compare November 18, 2025 14:28
@krishna-samy
Copy link
Contributor Author

ci:rerun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

master size/XL tests Topotests, make check, etc zebra

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant