SD-WAN multi-PoP multi-hub large scale design and failover
FortiOS 7.2.0 introduced a feature to define the minimum number of SD-WAN interface members that must meet SLA in order for the spoke to select a hub to process its SD-WAN traffic. This design is suitable for a single-PoP multi-hub architecture in order to achieve hub-to-hub failover. See Using multiple members per SD-WAN neighbor configuration.
In FortiOS 7.4.1 and later, the design is enhanced to support a multi-PoP multi-hub architecture in which incoming and outgoing traffic failover between PoPs is supported.
Based on the preceding diagram, incoming and outgoing traffic to the spoke is preferred over PoP1. If a single hub within PoP1 goes out of SLA, traffic will continue to flow through the PoP. If the minimum number of members to meet SLA in the PoP cannot be met, then traffic will fail over to PoP2.
The following enhancements have been made to support the multi-PoP failover scenario.
-
Add
minimum-sla-meet-members
setting in the SD-WAN zone configurations andzone-mode
setting in the SD-WAN service configurations:config system sdwan config zone edit <name> set minimum-sla-meet-members <integer> next end config service edit <id> set mode sla set zone-mode {enable | disable} next end end
When
zone-mode
is enabled on a SD-WAN service rule, the traffic is steered based on the status of the zone.The state of the health check referenced in the SD-WAN service can be defined as follows:
-
If the number of in SLA members in a zone is less than the
minimum-sla-meet-members
, then the zone's state is out of SLA; otherwise, it is in SLA. -
If a zone's state is out of SLA, then all members in the zone are out of SLA.
-
If a zone's state is in SLA, then the health check's state of individual members in the zone is determined by its own state.
-
-
Add
service-id
setting in the SD-WAN neighbor configurations:config system sdwan config neighbor edit <bgp_neighbor_ip> set member <member_id> set service-id <id> next end end
The SD-WAN neighbor’s behavior can be determined by SD-WAN service and naturally synchronizes with SD-WAN service.
-
The SD-WAN service defines priority zones, whose SLA state determines the advertised community preferable string.
-
The SD-WAN service defines the
hold-down-time
, which determines how long an advertised community preferable string can be kept when it is expected to be changed.
-
-
Add
sla-stickness
setting in the SD-WAN service configurations:config system sdwan config service edit <id> set mode sla set sla-stickiness {enable | disable} next end end
The switch-over of an existing session is determined as follows:
-
If the outgoing interface of the session is in SLA, then the session can keep its outgoing interface.
-
Otherwise, the session switches to a preferable path if one exists.
-
-
Allow the neighbor group to be configured in the SD-WAN neighbor configurations:
config system sdwan config neighbor edit <bgp_neighbor_group> set member <member_id> set health-check <name> set sla-id <id> next end end
Outgoing path control
The outgoing path from spoke to hub operates as follows:
-
Overlays to the primary and secondary PoP are assigned separately into an SD-WAN primary and secondary zone on the spoke.
-
One SD-WAN service rule is defined to include these zones as SD-WAN members.
-
When the primary zone is in SLA (
minimum-sla-meet-members
is met), the SD-WAN service rule steers traffic to the in SLA overlay members. -
When the primary zone is out of SLA (
minimum-sla-meet-members
is not met), the SD-WAN service rule steers traffic to the in SLA overlay members in the secondary zone. -
When the primary zone SLA is recovered:
-
If
sla-stickness
is disabled on the SD-WAN service rule, then traffic will wait the duration of thehold-down-time
before switching back to in SLA overlays in the primary zone. -
If
sla-stickness
is enabled on the SD-WAN service rule, then existing traffic will be kept on the in SLA overlays on the secondary zone, but new traffic will be steered to in SLA overlays in the primary zone.
-
Incoming path control
The incoming traffic from the core/external peers, to PoP, to spoke operates as follows:
-
When the primary zone is in SLA, the spoke uses the preferable route map to advertise local routes with the in SLA community to hubs in the primary and secondary PoPs.
-
Hubs in the primary PoP translate the in SLA community into a short AS path and advertise it to external peers to attract incoming traffic.
-
Hubs in the secondary PoP translate the in SLA community into a longer AS path and advertise it to external peers to deflect incoming traffic.
-
-
If the number of in SLA overlays in the primary zone is less than the
minimum-sla-meet-members
, then the spoke will use the default route map to advertise routes instead of with an out of SLA community to hubs in the primary PoP.-
Hubs in the primary PoP translate the out of SLA community into a longest AS path, and advertise it to external peers to deflect incoming traffic.
-
As a result, inbound traffic is routed to hubs in the secondary PoP.
-
-
When the primary zone SLA is recovered:
-
The spoke will wait the duration of the predefined
hold-down-time
in the SD-WAN service rule to use the preferable route map again to advertise routes with the in SLA community to hubs in the primary PoP. -
As a result, inbound traffic will be routed back to hubs in the primary PoP.
-
Neighbor group configuration
By configuring the neighbor group for spokes under the hub's SD-WAN neighbor configuration, if all paths from the hub to external peers are detected as out of SLA, then the hub will use the default route map to deny external routes to spokes that belong to this neighbor group defined on the hub. As a result, spokes will skip that specific hub and connect to external peers from other hubs.
This allows spokes to only measure overlay quality to each hub, and hubs to manage health checks to services by external peers. This significantly decreases the number of health check probes directly from the spoke to services and decreases the overall complexity. The complexity is further simplified by using multiple VRFs or segmentation where each spoke needs to send health check probes.
Example
This example configuration contains the following components:
-
Two PoPs:
-
The primary PoP has two hubs (Hub-1 and Hub-2).
-
The secondary PoP has one hub (Hub-3).
-
-
Spoke-1 has six overlays, with two overlay connections to each hub.
-
Spoke-1 has three BGP neighbors, with one BGP neighbor for each hub.
-
All BGP neighbors are established on loopback IPs.
-
-
Each hub has two paths to external peers.
Normally, outbound and inbound traffic go through hubs in the primary PoP. If the number of in SLA overlays to the primary PoP is less than the minimum-sla-meet-members
(set to 2 in this example), bi-directional traffic needs to be switched to hubs in the secondary PoP. But when the primary PoP recovers and the minimum-sla-meet-members
is met again, bi-directional traffic is forced back to hubs in the primary PoP after the predefined hold-down-time
duration.
The hubs do not require SD-WAN configurations to the spokes. However, they use SD-WAN for connections to external peer routers.
Configuring the FortiGates
The following configurations highlight important routing and SD-WAN settings that must be configured on the spoke and the hubs. It is assumed that other configurations such as underlays, IPsec VPN overlays, loopbacks, static routes, and so on are already configured.
To configure Spoke-1:
-
Create the primary (PoP1) and secondary (PoP2) zones, and set the
minimum-sla-meet-members
to2
on PoP1:config system sdwan set status enable config zone edit "virtual-wan-link" next edit "PoP1" set minimum-sla-meet-members 2 next edit "PoP2" next end end
-
Add the overlay members to each zone. Four overlays are defined for PoP1, and two overlays are defined for PoP2:
config system sdwan config members edit 1 set interface "H1_T11" set zone "PoP1" next edit 2 set interface "H1_T22" set zone "PoP1" next edit 3 set interface "H2_T11" set zone "PoP1" next edit 4 set interface "H2_T22" set zone "PoP1" next edit 5 set interface "H3_T11" set zone "PoP2" next edit 6 set interface "H3_T22" set zone "PoP2" next end end
-
Configure a performance SLA health check to a probe server behind the three hubs:
config system sdwan config health-check edit "Hubs" set server "172.31.100.100" set source 172.31.0.65 set members 0 config sla edit 1 set link-cost-factor latency set latency-threshold 200 next end next end end
-
Configure the service rule with the following settings: use SLA mode, enable zone mode to steer traffic based on the zone statuses, enable
sla-stickiness
, and use a 30-second hold down so that upon a recovery, existing sessions will remain on the secondary PoP while new sessions will switch back to the primary PoP once the 30-second duration ends:config system sdwan config service edit 1 set mode sla set zone-mode enable set dst "all" set src "CORP_LAN" set hold-down-time 30 set sla-stickiness enable config sla edit "Hubs" set id 1 next end set priority-zone "PoP1" "PoP2" next end end
Since the PoP1 zone is specified before PoP2, PoP1 is regarded as the primary and preferred over the PoP2 zone.
-
Configure the in_sla and out_sla route maps that define the communities that are advertised to the hub when the zones are in and out of SLA.
-
Configure the access list:
config router access-list edit "net10" config rule edit 1 set prefix 10.0.3.0 255.255.255.0 next end next end
-
Configure the route maps:
config router route-map edit "in_sla" config rule edit 1 set match-ip-address "net10" set set-community "10:1" next end next edit "out_sla" config rule edit 1 set match-ip-address "net10" set set-community "10:2" next end next end
-
-
Configure the default route map for out of SLA scenarios, preferable route map for in SLA scenarios, and the local network to be advertised:
config router bgp config neighbor edit "172.31.0.1" ... set route-map-out "out_sla" set route-map-out-preferable "in_sla" ... next edit "172.31.0.2" ... set route-map-out "out_sla" set route-map-out-preferable "in_sla" ... next edit "172.31.0.129" ... set route-map-out "out_sla" set route-map-out-preferable "in_sla" ... next end config network edit 1 set prefix 10.0.3.0 255.255.255.0 next end ... end
-
Define SD-WAN neighbors for each hub. The
minimum-sla-meet-members
is configured for the Hub-1 neighbor so that bi-directional traffic goes through Hub-1 as long as the in SLA overlays to Hub-1 are no less than 1. Associate the previously defined service rule to each SD-WAN neighbor:config system sdwan config neighbor edit "172.31.0.1" set member 1 2 set minimum-sla-meet-members 1 set service-id 1 next edit "172.31.0.2" set member 3 4 set service-id 1 next edit "172.31.0.129" set member 5 6 set service-id 1 next end end
To configure the hubs:
-
Configure the SD-WAN zone, members, and health check for the external connections to peer routers. Performance SLA health checks are sent to external servers in order to measure the health of the external connections:
config system sdwan set status enable config zone edit "virtual-wan-link" next end config members edit 1 set interface "port4" next edit 2 set interface "port5" next end config health-check edit "external_peers" set server "10.0.1.2" set members 1 2 config sla edit 1 set link-cost-factor latency set latency-threshold 200 next end next end end
-
Configure the route maps for in and out of SLA scenarios. When out of SLA (one of the external connections is down), external routes are denied to be advertised to the spokes that are part of the neighbor group.
-
Configure the access list:
config router access-list edit "net_Lo" config rule edit 1 set prefix 172.31.200.200 255.255.255.255 next end next end
-
Configure the route maps:
config router route-map edit "in_sla" config rule edit 1 set match-ip-address "net_Lo" next end next edit "out_sla" config rule edit 1 set action deny routes set match-ip-address "net_Lo" next end next end
-
-
In the BGP settings, configure the external network prefix to advertise. Then configure the neighbor group and neighbor range for the spokes. Configure the preferable and default route maps to define the behavior when the external connections are in and out of SLA:
config router bgp ... config network edit 1 set prefix 172.31.200.200 255.255.255.255 next end config neighbor-group edit "EDGE" ... set route-map-out "out_sla" set route-map-out-preferable "in_sla" ... next end config neighbor-range edit 1 set prefix 172.31.0.64 255.255.255.192 set neighbor-group "EDGE" next end ... end
-
Configure the SD-WAN neighbor to match the neighbor group that includes spokes as members. Specify that at least one of the external peer connections needs to be up to be considered in SLA:
config system sdwan config neighbor edit "EDGE" set member 1 2 set minimum-sla-meet-members 1 set health-check "external_peers" set sla-id 1 next end end
Testing and verification
The following tests use diagnostic commands on various FortiGates to verify the connections in the SD-WAN configuration.
Test case 1: the primary PoP and Hub-1 are in SLA
To verify the configuration:
-
Verify the SD-WAN service rules status on Spoke-1. When all six overlays are in SLA on Spoke-1, the primary PoP and primary zone PoP1 are preferred. In particular, the overlay H1_T11 over PoP1 is preferred:
Spoke-1 (root) # diagnose sys sdwan service Service(1): Address Mode(IPV4) flags=0x1c200 use-shortcut-sla use-shortcut sla-stickiness Tie break: cfg Gen(1), TOS(0x0/0x0), Protocol(0): src(1->65535):dst(1->65535), Mode(sla), sla-compare-order Hold down time(30) seconds, Hold start at 362646 second, now 362646 Service role: standalone Members(6): 1: Seq_num(1 H1_T11 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected 2: Seq_num(2 H1_T22 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected 3: Seq_num(3 H2_T11 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected 4: Seq_num(4 H2_T22 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected 5: Seq_num(5 H3_T11 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected 6: Seq_num(6 H3_T22 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected Src address(1): 10.0.0.0-10.255.255.255 Dst address(1): 0.0.0.0-255.255.255.255
-
Verify the BGP learned routes on Hub-1. The local route with in SLA community 10:1 is advertised to all hubs. Though, the AS paths on Hub-1 and Hub-2 are shorter than Hub-3:
PoP1-Hub1 (root) # get router info bgp network 10.0.3.0/24 VRF 0 BGP routing table entry for 10.0.3.0/24 Paths: (1 available, best #1, table Default-IP-Routing-Table) Not advertised to any peer Original VRF 0 Local, (Received from a RR-client) 172.31.0.65 from 172.31.0.65 (172.31.0.65) Origin IGP metric 0, localpref 100, valid, internal, best Community: 10:1 Last update: Mon Jul 17 15:16:57 2023
-
Send traffic from a host behind Spoke-1 to 172.31.200.200.
-
Run a sniffer trace on Spoke-1. Traffic leaves and returns on the H1_T11 overlay :
Spoke-1 (root) # diagnose sniffer packet any 'host 172.31.200.200' 4 interfaces=[any] filters=[host 172.31.200.200] 5.098248 port4 in 10.0.3.2 -> 172.31.200.200: icmp: echo request 5.098339 H1_T11 out 10.0.3.2 -> 172.31.200.200: icmp: echo request 5.098618 H1_T11 in 172.31.200.200 -> 10.0.3.2: icmp: echo reply 5.098750 port4 out 172.31.200.200 -> 10.0.3.2: icmp: echo reply
Test case 2: a single SD-WAN member on Hub-1 is out of SLA
Hub-1 and PoP1 are still preferred in this scenario.
To verify the configuration:
-
Verify the health check status on Spoke-1. The H1_T11 overlay on Hub-1/PoP1 is out of SLA:
Spoke-1 (root) # diagnose sys sdwan health-check Health Check(Hubs): Seq(1 H1_T11): state(alive), packet-loss(0.000%) latency(220.214), jitter(0.015), mos(4.104), bandwidth-up(999999), bandwidth-dw(999998), bandwidth-bi(1999997) sla_map=0x0 Seq(2 H1_T22): state(alive), packet-loss(0.000%) latency(0.196), jitter(0.014), mos(4.404), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x1 Seq(3 H2_T11): state(alive), packet-loss(0.000%) latency(0.173), jitter(0.008), mos(4.404), bandwidth-up(999998), bandwidth-dw(999997), bandwidth-bi(1999995) sla_map=0x1 …
-
Verify the SD-WAN neighbor status. The SD-WAN neighbor still displays Hub-1’s zone status as pass/alive:
Spoke-1 (root) # diagnose sys sdwan neighbor SD-WAN neighbor status: hold-down(disable), hold-down-time(0), hold_boot_time(0) Selected role(standalone) last_secondary_select_time/current_time in seconds 0/436439 Neighbor(172.31.0.1): member(1 2)role(standalone) Health-check(:0) sla-pass selected alive Neighbor(172.31.0.2): member(3 4)role(standalone) Health-check(:0) sla-pass selected alive Neighbor(172.31.0.129): member(5 6)role(standalone) Health-check(:0) sla-pass selected alive
-
Verify the SD-WAN service rules status. Spoke-1 steers traffic to the H1_T22 overlay through Hub-1:
Spoke-1 (root) # diagnose sys sdwan service Service(1): Address Mode(IPV4) flags=0x1c200 use-shortcut-sla use-shortcut sla-stickiness Tie break: cfg Gen(2), TOS(0x0/0x0), Protocol(0): src(1->65535):dst(1->65535), Mode(sla), sla-compare-order Hold down time(30) seconds, Hold start at 364162 second, now 364162 Service role: standalone Members(6): 1: Seq_num(2 H1_T22 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected 2: Seq_num(3 H2_T11 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected 3: Seq_num(4 H2_T22 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected 4: Seq_num(5 H3_T11 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected 5: Seq_num(6 H3_T22 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected 6: Seq_num(1 H1_T11 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected Src address(1): 10.0.0.0-10.255.255.255 Dst address(1): 0.0.0.0-255.255.255.255
-
Verify the BGP learned routes on Hub-1. The hubs continue to receive community 10:1 from the spoke and continue to route incoming traffic through Hub-1:
PoP1-Hub1 (root) # get router info bgp network 10.0.3.0/24 VRF 0 BGP routing table entry for 10.0.3.0/24 Paths: (1 available, best #1, table Default-IP-Routing-Table) Not advertised to any peer Original VRF 0 Local, (Received from a RR-client) 172.31.0.65 from 172.31.0.65 (172.31.0.65) Origin IGP metric 0, localpref 100, valid, internal, best Community: 10:1 Last update: Mon Jul 17 15:16:57 2023
-
Send traffic from a host behind Spoke-1 to 172.31.200.200.
-
Run a sniffer trace on Spoke-1. Traffic leaves and returns on the H1_T22 overlay:
Spoke-1 (root) # diagnose sniffer packet any 'host 172.31.200.200' 4 interfaces=[any] filters=[host 172.31.200.200] 25.299006 port4 in 10.0.3.2 -> 172.31.200.200: icmp: echo request 25.299080 H1_T22 out 10.0.3.2 -> 172.31.200.200: icmp: echo request 25.299323 H1_T22 in 172.31.200.200 -> 10.0.3.2: icmp: echo reply 25.299349 port4 out 172.31.200.200 -> 10.0.3.2: icmp: echo reply
Test case 3: both SD-WAN members on Hub-1 are out of SLA
Other in SLA overlays in zone PoP1 though Hub-2 are still preferred over PoP2 in this scenario.
To verify the configuration:
-
Verify the health check status on Spoke-1. Both H1_T11 and H1_T22 overlays on Hub-1/PoP1 are out of SLA:
Spoke-1 (root) # diagnose sys sdwan health-check Health Check(Hubs): Seq(1 H1_T11): state(alive), packet-loss(0.000%) latency(220.220), jitter(0.018), mos(4.103), bandwidth-up(999999), bandwidth-dw(999998), bandwidth-bi(1999997) sla_map=0x0 Seq(2 H1_T22): state(alive), packet-loss(0.000%) latency(220.174), jitter(0.007), mos(4.104), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x0 Seq(3 H2_T11): state(alive), packet-loss(0.000%) latency(0.184), jitter(0.015), mos(4.404), bandwidth-up(999998), bandwidth-dw(999997), bandwidth-bi(1999995) sla_map=0x1 Seq(4 H2_T22): state(alive), packet-loss(0.000%) latency(0.171), jitter(0.008), mos(4.404), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x1 Seq(5 H3_T11): state(alive), packet-loss(0.000%) latency(0.173), jitter(0.011), mos(4.404), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x1 Seq(6 H3_T22): state(alive), packet-loss(0.000%) latency(0.179), jitter(0.011), mos(4.404), bandwidth-up(999999), bandwidth-dw(999998), bandwidth-bi(1999997) sla_map=0x1
-
Verify the SD-WAN neighbor status. The SD-WAN neighbor displays Hub-1’s zone status as failed. However, SD-WAN Hub-2 is pass/alive:
Spoke-1 (root) # diagnose sys sdwan neighbor SD-WAN neighbor status: hold-down(disable), hold-down-time(0), hold_boot_time(0) Selected role(standalone) last_secondary_select_time/current_time in seconds 0/436535 Neighbor(172.31.0.1): member(1 2)role(standalone) Health-check(:0) sla-fail alive Neighbor(172.31.0.2): member(3 4)role(standalone) Health-check(:0) sla-pass selected alive Neighbor(172.31.0.129): member(5 6)role(standalone) Health-check(:0) sla-pass selected alive
-
Verify the SD-WAN service rules status. Spoke-1 steers traffic to the H2_T11 overlay through Hub-2:
Spoke-1 (root) # diagnose sys sdwan service Service(1): Address Mode(IPV4) flags=0x1c200 use-shortcut-sla use-shortcut sla-stickiness Tie break: cfg Gen(3), TOS(0x0/0x0), Protocol(0): src(1->65535):dst(1->65535), Mode(sla), sla-compare-order Hold down time(30) seconds, Hold start at 364489 second, now 364490 Service role: standalone Members(6): 1: Seq_num(3 H2_T11 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected 2: Seq_num(4 H2_T22 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected 3: Seq_num(5 H3_T11 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected 4: Seq_num(6 H3_T22 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected 5: Seq_num(1 H1_T11 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected 6: Seq_num(2 H1_T22 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected Src address(1): 10.0.0.0-10.255.255.255 Dst address(1): 0.0.0.0-255.255.255.255
-
Verify the BGP learned routes on Hub-1 and Hub-2. Hub-2 and Hub-3 continue to receive community 10:1 from Spoke-1, but Hub-1 receives the out of SLA community of 10:2.
-
On Hub-1:
PoP1-Hub1 (root) # get router info bgp network 10.0.3.0/24 VRF 0 BGP routing table entry for 10.0.3.0/24 Paths: (1 available, best #1, table Default-IP-Routing-Table) Not advertised to any peer Original VRF 0 Local, (Received from a RR-client) 172.31.0.65 from 172.31.0.65 (172.31.0.65) Origin IGP metric 0, localpref 100, valid, internal, best Community: 10:2 Last update: Mon Jul 17 18:08:58 2023
-
On Hub-2:
PoP1-Hub2 (root) # get router info bgp network 10.0.3.0/24 VRF 0 BGP routing table entry for 10.0.3.0/24 Paths: (1 available, best #1, table Default-IP-Routing-Table) Not advertised to any peer Original VRF 0 Local, (Received from a RR-client) 172.31.0.65 from 172.31.0.65 (172.31.0.65) Origin IGP metric 0, localpref 100, valid, internal, best Community: 10:1 Last update: Mon Jul 17 15:31:43 2023
-
-
Send traffic from a host behind Spoke-1 to 172.31.200.200.
-
Run a sniffer trace on Spoke-1. Traffic leaves and returns on the H2_T11 overlay:
Spoke-1 (root) # diagnose sniffer packet any 'host 172.31.200.200' 4 interfaces=[any] filters=[host 172.31.200.200] 13.726009 port4 in 10.0.3.2 -> 172.31.200.200: icmp: echo request 13.726075 H2_T11 out 10.0.3.2 -> 172.31.200.200: icmp: echo request 13.726354 H2_T11 in 172.31.200.200 -> 10.0.3.2: icmp: echo reply 13.726382 port4 out 172.31.200.200 -> 10.0.3.2: icmp: echo reply
Test case 4: three SD-WAN members on PoP1 are out of SLA
The number of in SLA overlays in zone PoP1 is less than the minimum-sla-meet-members
in zone PoP1. The SD-WAN service rule for Hub-2 is forcibly marked as sla(0x0)
or out of SLA.
To verify the configuration:
-
Verify the health check status on Spoke-1. All three H1_T11, H1_T22, and H2_T11 overlays on PoP1 are out of SLA:
Spoke-1 (root) # diagnose sys sdwan health-check Health Check(Hubs): Seq(1 H1_T11): state(alive), packet-loss(0.000%) latency(220.219), jitter(0.019), mos(4.103), bandwidth-up(999999), bandwidth-dw(999998), bandwidth-bi(1999997) sla_map=0x0 Seq(2 H1_T22): state(alive), packet-loss(0.000%) latency(220.184), jitter(0.008), mos(4.104), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x0 Seq(3 H2_T11): state(alive), packet-loss(0.000%) latency(220.171), jitter(0.009), mos(4.104), bandwidth-up(999998), bandwidth-dw(999997), bandwidth-bi(1999995) sla_map=0x0 Seq(4 H2_T22): state(alive), packet-loss(0.000%) latency(0.180), jitter(0.013), mos(4.404), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x1 Seq(5 H3_T11): state(alive), packet-loss(0.000%) latency(0.174), jitter(0.014), mos(4.404), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x1 Seq(6 H3_T22): state(alive), packet-loss(0.000%) latency(0.179), jitter(0.015), mos(4.404), bandwidth-up(999999), bandwidth-dw(999998), bandwidth-bi(1999997) sla_map=0x1
-
Verify the SD-WAN neighbor status. The SD-WAN neighbor displays Hub-1 and Hub-2’s zone status as failed:
Spoke-1 (root) # diagnose sys sdwan neighbor SD-WAN neighbor status: hold-down(disable), hold-down-time(0), hold_boot_time(0) Selected role(standalone) last_secondary_select_time/current_time in seconds 0/436605 Neighbor(172.31.0.1): member(1 2)role(standalone) Health-check(:0) sla-fail alive Neighbor(172.31.0.2): member(3 4)role(standalone) Health-check(:0) sla-fail alive Neighbor(172.31.0.129): member(5 6)role(standalone) Health-check(:0) sla-pass selected alive
-
Verify the SD-WAN service rules status. Since the minimum SLA members is not met for the primary zone (PoP1), the remaining overlay in PoP1 associated with the SD-WAN service rule is forcibly set to out of SLA. Spoke-1 steers traffic to the H3_T11 overlay through Hub-3:
Spoke-1 (root) # diagnose sys sdwan service Service(1): Address Mode(IPV4) flags=0x1c200 use-shortcut-sla use-shortcut sla-stickiness Tie break: cfg Gen(6), TOS(0x0/0x0), Protocol(0): src(1->65535):dst(1->65535), Mode(sla), sla-compare-order Hold down time(30) seconds, Hold start at 365341 second, now 365341 Service role: standalone Members(6): 1: Seq_num(5 H3_T11 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected 2: Seq_num(6 H3_T22 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected 3: Seq_num(1 H1_T11 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected 4: Seq_num(2 H1_T22 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected 5: Seq_num(3 H2_T11 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected 6: Seq_num(4 H2_T22 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected Src address(1): 10.0.0.0-10.255.255.255 Dst address(1): 0.0.0.0-255.255.255.255
-
Verify the BGP learned routes on each hub. Hub-3 continues to receive community 10:1 from Spoke-1, but Hub-1 and Hub-2 receive the out of SLA community of 10:2.
-
On Hub-1:
PoP1-Hub1 (root) # get router info bgp network 10.0.3.0/24 VRF 0 BGP routing table entry for 10.0.3.0/24 Paths: (1 available, best #1, table Default-IP-Routing-Table) Not advertised to any peer Original VRF 0 Local, (Received from a RR-client) 172.31.0.65 from 172.31.0.65 (172.31.0.65) Origin IGP metric 0, localpref 100, valid, internal, best Community: 10:2 Last update: Mon Jul 17 18:22:14 2023
-
On Hub-2:
PoP1-Hub2 (root) # get router info bgp network 10.0.3.0/24 VRF 0 BGP routing table entry for 10.0.3.0/24 Paths: (1 available, best #1, table Default-IP-Routing-Table) Not advertised to any peer Original VRF 0 Local, (Received from a RR-client) 172.31.0.65 from 172.31.0.65 (172.31.0.65) Origin IGP metric 0, localpref 100, valid, internal, best Community: 10:2 Last update: Mon Jul 17 18:37:53 2023
-
On Hub-3:
PoP2-Hub3 (root) # get router info bgp network 10.0.3.0/24 VRF 0 BGP routing table entry for 10.0.3.0/24 Paths: (1 available, best #1, table Default-IP-Routing-Table) Not advertised to any peer Original VRF 0 Local, (Received from a RR-client) 172.31.0.65 from 172.31.0.65 (172.31.0.65) Origin IGP metric 0, localpref 100, valid, internal, best Community: 10:1 Last update: Mon Jul 17 14:39:04 2023
-
-
Send traffic from a host behind Spoke-1 to 172.31.200.200.
-
Run a sniffer trace on Spoke-1. Traffic leaves and returns on the H3_T11 overlay:
Spoke-1 (root) # diagnose sniffer packet any 'host 172.31.200.200' 4 interfaces=[any] filters=[host 172.31.200.200] 38.501449 port4 in 10.0.3.2 -> 172.31.200.200: icmp: echo request 38.501519 H3_T11 out 10.0.3.2 -> 172.31.200.200: icmp: echo request 38.501818 H3_T11 in 172.31.200.200 -> 10.0.3.2: icmp: echo reply 38.501845 port4 out 172.31.200.200 -> 10.0.3.2: icmp: echo reply
Test case 5: an SD-WAN member on PoP1 recovers
SD-WAN member H2_T11 recovers and brings the number of overlays in SLA back to being above the minimum-sla-meet-members
threshold in PoP1. After the hold down time duration (30 seconds), in SLA overlays in zone PoP1 are preferred over PoP2 again. With sla-stickiness
enabled, existing traffic is kept on H3_T11, but new traffic is steered to H2_T11.
To verify the configuration:
-
Verify the SD-WAN service rules status on Spoke-1. The hold down timer has not yet passed, so H2_T11 is not yet preferred—even though the SLA status is pass/alive:
Spoke-1 (root) # diagnose sys sdwan service Service(1): Address Mode(IPV4) flags=0x1c200 use-shortcut-sla use-shortcut sla-stickiness Tie break: cfg Gen(16), TOS(0x0/0x0), Protocol(0): src(1->65535):dst(1->65535), Mode(sla), sla-compare-order Hold down time(30) seconds, Hold start at 431972 second, now 432000 Service role: standalone Members(6): 1: Seq_num(5 H3_T11 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected 2: Seq_num(6 H3_T22 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected 3: Seq_num(1 H1_T11 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected 4: Seq_num(2 H1_T22 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected 5: Seq_num(3 H2_T11 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected 6: Seq_num(4 H2_T22 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
-
Verify the SD-WAN service rules status again after the hold down timer passes. H2_T11 and H2_T22 from PoP1 are now preferred:
Spoke-1 (root) # diagnose sys sdwan service Service(1): Address Mode(IPV4) flags=0x1c200 use-shortcut-sla use-shortcut sla-stickiness Tie break: cfg Gen(17), TOS(0x0/0x0), Protocol(0): src(1->65535):dst(1->65535), Mode(sla), sla-compare-order Hold down time(30) seconds, Hold start at 432003 second, now 432003 Service role: standalone Members(6): 1: Seq_num(3 H2_T11 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected 2: Seq_num(4 H2_T22 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected 3: Seq_num(5 H3_T11 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected 4: Seq_num(6 H3_T22 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected 5: Seq_num(1 H1_T11 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected 6: Seq_num(2 H1_T22 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected
-
Verify the BGP learned routes on Hub-2, which now receives community 10:1 from Spoke-1:
PoP1-Hub2 (root) # get router info bgp network 10.0.3.0/24 VRF 0 BGP routing table entry for 10.0.3.0/24 Paths: (1 available, best #1, table Default-IP-Routing-Table) Not advertised to any peer Original VRF 0 Local, (Received from a RR-client) 172.31.0.65 from 172.31.0.65 (172.31.0.65) Origin IGP metric 0, localpref 100, valid, internal, best Community: 10:1 Last update: Tue Jul 18 14:41:32 2023
-
Send traffic from a host behind Spoke-1 to 172.31.200.200.
-
Run a sniffer trace on Spoke-1. Because of
sla-stickiness
, the existing traffic is kept on H3_T11:Spoke-1 (root) # diagnose sniffer packet any 'host 172.31.200.200' 4 interfaces=[any] filters=[host 172.31.200.200] 0.202708 port4 in 10.0.3.2 -> 172.31.200.200: icmp: echo request 0.202724 H3_T11 out 10.0.3.2 -> 172.31.200.200: icmp: echo request 0.202911 H3_T11 in 172.31.200.200 -> 10.0.3.2: icmp: echo reply 0.202934 port4 out 172.31.200.200 -> 10.0.3.2: icmp: echo reply
Test case 6: Hub-1 has an in SLA path to external peers
Since Hub-1 has an in SLA path to external peers, it will advertise the external route with destination 172.31.200.200/32 to Spoke-1.
To verify the configuration:
-
Verify the health check status on Hub-1. Note that port4 meets SLA, but port5 does not:
PoP1-Hub1 (root) # diagnose sys sdwan health-check Health Check(external_peers): Seq(1 port4): state(alive), packet-loss(0.000%) latency(0.161), jitter(0.009), mos(4.404), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x1 Seq(2 port5): state(dead), packet-loss(100.000%) sla_map=0x0
-
Verify the SD-WAN neighbor status. The
minimum-sla-meet-members
threshold of 1 is still met:PoP1-Hub1 (root) # diagnose sys sdwan neighbor Neighbor(EDGE): member(1 2)role(standalone) Health-check(external_peers:1) sla-pass selected alive
-
Verify the BGP learned routes. Hub-1 still advertises the external route to the Spoke-1 BGP neighbor:
PoP1-Hub1 (root) # get router info bgp neighbors 172.31.0.65 advertised-routes VRF 0 BGP table version is 13, local router ID is 172.31.0.1 Status codes: s suppressed, d damped, h history, * valid, > best, i - internal Origin codes: i - IGP, e - EGP, ? - incomplete Network Next Hop Metric LocPrf Weight RouteTag Path *>i172.31.200.200/32 172.31.0.1 100 32768 0 i <-/-> Total number of prefixes 1
Test case 7: all external peers on Hub-1 are out of SLA
In this case, Hub-1 will now advertise the default route map, which denies the advertisement of the external route. Spoke-1 will now route traffic to the next hub.
To verify the configuration:
-
Verify the health check status on Hub-1. Note that port4 and port5 do not meet SLA:
PoP1-Hub1 (root) # diagnose sys sdwan health-check Health Check(external_peers): Seq(1 port4): state(dead), packet-loss(100.000%) sla_map=0x0 Seq(2 port5): state(dead), packet-loss(100.000%) sla_map=0x0
-
Verify the SD-WAN neighbor status. The
minimum-sla-meet-members
threshold of 1 is not met:PoP1-Hub1 (root) # diagnose sys sdwan neighbor Neighbor(EDGE): member(1 2)role(standalone) Health-check(external_peers:1) sla-fail dead
-
Verify the BGP learned routes. Hub-1 does not advertise any external routes to the Spoke-1 BGP neighbor:
PoP1-Hub1 (root) # get router info bgp neighbors 172.31.0.65 advertised-routes % No prefix for neighbor 172.31.0.65