Fortinet black logo

Administration Guide

SD-WAN multi-PoP multi-hub large scale design and failover

SD-WAN multi-PoP multi-hub large scale design and failover

FortiOS 7.2.0 introduced a feature to define the minimum number of SD-WAN interface members that must meet SLA in order for the spoke to select a hub to process its SD-WAN traffic. This design is suitable for a single-PoP multi-hub architecture in order to achieve hub-to-hub failover. See Using multiple members per SD-WAN neighbor configuration.

In FortiOS 7.4.1 and later, the design is enhanced to support a multi-PoP multi-hub architecture in which incoming and outgoing traffic failover between PoPs is supported.

Based on the preceding diagram, incoming and outgoing traffic to the spoke is preferred over PoP1. If a single hub within PoP1 goes out of SLA, traffic will continue to flow through the PoP. If the minimum number of members to meet SLA in the PoP cannot be met, then traffic will fail over to PoP2.

The following enhancements have been made to support the multi-PoP failover scenario.

  • Add minimum-sla-meet-members setting in the SD-WAN zone configurations and zone-mode setting in the SD-WAN service configurations:

    config system sdwan
        config zone
            edit <name>
                set minimum-sla-meet-members <integer> 
            next
        end
        config service
            edit <id>
                set mode sla
                set zone-mode {enable | disable}
            next
        end
    end

    When zone-mode is enabled on a SD-WAN service rule, the traffic is steered based on the status of the zone.

    The state of the health check referenced in the SD-WAN service can be defined as follows:

    • If the number of in SLA members in a zone is less than the minimum-sla-meet-members, then the zone's state is out of SLA; otherwise, it is in SLA.

    • If a zone's state is out of SLA, then all members in the zone are out of SLA.

    • If a zone's state is in SLA, then the health check's state of individual members in the zone is determined by its own state.

  • Add service-id setting in the SD-WAN neighbor configurations:

    config system sdwan
        config neighbor
            edit <bgp_neighbor_ip>
                set member <member_id>
                set service-id <id>
            next
        end
    end

    The SD-WAN neighbor’s behavior can be determined by SD-WAN service and naturally synchronizes with SD-WAN service.

    • The SD-WAN service defines priority zones, whose SLA state determines the advertised community preferable string.

    • The SD-WAN service defines the hold-down-time, which determines how long an advertised community preferable string can be kept when it is expected to be changed.

  • Add sla-stickness setting in the SD-WAN service configurations:

    config system sdwan
        config service
            edit <id>
                set mode sla
                set sla-stickiness {enable | disable}
            next
        end
    end

    The switch-over of an existing session is determined as follows:

    • If the outgoing interface of the session is in SLA, then the session can keep its outgoing interface.

    • Otherwise, the session switches to a preferable path if one exists.

  • Allow the neighbor group to be configured in the SD-WAN neighbor configurations:

    config system sdwan
        config neighbor
            edit <bgp_neighbor_group>
                set member <member_id>
                set health-check <name>
                set sla-id <id>
            next
        end
    end

Outgoing path control

The outgoing path from spoke to hub operates as follows:

  1. Overlays to the primary and secondary PoP are assigned separately into an SD-WAN primary and secondary zone on the spoke.

  2. One SD-WAN service rule is defined to include these zones as SD-WAN members.

  3. When the primary zone is in SLA (minimum-sla-meet-members is met), the SD-WAN service rule steers traffic to the in SLA overlay members.

  4. When the primary zone is out of SLA (minimum-sla-meet-members is not met), the SD-WAN service rule steers traffic to the in SLA overlay members in the secondary zone.

  5. When the primary zone SLA is recovered:

    1. If sla-stickness is disabled on the SD-WAN service rule, then traffic will wait the duration of the hold-down-time before switching back to in SLA overlays in the primary zone.

    2. If sla-stickness is enabled on the SD-WAN service rule, then existing traffic will be kept on the in SLA overlays on the secondary zone, but new traffic will be steered to in SLA overlays in the primary zone.

Incoming path control

The incoming traffic from the core/external peers, to PoP, to spoke operates as follows:

  1. When the primary zone is in SLA, the spoke uses the preferable route map to advertise local routes with the in SLA community to hubs in the primary and secondary PoPs.

    1. Hubs in the primary PoP translate the in SLA community into a short AS path and advertise it to external peers to attract incoming traffic.

    2. Hubs in the secondary PoP translate the in SLA community into a longer AS path and advertise it to external peers to deflect incoming traffic.

  2. If the number of in SLA overlays in the primary zone is less than the minimum-sla-meet-members, then the spoke will use the default route map to advertise routes instead of with an out of SLA community to hubs in the primary PoP.

    1. Hubs in the primary PoP translate the out of SLA community into a longest AS path, and advertise it to external peers to deflect incoming traffic.

    2. As a result, inbound traffic is routed to hubs in the secondary PoP.

  3. When the primary zone SLA is recovered:

    1. The spoke will wait the duration of the predefined hold-down-time in the SD-WAN service rule to use the preferable route map again to advertise routes with the in SLA community to hubs in the primary PoP.

    2. As a result, inbound traffic will be routed back to hubs in the primary PoP.

Neighbor group configuration

By configuring the neighbor group for spokes under the hub's SD-WAN neighbor configuration, if all paths from the hub to external peers are detected as out of SLA, then the hub will use the default route map to deny external routes to spokes that belong to this neighbor group defined on the hub. As a result, spokes will skip that specific hub and connect to external peers from other hubs.

This allows spokes to only measure overlay quality to each hub, and hubs to manage health checks to services by external peers. This significantly decreases the number of health check probes directly from the spoke to services and decreases the overall complexity. The complexity is further simplified by using multiple VRFs or segmentation where each spoke needs to send health check probes.

Example

This example configuration contains the following components:

  • Two PoPs:

    • The primary PoP has two hubs (Hub-1 and Hub-2).

    • The secondary PoP has one hub (Hub-3).

  • Spoke-1 has six overlays, with two overlay connections to each hub.

  • Spoke-1 has three BGP neighbors, with one BGP neighbor for each hub.

    • All BGP neighbors are established on loopback IPs.

  • Each hub has two paths to external peers.

Normally, outbound and inbound traffic go through hubs in the primary PoP. If the number of in SLA overlays to the primary PoP is less than the minimum-sla-meet-members (set to 2 in this example), bi-directional traffic needs to be switched to hubs in the secondary PoP. But when the primary PoP recovers and the minimum-sla-meet-members is met again, bi-directional traffic is forced back to hubs in the primary PoP after the predefined hold-down-time duration.

The hubs do not require SD-WAN configurations to the spokes. However, they use SD-WAN for connections to external peer routers.

Configuring the FortiGates

The following configurations highlight important routing and SD-WAN settings that must be configured on the spoke and the hubs. It is assumed that other configurations such as underlays, IPsec VPN overlays, loopbacks, static routes, and so on are already configured.

To configure Spoke-1:
  1. Create the primary (PoP1) and secondary (PoP2) zones, and set the minimum-sla-meet-members to 2 on PoP1:

    config system sdwan
        set status enable
        config zone
            edit "virtual-wan-link"
            next
            edit "PoP1"
                set minimum-sla-meet-members 2 
            next
            edit "PoP2"
            next
        end
    end
  2. Add the overlay members to each zone. Four overlays are defined for PoP1, and two overlays are defined for PoP2:

    config system sdwan
        config members
            edit 1
                set interface "H1_T11"
                set zone "PoP1"
            next
            edit 2
                set interface "H1_T22"
                set zone "PoP1"
            next
            edit 3
                set interface "H2_T11"
                set zone "PoP1"
            next
            edit 4
                set interface "H2_T22"
                set zone "PoP1"
            next
            edit 5
                set interface "H3_T11"
                set zone "PoP2"
            next
            edit 6
                set interface "H3_T22"
                set zone "PoP2"
            next
        end
    end
  3. Configure a performance SLA health check to a probe server behind the three hubs:

    config system sdwan
        config health-check
            edit "Hubs"
                set server "172.31.100.100"
                set source 172.31.0.65
                set members 0
                config sla
                    edit 1
                        set link-cost-factor latency
                        set latency-threshold 200
                    next
                end
            next
        end
    end
  4. Configure the service rule with the following settings: use SLA mode, enable zone mode to steer traffic based on the zone statuses, enable sla-stickiness, and use a 30-second hold down so that upon a recovery, existing sessions will remain on the secondary PoP while new sessions will switch back to the primary PoP once the 30-second duration ends:

    config system sdwan
        config service
            edit 1
                set mode sla
                set zone-mode enable
                set dst "all"
                set src "CORP_LAN"                                           
                set hold-down-time 30
                set sla-stickiness enable
                config sla
                    edit "Hubs"
                        set id 1
                    next
                end
                set priority-zone "PoP1" "PoP2"
            next
        end
    end

    Since the PoP1 zone is specified before PoP2, PoP1 is regarded as the primary and preferred over the PoP2 zone.

  5. Configure the in_sla and out_sla route maps that define the communities that are advertised to the hub when the zones are in and out of SLA.

    1. Configure the access list:

      config router access-list
          edit "net10"
              config rule
                  edit 1
                      set prefix 10.0.3.0 255.255.255.0
                  next
              end
          next
      end
    2. Configure the route maps:

      config router route-map
          edit "in_sla"
              config rule
                  edit 1
                      set match-ip-address "net10"
                      set set-community "10:1" 
                  next
              end
          next
          edit "out_sla"
              config rule
                  edit 1
                      set match-ip-address "net10"
                      set set-community "10:2" 
                  next
              end
          next
      end
  6. Configure the default route map for out of SLA scenarios, preferable route map for in SLA scenarios, and the local network to be advertised:

    config router bgp
        config neighbor
            edit "172.31.0.1"
                ...
                set route-map-out "out_sla"
                set route-map-out-preferable "in_sla"
                ...
            next
            edit "172.31.0.2"
                ...
                set route-map-out "out_sla"
                set route-map-out-preferable "in_sla"
                ...
            next
            edit "172.31.0.129"
                ...
                set route-map-out "out_sla"
                set route-map-out-preferable "in_sla"
                ...
            next
        end
        config network
            edit 1
                set prefix 10.0.3.0 255.255.255.0
            next
        end
        ...
    end
  7. Define SD-WAN neighbors for each hub. The minimum-sla-meet-members is configured for the Hub-1 neighbor so that bi-directional traffic goes through Hub-1 as long as the in SLA overlays to Hub-1 are no less than 1. Associate the previously defined service rule to each SD-WAN neighbor:

    config system sdwan
        config neighbor
            edit "172.31.0.1"
                set member 1 2
                set minimum-sla-meet-members 1
                set service-id 1
            next
            edit "172.31.0.2"
                set member 3 4
                set service-id 1
            next
            edit "172.31.0.129"
                set member 5 6
                set service-id 1
            next
        end
    end
To configure the hubs:
  1. Configure the SD-WAN zone, members, and health check for the external connections to peer routers. Performance SLA health checks are sent to external servers in order to measure the health of the external connections:

    config system sdwan
        set status enable
        config zone
            edit "virtual-wan-link"
            next
        end
        config members
            edit 1
                set interface "port4"
            next
            edit 2
                set interface "port5"
            next
        end
        config health-check
            edit "external_peers"
                set server "10.0.1.2"
                set members 1 2
                config sla
                    edit 1
                        set link-cost-factor latency
                        set latency-threshold 200
                    next
                end
            next
        end
    end
  2. Configure the route maps for in and out of SLA scenarios. When out of SLA (one of the external connections is down), external routes are denied to be advertised to the spokes that are part of the neighbor group.

    1. Configure the access list:

      config router access-list
          edit "net_Lo"
              config rule
                  edit 1
                      set prefix 172.31.200.200 255.255.255.255
                  next
              end
          next
      end
    2. Configure the route maps:

      config router route-map
          edit "in_sla"
              config rule
                  edit 1                                                  
                      set match-ip-address "net_Lo"
                  next
              end
          next
          edit "out_sla"
              config rule
                  edit 1
                      set action deny routes
                      set match-ip-address "net_Lo"
                  next
              end
          next
      end
  3. In the BGP settings, configure the external network prefix to advertise. Then configure the neighbor group and neighbor range for the spokes. Configure the preferable and default route maps to define the behavior when the external connections are in and out of SLA:

    config router bgp
        ...
        config network
            edit 1
                set prefix 172.31.200.200 255.255.255.255
            next
        end
        config neighbor-group
            edit "EDGE"
                ...
                set route-map-out "out_sla"
                set route-map-out-preferable "in_sla" 
                ...
            next
        end
        config neighbor-range
            edit 1
                set prefix 172.31.0.64 255.255.255.192
                set neighbor-group "EDGE"                                
            next
        end
        ...
    end
  4. Configure the SD-WAN neighbor to match the neighbor group that includes spokes as members. Specify that at least one of the external peer connections needs to be up to be considered in SLA:

    config system sdwan
        config neighbor
            edit "EDGE" 
                set member 1 2
                set minimum-sla-meet-members 1
                set health-check "external_peers"
                set sla-id 1
            next
        end
    end

Testing and verification

The following tests use diagnostic commands on various FortiGates to verify the connections in the SD-WAN configuration.

Test case 1: the primary PoP and Hub-1 are in SLA

To verify the configuration:
  1. Verify the SD-WAN service rules status on Spoke-1. When all six overlays are in SLA on Spoke-1, the primary PoP and primary zone PoP1 are preferred. In particular, the overlay H1_T11 over PoP1 is preferred:

    Spoke-1 (root) # diagnose sys sdwan service
    Service(1): Address Mode(IPV4) flags=0x1c200 use-shortcut-sla use-shortcut sla-stickiness
     Tie break: cfg
      Gen(1), TOS(0x0/0x0), Protocol(0): src(1->65535):dst(1->65535), Mode(sla), sla-compare-order
    Hold down time(30) seconds, Hold start at 362646 second, now 362646
      Service role: standalone
      Members(6):
        1: Seq_num(1 H1_T11 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected 
        2: Seq_num(2 H1_T22 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        3: Seq_num(3 H2_T11 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        4: Seq_num(4 H2_T22 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        5: Seq_num(5 H3_T11 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
        6: Seq_num(6 H3_T22 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
      Src address(1):
            10.0.0.0-10.255.255.255
      Dst address(1):
            0.0.0.0-255.255.255.255
  2. Verify the BGP learned routes on Hub-1. The local route with in SLA community 10:1 is advertised to all hubs. Though, the AS paths on Hub-1 and Hub-2 are shorter than Hub-3:

    PoP1-Hub1 (root) # get router info bgp network 10.0.3.0/24
    VRF 0 BGP routing table entry for 10.0.3.0/24
    Paths: (1 available, best #1, table Default-IP-Routing-Table)
      Not advertised to any peer
      Original VRF 0
      Local, (Received from a RR-client)
        172.31.0.65 from 172.31.0.65 (172.31.0.65)
          Origin IGP metric 0, localpref 100, valid, internal, best
          Community: 10:1
          Last update: Mon Jul 17 15:16:57 2023
  3. Send traffic from a host behind Spoke-1 to 172.31.200.200.

  4. Run a sniffer trace on Spoke-1. Traffic leaves and returns on the H1_T11 overlay :

    Spoke-1 (root) # diagnose sniffer packet any 'host 172.31.200.200' 4
    interfaces=[any]
    filters=[host 172.31.200.200]
    5.098248 port4 in 10.0.3.2 -> 172.31.200.200: icmp: echo request
    5.098339 H1_T11 out 10.0.3.2 -> 172.31.200.200: icmp: echo request
    5.098618 H1_T11 in 172.31.200.200 -> 10.0.3.2: icmp: echo reply
    5.098750 port4 out 172.31.200.200 -> 10.0.3.2: icmp: echo reply

Test case 2: a single SD-WAN member on Hub-1 is out of SLA

Hub-1 and PoP1 are still preferred in this scenario.

To verify the configuration:
  1. Verify the health check status on Spoke-1. The H1_T11 overlay on Hub-1/PoP1 is out of SLA:

    Spoke-1 (root) # diagnose sys sdwan health-check
    Health Check(Hubs):
    Seq(1 H1_T11): state(alive), packet-loss(0.000%) latency(220.214), jitter(0.015), mos(4.104), bandwidth-up(999999), bandwidth-dw(999998), bandwidth-bi(1999997) sla_map=0x0
    Seq(2 H1_T22): state(alive), packet-loss(0.000%) latency(0.196), jitter(0.014), mos(4.404), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x1
    Seq(3 H2_T11): state(alive), packet-loss(0.000%) latency(0.173), jitter(0.008), mos(4.404), bandwidth-up(999998), bandwidth-dw(999997), bandwidth-bi(1999995) sla_map=0x1
    …
    
  2. Verify the SD-WAN neighbor status. The SD-WAN neighbor still displays Hub-1’s zone status as pass/alive:

    Spoke-1 (root) # diagnose sys sdwan neighbor
    SD-WAN neighbor status: hold-down(disable), hold-down-time(0), hold_boot_time(0)
            Selected role(standalone) last_secondary_select_time/current_time in seconds 0/436439
    Neighbor(172.31.0.1): member(1 2)role(standalone)
            Health-check(:0)  sla-pass selected alive
    Neighbor(172.31.0.2): member(3 4)role(standalone)
            Health-check(:0)  sla-pass selected alive
    Neighbor(172.31.0.129): member(5 6)role(standalone)
            Health-check(:0)  sla-pass selected alive
  3. Verify the SD-WAN service rules status. Spoke-1 steers traffic to the H1_T22 overlay through Hub-1:

    Spoke-1 (root) # diagnose sys sdwan service
    Service(1): Address Mode(IPV4) flags=0x1c200 use-shortcut-sla use-shortcut sla-stickiness
     Tie break: cfg
      Gen(2), TOS(0x0/0x0), Protocol(0): src(1->65535):dst(1->65535), Mode(sla), sla-compare-order
    Hold down time(30) seconds, Hold start at 364162 second, now 364162
      Service role: standalone
      Members(6):
        1: Seq_num(2 H1_T22 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        2: Seq_num(3 H2_T11 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        3: Seq_num(4 H2_T22 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        4: Seq_num(5 H3_T11 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
        5: Seq_num(6 H3_T22 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
        6: Seq_num(1 H1_T11 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected
      Src address(1):
            10.0.0.0-10.255.255.255
      Dst address(1):
            0.0.0.0-255.255.255.255
    
  4. Verify the BGP learned routes on Hub-1. The hubs continue to receive community 10:1 from the spoke and continue to route incoming traffic through Hub-1:

    PoP1-Hub1 (root) # get router info bgp network 10.0.3.0/24
    VRF 0 BGP routing table entry for 10.0.3.0/24
    Paths: (1 available, best #1, table Default-IP-Routing-Table)
      Not advertised to any peer
      Original VRF 0
      Local, (Received from a RR-client)
        172.31.0.65 from 172.31.0.65 (172.31.0.65)
          Origin IGP metric 0, localpref 100, valid, internal, best
          Community: 10:1
          Last update: Mon Jul 17 15:16:57 2023
    
  5. Send traffic from a host behind Spoke-1 to 172.31.200.200.

  6. Run a sniffer trace on Spoke-1. Traffic leaves and returns on the H1_T22 overlay:

    Spoke-1 (root) # diagnose sniffer packet any 'host 172.31.200.200' 4
    interfaces=[any]
    filters=[host 172.31.200.200]
    25.299006 port4 in 10.0.3.2 -> 172.31.200.200: icmp: echo request
    25.299080 H1_T22 out 10.0.3.2 -> 172.31.200.200: icmp: echo request
    25.299323 H1_T22 in 172.31.200.200 -> 10.0.3.2: icmp: echo reply
    25.299349 port4 out 172.31.200.200 -> 10.0.3.2: icmp: echo reply
    

Test case 3: both SD-WAN members on Hub-1 are out of SLA

Other in SLA overlays in zone PoP1 though Hub-2 are still preferred over PoP2 in this scenario.

To verify the configuration:
  1. Verify the health check status on Spoke-1. Both H1_T11 and H1_T22 overlays on Hub-1/PoP1 are out of SLA:

    Spoke-1 (root) # diagnose sys sdwan health-check
    Health Check(Hubs):
    Seq(1 H1_T11): state(alive), packet-loss(0.000%) latency(220.220), jitter(0.018), mos(4.103), bandwidth-up(999999), bandwidth-dw(999998), bandwidth-bi(1999997) sla_map=0x0
    Seq(2 H1_T22): state(alive), packet-loss(0.000%) latency(220.174), jitter(0.007), mos(4.104), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x0
    Seq(3 H2_T11): state(alive), packet-loss(0.000%) latency(0.184), jitter(0.015), mos(4.404), bandwidth-up(999998), bandwidth-dw(999997), bandwidth-bi(1999995) sla_map=0x1
    Seq(4 H2_T22): state(alive), packet-loss(0.000%) latency(0.171), jitter(0.008), mos(4.404), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x1
    Seq(5 H3_T11): state(alive), packet-loss(0.000%) latency(0.173), jitter(0.011), mos(4.404), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x1
    Seq(6 H3_T22): state(alive), packet-loss(0.000%) latency(0.179), jitter(0.011), mos(4.404), bandwidth-up(999999), bandwidth-dw(999998), bandwidth-bi(1999997) sla_map=0x1
    
  2. Verify the SD-WAN neighbor status. The SD-WAN neighbor displays Hub-1’s zone status as failed. However, SD-WAN Hub-2 is pass/alive:

    Spoke-1 (root) # diagnose sys sdwan neighbor
    SD-WAN neighbor status: hold-down(disable), hold-down-time(0), hold_boot_time(0)
            Selected role(standalone) last_secondary_select_time/current_time in seconds 0/436535
    Neighbor(172.31.0.1): member(1 2)role(standalone)
            Health-check(:0)  sla-fail alive
    Neighbor(172.31.0.2): member(3 4)role(standalone)
            Health-check(:0)  sla-pass selected alive
    Neighbor(172.31.0.129): member(5 6)role(standalone)
            Health-check(:0)  sla-pass selected alive
  3. Verify the SD-WAN service rules status. Spoke-1 steers traffic to the H2_T11 overlay through Hub-2:

    Spoke-1 (root) # diagnose sys sdwan service
    Service(1): Address Mode(IPV4) flags=0x1c200 use-shortcut-sla use-shortcut sla-stickiness
     Tie break: cfg
      Gen(3), TOS(0x0/0x0), Protocol(0): src(1->65535):dst(1->65535), Mode(sla), sla-compare-order
    Hold down time(30) seconds, Hold start at 364489 second, now 364490
      Service role: standalone
      Members(6):
        1: Seq_num(3 H2_T11 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        2: Seq_num(4 H2_T22 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        3: Seq_num(5 H3_T11 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
        4: Seq_num(6 H3_T22 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
        5: Seq_num(1 H1_T11 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected
        6: Seq_num(2 H1_T22 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected
      Src address(1):
            10.0.0.0-10.255.255.255
      Dst address(1):
            0.0.0.0-255.255.255.255
    
  4. Verify the BGP learned routes on Hub-1 and Hub-2. Hub-2 and Hub-3 continue to receive community 10:1 from Spoke-1, but Hub-1 receives the out of SLA community of 10:2.

    1. On Hub-1:

      PoP1-Hub1 (root) # get router info bgp network 10.0.3.0/24
      VRF 0 BGP routing table entry for 10.0.3.0/24
      Paths: (1 available, best #1, table Default-IP-Routing-Table)
        Not advertised to any peer
        Original VRF 0
        Local, (Received from a RR-client)
          172.31.0.65 from 172.31.0.65 (172.31.0.65)
            Origin IGP metric 0, localpref 100, valid, internal, best
            Community: 10:2
            Last update: Mon Jul 17 18:08:58 2023
      
    2. On Hub-2:

      PoP1-Hub2 (root) # get router info bgp network 10.0.3.0/24
      VRF 0 BGP routing table entry for 10.0.3.0/24
      Paths: (1 available, best #1, table Default-IP-Routing-Table)
        Not advertised to any peer
        Original VRF 0
        Local, (Received from a RR-client)
          172.31.0.65 from 172.31.0.65 (172.31.0.65)
            Origin IGP metric 0, localpref 100, valid, internal, best
            Community: 10:1
            Last update: Mon Jul 17 15:31:43 2023
      
  5. Send traffic from a host behind Spoke-1 to 172.31.200.200.

  6. Run a sniffer trace on Spoke-1. Traffic leaves and returns on the H2_T11 overlay:

    Spoke-1 (root) # diagnose sniffer packet any 'host 172.31.200.200' 4
    interfaces=[any]
    filters=[host 172.31.200.200]
    13.726009 port4 in 10.0.3.2 -> 172.31.200.200: icmp: echo request
    13.726075 H2_T11 out 10.0.3.2 -> 172.31.200.200: icmp: echo request                              
    13.726354 H2_T11 in 172.31.200.200 -> 10.0.3.2: icmp: echo reply
    13.726382 port4 out 172.31.200.200 -> 10.0.3.2: icmp: echo reply
    

Test case 4: three SD-WAN members on PoP1 are out of SLA

The number of in SLA overlays in zone PoP1 is less than the minimum-sla-meet-members in zone PoP1. The SD-WAN service rule for Hub-2 is forcibly marked as sla(0x0) or out of SLA.

To verify the configuration:
  1. Verify the health check status on Spoke-1. All three H1_T11, H1_T22, and H2_T11 overlays on PoP1 are out of SLA:

    Spoke-1 (root) # diagnose sys sdwan health-check
    Health Check(Hubs):
    Seq(1 H1_T11): state(alive), packet-loss(0.000%) latency(220.219), jitter(0.019), mos(4.103), bandwidth-up(999999), bandwidth-dw(999998), bandwidth-bi(1999997) sla_map=0x0
    Seq(2 H1_T22): state(alive), packet-loss(0.000%) latency(220.184), jitter(0.008), mos(4.104), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x0
    Seq(3 H2_T11): state(alive), packet-loss(0.000%) latency(220.171), jitter(0.009), mos(4.104), bandwidth-up(999998), bandwidth-dw(999997), bandwidth-bi(1999995) sla_map=0x0
    Seq(4 H2_T22): state(alive), packet-loss(0.000%) latency(0.180), jitter(0.013), mos(4.404), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x1
    Seq(5 H3_T11): state(alive), packet-loss(0.000%) latency(0.174), jitter(0.014), mos(4.404), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x1
    Seq(6 H3_T22): state(alive), packet-loss(0.000%) latency(0.179), jitter(0.015), mos(4.404), bandwidth-up(999999), bandwidth-dw(999998), bandwidth-bi(1999997) sla_map=0x1
    
  2. Verify the SD-WAN neighbor status. The SD-WAN neighbor displays Hub-1 and Hub-2’s zone status as failed:

    Spoke-1 (root) # diagnose sys sdwan neighbor
    SD-WAN neighbor status: hold-down(disable), hold-down-time(0), hold_boot_time(0)
            Selected role(standalone) last_secondary_select_time/current_time in seconds 0/436605
    Neighbor(172.31.0.1): member(1 2)role(standalone)
            Health-check(:0)  sla-fail alive
    Neighbor(172.31.0.2): member(3 4)role(standalone)
            Health-check(:0)  sla-fail alive
    Neighbor(172.31.0.129): member(5 6)role(standalone)
            Health-check(:0)  sla-pass selected alive
  3. Verify the SD-WAN service rules status. Since the minimum SLA members is not met for the primary zone (PoP1), the remaining overlay in PoP1 associated with the SD-WAN service rule is forcibly set to out of SLA. Spoke-1 steers traffic to the H3_T11 overlay through Hub-3:

    Spoke-1 (root) # diagnose sys sdwan service
    Service(1): Address Mode(IPV4) flags=0x1c200 use-shortcut-sla use-shortcut sla-stickiness
     Tie break: cfg
      Gen(6), TOS(0x0/0x0), Protocol(0): src(1->65535):dst(1->65535), Mode(sla), sla-compare-order
    Hold down time(30) seconds, Hold start at 365341 second, now 365341
      Service role: standalone
      Members(6):
        1: Seq_num(5 H3_T11 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected 
        2: Seq_num(6 H3_T22 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
        3: Seq_num(1 H1_T11 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected
        4: Seq_num(2 H1_T22 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected      
        5: Seq_num(3 H2_T11 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected
        6: Seq_num(4 H2_T22 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected 
      Src address(1):
            10.0.0.0-10.255.255.255
      Dst address(1):
            0.0.0.0-255.255.255.255
    
  4. Verify the BGP learned routes on each hub. Hub-3 continues to receive community 10:1 from Spoke-1, but Hub-1 and Hub-2 receive the out of SLA community of 10:2.

    1. On Hub-1:

      PoP1-Hub1 (root) # get router info bgp network 10.0.3.0/24
      VRF 0 BGP routing table entry for 10.0.3.0/24
      Paths: (1 available, best #1, table Default-IP-Routing-Table)
        Not advertised to any peer
        Original VRF 0
        Local, (Received from a RR-client)
          172.31.0.65 from 172.31.0.65 (172.31.0.65)
            Origin IGP metric 0, localpref 100, valid, internal, best
            Community: 10:2
            Last update: Mon Jul 17 18:22:14 2023
      
    2. On Hub-2:

      PoP1-Hub2 (root) # get router info bgp network 10.0.3.0/24
      VRF 0 BGP routing table entry for 10.0.3.0/24
      Paths: (1 available, best #1, table Default-IP-Routing-Table)
        Not advertised to any peer
        Original VRF 0
        Local, (Received from a RR-client)
          172.31.0.65 from 172.31.0.65 (172.31.0.65)
            Origin IGP metric 0, localpref 100, valid, internal, best
            Community: 10:2
            Last update: Mon Jul 17 18:37:53 2023
      
    3. On Hub-3:

      PoP2-Hub3 (root) # get router info bgp network 10.0.3.0/24
      VRF 0 BGP routing table entry for 10.0.3.0/24
      Paths: (1 available, best #1, table Default-IP-Routing-Table)
        Not advertised to any peer
        Original VRF 0
        Local, (Received from a RR-client)
          172.31.0.65 from 172.31.0.65 (172.31.0.65)
            Origin IGP metric 0, localpref 100, valid, internal, best
            Community: 10:1
            Last update: Mon Jul 17 14:39:04 2023
      
  5. Send traffic from a host behind Spoke-1 to 172.31.200.200.

  6. Run a sniffer trace on Spoke-1. Traffic leaves and returns on the H3_T11 overlay:

    Spoke-1 (root) # diagnose sniffer packet any 'host 172.31.200.200' 4
    interfaces=[any]
    filters=[host 172.31.200.200]
    38.501449 port4 in 10.0.3.2 -> 172.31.200.200: icmp: echo request
    38.501519 H3_T11 out 10.0.3.2 -> 172.31.200.200: icmp: echo request
    38.501818 H3_T11 in 172.31.200.200 -> 10.0.3.2: icmp: echo reply
    38.501845 port4 out 172.31.200.200 -> 10.0.3.2: icmp: echo reply
    

Test case 5: an SD-WAN member on PoP1 recovers

SD-WAN member H2_T11 recovers and brings the number of overlays in SLA back to being above the minimum-sla-meet-members threshold in PoP1. After the hold down time duration (30 seconds), in SLA overlays in zone PoP1 are preferred over PoP2 again. With sla-stickiness enabled, existing traffic is kept on H3_T11, but new traffic is steered to H2_T11.

To verify the configuration:
  1. Verify the SD-WAN service rules status on Spoke-1. The hold down timer has not yet passed, so H2_T11 is not yet preferred—even though the SLA status is pass/alive:

    Spoke-1 (root) # diagnose sys sdwan service
    
    Service(1): Address Mode(IPV4) flags=0x1c200 use-shortcut-sla use-shortcut sla-stickiness
     Tie break: cfg
      Gen(16), TOS(0x0/0x0), Protocol(0): src(1->65535):dst(1->65535), Mode(sla), sla-compare-order
    Hold down time(30) seconds, Hold start at 431972 second, now 432000
      Service role: standalone
      Members(6):
        1: Seq_num(5 H3_T11 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
        2: Seq_num(6 H3_T22 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
        3: Seq_num(1 H1_T11 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected
        4: Seq_num(2 H1_T22 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected
        5: Seq_num(3 H2_T11 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        6: Seq_num(4 H2_T22 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
    
  2. Verify the SD-WAN service rules status again after the hold down timer passes. H2_T11 and H2_T22 from PoP1 are now preferred:

    Spoke-1 (root) # diagnose sys sdwan service
    Service(1): Address Mode(IPV4) flags=0x1c200 use-shortcut-sla use-shortcut sla-stickiness
     Tie break: cfg
      Gen(17), TOS(0x0/0x0), Protocol(0): src(1->65535):dst(1->65535), Mode(sla), sla-compare-order
    Hold down time(30) seconds, Hold start at 432003 second, now 432003
      Service role: standalone
      Members(6):
        1: Seq_num(3 H2_T11 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        2: Seq_num(4 H2_T22 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        3: Seq_num(5 H3_T11 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
        4: Seq_num(6 H3_T22 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
        5: Seq_num(1 H1_T11 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected
        6: Seq_num(2 H1_T22 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected
    
  3. Verify the BGP learned routes on Hub-2, which now receives community 10:1 from Spoke-1:

    PoP1-Hub2 (root) #  get router info bgp network 10.0.3.0/24
    VRF 0 BGP routing table entry for 10.0.3.0/24
    Paths: (1 available, best #1, table Default-IP-Routing-Table)
      Not advertised to any peer
      Original VRF 0
      Local, (Received from a RR-client)
        172.31.0.65 from 172.31.0.65 (172.31.0.65)
          Origin IGP metric 0, localpref 100, valid, internal, best
          Community: 10:1
          Last update: Tue Jul 18 14:41:32 2023
    
  4. Send traffic from a host behind Spoke-1 to 172.31.200.200.

  5. Run a sniffer trace on Spoke-1. Because of sla-stickiness, the existing traffic is kept on H3_T11:

    Spoke-1 (root) # diagnose sniffer packet any 'host 172.31.200.200' 4
    interfaces=[any]
    filters=[host 172.31.200.200]
    
    0.202708 port4 in 10.0.3.2 -> 172.31.200.200: icmp: echo request
    0.202724 H3_T11 out 10.0.3.2 -> 172.31.200.200: icmp: echo request
    0.202911 H3_T11 in 172.31.200.200 -> 10.0.3.2: icmp: echo reply
    0.202934 port4 out 172.31.200.200 -> 10.0.3.2: icmp: echo reply
    

Test case 6: Hub-1 has an in SLA path to external peers

Since Hub-1 has an in SLA path to external peers, it will advertise the external route with destination 172.31.200.200/32 to Spoke-1.

To verify the configuration:
  1. Verify the health check status on Hub-1. Note that port4 meets SLA, but port5 does not:

    PoP1-Hub1 (root) # diagnose sys sdwan health-check
    Health Check(external_peers):
    Seq(1 port4): state(alive), packet-loss(0.000%) latency(0.161), jitter(0.009), mos(4.404), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x1
    Seq(2 port5): state(dead), packet-loss(100.000%) sla_map=0x0
    
  2. Verify the SD-WAN neighbor status. The minimum-sla-meet-members threshold of 1 is still met:

    PoP1-Hub1 (root) # diagnose sys sdwan neighbor
    Neighbor(EDGE): member(1 2)role(standalone)
            Health-check(external_peers:1)  sla-pass selected alive
  3. Verify the BGP learned routes. Hub-1 still advertises the external route to the Spoke-1 BGP neighbor:

    PoP1-Hub1 (root) # get router info bgp neighbors 172.31.0.65 advertised-routes
    VRF 0 BGP table version is 13, local router ID is 172.31.0.1
    Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
    Origin codes: i - IGP, e - EGP, ? - incomplete
       Network                 Next Hop      Metric     LocPrf  Weight  RouteTag Path
    *>i172.31.200.200/32     172.31.0.1     100        32768        0          i <-/->
    Total number of prefixes 1
    

Test case 7: all external peers on Hub-1 are out of SLA

In this case, Hub-1 will now advertise the default route map, which denies the advertisement of the external route. Spoke-1 will now route traffic to the next hub.

To verify the configuration:
  1. Verify the health check status on Hub-1. Note that port4 and port5 do not meet SLA:

    PoP1-Hub1 (root) # diagnose sys sdwan health-check
    Health Check(external_peers):
    Seq(1 port4): state(dead), packet-loss(100.000%) sla_map=0x0
    Seq(2 port5): state(dead), packet-loss(100.000%) sla_map=0x0
  2. Verify the SD-WAN neighbor status. The minimum-sla-meet-members threshold of 1 is not met:

    PoP1-Hub1 (root) # diagnose sys sdwan neighbor
    Neighbor(EDGE): member(1 2)role(standalone)
            Health-check(external_peers:1)  sla-fail dead
  3. Verify the BGP learned routes. Hub-1 does not advertise any external routes to the Spoke-1 BGP neighbor:

    PoP1-Hub1 (root) # get router info bgp neighbors 172.31.0.65 advertised-routes
    % No prefix for neighbor 172.31.0.65

SD-WAN multi-PoP multi-hub large scale design and failover

FortiOS 7.2.0 introduced a feature to define the minimum number of SD-WAN interface members that must meet SLA in order for the spoke to select a hub to process its SD-WAN traffic. This design is suitable for a single-PoP multi-hub architecture in order to achieve hub-to-hub failover. See Using multiple members per SD-WAN neighbor configuration.

In FortiOS 7.4.1 and later, the design is enhanced to support a multi-PoP multi-hub architecture in which incoming and outgoing traffic failover between PoPs is supported.

Based on the preceding diagram, incoming and outgoing traffic to the spoke is preferred over PoP1. If a single hub within PoP1 goes out of SLA, traffic will continue to flow through the PoP. If the minimum number of members to meet SLA in the PoP cannot be met, then traffic will fail over to PoP2.

The following enhancements have been made to support the multi-PoP failover scenario.

  • Add minimum-sla-meet-members setting in the SD-WAN zone configurations and zone-mode setting in the SD-WAN service configurations:

    config system sdwan
        config zone
            edit <name>
                set minimum-sla-meet-members <integer> 
            next
        end
        config service
            edit <id>
                set mode sla
                set zone-mode {enable | disable}
            next
        end
    end

    When zone-mode is enabled on a SD-WAN service rule, the traffic is steered based on the status of the zone.

    The state of the health check referenced in the SD-WAN service can be defined as follows:

    • If the number of in SLA members in a zone is less than the minimum-sla-meet-members, then the zone's state is out of SLA; otherwise, it is in SLA.

    • If a zone's state is out of SLA, then all members in the zone are out of SLA.

    • If a zone's state is in SLA, then the health check's state of individual members in the zone is determined by its own state.

  • Add service-id setting in the SD-WAN neighbor configurations:

    config system sdwan
        config neighbor
            edit <bgp_neighbor_ip>
                set member <member_id>
                set service-id <id>
            next
        end
    end

    The SD-WAN neighbor’s behavior can be determined by SD-WAN service and naturally synchronizes with SD-WAN service.

    • The SD-WAN service defines priority zones, whose SLA state determines the advertised community preferable string.

    • The SD-WAN service defines the hold-down-time, which determines how long an advertised community preferable string can be kept when it is expected to be changed.

  • Add sla-stickness setting in the SD-WAN service configurations:

    config system sdwan
        config service
            edit <id>
                set mode sla
                set sla-stickiness {enable | disable}
            next
        end
    end

    The switch-over of an existing session is determined as follows:

    • If the outgoing interface of the session is in SLA, then the session can keep its outgoing interface.

    • Otherwise, the session switches to a preferable path if one exists.

  • Allow the neighbor group to be configured in the SD-WAN neighbor configurations:

    config system sdwan
        config neighbor
            edit <bgp_neighbor_group>
                set member <member_id>
                set health-check <name>
                set sla-id <id>
            next
        end
    end

Outgoing path control

The outgoing path from spoke to hub operates as follows:

  1. Overlays to the primary and secondary PoP are assigned separately into an SD-WAN primary and secondary zone on the spoke.

  2. One SD-WAN service rule is defined to include these zones as SD-WAN members.

  3. When the primary zone is in SLA (minimum-sla-meet-members is met), the SD-WAN service rule steers traffic to the in SLA overlay members.

  4. When the primary zone is out of SLA (minimum-sla-meet-members is not met), the SD-WAN service rule steers traffic to the in SLA overlay members in the secondary zone.

  5. When the primary zone SLA is recovered:

    1. If sla-stickness is disabled on the SD-WAN service rule, then traffic will wait the duration of the hold-down-time before switching back to in SLA overlays in the primary zone.

    2. If sla-stickness is enabled on the SD-WAN service rule, then existing traffic will be kept on the in SLA overlays on the secondary zone, but new traffic will be steered to in SLA overlays in the primary zone.

Incoming path control

The incoming traffic from the core/external peers, to PoP, to spoke operates as follows:

  1. When the primary zone is in SLA, the spoke uses the preferable route map to advertise local routes with the in SLA community to hubs in the primary and secondary PoPs.

    1. Hubs in the primary PoP translate the in SLA community into a short AS path and advertise it to external peers to attract incoming traffic.

    2. Hubs in the secondary PoP translate the in SLA community into a longer AS path and advertise it to external peers to deflect incoming traffic.

  2. If the number of in SLA overlays in the primary zone is less than the minimum-sla-meet-members, then the spoke will use the default route map to advertise routes instead of with an out of SLA community to hubs in the primary PoP.

    1. Hubs in the primary PoP translate the out of SLA community into a longest AS path, and advertise it to external peers to deflect incoming traffic.

    2. As a result, inbound traffic is routed to hubs in the secondary PoP.

  3. When the primary zone SLA is recovered:

    1. The spoke will wait the duration of the predefined hold-down-time in the SD-WAN service rule to use the preferable route map again to advertise routes with the in SLA community to hubs in the primary PoP.

    2. As a result, inbound traffic will be routed back to hubs in the primary PoP.

Neighbor group configuration

By configuring the neighbor group for spokes under the hub's SD-WAN neighbor configuration, if all paths from the hub to external peers are detected as out of SLA, then the hub will use the default route map to deny external routes to spokes that belong to this neighbor group defined on the hub. As a result, spokes will skip that specific hub and connect to external peers from other hubs.

This allows spokes to only measure overlay quality to each hub, and hubs to manage health checks to services by external peers. This significantly decreases the number of health check probes directly from the spoke to services and decreases the overall complexity. The complexity is further simplified by using multiple VRFs or segmentation where each spoke needs to send health check probes.

Example

This example configuration contains the following components:

  • Two PoPs:

    • The primary PoP has two hubs (Hub-1 and Hub-2).

    • The secondary PoP has one hub (Hub-3).

  • Spoke-1 has six overlays, with two overlay connections to each hub.

  • Spoke-1 has three BGP neighbors, with one BGP neighbor for each hub.

    • All BGP neighbors are established on loopback IPs.

  • Each hub has two paths to external peers.

Normally, outbound and inbound traffic go through hubs in the primary PoP. If the number of in SLA overlays to the primary PoP is less than the minimum-sla-meet-members (set to 2 in this example), bi-directional traffic needs to be switched to hubs in the secondary PoP. But when the primary PoP recovers and the minimum-sla-meet-members is met again, bi-directional traffic is forced back to hubs in the primary PoP after the predefined hold-down-time duration.

The hubs do not require SD-WAN configurations to the spokes. However, they use SD-WAN for connections to external peer routers.

Configuring the FortiGates

The following configurations highlight important routing and SD-WAN settings that must be configured on the spoke and the hubs. It is assumed that other configurations such as underlays, IPsec VPN overlays, loopbacks, static routes, and so on are already configured.

To configure Spoke-1:
  1. Create the primary (PoP1) and secondary (PoP2) zones, and set the minimum-sla-meet-members to 2 on PoP1:

    config system sdwan
        set status enable
        config zone
            edit "virtual-wan-link"
            next
            edit "PoP1"
                set minimum-sla-meet-members 2 
            next
            edit "PoP2"
            next
        end
    end
  2. Add the overlay members to each zone. Four overlays are defined for PoP1, and two overlays are defined for PoP2:

    config system sdwan
        config members
            edit 1
                set interface "H1_T11"
                set zone "PoP1"
            next
            edit 2
                set interface "H1_T22"
                set zone "PoP1"
            next
            edit 3
                set interface "H2_T11"
                set zone "PoP1"
            next
            edit 4
                set interface "H2_T22"
                set zone "PoP1"
            next
            edit 5
                set interface "H3_T11"
                set zone "PoP2"
            next
            edit 6
                set interface "H3_T22"
                set zone "PoP2"
            next
        end
    end
  3. Configure a performance SLA health check to a probe server behind the three hubs:

    config system sdwan
        config health-check
            edit "Hubs"
                set server "172.31.100.100"
                set source 172.31.0.65
                set members 0
                config sla
                    edit 1
                        set link-cost-factor latency
                        set latency-threshold 200
                    next
                end
            next
        end
    end
  4. Configure the service rule with the following settings: use SLA mode, enable zone mode to steer traffic based on the zone statuses, enable sla-stickiness, and use a 30-second hold down so that upon a recovery, existing sessions will remain on the secondary PoP while new sessions will switch back to the primary PoP once the 30-second duration ends:

    config system sdwan
        config service
            edit 1
                set mode sla
                set zone-mode enable
                set dst "all"
                set src "CORP_LAN"                                           
                set hold-down-time 30
                set sla-stickiness enable
                config sla
                    edit "Hubs"
                        set id 1
                    next
                end
                set priority-zone "PoP1" "PoP2"
            next
        end
    end

    Since the PoP1 zone is specified before PoP2, PoP1 is regarded as the primary and preferred over the PoP2 zone.

  5. Configure the in_sla and out_sla route maps that define the communities that are advertised to the hub when the zones are in and out of SLA.

    1. Configure the access list:

      config router access-list
          edit "net10"
              config rule
                  edit 1
                      set prefix 10.0.3.0 255.255.255.0
                  next
              end
          next
      end
    2. Configure the route maps:

      config router route-map
          edit "in_sla"
              config rule
                  edit 1
                      set match-ip-address "net10"
                      set set-community "10:1" 
                  next
              end
          next
          edit "out_sla"
              config rule
                  edit 1
                      set match-ip-address "net10"
                      set set-community "10:2" 
                  next
              end
          next
      end
  6. Configure the default route map for out of SLA scenarios, preferable route map for in SLA scenarios, and the local network to be advertised:

    config router bgp
        config neighbor
            edit "172.31.0.1"
                ...
                set route-map-out "out_sla"
                set route-map-out-preferable "in_sla"
                ...
            next
            edit "172.31.0.2"
                ...
                set route-map-out "out_sla"
                set route-map-out-preferable "in_sla"
                ...
            next
            edit "172.31.0.129"
                ...
                set route-map-out "out_sla"
                set route-map-out-preferable "in_sla"
                ...
            next
        end
        config network
            edit 1
                set prefix 10.0.3.0 255.255.255.0
            next
        end
        ...
    end
  7. Define SD-WAN neighbors for each hub. The minimum-sla-meet-members is configured for the Hub-1 neighbor so that bi-directional traffic goes through Hub-1 as long as the in SLA overlays to Hub-1 are no less than 1. Associate the previously defined service rule to each SD-WAN neighbor:

    config system sdwan
        config neighbor
            edit "172.31.0.1"
                set member 1 2
                set minimum-sla-meet-members 1
                set service-id 1
            next
            edit "172.31.0.2"
                set member 3 4
                set service-id 1
            next
            edit "172.31.0.129"
                set member 5 6
                set service-id 1
            next
        end
    end
To configure the hubs:
  1. Configure the SD-WAN zone, members, and health check for the external connections to peer routers. Performance SLA health checks are sent to external servers in order to measure the health of the external connections:

    config system sdwan
        set status enable
        config zone
            edit "virtual-wan-link"
            next
        end
        config members
            edit 1
                set interface "port4"
            next
            edit 2
                set interface "port5"
            next
        end
        config health-check
            edit "external_peers"
                set server "10.0.1.2"
                set members 1 2
                config sla
                    edit 1
                        set link-cost-factor latency
                        set latency-threshold 200
                    next
                end
            next
        end
    end
  2. Configure the route maps for in and out of SLA scenarios. When out of SLA (one of the external connections is down), external routes are denied to be advertised to the spokes that are part of the neighbor group.

    1. Configure the access list:

      config router access-list
          edit "net_Lo"
              config rule
                  edit 1
                      set prefix 172.31.200.200 255.255.255.255
                  next
              end
          next
      end
    2. Configure the route maps:

      config router route-map
          edit "in_sla"
              config rule
                  edit 1                                                  
                      set match-ip-address "net_Lo"
                  next
              end
          next
          edit "out_sla"
              config rule
                  edit 1
                      set action deny routes
                      set match-ip-address "net_Lo"
                  next
              end
          next
      end
  3. In the BGP settings, configure the external network prefix to advertise. Then configure the neighbor group and neighbor range for the spokes. Configure the preferable and default route maps to define the behavior when the external connections are in and out of SLA:

    config router bgp
        ...
        config network
            edit 1
                set prefix 172.31.200.200 255.255.255.255
            next
        end
        config neighbor-group
            edit "EDGE"
                ...
                set route-map-out "out_sla"
                set route-map-out-preferable "in_sla" 
                ...
            next
        end
        config neighbor-range
            edit 1
                set prefix 172.31.0.64 255.255.255.192
                set neighbor-group "EDGE"                                
            next
        end
        ...
    end
  4. Configure the SD-WAN neighbor to match the neighbor group that includes spokes as members. Specify that at least one of the external peer connections needs to be up to be considered in SLA:

    config system sdwan
        config neighbor
            edit "EDGE" 
                set member 1 2
                set minimum-sla-meet-members 1
                set health-check "external_peers"
                set sla-id 1
            next
        end
    end

Testing and verification

The following tests use diagnostic commands on various FortiGates to verify the connections in the SD-WAN configuration.

Test case 1: the primary PoP and Hub-1 are in SLA

To verify the configuration:
  1. Verify the SD-WAN service rules status on Spoke-1. When all six overlays are in SLA on Spoke-1, the primary PoP and primary zone PoP1 are preferred. In particular, the overlay H1_T11 over PoP1 is preferred:

    Spoke-1 (root) # diagnose sys sdwan service
    Service(1): Address Mode(IPV4) flags=0x1c200 use-shortcut-sla use-shortcut sla-stickiness
     Tie break: cfg
      Gen(1), TOS(0x0/0x0), Protocol(0): src(1->65535):dst(1->65535), Mode(sla), sla-compare-order
    Hold down time(30) seconds, Hold start at 362646 second, now 362646
      Service role: standalone
      Members(6):
        1: Seq_num(1 H1_T11 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected 
        2: Seq_num(2 H1_T22 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        3: Seq_num(3 H2_T11 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        4: Seq_num(4 H2_T22 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        5: Seq_num(5 H3_T11 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
        6: Seq_num(6 H3_T22 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
      Src address(1):
            10.0.0.0-10.255.255.255
      Dst address(1):
            0.0.0.0-255.255.255.255
  2. Verify the BGP learned routes on Hub-1. The local route with in SLA community 10:1 is advertised to all hubs. Though, the AS paths on Hub-1 and Hub-2 are shorter than Hub-3:

    PoP1-Hub1 (root) # get router info bgp network 10.0.3.0/24
    VRF 0 BGP routing table entry for 10.0.3.0/24
    Paths: (1 available, best #1, table Default-IP-Routing-Table)
      Not advertised to any peer
      Original VRF 0
      Local, (Received from a RR-client)
        172.31.0.65 from 172.31.0.65 (172.31.0.65)
          Origin IGP metric 0, localpref 100, valid, internal, best
          Community: 10:1
          Last update: Mon Jul 17 15:16:57 2023
  3. Send traffic from a host behind Spoke-1 to 172.31.200.200.

  4. Run a sniffer trace on Spoke-1. Traffic leaves and returns on the H1_T11 overlay :

    Spoke-1 (root) # diagnose sniffer packet any 'host 172.31.200.200' 4
    interfaces=[any]
    filters=[host 172.31.200.200]
    5.098248 port4 in 10.0.3.2 -> 172.31.200.200: icmp: echo request
    5.098339 H1_T11 out 10.0.3.2 -> 172.31.200.200: icmp: echo request
    5.098618 H1_T11 in 172.31.200.200 -> 10.0.3.2: icmp: echo reply
    5.098750 port4 out 172.31.200.200 -> 10.0.3.2: icmp: echo reply

Test case 2: a single SD-WAN member on Hub-1 is out of SLA

Hub-1 and PoP1 are still preferred in this scenario.

To verify the configuration:
  1. Verify the health check status on Spoke-1. The H1_T11 overlay on Hub-1/PoP1 is out of SLA:

    Spoke-1 (root) # diagnose sys sdwan health-check
    Health Check(Hubs):
    Seq(1 H1_T11): state(alive), packet-loss(0.000%) latency(220.214), jitter(0.015), mos(4.104), bandwidth-up(999999), bandwidth-dw(999998), bandwidth-bi(1999997) sla_map=0x0
    Seq(2 H1_T22): state(alive), packet-loss(0.000%) latency(0.196), jitter(0.014), mos(4.404), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x1
    Seq(3 H2_T11): state(alive), packet-loss(0.000%) latency(0.173), jitter(0.008), mos(4.404), bandwidth-up(999998), bandwidth-dw(999997), bandwidth-bi(1999995) sla_map=0x1
    …
    
  2. Verify the SD-WAN neighbor status. The SD-WAN neighbor still displays Hub-1’s zone status as pass/alive:

    Spoke-1 (root) # diagnose sys sdwan neighbor
    SD-WAN neighbor status: hold-down(disable), hold-down-time(0), hold_boot_time(0)
            Selected role(standalone) last_secondary_select_time/current_time in seconds 0/436439
    Neighbor(172.31.0.1): member(1 2)role(standalone)
            Health-check(:0)  sla-pass selected alive
    Neighbor(172.31.0.2): member(3 4)role(standalone)
            Health-check(:0)  sla-pass selected alive
    Neighbor(172.31.0.129): member(5 6)role(standalone)
            Health-check(:0)  sla-pass selected alive
  3. Verify the SD-WAN service rules status. Spoke-1 steers traffic to the H1_T22 overlay through Hub-1:

    Spoke-1 (root) # diagnose sys sdwan service
    Service(1): Address Mode(IPV4) flags=0x1c200 use-shortcut-sla use-shortcut sla-stickiness
     Tie break: cfg
      Gen(2), TOS(0x0/0x0), Protocol(0): src(1->65535):dst(1->65535), Mode(sla), sla-compare-order
    Hold down time(30) seconds, Hold start at 364162 second, now 364162
      Service role: standalone
      Members(6):
        1: Seq_num(2 H1_T22 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        2: Seq_num(3 H2_T11 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        3: Seq_num(4 H2_T22 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        4: Seq_num(5 H3_T11 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
        5: Seq_num(6 H3_T22 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
        6: Seq_num(1 H1_T11 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected
      Src address(1):
            10.0.0.0-10.255.255.255
      Dst address(1):
            0.0.0.0-255.255.255.255
    
  4. Verify the BGP learned routes on Hub-1. The hubs continue to receive community 10:1 from the spoke and continue to route incoming traffic through Hub-1:

    PoP1-Hub1 (root) # get router info bgp network 10.0.3.0/24
    VRF 0 BGP routing table entry for 10.0.3.0/24
    Paths: (1 available, best #1, table Default-IP-Routing-Table)
      Not advertised to any peer
      Original VRF 0
      Local, (Received from a RR-client)
        172.31.0.65 from 172.31.0.65 (172.31.0.65)
          Origin IGP metric 0, localpref 100, valid, internal, best
          Community: 10:1
          Last update: Mon Jul 17 15:16:57 2023
    
  5. Send traffic from a host behind Spoke-1 to 172.31.200.200.

  6. Run a sniffer trace on Spoke-1. Traffic leaves and returns on the H1_T22 overlay:

    Spoke-1 (root) # diagnose sniffer packet any 'host 172.31.200.200' 4
    interfaces=[any]
    filters=[host 172.31.200.200]
    25.299006 port4 in 10.0.3.2 -> 172.31.200.200: icmp: echo request
    25.299080 H1_T22 out 10.0.3.2 -> 172.31.200.200: icmp: echo request
    25.299323 H1_T22 in 172.31.200.200 -> 10.0.3.2: icmp: echo reply
    25.299349 port4 out 172.31.200.200 -> 10.0.3.2: icmp: echo reply
    

Test case 3: both SD-WAN members on Hub-1 are out of SLA

Other in SLA overlays in zone PoP1 though Hub-2 are still preferred over PoP2 in this scenario.

To verify the configuration:
  1. Verify the health check status on Spoke-1. Both H1_T11 and H1_T22 overlays on Hub-1/PoP1 are out of SLA:

    Spoke-1 (root) # diagnose sys sdwan health-check
    Health Check(Hubs):
    Seq(1 H1_T11): state(alive), packet-loss(0.000%) latency(220.220), jitter(0.018), mos(4.103), bandwidth-up(999999), bandwidth-dw(999998), bandwidth-bi(1999997) sla_map=0x0
    Seq(2 H1_T22): state(alive), packet-loss(0.000%) latency(220.174), jitter(0.007), mos(4.104), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x0
    Seq(3 H2_T11): state(alive), packet-loss(0.000%) latency(0.184), jitter(0.015), mos(4.404), bandwidth-up(999998), bandwidth-dw(999997), bandwidth-bi(1999995) sla_map=0x1
    Seq(4 H2_T22): state(alive), packet-loss(0.000%) latency(0.171), jitter(0.008), mos(4.404), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x1
    Seq(5 H3_T11): state(alive), packet-loss(0.000%) latency(0.173), jitter(0.011), mos(4.404), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x1
    Seq(6 H3_T22): state(alive), packet-loss(0.000%) latency(0.179), jitter(0.011), mos(4.404), bandwidth-up(999999), bandwidth-dw(999998), bandwidth-bi(1999997) sla_map=0x1
    
  2. Verify the SD-WAN neighbor status. The SD-WAN neighbor displays Hub-1’s zone status as failed. However, SD-WAN Hub-2 is pass/alive:

    Spoke-1 (root) # diagnose sys sdwan neighbor
    SD-WAN neighbor status: hold-down(disable), hold-down-time(0), hold_boot_time(0)
            Selected role(standalone) last_secondary_select_time/current_time in seconds 0/436535
    Neighbor(172.31.0.1): member(1 2)role(standalone)
            Health-check(:0)  sla-fail alive
    Neighbor(172.31.0.2): member(3 4)role(standalone)
            Health-check(:0)  sla-pass selected alive
    Neighbor(172.31.0.129): member(5 6)role(standalone)
            Health-check(:0)  sla-pass selected alive
  3. Verify the SD-WAN service rules status. Spoke-1 steers traffic to the H2_T11 overlay through Hub-2:

    Spoke-1 (root) # diagnose sys sdwan service
    Service(1): Address Mode(IPV4) flags=0x1c200 use-shortcut-sla use-shortcut sla-stickiness
     Tie break: cfg
      Gen(3), TOS(0x0/0x0), Protocol(0): src(1->65535):dst(1->65535), Mode(sla), sla-compare-order
    Hold down time(30) seconds, Hold start at 364489 second, now 364490
      Service role: standalone
      Members(6):
        1: Seq_num(3 H2_T11 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        2: Seq_num(4 H2_T22 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        3: Seq_num(5 H3_T11 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
        4: Seq_num(6 H3_T22 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
        5: Seq_num(1 H1_T11 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected
        6: Seq_num(2 H1_T22 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected
      Src address(1):
            10.0.0.0-10.255.255.255
      Dst address(1):
            0.0.0.0-255.255.255.255
    
  4. Verify the BGP learned routes on Hub-1 and Hub-2. Hub-2 and Hub-3 continue to receive community 10:1 from Spoke-1, but Hub-1 receives the out of SLA community of 10:2.

    1. On Hub-1:

      PoP1-Hub1 (root) # get router info bgp network 10.0.3.0/24
      VRF 0 BGP routing table entry for 10.0.3.0/24
      Paths: (1 available, best #1, table Default-IP-Routing-Table)
        Not advertised to any peer
        Original VRF 0
        Local, (Received from a RR-client)
          172.31.0.65 from 172.31.0.65 (172.31.0.65)
            Origin IGP metric 0, localpref 100, valid, internal, best
            Community: 10:2
            Last update: Mon Jul 17 18:08:58 2023
      
    2. On Hub-2:

      PoP1-Hub2 (root) # get router info bgp network 10.0.3.0/24
      VRF 0 BGP routing table entry for 10.0.3.0/24
      Paths: (1 available, best #1, table Default-IP-Routing-Table)
        Not advertised to any peer
        Original VRF 0
        Local, (Received from a RR-client)
          172.31.0.65 from 172.31.0.65 (172.31.0.65)
            Origin IGP metric 0, localpref 100, valid, internal, best
            Community: 10:1
            Last update: Mon Jul 17 15:31:43 2023
      
  5. Send traffic from a host behind Spoke-1 to 172.31.200.200.

  6. Run a sniffer trace on Spoke-1. Traffic leaves and returns on the H2_T11 overlay:

    Spoke-1 (root) # diagnose sniffer packet any 'host 172.31.200.200' 4
    interfaces=[any]
    filters=[host 172.31.200.200]
    13.726009 port4 in 10.0.3.2 -> 172.31.200.200: icmp: echo request
    13.726075 H2_T11 out 10.0.3.2 -> 172.31.200.200: icmp: echo request                              
    13.726354 H2_T11 in 172.31.200.200 -> 10.0.3.2: icmp: echo reply
    13.726382 port4 out 172.31.200.200 -> 10.0.3.2: icmp: echo reply
    

Test case 4: three SD-WAN members on PoP1 are out of SLA

The number of in SLA overlays in zone PoP1 is less than the minimum-sla-meet-members in zone PoP1. The SD-WAN service rule for Hub-2 is forcibly marked as sla(0x0) or out of SLA.

To verify the configuration:
  1. Verify the health check status on Spoke-1. All three H1_T11, H1_T22, and H2_T11 overlays on PoP1 are out of SLA:

    Spoke-1 (root) # diagnose sys sdwan health-check
    Health Check(Hubs):
    Seq(1 H1_T11): state(alive), packet-loss(0.000%) latency(220.219), jitter(0.019), mos(4.103), bandwidth-up(999999), bandwidth-dw(999998), bandwidth-bi(1999997) sla_map=0x0
    Seq(2 H1_T22): state(alive), packet-loss(0.000%) latency(220.184), jitter(0.008), mos(4.104), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x0
    Seq(3 H2_T11): state(alive), packet-loss(0.000%) latency(220.171), jitter(0.009), mos(4.104), bandwidth-up(999998), bandwidth-dw(999997), bandwidth-bi(1999995) sla_map=0x0
    Seq(4 H2_T22): state(alive), packet-loss(0.000%) latency(0.180), jitter(0.013), mos(4.404), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x1
    Seq(5 H3_T11): state(alive), packet-loss(0.000%) latency(0.174), jitter(0.014), mos(4.404), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x1
    Seq(6 H3_T22): state(alive), packet-loss(0.000%) latency(0.179), jitter(0.015), mos(4.404), bandwidth-up(999999), bandwidth-dw(999998), bandwidth-bi(1999997) sla_map=0x1
    
  2. Verify the SD-WAN neighbor status. The SD-WAN neighbor displays Hub-1 and Hub-2’s zone status as failed:

    Spoke-1 (root) # diagnose sys sdwan neighbor
    SD-WAN neighbor status: hold-down(disable), hold-down-time(0), hold_boot_time(0)
            Selected role(standalone) last_secondary_select_time/current_time in seconds 0/436605
    Neighbor(172.31.0.1): member(1 2)role(standalone)
            Health-check(:0)  sla-fail alive
    Neighbor(172.31.0.2): member(3 4)role(standalone)
            Health-check(:0)  sla-fail alive
    Neighbor(172.31.0.129): member(5 6)role(standalone)
            Health-check(:0)  sla-pass selected alive
  3. Verify the SD-WAN service rules status. Since the minimum SLA members is not met for the primary zone (PoP1), the remaining overlay in PoP1 associated with the SD-WAN service rule is forcibly set to out of SLA. Spoke-1 steers traffic to the H3_T11 overlay through Hub-3:

    Spoke-1 (root) # diagnose sys sdwan service
    Service(1): Address Mode(IPV4) flags=0x1c200 use-shortcut-sla use-shortcut sla-stickiness
     Tie break: cfg
      Gen(6), TOS(0x0/0x0), Protocol(0): src(1->65535):dst(1->65535), Mode(sla), sla-compare-order
    Hold down time(30) seconds, Hold start at 365341 second, now 365341
      Service role: standalone
      Members(6):
        1: Seq_num(5 H3_T11 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected 
        2: Seq_num(6 H3_T22 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
        3: Seq_num(1 H1_T11 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected
        4: Seq_num(2 H1_T22 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected      
        5: Seq_num(3 H2_T11 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected
        6: Seq_num(4 H2_T22 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected 
      Src address(1):
            10.0.0.0-10.255.255.255
      Dst address(1):
            0.0.0.0-255.255.255.255
    
  4. Verify the BGP learned routes on each hub. Hub-3 continues to receive community 10:1 from Spoke-1, but Hub-1 and Hub-2 receive the out of SLA community of 10:2.

    1. On Hub-1:

      PoP1-Hub1 (root) # get router info bgp network 10.0.3.0/24
      VRF 0 BGP routing table entry for 10.0.3.0/24
      Paths: (1 available, best #1, table Default-IP-Routing-Table)
        Not advertised to any peer
        Original VRF 0
        Local, (Received from a RR-client)
          172.31.0.65 from 172.31.0.65 (172.31.0.65)
            Origin IGP metric 0, localpref 100, valid, internal, best
            Community: 10:2
            Last update: Mon Jul 17 18:22:14 2023
      
    2. On Hub-2:

      PoP1-Hub2 (root) # get router info bgp network 10.0.3.0/24
      VRF 0 BGP routing table entry for 10.0.3.0/24
      Paths: (1 available, best #1, table Default-IP-Routing-Table)
        Not advertised to any peer
        Original VRF 0
        Local, (Received from a RR-client)
          172.31.0.65 from 172.31.0.65 (172.31.0.65)
            Origin IGP metric 0, localpref 100, valid, internal, best
            Community: 10:2
            Last update: Mon Jul 17 18:37:53 2023
      
    3. On Hub-3:

      PoP2-Hub3 (root) # get router info bgp network 10.0.3.0/24
      VRF 0 BGP routing table entry for 10.0.3.0/24
      Paths: (1 available, best #1, table Default-IP-Routing-Table)
        Not advertised to any peer
        Original VRF 0
        Local, (Received from a RR-client)
          172.31.0.65 from 172.31.0.65 (172.31.0.65)
            Origin IGP metric 0, localpref 100, valid, internal, best
            Community: 10:1
            Last update: Mon Jul 17 14:39:04 2023
      
  5. Send traffic from a host behind Spoke-1 to 172.31.200.200.

  6. Run a sniffer trace on Spoke-1. Traffic leaves and returns on the H3_T11 overlay:

    Spoke-1 (root) # diagnose sniffer packet any 'host 172.31.200.200' 4
    interfaces=[any]
    filters=[host 172.31.200.200]
    38.501449 port4 in 10.0.3.2 -> 172.31.200.200: icmp: echo request
    38.501519 H3_T11 out 10.0.3.2 -> 172.31.200.200: icmp: echo request
    38.501818 H3_T11 in 172.31.200.200 -> 10.0.3.2: icmp: echo reply
    38.501845 port4 out 172.31.200.200 -> 10.0.3.2: icmp: echo reply
    

Test case 5: an SD-WAN member on PoP1 recovers

SD-WAN member H2_T11 recovers and brings the number of overlays in SLA back to being above the minimum-sla-meet-members threshold in PoP1. After the hold down time duration (30 seconds), in SLA overlays in zone PoP1 are preferred over PoP2 again. With sla-stickiness enabled, existing traffic is kept on H3_T11, but new traffic is steered to H2_T11.

To verify the configuration:
  1. Verify the SD-WAN service rules status on Spoke-1. The hold down timer has not yet passed, so H2_T11 is not yet preferred—even though the SLA status is pass/alive:

    Spoke-1 (root) # diagnose sys sdwan service
    
    Service(1): Address Mode(IPV4) flags=0x1c200 use-shortcut-sla use-shortcut sla-stickiness
     Tie break: cfg
      Gen(16), TOS(0x0/0x0), Protocol(0): src(1->65535):dst(1->65535), Mode(sla), sla-compare-order
    Hold down time(30) seconds, Hold start at 431972 second, now 432000
      Service role: standalone
      Members(6):
        1: Seq_num(5 H3_T11 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
        2: Seq_num(6 H3_T22 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
        3: Seq_num(1 H1_T11 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected
        4: Seq_num(2 H1_T22 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected
        5: Seq_num(3 H2_T11 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        6: Seq_num(4 H2_T22 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
    
  2. Verify the SD-WAN service rules status again after the hold down timer passes. H2_T11 and H2_T22 from PoP1 are now preferred:

    Spoke-1 (root) # diagnose sys sdwan service
    Service(1): Address Mode(IPV4) flags=0x1c200 use-shortcut-sla use-shortcut sla-stickiness
     Tie break: cfg
      Gen(17), TOS(0x0/0x0), Protocol(0): src(1->65535):dst(1->65535), Mode(sla), sla-compare-order
    Hold down time(30) seconds, Hold start at 432003 second, now 432003
      Service role: standalone
      Members(6):
        1: Seq_num(3 H2_T11 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        2: Seq_num(4 H2_T22 PoP1), alive, sla(0x1), gid(0), cfg_order(0), local cost(0), selected
        3: Seq_num(5 H3_T11 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
        4: Seq_num(6 H3_T22 PoP2), alive, sla(0x1), gid(0), cfg_order(1), local cost(0), selected
        5: Seq_num(1 H1_T11 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected
        6: Seq_num(2 H1_T22 PoP1), alive, sla(0x0), gid(0), cfg_order(0), local cost(0), selected
    
  3. Verify the BGP learned routes on Hub-2, which now receives community 10:1 from Spoke-1:

    PoP1-Hub2 (root) #  get router info bgp network 10.0.3.0/24
    VRF 0 BGP routing table entry for 10.0.3.0/24
    Paths: (1 available, best #1, table Default-IP-Routing-Table)
      Not advertised to any peer
      Original VRF 0
      Local, (Received from a RR-client)
        172.31.0.65 from 172.31.0.65 (172.31.0.65)
          Origin IGP metric 0, localpref 100, valid, internal, best
          Community: 10:1
          Last update: Tue Jul 18 14:41:32 2023
    
  4. Send traffic from a host behind Spoke-1 to 172.31.200.200.

  5. Run a sniffer trace on Spoke-1. Because of sla-stickiness, the existing traffic is kept on H3_T11:

    Spoke-1 (root) # diagnose sniffer packet any 'host 172.31.200.200' 4
    interfaces=[any]
    filters=[host 172.31.200.200]
    
    0.202708 port4 in 10.0.3.2 -> 172.31.200.200: icmp: echo request
    0.202724 H3_T11 out 10.0.3.2 -> 172.31.200.200: icmp: echo request
    0.202911 H3_T11 in 172.31.200.200 -> 10.0.3.2: icmp: echo reply
    0.202934 port4 out 172.31.200.200 -> 10.0.3.2: icmp: echo reply
    

Test case 6: Hub-1 has an in SLA path to external peers

Since Hub-1 has an in SLA path to external peers, it will advertise the external route with destination 172.31.200.200/32 to Spoke-1.

To verify the configuration:
  1. Verify the health check status on Hub-1. Note that port4 meets SLA, but port5 does not:

    PoP1-Hub1 (root) # diagnose sys sdwan health-check
    Health Check(external_peers):
    Seq(1 port4): state(alive), packet-loss(0.000%) latency(0.161), jitter(0.009), mos(4.404), bandwidth-up(999999), bandwidth-dw(999999), bandwidth-bi(1999998) sla_map=0x1
    Seq(2 port5): state(dead), packet-loss(100.000%) sla_map=0x0
    
  2. Verify the SD-WAN neighbor status. The minimum-sla-meet-members threshold of 1 is still met:

    PoP1-Hub1 (root) # diagnose sys sdwan neighbor
    Neighbor(EDGE): member(1 2)role(standalone)
            Health-check(external_peers:1)  sla-pass selected alive
  3. Verify the BGP learned routes. Hub-1 still advertises the external route to the Spoke-1 BGP neighbor:

    PoP1-Hub1 (root) # get router info bgp neighbors 172.31.0.65 advertised-routes
    VRF 0 BGP table version is 13, local router ID is 172.31.0.1
    Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
    Origin codes: i - IGP, e - EGP, ? - incomplete
       Network                 Next Hop      Metric     LocPrf  Weight  RouteTag Path
    *>i172.31.200.200/32     172.31.0.1     100        32768        0          i <-/->
    Total number of prefixes 1
    

Test case 7: all external peers on Hub-1 are out of SLA

In this case, Hub-1 will now advertise the default route map, which denies the advertisement of the external route. Spoke-1 will now route traffic to the next hub.

To verify the configuration:
  1. Verify the health check status on Hub-1. Note that port4 and port5 do not meet SLA:

    PoP1-Hub1 (root) # diagnose sys sdwan health-check
    Health Check(external_peers):
    Seq(1 port4): state(dead), packet-loss(100.000%) sla_map=0x0
    Seq(2 port5): state(dead), packet-loss(100.000%) sla_map=0x0
  2. Verify the SD-WAN neighbor status. The minimum-sla-meet-members threshold of 1 is not met:

    PoP1-Hub1 (root) # diagnose sys sdwan neighbor
    Neighbor(EDGE): member(1 2)role(standalone)
            Health-check(external_peers:1)  sla-fail dead
  3. Verify the BGP learned routes. Hub-1 does not advertise any external routes to the Spoke-1 BGP neighbor:

    PoP1-Hub1 (root) # get router info bgp neighbors 172.31.0.65 advertised-routes
    % No prefix for neighbor 172.31.0.65