In traditional WAN, an application's network path is typically determined by a routing protocol such as BGP, OSPF, IS-IS, or EIGRP. These legacy routing protocols have been around for many years and have become very good at detecting network blackout conditions such as direct and indirect failures of WAN circuits. When calculating the network path for given application traffic, they typically consider the destination prefix, routing metric, and link-state information. However, these legacy routing protocols have not been designed to detect brownout conditions or soft failures in the network, such as performance degradations of transport circuits. As network engineers, we all have seen multiple times that packet loss, latency, and jitter can suddenly appear on any WAN circuits, especially on low-cost Internet links.
An Application-Aware Routing (App-route) policy allows the SD-WAN fabric to monitor the quality of network paths in real-time and do SLA-driven routing across the best available transports. App-route policies rely on BFD to detect performance issues such as packet loss, latency, and jitter on the overlay tunnels. Suppose BFD reports that IPsec tunnel characteristics have become worse than the defined SLA class for a given application. In that case, the App-route policy automatically moves the matched application traffic off the network path, experiencing performance degradation. The app-route policy automatically returns the application traffic as the network eventually recovers from the brownout condition. Figure 1 shows where an App-route policy fits into the Cisco SD-WAN policy framework.
The Application-Aware routing policies allow WAN Edge routers to consider the performance characteristics of the available overlay tunnels when making network path decisions for given application traffic.
The structure of App-route policy
An Application-aware Routing (App-route) policy is a special type of centralized data policy. It follows pretty much the same process of provisioning and activating. However, the main difference is that app-route policies are always applied in a “from-service” direction because they are used to determine which egress overlay tunnel given traffic should be sent through.
Figure 2 illustrates the mechanics of an Application-Aware Routing policy in four key steps:
Step 1 - Provisioning an App-route policy on vManage: The structure of an App-route policy is very similar to a centralized data policy. First, we define the required lists that we will later invoke in the match-action rules. In the case of an application-aware routing policy, there is a special type of list called sla-class that defines the maximum packet loss, latency, and jitter for a data plane tunnel. The basic building blocks of an application-aware routing policy are as follows:
- VPN-list that specifies which VPN-ids the policy will affect; An App-route policy is always applied under to a vpn-list.
- App-list that matches the apps of interest. We can use DPI signatures or alternatively any other list type (prefix-list) or value (dscp) that matches the apps of interest, based on a 6-tuple filter;
- App-probe-class that specifies the BFD dscp value to be used when probing the WAN (optional);
- SLA-class that establishes the maximum packet loss, latency, and jitter for the applications;
- The match-action rules of the App-route policy itself;
- Site-list that specifies which sites will receive the app-route policy;
The exact order in which we configure each of these components is irrelevant to vManage's GUI or vSmart's CLI parser. However, from the network administrators' point of view, a logical order would be to first define all lists that would be later invoked in the match-action rules of the application-aware routing policy. Once all necessary lists are specified, the logic of the app-route policy itself would be as follow:
app-route-policy {name}
vpn-list {name}
sequence {number}
match
applications based on L3/L4 header info or DPI signatures
!
action
choose outgoing tunnel based on SLA-class
!
Step 2 - Pushing the policy to vEdges: Once the app-route policy is activated through the GUI, vManage pushes it to the vSmart controller via NETCONF, and the policy becomes a part of vSmart’s running configuration. The vSmart controller then sends the app-route policy parameters encoded into OMP updates down to all WAN edge devices matched by the applied site list;
Step 3 - Monitoring and measuring the characteristics of overlay tunnels: The affected vEdges receive the policy in a read-only fashion via OMP and then execute it in memory. The next step in the process is to measure the performance characteristics of the overlay tunnels and determine which tunnels are in compliance with the defined SLA in the received sla-class list. This performance information is collected from BFD, which runs on top of each overlay tunnel in the SD-WAN fabric;
Step 4 - Mapping app traffic to overlay tunnels: The Application-aware Routing policy allows vEdges to consider the tunnel performance characteristics, that have been determined by BFD, in the path selection process. WAN edge routers evaluate the performance information of tunnels against the defined SLA classes and determine which overlay tunnels are in compliance. Then the forwarding decisions made by the vEdges are done with respect to these SLA compliance states.
BFD
In Cisco SD-WAN, once an overlay data tunnel is established, bi-directional forwarding detection (BFD) is automatically started on top and cannot be disabled. It runs on all data tunnels between vEdges across all transports. That is because multiple advanced features depend on BFD. vEdges use BFD to detect tunnels' link-state and measure real-time performance characteristics such as latency, jitter, and packet loss. Application-aware Routing policies depend on BFD to be providing real-time performance characteristics of every tunnel.
BFD Hello & BFD Multiplier
Cisco SD-WAN utilizes the Bi-directional Forwarding Detection (BFD) protocol in echo mode. A vEdge router tests an overlay tunnel by sending BFD probes to the remote vEdge without involving the remote system. The remote vEdge returns the BFD probes through the data plane without processing them.
In the context of Echo mode, the hello-interval determines how frequently a vEdge router sends BFD probes over a given overlay tunnel. The Multiplier value determines how many consecutive BFD probes must be lost before a vEdge router declares a tunnel as down, as illustrated in figure 3 below.
The BFD default values in Cisco SD-WAN are hello-interval 1000msec, multiplier 7, and dscp value of 48. However, all BFD settings are configurable per-color on per-vEdge basis as shown below:
bfd color [color]
hello-interval [100..300000]
multiplier [1..60]
dscp [0..63]
!
This raises an interesting question, though: what happens if a tunnel is established between different colors with different BFD settings? For example, biz-internet color with default settings establishes a tunnel to lte color with BFD hello-interval of 3000 msec and multiplier of 10. In such cases, the vEdge routers on both ends negotiate the greater of the two configured values as shown in the output below:
vEdge# show bfd sessions | t
SITE DETECT TX
SRC IP DST IP PROTO ID LOCAL COLOR COLOR STATE MULTIPLIER INTERVAL
----------------------------------------------------------------------------------
39.3.0.2 39.3.0.1 ipsec 1 biz-internet biz-internet up 7 1000
39.3.0.2 39.3.0.3 ipsec 1 biz-internet lte up 10 3000
Monitoring Tunnel Performance
App-route Poll Interval
The same BFD probes that are used to determine the link-state of an overlay tunnel are used to also measure the tunnel’s performance characteristics. WAN edge routers collect packet loss, latency, and jitter information for every BFD probe. This information is accumulated over a period of time called Poll Interval, which is 10 minutes by default. When we know that the default BFD hello probe is sent each 1 second, we can easily calculate that one poll interval accumulates information from 600 probes to calculate average packet loss, latency, and jitter for an overlay tunnel.
The poll interval value is configurable per WAN edge router using the following command:
vEdge(config)# bfd app-route poll-interval [1..4294967295]
However, as a general rule of thumb, when we reduce the poll-interval time we should also adjust the BFD hello times in such a way so that the performance statistics are calculated based on at least 500 probes. Otherwise, we can fall into the following situation - if we configure the poll interval to be 1 minute, for example, the performance of a tunnel will be measured over 60 probes. This means that 1 lost probe will result in 1.66% packet loss (1/60*100) which can subsequently trigger the app-route policy to move particular application traffic off a tunnel. Having in mind that in production environments, these types of changes are typically deployed using templates, unproportional poll-interval/bfd hello-interval can create traffic instabilities at scale. If we reduce the poll-interval to 1 minute, we would like to reduce the BFD hello interval to 100 milliseconds so that the performance of a tunnel will be measured over 600 probes again.
App-route Multiplier
The packet loss, latency, and jitter information from each poll-interval is preserved for a number of poll intervals specified by the App-route Multiplier value. The default value is 6, which means that 60-minute averages (6 x 10min poll-interval) for loss, latency, and jitter are compared against the SLA thresholds defined in configured sla-class before an out-of-threshold forwarding decision is made. The App-route Multiplier value is configurable per WAN edge router using the following command:
vEdge(config)# bfd app-route multiplier [1..6]
By default, the app-route multiplier defines a sliding window of 6 poll-intervals. This means that when the seventh poll interval is measured, the information from the earliest poll interval (the first) is discarded to make way for the latest information (the seventh) and another comparison against the SLA thresholds is made using the newest performance data. This process is rolling, which means that the sliding window is moved ahead with each new poll interval.
Let’s look at an example of the process of detecting an out-of-threshold condition and see how much time does the SD-WAN fabric need to recognize performance degradation. Figure 5 illustrates a scenario where the latency suddenly increases from 20 ms to 200 ms at the beginning of poll interval 6. However, while the sliding window calculates the averages between poll-intervals 1 through 6, the average latency is well below the threshold of 100 ms ((20+20+20+20+20+200)/6=50). When the sliding window moves to poll interval 7, the new average latency becomes 80ms ((20+20+20+20+200+200)/6). Only when the sliding window moves to poll interval 8, the latency average over 6 intervals jumps over the configured SLA threshold of 100ms ((20+20+20+200+200+200)/6 = 110ms).
Therefore, when the latency suddenly jumps to 200ms at the beginning of poll-interval 6, the appr-route policy needs 3 more polling intervals to detect the out-of-threshold condition. Using the default poll time of 10 minutes, the application-aware routing policy would take 30 minutes to move the traffic off the tunnel, experiencing a latency jump from 20 to 200 ms. One solution to this problem would be to lower the app-route multiplier value to 2, which will shorten the sliding window, as shown in figure 6.
Full Content Access is for Registered Users Only (it's FREE)...
- Learn any CCNA, DevNet or Network Automation topic with animated explanation.
- We focus on simplicity. Networking tutorials and examples written in simple, understandable language for beginners.