The Science of Transport Performance – MX SD-WAN Best Practices – Cisco Meraki

Meraki MX security appliances use UDP probes approximately 100 bytes in size to continuously monitor performance across all available transport paths. These probes are unidirectional. Therefore, an MX high-availability pair with dual WAN links would have multiple probes passing via each available WAN uplink, including cellular uplinks. The default probe interval is 1 second. However, this is modified automatically to up to 10 seconds for very large deployments with thousands of Auto VPN nodes to avoid overloading devices with probe traffic and the associated monitoring overhead. Even though the Meraki support team may be engaged to manually change the probe interval, the Meraki way is to let the automation take care of the process. And take note, automation is a key parameter that has allowed Meraki SD-WAN to be simplified and scaled for any industry.

All Meraki devices synchronize their local Network Time Protocol (NTP) clock using the management plane back to the Meraki cloud. This synchronization is leveraged by MX security appliances to calculate the latency on each transport path using timestamp information contained in the data payload of the probes. The timestamps and other remaining data in the probe response packets are used to calculate the round-trip time (RTT), latency, jitter, and loss, which are all considered when calculating the Mean Operating Score (MOS) of a transport path. The MOS value is a commonly used general performance metric to quickly determine the average link quality for transport paths based on the metrics noted previously. The MOS is calculated to result in a value between 1 and 5, with the higher MOS value representing a better quality path.

Meraki will aggregate probe data over 30-second periods to use for monitoring and reporting of SD-WAN metrics, which in a typical network will provide 30 data points (one probe per second) to be used to determine link quality. For extremely large Auto VPN deployments where a 10-second probe interval is used, this still provides three probes per uplink path to be aggregated to calculate the network performance metrics. This probe logic is optimized to trigger failover between transport paths as quickly as possible, with the ability to reach sub-second failover times in some scenarios. This allows Meraki devices with proper SD-WAN implementations to provide high levels of reliability and connectivity for critical traffic across multiple transport circuits with minimal configuration.

Table 6-1 shows the various services and their expected failover and failback times for Meraki devices. The main parameter to note is the sub-second performance of DPS compared to other services. That is where the value of SD-WAN becomes realized.

Table 6-1 Meraki Service Failover Times

Note that the DPS time is the only SD-WAN–based failover time listed in Table 6-1 and that the true failover time will depend heavily on the policy type and configuration. In the vast majority of scenarios, failover will occur in 1 to 3 seconds, but with proper policy configurations, dynamic path failover can take less than 500 ms. In the instance of a complete circuit failure, the time to failover to a secondary path is near instantaneous and is less than 100 ms. Additionally, the 300-second time listed for general WAN connectivity failover is an absolute worst case scenario for a device experiencing intermittent WAN degradation. The 300-second WAN failover time in this case is not an SD-WAN implementation failover time despite often being touted as such by competitors.

Pro Tip

You can find details about the MOS and other performance metrics previously measured by navigating to the Security & SD-WAN > VPN Status page from the network of one peer, then selecting the entry for the related VPN peer in the Site-to-Site Peers table.