IN PRACTICE: Detecting ‘problem device’ at SD-WAN provider’s PoP
by December 4, 2018

Filed under: Networking Technology, Performance Monitoring

IN PRACTICE | This article is part of a series of posts sharing examples of how AppNeta users have leveraged the service to solve performance problems.

Continuous, real-time visibility into the condition of the core network is crucial to ensure consistent performance for end users. Without network visibility, problems may go undetected until they are severe enough to negatively impact end users. When issues are finally reported, a lack of visibility will make diagnosing the cause that much harder. Avoiding this outcome is one of the main reasons customers implement AppNeta Performance Manager.

To understand network performance and mitigate issues, we need to:

  1. Gain real-time visibility into the network’s performance
  2. Observe performance over time and identify anomalous behavior
  3. Drill in to narrow down the root cause
  4. Compare results against other paths to validate and build a convincing case
  5. Provide evidence to the team responsible for the culprit device
  6. Compare performance before and after the fix to verify improved performance

Let’s take a look at an example of how AppNeta recently worked with an SD-WAN provider to identify an issue at one of their Points of Presence (PoPs) and the specific device causing a bottleneck.

1. Gain visibility

AppNeta Monitoring Points were deployed into each of the SD-WAN provider’s PoPs in a network stack that routed their end customer traffic through a public interface. We then monitored the performance of the network segment commonly used by customers to access the SD-WAN service by creating network paths in Delivery from the PoP’s monitoring point to the external IP interface.

2. Observe behavior

AppNeta accurately measures a network path’s total and utilized capacity without impacting network performance. At one of the provider’s PoPs with a 10g link, AppNeta Performance Manager reported less than 500 Mbps total capacity consistently over a series of several days and extreme spikes in round trip time (RTT) during peak business hours.

 The current path route has a very high RTT of 54 ms. Based on proximity to the target, it should be sub-millisecond.

The current path route has a very high RTT of 54 ms. Based on proximity to the target, it should be sub-millisecond.

Total Capacity of a 10g link is calculated at < 500Mbps over several days. Vertical lines are annotations indicating alert profile violation events (in this case, whenever total capacity drops below 400Mbps)

Total Capacity of a 10g link is calculated at < 500Mbps over several days. Vertical lines are annotations indicating alert profile violation events (in this case, whenever total capacity drops below 400Mbps)

A pattern of significant spikes in round-trip time is detected on a network path. The recurring pattern indicates RTT increases during business hours.

A pattern of significant spikes in round-trip time is detected on a network path. The recurring pattern indicates RTT increases during business hours.

3. Narrow down cause

Once monitoring, alert profiles are used to define what to consider acceptable performance thresholds. Automatic diagnostic tests run on this path each time a condition in the alert profile is violated. The diagnostic detects high utilization at the very first hop on the path indicating it as the likely bottleneck. But before we make assumptions, we should gather more evidence.

a diagnostic shows high utilization detected at first hop

A diagnostic shows high utilization detected at first hop

4. Collect supporting evidence

A second path is set up in AppNeta Performance Manager to the same target that bypasses the suspected problem router. Sure enough, round trip time is down to the expected sub-millisecond and total capacity is higher.

round-trip time is substantially lower on the route that is bypassing the suspected problem router

Round-trip time is substantially lower on the route that is bypassing the suspected problem router

5. Provide data to the team responsible for bottleneck device

To build a strong case for the network engineers responsible for the router, we provided the identity of the specific router, results of several diagnostic tests showing expected performance at other PoPs and sub-par performance at the PoP in question. Results from the bypass path are provided.

6. Compare before and after the fix to confirm improved performance

Engineers were able to improve performance on the problem router, which was reflected in AppNeta Performance Manager. Round trip time on the original path is now below 1 ms and capacity has increased to 1500 Mbps. The engineering team is working to further optimize infrastructure capacity.

 total capacity triples after router re-configuration. Vertical lines indicate alert profile violations (in this example, whenever total capacity drops below 400 Mbps).

Total capacity triples after router re-configuration. Vertical lines indicate alert profile violations (in this example, whenever total capacity drops below 400 Mbps).

average and maximum round-trip time decreases substantially after router re-configuration

Average and maximum round-trip time decreases substantially after router re-configuration

Unbeknownst to the SD-WAN vendor, this routing device caused a significant performance impact for any customer connected to one of their data centers. Gaining continuous real-time visibility into the performance of their core network with AppNeta enabled the network team to detect anomalous behavior, narrow down the likely root cause, and provide evidence to the team responsible for the fix.