IN PRACTICE: Detecting ‘problem device’ at SD-WAN provider’s PoP by
Justin Tiearney December 4, 2018
IN PRACTICE | This article is part of a series of posts sharing examples of how AppNeta users have leveraged the service to solve performance problems.
Continuous, real-time visibility into the condition of the core network is crucial to ensure consistent performance for end users. Without network visibility, problems may go undetected until they are severe enough to negatively impact end users. When issues are finally reported, a lack of visibility will make diagnosing the cause that much harder. Avoiding this outcome is one of the main reasons customers implement AppNeta Performance Manager.
To understand network performance and mitigate issues, we need to:
- Gain real-time visibility into the network’s performance
- Observe performance over time and identify anomalous behavior
- Drill in to narrow down the root cause
- Compare results against other paths to validate and build a convincing case
- Provide evidence to the team responsible for the culprit device
- Compare performance before and after the fix to verify improved performance
Let’s take a look at an example of how AppNeta recently worked with an SD-WAN provider to identify an issue at one of their Points of Presence (PoPs) and the specific device causing a bottleneck.
1. Gain visibility
AppNeta Monitoring Points were deployed into each of the SD-WAN provider’s PoPs in a network stack that routed their end customer traffic through a public interface. We then monitored the performance of the network segment commonly used by customers to access the SD-WAN service by creating network paths in Delivery from the PoP’s monitoring point to the external IP interface.
2. Observe behavior
AppNeta accurately measures a network path’s total and utilized capacity without impacting network performance. At one of the provider’s PoPs with a 10g link, AppNeta Performance Manager reported less than 500 Mbps total capacity consistently over a series of several days and extreme spikes in round trip time (RTT) during peak business hours.
3. Narrow down cause
Once monitoring, alert profiles are used to define what to consider acceptable performance thresholds. Automatic diagnostic tests run on this path each time a condition in the alert profile is violated. The diagnostic detects high utilization at the very first hop on the path indicating it as the likely bottleneck. But before we make assumptions, we should gather more evidence.
4. Collect supporting evidence
A second path is set up in AppNeta Performance Manager to the same target that bypasses the suspected problem router. Sure enough, round trip time is down to the expected sub-millisecond and total capacity is higher.
5. Provide data to the team responsible for bottleneck device
To build a strong case for the network engineers responsible for the router, we provided the identity of the specific router, results of several diagnostic tests showing expected performance at other PoPs and sub-par performance at the PoP in question. Results from the bypass path are provided.
6. Compare before and after the fix to confirm improved performance
Engineers were able to improve performance on the problem router, which was reflected in AppNeta Performance Manager. Round trip time on the original path is now below 1 ms and capacity has increased to 1500 Mbps. The engineering team is working to further optimize infrastructure capacity.
Unbeknownst to the SD-WAN vendor, this routing device caused a significant performance impact for any customer connected to one of their data centers. Gaining continuous real-time visibility into the performance of their core network with AppNeta enabled the network team to detect anomalous behavior, narrow down the likely root cause, and provide evidence to the team responsible for the fix.