IN PRACTICE: Identifying the cause of SD-WAN degradation
by June 19, 2018

Filed under: Industry Insights, Performance Monitoring, Use Case

Context

A major US retail chain had been running the AppNeta Performance Manager for 10 months to monitor the retail store WAN services between their primary data center and stores in order to validate that their circuits are performing. Following a growing trend in enterprise networking, the Network team is migrating their stores from a traditional MPLS network to an SD-WAN solution. They discovered that the SD-WAN provider had a lag between when packet loss occurred and when it was being reported in the SD-WAN controller dashboard. It also didn’t consistently pick up intermittent conditions like if a circuit is flapping or goes down and recovers quickly. Thankfully, they are able to leverage the AppNeta Performance Manager to detect network issues with ease.

How AppNeta Identified the Issue

The AppNeta Performance Manager alerted the Network & Telecom team to excessive packet loss, latency and jitter across their SD-WAN sites coming into the main data center. Because the problem was a degradation of the links’ performance, not a complete loss of service, users at retail branches were not yet reporting issues and other systems were not setting off alarm bells. They investigated the issue because APM reported high packet loss.

Data loss comparison report for eight flagship stores (4 on SD-WAN, 4 on MPLS).

Data loss comparison report for eight flagship stores (4 on SD-WAN, 4 on MPLS).

That level of degradation at nearly 20 sites would impact the store’s ability to put through sales transactions and potentially have a negative impact on revenue numbers. All internal applications (POS, business applications) would be impacted, with users suffering increased load times and reduced application reliability. Had the issue continued, it would have most likely caused applications to intermittently fail. Then by the time users did report seeing issues, without the visibility provided by APM, the network team would have looked to their other systems for answers, but come up empty. According to their Network Engineer, without AppNeta,

 we could have probably gone back and forth for weeks if not months on it. 

Thanks to AppNeta, the network team was easily able to identify a pattern around which locations were experiencing the loss without needing to rely on any user input. This knowledge allowed them to identify some common infrastructure/configuration shared by all the affected paths, and closer examination of this infrastructure revealed a configuration issue that was causing the loss.

Honestly, if it weren’t for AppNeta reporting those sites as having 5-10% packet loss, we probably wouldn’t have even noticed it and it probably would have gone on for months.

Steps Taken to Solve the Issue

1. Notice the issue

The retailer’s network team is continuously monitoring the links from their primary data center out to each of their stores. The company’s Senior Network Engineer keeps an eye on the AppNeta Performance Manager’s Service Level Compliance – Past 24 Hours dashboard to understand the performance trends for his network. Typically his environment hovers around 85-90% compliance with an expected dip in the morning when a large batch runs on the network. Since he understands the normal baseline for his environment, he was quick to notice when compliance dropped significantly. When the level stayed steady at 60%, he knew something abnormal was going on in the network.  

Side by side comparison of AppNeta dashboard prior to issue and as service violations increase

Side by side comparison of AppNeta dashboard prior to issue and as service violations increase

2. Look for high level commonalities

The first thing the network engineer looked for was commonalities between the stores reporting service quality violations. He noticed all affected sites use the SD-WAN solution and that the sites still using their traditional MPLS connections were unaffected.

There are a few ways to identify which paths are having issues:

a. On the Network Path List within the Delivery component, filtering by path status will quickly return all paths currently in a violation state.

network paths

b. Alternately from the Network Path List, grouping by a path characteristic is an easy way to tell at a glance if the violated paths have something in common.

c. In this example, because our retailer was on the AppNeta Performance Manager dashboard when he noticed a decrease in service compliance, he could click on the chart to run a report of the last day’s service quality. From there a ‘top offenders’ report can be run to identify which paths were violated most often and for the longest duration.

top offenders

3. Narrow focus

Once the network engineer realized the degradation only occurred for the SD-WAN sites, he narrowed his focus to look at the infrastructure shared between their SD-WAN sites. Using basic troubleshooting through CLI, he then narrowed the issue down to a specific uplink that was having input errors.

From the input errors, he was able to track down that the degradation was due to an upstream link issue on the firewall. Once they cleared the issue, service quality returned to normal.

Data loss comparison chart showing performance before, during and after issue resolution for eight flagship stores (4 on SD-WAN, 4 on MPLS).

Data loss comparison chart showing performance before, during and after issue resolution for eight flagship stores (4 on SD-WAN, 4 on MPLS).

Thanks to AppNeta, this retail chain was able to quickly and easily identify a serious packet loss issue during their MPLS to SD-WAN migration, avoiding any revenue-impacting network performance issues.

Has the AppNeta Performance Manager helped you identify similar issues? We’d love to hear your story!