AppNeta in Action: Excessive Jitter Alert
Just like our enterprise clients, we leverage AppNeta Performance Manager internally to monitor our private cloud deployment, giving our corporate IT team the visibility they need to get to the bottom of issues impacting our own network environments.
Recently, our Director of IT, Jason Hislop, was able to leverage our solution to identify an instance of jitter that had the potential to compound before it began chronically impacting users at our office in Vancouver, British Columbia.
Jason was first alerted to the issue via email notifications that indicated a number of paths (inbound and outbound) were exceeding thresholds for data jitter. A violation at 14:29 triggered the initial investigation with more alerts arriving ten to twelve minutes later.
Looking specifically at the Delivery chart for Data Jitter, we can see that while jitter had been increasing since 13:00 and actually exceeded the threshold previously (\~13:15), the issue did not persist long enough to trigger the alert.
As a next step for troubleshooting, our team checked Usage, but didn’t see any significant traffic spikes within the last few hours.
A quick check of the firewall’s LAN interface confirmed that current traffic throughput was normal (the green line indicates inbound traffic, pink represents outbound, and the blue is the total). Note that the spike at 12:45 precedes, but doesn’t persist through, the 14:30 through 15:15 timeframe for the spike in data jitter.
What was abnormal was the CPU usage on the firewall. Jason found that the ITIM solution had also alerted on performance and reported repeated spikes to 100 percent. From this, Jason was able to isolate a VPN connection, and the suspected VPN tunnel traffic was to AWS infrastructure, which caused a major increase in CPU usage on the firewall.
To identify what was behind the traffic, Jason went back to AppNeta Performance Manager, specifically Usage, to look for any device sending a large amount of traffic to AppNeta AWS infrastructure (10.x.x.x subnets) and found a single machine that had sent 31GB of traffic in the past 4 hours.
With Usage, Jason was able to identify the host and user that topped the list for network flows and spoke with the end user. The user confirmed that they had been copying a large amount of data to AWS. They cancelled the transfer and the firewall’s CPU immediately dropped back down to a more normal state around 15:12.
Problem solved! Without Usage, Jason might still be searching for the cause of this issue. For a more detailed view into the issue, check out the charts below. You can see the start of the issue at 13:00 and the end around 15:20. In retrospect, we can notice that utilization and jitter both drop off when the transfer stops. That helps confirm the problem has been found and resolved.
Four Dimensions of Network Performance Monitoring
To learn more about what a comprehensive performance monitoring solution should deliver, download our whitepaper.
Filed Under: company news