IN PRACTICE: Pinpointing Cause of Point of Sale App Latency
IN PRACTICE | This article is part of a series of posts sharing examples of how AppNeta users have leveraged the service to solve performance problems.
What Problem Was the Customer Attempting to Solve?
In retail, slow performance in Point of Sale (POS) applications has an immediate impact. When numerous complaints came in regarding overall performance slowdowns for a large national retailer the corporate networking team who oversees the store connections sprung into action. As the team began investigating and troubleshooting the source of the problem, they found that pinpointing whether latency was caused by the network or the application was difficult in their complex environment.
With recent technology advances designed to bring the retail sales staff out to the customer, revenue-driving applications take on greater complexity. Instead of a static cash machine connected to the LAN via wired interfaces, the Point of Sale app is now deployed on a mobile device connecting to any one of the wireless access points in the store.
The troubleshooting event was a classic example of the network and application teams coming together on a conference call, with disparate data sets, attempting to prove where the root of the user degradation was occurring (network vs. application layer).
The retailer utilizes AppDynamics Real-User Monitoring (RUM) which uses server-side instrumentation to get timing data from clients of the application. The application team saw the poor end-user experience from a retail store reflected in the RUM data below (image 1).
Image 1: Chart from AppDynamics showing the time taken (in milliseconds) for users to connect to the POS system throughout the day. Two large spikes in connection time confirm the user report of slowness.
During the investigation into the AppDynamics environment, the RUM data brought attention to a few latency spikes in the user connection (TimeTakenToConnect), but no outlying events within the application environment itself. The data suggests that performance of the web application is not the source of the slowdown which points troubleshooting towards the network. While RUM did not identify the issue in this case, this type of monitoring is great for understanding the user experience of applications you host. However, on its own, RUM does not provide insight into the root cause of poor network performance.
How AppNeta Helped to Troubleshoot
AppNeta’s Usage monitoring gathers information from the network packet stream via a SPAN port or even inline for remote locations where a SPAN is not available. Usage analysis automatically identifies about 2000 applications and allows customers to create Custom Applications by adding rules around proprietary application traffic. Performance is measured on a per user, per application basis making it easy to assess where teams should be focusing their efforts to improve performance the most.
The benefit to automatically identifying the applications is the network monitoring team can not only tell what traffic is competing with the business-critical applications but because AppNeta’s Usage identifies the real-user experience of those business-critical applications, the team can quickly identify which segment in the environment they need to begin troubleshooting.
Of particular note in this example is the ability to call out a single host and, for this host, the network latency and retransmit rate for the selected time period (image 2).
Image 2: Chart from AppNeta Performance Manager showing actual POS traffic from multiple retail stores. The highlighted row shows the POS app performance for the store that complained of poor performance.
Like the RUM monitoring, Usage analysis is based on real user traffic, but the deeper insight can identify if the latency is within the network or in the application stack and identify packet retransmit events for each user session independently. In this case, the fact that the network latency is 5 times that of the application latency is an issue, but the clear smoking gun of 32.9% retransmit rate will render any application unusable.
We can now take that knowledge and dig into why the network was impacting the real users’ experience of the POS app. AppNeta’s Delivery monitoring provided the retail customer’s network and application teams with the end-to-end visibility to make an actionable decision (and cut down the time to resolution). This visibility allowed them to understand that the network was impacting user experience, and then using the full suite of tools, they were able to isolate where on the network the event had occurred.
The first KPI that sticks out is the capacity breakdown chart. Total capacity is as expected for a bonded T1 connection, but AppNeta identifies high utilization spikes (+60%) on the link (image 3).
Image 3: AppNeta Performance Manager charts showing the total, utilized and available capacity of the bonded T1 link between the store and the POS system in the outbound (left) and inbound (right) direction.
As we continue down to the secondary KPIs, we find jitter and RTT both exhibit spikes at the same time. What is interesting in this case is that we didn’t detect any loss or 1-way latency spikes, so no packets were dropped.
Image 4: AppNeta Performance Manager charts showing data loss, jitter, round trip time and 1-way latency between the store and the POS system at the time of the performance issues.]
Whenever the performance of any measured metric falls outside of user-defined thresholds, the AppNeta Performance Manager system will automatically perform a diagnostic test measuring performance at each layer 3 hop to identify the source of the issue, even if that device is not on your network. The diagnostic test details that hop 2 on the retailers LAN had relatively high utilization, but extremely high spikes of jitter and RTT, both of which should be extremely low on a local network.
The poor performance at this point in the network cascaded the high max jitter and RTT values to all following hops as well. With this understanding, the retail customer’s network team could investigate the device logs and other device data sets, like load and memory utilization for device-level troubleshooting.
Image 5: Diagnostic test results
We can see in this example that one point sticks out, hop 2 (a network firewall), which happened to be overloaded with requests during that time frame. This POS application was written in such a way that if the packets were delayed by a certain amount of time the application considered them lost and requested them again. The congested firewall had a different timeout, so it was delaying the packets for longer than the application expected, but it never considered it “packet loss” because it eventually sent them and therefore the interface logs would never report high levels of packet loss.
The underperforming firewall ultimately didn’t cause any packet drops, however, the periodic spikes in RTT and caused application timeouts and the packets to be retransmitted, ultimately impacting the user experience of the application at the retail store.
Has AppNeta Performance Manager helped you identify similar issues? We’d love to hear your story!