As part of the sales engineering team here at AppNeta, I see a lot of enterprise networks in action. We solved one interesting mystery recently.
I was working with a customer on a proof of concept for our monitoring solution (AppNeta Performance Manager). On February 6, they were unable to attend a call we had set up, due to a Tier 1 network issue they were having. I started looking around the POC environment to better understand if what we had configured, to that point, could help shed some light on the issue. Lo and behold, it did. The customer was ultimately able to take the AppNeta monitoring data back to Bell Canada to prove the network performance degradation between a Toronto location and a remote site in the southeast United States.
How We Found the Cause of a Tier 1 Network Issue
Here’s how we gathered that data. We had already placed an AppNeta m35 at both of the company’s enterprise locations and created a dual-ended path between those sites for continuous network analysis. Each location has two WAN connections, a private Bell Canada connection and a secondary business-class connection, with the routing controlled by Cisco’s iWAN technology.
You can see this in the charts below. What’s interesting about this first chart, aside from the massive black hole in the middle (we’ll get to that in a minute), is the wide variance in total capacity on the outbound and inbound KPI charts. The MPLS is a 100 Mbps connection, and for the majority of the timeframe outlined, we see that 100 Mbps connection functioning correctly. However, at various times, we see only 25 Mbps total capacity. This is where the user routes switch from primary MPLS to the secondary broadband connection.
In this case, we decided to monitor both connections in a singular path and follow the user routing. In similar customer scenarios, we create three paths: 1, the static route over MPLS; 2, the static route over broadband connection; and 3, the dynamic route to follow the user routing path.
Digging into the event at hand, though, we are able to zoom in (to one-minute granularity, if needed) and find that the event causing performance issues started just before midnight on February 6 and lasted until about 6pm on the 6th. Three items stick out here:
- The vertical black lines represent connectivity outages
- The total capacity on the link fluctuates drastically, while the utilization is higher than normal (dark shaded blue)
- The data/voice loss is almost making the link unusable
We had previously configured an Alert Profile to trigger notifications (on connectivity and data loss), and to enable the kick off of the enhanced diagnostic testing.
The diagnostics will take us from the continuous path monitoring, which has the summarized view and points out when events occur, to gaining a high-definition view of the event itself.
As we can see in the shot below (taken during the event), the local environment, at the source, is healthy and performing as expected. Once our testing packets (powered by TruPath) hit the middle hops in the MPLS, we see non-standard MTU hops, significant spikes in data jitter and RTT spikes. We are also detecting very high amounts of data loss through the mid-path hops. We also see a common hop on the inbound/outbound paths that is poorly performing.
We can clearly see that the MPLS is the root of the issues on the path.
Compare that to a normal diagnostic, where we do not see the same impairments at those mid-path hops.
Once we gathered this information, the customer took the data to the carrier to point to those mid-path hops and validate that the issue was not related to the enterprise locations. It turns out that Bell Canada had a major East Coast outage for traffic going to the U.S. via New York. They lost their 100G core link to New York and had rerouted traffic over to Chicago, which caused packet loss, latency and congestion.
This situation may sound familiar to those of you managing enterprise networks. How might this IT team respond differently next time? The key benefits for this customer were those proactive notifications so they could see there was an issue that had occurred overnight. Knowing that, they could reroute all traffic to the backup WAN link to bypass the Bell connection altogether. From there, the data could be presented to the carrier immediately to cut the Mean Time to Resolution—and for our customer’s IT team, Mean Time to Innocence. In this case, the cause of the performance problem was the network. However, the fault definitely fell outside of the network ops team’s hands. It’s not always easy to find the cause of a slowdown, or the fix, but we’re glad this turned into a monitoring success story.