An Engineer’s Search: Who Broke the Network?
‘Why Is Our Network Broken?’
Even companies that provide network performance management services still have to manage their own networks! So, a few weeks ago when I received notification that we were experiencing extremely low available bandwidth out to the internet, I knew I had to act fast to avoid failure of our critical services- The question was: which tool would solve the problem faster and more completely. Around the office, VoIP phone calls were failing and the speed of our web-based CRM we use came to a s-l-o-w crawl.
I wanted to act fast, but also thought it would be interesting to compare the speed and accuracy of PRTG, a very common SNMP tool, to PathView Cloud with FlowView, a breakthrough netflow and traffic analysis service.
1. Investigation with SNMP
Looking at SNMP data from switches, there was extremely high usage coming from a single interface. That one interface was consuming over 90% of our capacity to the Internet.
The problem is this is a physical port on a switch, and there was no easy way to identify the person involved and the activity they were engaged in. Having just moved into this new office, we did not have a complete map of ports to network drops in the office yet, so I had to trace the cables from the switch to the punch-down, then find that port in the office. Like most companies, we have multiple of a single model of switch, and first time around I traced the wrong cable and falsely accused an intern of crashing our network (Sorry Bri). Once I had identified the correct switch and traced it back to a switch port I was able to take the offending system offline, but if I wasn’t physically in the office I would not have been able to trace that to a specific person.
2. Investigation with PathView Cloud with FlowView
One of the key components of PathView is its ability to monitor key aspects of network performance and alert you to issues quickly and easily. We are pretty open with our internet usage policy but there are several common sense thresholds in place to alert us when our key business services are impacted; and others that alert when available bandwidth capacity is less than 20% or total bandwidth.. This is how we were alerted to this issue in the first place.
PathView Cloud includes FlowView, a complete system for analyzing network traffic, generating NetFlow records and reporting on the activity. This is a great system because we don’t have to enable Netflow on any of our network devices and potentially slowing down the devices and all traffic flowing through them. After logging into Flowview, I looked at the same time frame:
I drilled down into the applications and saw that HTTP traffic originating from Akamai was using the vast majority of the bandwidth, and that one person was responsible for 22.5Gb out of the total 23.2Gb traffic to the internet that morning. The hostname of that machine has been kept cryptic to protect the guilty, but I know who it was and what they were doing.
A little further analysis of the traffic answered a few other questions: Who is going to Akamai? Only this person. Akamai is a content distribution network for many very popular websites and internet services, but a little snooping revealed this was iTunes (the only other hosts that connected to that target were iPhones and iPads). Apple doesn’t limit throttle bandwidth usage within iTunes – download as fast as you can, and as it turns out, this coworker has a long commute and was downloading a TV series from iTunes to watch on the train.
In the end, both tools technically could solve this problem. But with PRTG I needed physical access to the hardware and got a less than complete answer about exactly what was going on. I followed the wire, searched for plates on the wall and found that port number. With PathView Cloud, I could see the computer number, the source and the application from a single interface within seconds of logging in.
Filed Under: performance monitoring