If I had a nickel for every time I heard “We have enough bandwidth,” I would have enough money to buy the new iPad. While this has not gotten me the iPad, it has given me some great insight into how important knowing your available and utilized bandwidth is to assuring network performance and delivering critical business services like VoIP, video conferencing and VDI.
I was recently working with one of our customers, a large global technology provider that was doing large replications across an MPLS network. To consolidate servers, they were delivering this service from an ESX 4.0 host running Server 2008. They had allocated more than adequate processing, memory and storage to the various systems on the ESX. Additionally, the VNIC and internal network was rated for GigE. This GigE LAN was close to the edge of the datacenter (DC) therefore it was only a single layer two switch to the MPLS service. Once the MPLS service was reached, an OC3 (155mbps minus network overhead) pipe was provisioned from AT&T.
With all the above in place, it appeared to be a fairly robust and healthy infrastructure. However, when the project went into beta a major problem occurred. Replicating small amounts of data would take ages! With hours upon hours dedicated to small replications, the customer deemed it impossible to scale the project to what it should be.
Until they started using PathView Cloud.
Working with the network engineering team we quickly established two rackAppliances in their two datacenters. These datacenters were bicoastal connected via an OC3 pipe. The appliances were up and running within minutes, with three core paths that we wanted to monitor and better understand.
The first two were the internal connections from the rackAppliance to the servers in question. From the screen shot below, you can see that GigE was achieved on the LAN to each server. However, the last path we wanted to analyze was from DC1 to DC2 over the OC3 link. This is where things became interesting.
Monitoring the path, bi-directionally utilizing UDP (we confirmed there were no policing of our packets), between datacenters showed that the WAN pipe was only able to achieve ~50-60mbps. Now I’ve dealt with carriers when I was only achieving 2.5mbps on a 2xT1 link, and they blamed it on “network overhead.” So forgive me for being skeptical when I initially blamed the readings on the inability of the carrier.
Upon further inspection, the carrier was going to be vindicated.
PathView Cloud quickly showed us that the faulty firewall was implemented in datacenter number 1. PathView’s hop-by-hop analysis showed us that this was severely limiting the capacity to that 50mbps before it even hit the OC3 link!
Therefore, the bottleneck was on their side of the firewall. This may be the only time I say this, I’m sorry AT&T for blaming my bandwidth problems on you.