FlowView: How We Did It (Part 1)
If you’re into network traffic analysis, FlowView gives you incredible insight into what is running on your network with simple deployment. This month, we’re celebrating you, the people who take the time to look at your network traffic with FlowVemeber, where we’re giving 10% of all FlowView proceeds in November to the national Movember charities.
When we launched FlowView last year, it started with 60 days of data retention, but it also applied a cap of 3 GB to the amount of data you could store. Recently, we extended that to 90 days of storage, with no data cap. FlowView has always been about figuring out what’s going on in your network right now. How much of your bandwidth is consumed by Netflix? Which user is consuming the majority of your uplink bandwidth by syncing huge videos back to their iCloud account? If you had an especially busy network or many locations to monitor, it was easy to blow through that cap. Worse yet, as more people transitions to 10Gbps networks, the volume of data continues to climb. Clearly, we needed a better solution. Where could we put our data?
The answer is the cloud! FlowView works by deploying appliances on their own networks and watching the traffic on that network. Each appliance periodically uploads encrypted and compressed binary FlowView records up to our collectors in AWS, where we process the binary records and store them for months at a time. From there, our hopefully happy customer can run ad-hoc queries across the whole dataset of FlowView records.
Traditional NetFlow is network protocol developed by Cisco System for collecting IP traffic information. This tends to be built in to network hardware, with the idea being that you send those NetFlow records to an external collector to be analyzed. Our appliances directly analyze the network traffic in order to generate NetFlow v9 template records. These compressed flow records are stored temporarily stored on the appliance and uploaded through an SSL tunnel back to our servers once every 5 minutes.
In most solutions, customers have to estimate flow storage base on their typical network storage, then purchase expensive dedicated flow-based traffic analysis hardware. The typical rule of thumb for is 50 flows/sec (fps) for each 10 Mbps of traffic, so 5,000 fps for 1 GigE, 50,000 fps for 10 GigE.
Each flow basically contains the following attributes:
- Source IP
- Destination IP
- Source port
- Destination port
- Application identifier
- QoS bits
- Flow direction (inbound / outbound / none)
- Total packets (measurement)
- Total bytes (measurement)
- Total retransmitted packets (measurement)
As you can probably imagine, the number of flow records a network generates varies drastically depending on the type of traffic. If you have a lot of DNS lookups, the total bandwidth will be low but number of individual requests will be high, which results in a relatively high number of flow records. On the other hand, when you FTP a huge file between two sites, the total byte count will be high, but the number of flow records generated is actually quite low.
Based on the first iteration of FlowView, we found the above rule of thumb to be pretty accurate. With that number in hand, we can start to extrapolate just how much storage we’ll need. Each flow record turns into a single row in our DB, so for planning purposes, we can draw out the ballpark size of data we’re talking about.
|data retention / traffic volume||10Mbps||100Mbps||1GigE||10GigE|
|1 hour||180,000||1,800,000||18 M||180 M|
|1 day||4.32 M||43.2 M||432 M||4.32 B|
|90 days||388.8 M||3.88 B||38.88 B||388.8 B|
Yikes. Those are really big numbers! Remember this is only is coming from one appliance, and our FlowView storage system, on day one, is going to see thousands of appliances. Some appliance can collect flow records from multiple interfaces which mean that number can be doubled for some.
Knowing the scale of data, what kind of ad-hoc queries FlowView typically need to support?
- Which are the top applications generate the high traffic?
- Which are the top hosts (regardless it is part of source or destination IP) is generating the most traffic for a given application?
- What are application generated from one host?
- What are the top source generating outbound traffic?
Those queries typically ask for top result from last hour, day, week or 30 days. We need to ensure reasonable response top for those queries. To make the problem even more interesting, we support very flexible filters for the above queries. We need to support filter any combination of the following:
- one or more IP or IP subnet in source, destination or either source or destination
- specific application or group of applications
- specific QoS bit
- specific flow direction such as inbound or outbound
That’s pretty much all the data we collect. At this point, we’re looking at how to query trillions of rows, with arbitrary filters, and return fast enough that nobody gets bored.
Does it looks like interesting problem to solve? Do you want to know the bumpy struggle we have been through to solve this problem? In part 2, we’ll look at how ways that seem like they would work, but didn’t, and in part 3, we’ll describe the solution that’s current in production today.
Do you also have interesting big data problems you’ve solved? Please share your experience with me in the comments!