Filed under: Performance Monitoring
What’s the best way to debug web systems performance problems? It starts simply: look at how the application is functioning. But if the system is at scale, with massive concurrency, how do we wrap our heads around all that performance data?
Traditionally we aggregate our data into summary statistics like averages or percentiles. These are great tools, but like anything else have their balance of pluses and minuses. In order to truly understand our systems, we need to know when and how to sidestep those abstractions, to get deep, detailed performance insight. Watch the video for the 5-minute version, or read on for the full story.
Abstraction: “the forest for the trees”
When raw data becomes overwhelming, we turn to abstraction to understand our world. In our systems, the data is always overwhelming. So what are the abstractions we’re using, exactly? Abstraction is a transformation from data to actionable information. In order to make decisions, we take a huge amount of granular raw “fact” data, and use abstractions to reduce it to a working set. Feeding a set of latency data points over a day into a summary mechanism, we can get an average or median, 95% etc, over the course of the day:
Seeking Data <=> Abstraction Fit
As a software engineers, we can never get too far from statistics. In fact, if you look at enough performance data, you’ll start to notice some trends:
Latency data–for all parts of systems–typically has a lower bound and a long tail on the top end. This is often modeled as a log-normal distribution. The lower bound makes sense: at the very least, nothing is going to happen faster than zero milliseconds, and typically there’s actually a physically reasonable lower bound that’s greater than zero. Where does the long tail come from? Concurrency and complexity in systems are the major contributors: as more shared resources and more users sharing the resources become involved, the greater the likelihood of concurrency-related delays.
The upshot of this is that summary statistics that we typically think about with a normal distribution won’t give us the type of insight we expect from them. For instance, if you look at the average for a lot of performance data, it may be higher than the 95th percentile value due to the long tail.
It’s not a contrived or uncommon scenario to see avg > 95th percentile. We recently launched weekly reports that ship you some stats about your latency on a weekly basis. You’d be shocked how many web apps have such a long tail of latency that they meet this criteria–in fact, over 40% of TraceView customers have at least one app that exhibits this kind of outlier-skewed behavior. (Does yours?)
One of the reasons we see such skewed averages is long-tail outliers. Good continuous statistical models, by their definition, successfully account for most of the data systems generate. Unfortunately, there are frequently effects that are difficult to model, albeit still important to account for.
If we see a spike in mean latency, what does it represent? With summary statistics, we’ll tend to see random spikes, but without more information it’s nearly impossible to distinguish between a single left-field outlier and a systematic change.
When we look at performance data, we’re rarely seeing a population of samples that describes a single behavior. That means we’re rarely seeing a single normal distribution–or even single log-normal distribution–underlying our data. The characteristics we measure are heavily influenced by the behavior of end-users, the resources in place, infrastructure concurrency, and many more factors
In the case of multi-modal data, averages are especially dangerous. Averaging latency across logged in / logged out user sessions might be mixing apples and oranges. Similarly, when we mix cold and warm cache hit performance data together, an average manufactures an experience that’s never had by any users–a value somewhere between two sides of a binary.
Better Abstractions: Quantiles
Quantiles, usually percentiles, are the next step up from averages. These stats make a very specific and useful guarantee: at the 90th percentile, 90% of requests were that latency or faster, while 10% were slower. This immediately provides an understanding of the distribution of requests–something that averages are missing. For that reason percentiles form the basis for user experience scoring systems like APDEX. They’re also used in many SLAs and burstable billing models.
But percentiles also leave a lot of open questions: how far is my 95th percentile from the median? Is it a long tail, or are my values clustered tightly? Pretty soon you want a bunch of percentiles to understand your (again, often multi-modal) data.
For instance, in the following graph, are we looking at a single, highly variant population, or rather multiple populations of performance data?
Maybe there are 3 different workloads being represented here: fast, moderate, and slow. And the fast requests simply stopped happening for a period of time, which led to the blip in our aggregates:
Understanding Data Distribution
So maybe what we’d like to do is get a good sense of our data’s underlying distribution, so we can figure out what stats might map to it well. A histogram is a good tool for this. Histograms plot value vs frequency; a histogram for the normal distribution gives us our classic “bell curve.”
But that’s not the data we work with–ours is much more interesting. So let’s take that colored multi-modal from above and see what it would look like as a histogram:
This does a good job of showing us we have trimodal data!
Keep in mind that bucketization–the ‘width’ of our x-axis columns in terms of the values they span–has a pretty significant impact on what our histogram will look like. Wide buckets will obscure through smoothing, while small buckets will render the data too noisy. (CDFs are one way to avoid bucketization problems, but they’re basically unsuitable for human consumption so they are found primarily in academia.)
Operational Data Distribution
Did we lose anything when turning to the histogram? We seem to have lost one of our favorite axes: the time axis! This is incredibly important in the way we think about operationalizing performance data. If we can’t perceive changes over time, maybe related to an incident, or a deploy, or increased request volume, we’d be missing out on some pretty important information.
Is there a way we can get time back into our histograms?
Yikes, that’s kind of messy. What about…
Though the term “heatmap” is used to describe a number of different visualizations, what ties them all together is the notion that color is used to represent intensity of a value. This density of color lets us represent up to 3 dimensions in a normally 2d chart–and it will even still make sense.
To see how this works, let’s first take a histogram representing the distribution of latency values over a 30-second span of performance data and color code bars by their height:
That seems redundant, right? We’re using both color and size to indicate the # of requests in each latency band. So now we don’t need the height anymore:
If we rotate it vertically, we’ve got an extra axis to play with–our time axis is back!
We’ve ended up with a visualization that’s robust to non-normal distributions and best of all reveals rather than concealing multi-modal data!
Static heatmaps have incredible diagnostic power, but we can take them one step further. After plotting heatmaps of latency data, we frequently found ourselves asking “what is going on in those outliers?” or “what’s going on in this band of requests up here?” And that’s one of the hallmarks of a great visualization–not only can it help answer questions, but it allows you to learn the ones you wish you’d been asking in the first place.
We can add one more flourish to this visualization by making the chart itself selectable. This technique is particularly powerful if you have rich data behind it, like distributed transaction traces. In the graph above, we’re looking at the overall latency for requests to a single URL. To drill down on the slower requests, you can simply drag-selection the region in question and view the source data points themselves.
Beyond this blog post
- “Look at your data” – Velocity 2011 talk by John Rauser: http://www.youtube.com/watch?v=coNDCIMH8bk
- “The Statistics of Web Performance Analysis” – Philip Tellis – http://www.slideshare.net/bluesmoon/the-statistics-of-web-performance-analysis
- Almost any blog post at Brendan Gregg’s blog – http://dtrace.org/blogs/brendan/