Filed under: Performance Monitoring
Most devops dashboards are filled to the brim with charts, and TraceView is no different. Since version 0.1, our dashboard has been one big chart that you can filter, slice, and dice. We tweaked what’s below it, pretty much exclusively in an attempt to surface abnormalities in the data by looking at it in a number of different ways. All those tables of URLs and trend charts of breakdowns (like HTTP verbs, SQL operations, or API domains) tease apart our transaction traces into the most important bits, and summarize everything for your consumption.
The problem with tables and charts is that they’re hard to talk about. “Hey, are those full-collection scans in Mongo still too slow?” “Erm, the graph was … uh … blue. And it had a kink in it 2 days ago, where the latency was kind of going up slower than it usually does in the mornings. I think it’s fine!”
We figured we could do something to fix that:
The new Performance Summary lives above all the graphs and breakdowns inside your app. We didn’t actually remove anything; it’s just a new header to the rest of your data. We’ve hand-picked the most important numbers about your app. These are the metrics that you already have internalized; the metrics that when one of these changes by 20%, you think, “Wait, what happened?”. Specifically,
- Average latency
- Average request volume
- Error frequency
Over the entire application, these metrics have pretty broad meaning. It’s still important to track them, but a 50% increase in average latency could be due to overloaded infrastructure, or it could simply be due to introducing a reporting feature that consistently takes 4 seconds to generate PDFs.
To help understand that, all of these numbers behave in exactly the same way the rest of the app does. Any filter — domain, URL, layer, operation, or anything — applies to the Performance Summary as well. As you work on specific projects within your app, this lets you build that same sort of understanding around subsets of your performance data. Even better, if you’re digging into an incident after the fact, you can compare numbers directly by just selecting the data around the time the alert went off. Not only can you compare to your top-level baseline, this allows you to see the difference between problem and non-problem periods.
We’ve been using this internally, and we’re surprised we managed to live this long without it. Let us know what you think!