Filed under: Performance Monitoring
Earlier this year my colleague TR wrote about the launch of our data API. I’m a big fan of making data more accessible, so I used our own API to write a Hubot script for exploring TraceView data without ever leaving HipChat. Even while I was writing that article, though, we were already lining up a host of new features based on customer feedback. There are a lot of improvements, but now that v2.0 of our API has been rolled out I wanted to write about my three favorites.
One of the most popular requests has been better API support for layers, which are how we represent components of your web application’s stack within TraceView. Traces are composed of calls to layers, and layers also feature prominently in our top-level views of app performance. By drilling down into an app’s layers, it’s possible to find performance problems like looped queries or webserver queueing.
But what if you want to access layer-based data programmatically, rather than using our heatmap? With APIv2 out the door, that’s now an option. A solid first step is to use our discovery API to retrieve a list of layers currently associated with your app in TraceView. This can be a quick sanity check to determine whether everything is functioning properly. Just to give an example, if your cache layer suddenly goes missing, that could indicate an unintended configuration change within your application’s architecture.
One of the neatest changes to our API is that many endpoints now accept layers as a type of filter. This dovetails nicely with the above layer discovery endpoint, but you can specify layers manually if you’d prefer. We could use this endpoint to retrieve the list of layers associated with an application, then use them as filters in the server latency timeseries endpoint to retrieve layer-specific timing information.
However, looping over a list of layers could generate a lot of API calls for a large or complex application, and we’re doing extra work to merge the API responses, too. Archiving per-layer server latency is such a common task that we also added a by-layer app latency endpoint. This retrieves the same data displayed on TraceView’s layer breakdown screen:
But now that we have that data, we can do whatever we want with it. Here’s what the main chart looks like when rendered with an alternative library:
The two most common reasons that our users pull data out of TraceView are to integrate it with dashboards (like we’ve done with Geckoboard), and to archive data for comparisons with historical performance. Latency data is a good start for both use cases, but a 50% speed increase might be the result of half of your requests being dropped as 500 errors. That’s why we’ve extended our API with a new endpoint to fetch the percent of traces containing errors within each time span.
We made layers a big part of this batch of API improvements, so we’ve also made sure that the error rate endpoint accepts layer filters. TraceView already supports alerting based on per-layer error rates, but programmatic access to this data opens up alternative use cases like ‘sanity checks’ in continuous integration environments. Even full test coverage of a web application’s code might not be enough to catch ‘devops problems’ like excess database connections or memcached timeouts caused by stored values increasing in memory consumption. The latency API already lets you require that your app isn’t making more calls, or making them more slowly, and it can now be extended so that builds can be rejected if they have a higher error rate. This can catch unanticipated scaling problems, or potentially even identify a misconfigured build environment that could false positives or negatives elsewhere.
What works for a single-server WordPress blog might stop working for a backend with hundreds of hosts spanning multiple service tiers. If you spin up servers on a per-customer basis, you’ll be glad to hear that you can now create new apps with our app management endpoint. Provisioning a new customer’s app infrastructure? Deploy the nodes pre-installed with TraceView using our Chef recipe, assign them to the customer’s new app as they come online, and add an annotation as you complete each step in your runbook. Your support team can follow the live data as it streams in, even if they weren’t the ones who kicked off the job. When it starts to hang at step 15 out of 37, they’ll know whether to talk to a sysadmin or a network engineer – or maybe even your billing department.
But wait, there’s more!
I haven’t managed to touch on every change we’ve made to our API, but even so I wanted to point out we’ve improved some endpoints just by developing other areas of TraceView. For instance, assigning apps via API works great with our new app-based host screen. And while many of our customers already add annotations to mark deployments, we’re now showing them on end user experience graphs too. That means easier monitoring of frontend code changes like deploying new A/B tests or switching the CDN you serve assets from. Developers are glad to segregate themselves into ‘frontend’ or ‘backend’, but TraceView helps you understand how and why the decisions of one can strongly impact the other.