Filed under: Case Studies
If you’ve ever been on the ops team for a web application, you’ve probably noticed that many performance degradations and even downtime incidents are caused by human error. Sometimes that takes the form of misconfiguration, but often it’s a performance or availability problem due to regressions in application code.
There’s therefore a strong incentive to maintain a record of all operations events (releases, configuration changes, upgrades) that can be easily correlated with performance and error data to quickly resolve problems and find root cause for issues. That’s why Tracelytics provides an API for recording operations events.
In this blog post, I’ll introduce you to using our command-line tool tlog to annotate events in Tracelytics, and then go through a quick case study of how we (and our customers) use it to track regressions in code.
The tlog tool is a command-line client for the API that allows you to report system-wide events that appear as annotations above the charts shown in Tracelytics. It is automatically installed as part of the tracelyzer package, so you can run tlog wherever you are tracing.
$ tlog --help
Usage: tlog [options] [MESSAGE]
Logs an annotation message for viewing in Tracelytics.
-h, --help show this help message and exit
-m MESSAGE specifies the message to report.
-a APPNAME specifies an app to which this message belongs. (optional)
-H HOSTNAME specifies an alternate hostname associated with this
It will then show up like this on your charts and heatmaps (though collapsed by default):
Integrating annotations with operations processes
Of course, for these ops annotations to be useful they have to exist! That’s why we provide the command line tlog client that’s easy to integrate with your deployment and operations processes. The most common way this is used is to note new code deployments.
You can see here that a performance issue was resolved by a subsequent deployment of code.
Okay, great, so how would I do that? For a basic approach, you’d just have your deploy scripts execute something along the lines of:
For a more interesting example, which integrates the git branch and commit hash, here’s a fabric snippet that we use in our deploy automation:
In practice: finding a performance regression
It’s always great eat your own dog food, even if sometimes it’s embarrassing to get caught eating dog food. So, here’s just one incident in which we used Tracelytics to ensure that Tracelytics was running well.
After noting that one of our chart endpoints was suddenly causing latency alerts, we investigated in Tracelytics and noticed that requests to our web app were spending more time in the MySQL layer than they had previously:
And it turned out that this coincided with a code release—how unexpected! Because we do very frequent releases inspired by continuous deployment, we already knew where to look. In fact, a re-factored query had lost a constraint on a subquery which caused it to be doing work that was later thrown away by app logic.
So we were able to quickly find the issue, but as usual there’s more than one way to skin a cat. In this case we were able to verify the problem using Tracelytics before even looking at the commits and code involved. Drilling down on the MySQL layer, we were led to a particular recently slow query, as well as traces showing the circumstances under which it manifested:
Fortunately, in addition to being easy to track down, the issue was also easy to resolve!
Looking to get the most out of Tracelytics? This isn’t the only way to extend your visibility: check out last week’s post on framework instrumentation, or head to the documentation to read about the custom event API.