Recording Operations Events: tlog
by March 21, 2012

Filed under: Performance Monitoring

AppNeta no longer blogs on DevOps topics like this one.

Feel free to enjoy it, and check out what we can do for monitoring end user experience of the apps you use to drive your business at www.appneta.com.

Tracelytics

When AppNeta acquired Tracelytics and their awesome team – we decided to keep their great blog content to document AppNeta’s journey!

If you’ve ever been on the ops team for a web application, you’ve probably noticed that many performance degradations and even downtime incidents are caused by human error. Sometimes that takes the form of misconfiguration, but often it’s a performance or availability problem due to regressions in application code.

There’s therefore a strong incentive to maintain a record of all operations events (releases, configuration changes, upgrades) that can be easily correlated with performance and error data to quickly resolve problems and find root cause for issues.  That’s why Tracelytics provides an API for recording operations events.

In this blog post, I’ll introduce you to using our command-line tool tlog to annotate events in Tracelytics, and then go through a quick case study of how we (and our customers) use it to track regressions in code.

Introducing tlog

The tlog tool is a command-line client for the API that allows you to report system-wide events that appear as annotations above the charts shown in Tracelytics.   It is automatically installed as part of the tracelyzer package, so you can run tlog wherever you are tracing.

$ tlog --help
Usage: tlog [options] [MESSAGE]
Logs an annotation message for viewing in Tracelytics.
Options:
  -h, --help   show this help message and exit
  -m MESSAGE   specifies the message to report.
  -a APPNAME   specifies an app to which this message belongs. (optional)
  -H HOSTNAME  specifies an alternate hostname associated with this
                   message. (optional)

It will then show up like this on your charts and heatmaps (though collapsed by default):

tlog 2

Integrating annotations with operations processes

Of course, for these ops annotations to be useful they have to exist! That’s why we provide the command line tlog client that’s easy to integrate with your deployment and operations processes. The most common way this is used is to note new code deployments.

tglog 3

You can see here that a performance issue was resolved by a subsequent deployment of code.

Okay, great, so how would I do that?  For a basic approach, you’d just have your deploy scripts execute something along the lines of:

1
$ tlog -m "Pushed a new release." -a "My_AppName"

For a more interesting example, which integrates the git branch and commit hash, here’s a fabric snippet that we use in our deploy automation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def _tlog_deploy(app, extra=None):
    user = local("whoami")
    host = run("hostname")
    with cd("/deploy/tracelons/tracelytics"):
        branch = run("git branch | grep '*' | cut -d' ' -f2")
        commit = run("git log | head -n1 | cut -d' ' -f2")
    if extra:
        extra = ' ' + extra
    else:
        extra = ''
    with settings(warn_only=True):
        run('tlog -m "[deploy] %s deployed %s (%s) on %s%s" -a "%s"'
                % (user, branch, commit, host, extra, app))

In practice: finding a performance regression

It’s always great eat your own dog food, even if sometimes it’s embarrassing to get caught eating dog food.  So, here’s just one incident in which we used Tracelytics to ensure that Tracelytics was running well.

After noting that one of our chart endpoints was suddenly causing latency alerts, we investigated in Tracelytics and  noticed that requests to our web app were spending more time in the MySQL layer than they had previously:

tlog4

And it turned out that this coincided with a code release—how unexpected!  Because we do very frequent releases inspired by continuous deployment, we already knew where to look.  In fact, a re-factored query had lost a constraint on a subquery which caused it to be doing work that was later thrown away by app logic.

So we were able to quickly find the issue, but as usual there’s more than one way to skin a cat.  In this case we were able to verify the problem using Tracelytics before even looking at the commits and code involved. Drilling down on the MySQL layer, we were led to a particular recently slow query,  as well as traces showing the circumstances under which it manifested:

tlog5

Fortunately, in addition to being easy to track down, the issue was also easy to resolve!

Beyond tlog

Looking to get the most out of Tracelytics?  This isn’t the only way to extend your visibility: check out last week’s post on framework instrumentation, or head to the documentation to read about the custom event API.