Using TraceView to Identify and Solve Slow Query Problems
by March 4, 2013

Filed under: Application Performance Management, Performance Management Tech

AppNeta no longer blogs on DevOps topics like this one.

Feel free to enjoy it, and check out what we can do for monitoring end user experience of the apps you use to drive your business at www.appneta.com.

In this blog post, we’re going to continue our series of videos on common web application problems with a classic problem: finding a slow query. Watch through it below, or scroll down for the (edited and editorialized) transcript.

The app we’re going to look at today is actually the TraceView backend; a Python app with a MySQL DB. (We also use Cassandra, S3, and a couple other technologies, but today we’ll concentrate on the MySQL part of it). The MySQL access is done from a Thrift service, behind the Pythons app. Since we’re specifically looking for slow queries, let’s start with just the average call time for that layer.

Slow Query 1

Average MySQL query time over the last day

From this chart, it looks like this application is pretty fast: the average query time is less than 15 ms. However, the average can be misleading: many fast queries can hide the real effect of a couple slow ones. Another way to look at this is the SQL heatmap.

Slow Query 2

Distribution of MySQL query times over the last day (slowest 2% hidden by default)

Slow Query 3

Entire distribution of MySQL query times over the last day

The distribution gives us a bit more confidence that our queries are generally performing well. The default view hides the slowest 2%, or, put another way, the top of the graph here at around 50 ms is our 98th percentile line. That’s somewhat slower than our average, but still acceptably fast. The story changes when we include those slowest outliers. It turns out that our latency tail is quite long — some of the queries here are upwards of 10 seconds!

Slow Query 4

The slowest queries in our app

Slow Query 5

The endpoints experiencing the slowest queries in our app

In order to understand where those queries are coming from, we’ve selected them in the heatmap, and that’s filtered the Top URLs table to display only endpoints that created those queries. Not only can we see the query that’s being made, but it’s all from the same endpoint. Since this endpoint isn’t loaded that frequently, this problem was hiding behind the average; most places in the app may be fast, but for users of this particular (AJAX) endpoint, it’s incredibly slow!

Slow Query 6

A single request to /layer_host_series, with the offending query

Inside a single request, we can see that this is a real problem; virtually the entire page load time is due to this slow query. We can even see exactly what the query was, with all its bound parameters, by drilling down on the green libmysql layer. We can also see where in the code it came from, by looking at the stack trace in the sqlalchemy layer.

At this point, we could use EXPLAIN on this query, or go straight to the code to eliminate extra work. In this case, it turns out there is information in what we don’t see. The results of this query are typically cached in Cassandra, and there’s no call to Cassandra here! Re-enabling that caching layer cut the average time by 80%, and eliminated all queries over 5 seconds total.

More videos to follow

This is the second video in a series we are producing about identifying and solving common web app performance problems.  If there is a performance problem you would like us to cover, leave a comment below.