In this blog post, we’re going to continue our series of videos on common web application problems with a classic problem: finding a slow query. Watch through it below, or scroll down for the (edited and editorialized) transcript.
The app we’re going to look at today is actually the TraceView backend; a Python app with a MySQL DB. (We also use Cassandra, S3, and a couple other technologies, but today we’ll concentrate on the MySQL part of it). The MySQL access is done from a Thrift service, behind the Pythons app. Since we’re specifically looking for slow queries, let’s start with just the average call time for that layer.
From this chart, it looks like this application is pretty fast: the average query time is less than 15 ms. However, the average can be misleading: many fast queries can hide the real effect of a couple slow ones. Another way to look at this is the SQL heatmap.
The distribution gives us a bit more confidence that our queries are generally performing well. The default view hides the slowest 2%, or, put another way, the top of the graph here at around 50 ms is our 98th percentile line. That’s somewhat slower than our average, but still acceptably fast. The story changes when we include those slowest outliers. It turns out that our latency tail is quite long — some of the queries here are upwards of 10 seconds!
In order to understand where those queries are coming from, we’ve selected them in the heatmap, and that’s filtered the Top URLs table to display only endpoints that created those queries. Not only can we see the query that’s being made, but it’s all from the same endpoint. Since this endpoint isn’t loaded that frequently, this problem was hiding behind the average; most places in the app may be fast, but for users of this particular (AJAX) endpoint, it’s incredibly slow!
Inside a single request, we can see that this is a real problem; virtually the entire page load time is due to this slow query. We can even see exactly what the query was, with all its bound parameters, by drilling down on the green libmysql layer. We can also see where in the code it came from, by looking at the stack trace in the sqlalchemy layer.
At this point, we could use EXPLAIN on this query, or go straight to the code to eliminate extra work. In this case, it turns out there is information in what we don’t see. The results of this query are typically cached in Cassandra, and there’s no call to Cassandra here! Re-enabling that caching layer cut the average time by 80%, and eliminated all queries over 5 seconds total.
More videos to follow
This is the second video in a series we are producing about identifying and solving common web app performance problems. If there is a performance problem you would like us to cover, leave a comment below.