Filed under: Performance Monitoring
Eric Waller – Application Performance Management Guest Blogger
We’ve had great success at SeatGeek moving more and more of our software into independent services. Clear service boundaries have allowed us to improve our code quality, increase programmer productivity/happiness, open source a few things, and in some cases, drastically improve performance.
The flip-side is that with 3 or 4 different languages connecting to 3 or 4 different data stores, with the network between them, and all writing to different log files, it has become a lot more difficult to reason about performance. When you can render a page with a single (albeit complex) SQL query, it’s easy to know where to look when you want to improve response times. If rendering that same page means a simpler SQL query, plus an HTTP call to an external service which may in turn communicate with Redis, it’s a little bit tougher to figure out where you’re spending most of your time.
StatsD and Graphite
Our first solution to this problem was a combination of StatsD and Graphite. Etsy has an interesting post about their usage of this combo. In short, StatsD + Graphite gives you a very simple way to time and count things, and then graph the results in real-time. Here’s an example chart of our tickets feed load times over the past 2 months or so. You can see the result of some performance improvements we made in mid-September.
This is a pretty powerful setup. We’re measuring a ton of stuff using StatsD and Graphite, and it’s a great tool for letting you know where to start looking when investigating performance issues.
Improving Application Performance with TraceView
Recently we were lucky enough to take part in the beta of a new product called TraceView. While StatsD + Graphite can help you figure out where to look, TraceView comes right out and tells you exactly what’s wrong. TraceView is organized around the concept of a “trace”, which is a detailed snapshot of a specific request. A trace contains details about every layer of software involved in handling a request, even when those layers are separated by the network.
I can’t open TraceView without stumbling across a glaring performance issue, and I mean that literally. I just popped TraceView open to grab some screenshots for this post and realized that we were requesting the same object from s3 multiple times in a single request – not only did we have some bad logic which was grabbing the object over and over again, but because of incorrectly configured permissions, our per-server file cache for s3 objects wasn’t writable and so wasn’t working at all. Check it out (click to enlarge):
What you’re seeing here is the details view for a single trace. The timeline at the top shows timing information for various request layers. The bottom left is showing details about the currently selected layer (the darkest blue rectangle on the timeline), which happens to be an HTTP request to s3. I clicked on each of the other blue rectangles, and saw that they too were requests to s3. Not good. Now as if that weren’t enough information to get started on a fix, I can scroll down and get an exact stack trace for each call:
Remember the “performance improvements” from mid-September that were illustrated in the Graphite screenshot above? Well that problem was actually uncovered and diagnosed with TraceView. I apologize for the crappy screenshot, but I took it just to throw in an email back in September (you can’t view historical data past a week in the TraceView interface yet – the data is retained, it’s just not exposed in the interface yet). Here is a heatmap view of the performance of a specific SQL query (click to enlarge):
This a very simple query pulling tickets out of a single table. We had indexes in the right places, but the query pulls a lot of data and the table has grown a lot, and probably wasn’t residing completely in memory anymore. We ended up upgrading our DB server to the next instance size and tuning some MySQL config parameters and were able to knock the average time for that query down from ~500ms to ~5ms.