Using TraceView to Identify and Solve Webserver Queueing Problems
by February 26, 2013

Filed under: Application Performance Management, Performance Management Tech

AppNeta no longer blogs on DevOps topics like this one.

Feel free to enjoy it, and check out what we can do for monitoring end user experience of the apps you use to drive your business at www.appneta.com.

We’re kicking off a series of videos on how to find and attack common problems in your complex web applications. Today’s topic is one we’ve seen in a number of customers: Webserver Queueing. Check out the video below, or, if you don’t have three minutes or a pair of headphones handy, read on for a (liberally edited) transcript.

Any sort of queueing occurs when one part of the stack has lower throughput than the layer that calls into it. This can manifest itself as queueing, where the request simply waits for the layer below it to signal that it’s ready, or it can manifest in timeouts. In the case of webservers, the timeout can be fairly long (five minutes in Apache’s default mod_proxy!), so any queue time shorter than that gets added directly to the latency of the request. Let’s look at how this might manifest in a Ruby on Rails application.

Ruby Queuing 1

A breakdown of latency, by layer, from our Rails application.

Our Rails app is a single mongrel server behind Apache. Our first clue (besides the fact that we’re running only a single app server) is front and center in this breakdown. Apache seems to be responsible for more than 50% of the request latency. Ideally, this should be closer to 10 ms, not the “X” ms we’re seeing here.

Ruby Queuing 2

A heatmap of time spent in Apache. Darker squares mean more requests.

Apache’s performance can vary depending on a number of things, such as client connection speed. A mobile user takes more time to download content than a desktop user; but looking at this graph, that doesn’t seem to be the case. Our Apache response times are universally high, something that we wouldn’t have been able to see by just looking at the average line. This sort of consistently high latency from Apache definitely indicates a problem like queueing, which is more systematic and less data-triggered.

Ruby Queuing 3

A single request, with 2 seconds (!!) of queueing before the app server.

To really confirm this problem, we should look down all the way into a single request. Not only do we see this request taking most of its time in Apache, but it’s all before the app server even receives the request. The connection speed of this client seems fine — that would appear as a block of time spent in Apache after Rails returned. The webserver is just waiting for the app server to process this request, adding an inordinate delay to what would otherwise be a quick pageload.

Ruby Queuing 4

Once we’ve found this problem, one solution is just to add more capacity to the application server layer, either by tweaking the settings to enable more processes, or, if this was a larger environment, adding more servers. The trace above is what our requests look like after making that change. The response time has dropped from two seconds to 250 ms, and virtually all the time is spent in Rails actually doing work … all because of an easy tweak to our application’s config.

More videos to follow

Over the next few weeks, I’ll publish similar videos about common web performance problems, how to identify them using TraceView and what to do to solve those problems.  Stay tuned to the AppNeta blog for future video releases.