Monitoring Background Jobs in Ruby’s Resque

by TR Jordan on January 22, 2013

Tracing Resque

Here at AppNeta, we get to see a lot about how people build their web applications. From simple PHP scripts to heavily service-oriented Java clouds to monolithic Django apps, everybody’s product is architected a little differently. We’re still out to trace everything, and today I want to talk how to get visibility into an important component of any complex system: the messaging queue. Specifically, let’s look at how to trace a job from Rails using Resque.

Messaging Queues

If you haven’t used a messaging queue in your app, the idea is simple. Instead of forcing all the work to happen during the request, while the user is waiting, you can delay some of the more time-consuming tasks. You can do anything in these tasks, ranging from a simple insert to kicking off a series of user analytics that touch all parts of your infrastructure. The advantage is that you can return a speedy response to the user, or, if they are actually waiting on the task results, give them a better loading interface than a white screen and browser loading bar.

A Quick Resque Tutorial

In Ruby, Resque is a task runner, which by default stores the task descriptions in Redis(though other options are available). Resque jobs are just Ruby classes, with a single mandatory method perform. Resque will call perform with the arguments given in the task description. Let’s look at a minimal task, that takes a single argument and prints it. (Useless, I know.)

1
2
3
4
5
6
7
8
class PrintArgument
  @queue = :useless
  def perform(arg)
    puts 'Hooray, ', arg
  end
end

The @queue variable defines a name that a worker can bind to, in case you want to spread different types of jobs across different machines. To create a task that this worker could run, we just call it from our request:

Tracing a Resque Task

Now that we’ve added this to our system, we should have monitoring around it. The easiest way to do this would be to just measure the time each task takes, and log that information:

1
2
3
4
5
6
7
8
9
10
11
class PrintArgument
  @queue = :useless
  def perform(arg)
    start_time = XXX
    puts 'Hooray, ', arg
    end_time = XXX
    print 'Time: ', end_time - start_time
  end
end

Unfortunately, the data presentation here leaves a bit to be desired, so I’m going to use TraceView to log this information instead. This also has the benefit of logging any SQL queries, cache accesses, or service calls that we might do in a more complex task, as well as reporting errors. To start a trace fresh, we can wrap this call in the start_trace block:

And that’s our job! Maybe not the most interesting job, and probably not prone to performance issues, but we don’t know that yet. So let’s measure it!

1
2
3
4
5
6
7
8
9
10
class PrintArgument
  @queue = :useless
  def perform(arg)
    Oboe::API.start_trace('print_argument') do
      puts 'Hooray, ', arg
    end
  end
end

That’s a start! We’ve now got some visibility into our Resque jobs, and we can rest easy knowing that this is running smoothly in production.

Tracing a Resque Job (with multiple tasks!)

For cron-style jobs, the approach of tracing each task individually works fine. For reference, let’s look at the events we’re generating with that code:

Ruby Tracing

Pretty straightforward. Now let’s consider a more complicated set of tasks: a document-processing pipeline. That code might look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Worker one
class ProcessDoc
  @queue = :file_serve
  @task_name = :process_doc
  def self.perform(doc)
    # Make sure to continue the task here
    Oboe::API.start_trace(@task_name) do
      # Do work here
      doc.process
      puts "did some work on", arg1.name
    end
    # Relinquish control of current task
    Resque.enqueue(Archive, doc)
  end
end
# Worker two
class Archive
  @queue = :file_serve
  @task_name = :archive_doc
  def self.perform(doc)
    # Make sure to continue the task here
    Oboe::API.start_trace(@task_name) do
      # Do work here
      doc.store_in_s3
    end
  end
end

In this case, our first task takes a document, and the second one archives it. If we have multiple tasks, each one gets logged separately, and we can figure out same statistics for each — average, std. dev., percentiles, and the like. But what if you have a job that spans multiple tasks? We can further aggregate the stats, but we might be starting to miss things, like large inputs that cause the entire pipeline to slow down.

What we’d really like is to correlate the related tasks, so instead of timing the each task, we’re timing the entire job. Under the hood, TraceView generates a token for each request. If we pass this ID (generally stored in xtrace, after the X-Trace header it’s passed around in) to each task, we can correlate those timings before storing them, and retrieve them all together. To do this, we can modify each task to take this token, and trace using that ID. ProcessDoc then becomes:

1
2
3
4
5
6
7
8
9
10
11
12
13
def self.perform(xtrace, doc)
  # Make sure to continue the task here
  result, xtrace = Oboe::API.start_trace(@task_name, xtrace) do
    # Do work here
    doc.process
    puts "did some work on", doc.name
  end
  # Relinquish control of current task, and pass ID to second worker
  Resque.enqueue(Archive, xtrace, doc)
end

Now we need to start the trace somewhere, but we’re not doing it in the job. We could start it in the first task, or we could link this one step further up the chain and tie it back to the web request that started it in the first place. In a default rails stack, that request generates the following events:

Ruby Tracing 2

To add in the task queue call to the logged request, we can call the following function.

1
2
3
4
5
6
7
8
9
10
def start_job(*args)
  # Log the fact that we're starting a job, and send the job with the current
  # task id
  ctx = Oboe::Context.toString
  Oboe::API.log_entry('job', { :Async => true })
  xtrace = Oboe::Context.toString unless not Oboe::Context.isValid
  Oboe::Context.fromString(ctx) unless not ctx
  Resque.

We have to force a fork in the execution path to indicate that we’re running an asynchronous task, possibly in parallel, with the rest of the web request, which is done with the call tofromString. Aside from that, this is the same underlying call as is done by start_traceabove — log that we’re entering a named block of code, and start timing it.

When we put it all together, we get a secondary execution path attached to the web request, and the logged events look like this:

Ruby Tracing 3

Now we’ve got everything: the original request, all tasks, individual timing information, and a global view of how the process performed. Not that we now have an additional timing measurement here: the delay before starting the task at all. In this case, we waited a full 500ms between queuing the job actually executing it! Once we were in the pipeline, the tasks happened much faster (only 25ms between processing and archiving).

Caveats

Lest you think that everything was easy, there’s a couple things to keep in mind when you use this in your own application.

  • Because we’re starting the timing in the web request and ending it in a task queue, we’re relying on those two processes to have an identical clock. If they’re on the same machine, it won’t be a problem, but on different machines, any clock skew will effect the timing.
  • I’ve quietly assumed everything in this system is reliable, which is almost certainly wrong. Whatever your error handling is, make sure you always log the exit event for ‘job’, or you may never know that you have errors!
Summary
Article Name
Monitoring Background Jobs in Ruby’s Resque
Author
Description
Using Ruby's Resque is a great way to optimize web performance if you're being diligent about monitoring the performance of background jobs.
TwitterFacebookLinkedInRedditEmail

Get the most awesome monitoring for your entire site.

Create your free account of TraceView and AppView today!

Create your free account