Python and gevent
by November 9, 2012

Filed under: Application Performance Management, Performance Management Tech

AppNeta no longer blogs on DevOps topics like this one.

Feel free to enjoy it, and check out what we can do for monitoring end user experience of the apps you use to drive your business at www.appneta.com.

The easiest way to make your code run faster is to do less. At some point, though, you don’t want to do less. Maybe you want to do more, without it being any slower. Maybe you want to make what you have fast, without cutting out any of the work. What then? In this enlightened age, the answer is easy — parallelize it! Threads are always a good choice, but without careful consideration, it’s easy to create all manner of strange race conditions or out-of-order data access. So today, let’s talk about a different route: event loops.

Event whats?

If you’re not familiar with evented code, it’s a way to parallelize execution, similar to threading or multiprocessing. Unlike threads, though, evented code is typically cooperative — each execution path must voluntarily give up control. Each of these execution units actually runs in serial, and when finished, returns control to the main loop. The parallelization gain comes from cleverly dividing the work so that when they make a blocking call (i.e. a DB call, HTTP request, or disk access), they give up control, letting the main event loop run other functions and wait for the call to return. This is perfect for cases where you do a lot of I/O and relatively little work in the evented thread itself.

In my case, I’m interested in doing a number of heterogeneous, but related, data lookups in response to a web request. We’re running all of this behind our data access layer, which is an evented Thrift server. I have a number of functions (with a common API) I’m interested in running, and a naive implementation would look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import oboe
def get_app_stats(app, start, end):
    # Each function takes (app, start, end), and returns a dictionary
    data_funcs = {'errors': get_num_errors,
                  'alerts': get_triggered_alerts,
                  'latency': get_latency_series,
                  'volume': get_volume_series,
                  'urls': get_top_urls,
                  'controllers': get_top_controllers,
                  'queries': get_top_queries}
    results = []
    tasks = []
    for key, func in data_funcs:
        # Let's profile each of these function calls with TraceView
        with oboe.profile_block(key):
            results.append(func(app, start, end))
    return result

What does that do?

If we run this on a machine with TraceView installed, we’ll see the following request structure:

Event Loops 1

Pretty predictable. We called into each function serially, which is exactly what we said we’d do. We can also look at the raw events TraceView collected, and they tell a similar story:

Event Loops 2

All together now!

This seems like a good baseline, but let’s see what happens when we parallelize it. Let’s use Python’s gevent, which has two major selling points. First, it implements an event loop based on libevent, which means we won’t have to worry about actually implementing the event loop. We can just spawn separate greenlet (i.e. non-native) tasks, and gevent will handle all the scheduling. The other big advantage is that gevent knows about and can monkeypatch existing synchronous libraries to cede control when they block. This means that outside of the actual event spawning, we can leave our existing code untouched, and our external calls will just do the right thing.

That just leaves the questions of how to break up the work into parallel coroutines. It seems natural to give each of these functions their own, and then have our “main thread” wait for them all the finish. We do this by firing off each function in a separate task and collecting them in a list. We then wait for all of those tasks to finish and collect the results. Easy! Here’s what the same function as above looks like, but evented:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import oboe
def get_app_stats(app, start, end):
    # Each function takes (app, start, end), and returns a dictionary
    data_funcs = {'errors': get_num_errors,
                  'alerts': get_triggered_alerts,
                  'latency': get_latency_series,
                  'volume': get_volume_series,
                  'urls': get_top_urls,
                  'controllers': get_top_controllers,
                  'queries': get_top_queries}
    event_pool = gevent.pool.Pool(size=GEVENT_POOL_SIZE)
    results = []
    tasks = []
    for key, func in data_funcs:
        # Wrap each function in a layer with {Async : True} in TraceView
        wrapped_func = oboe.log_method(key, entry_kvs={'Async': True})(func)
        # Spin off a new green thread for each task, in turn
        tasks.append(event_pool.spawn(wrapped_func, app, start, end))
    # Block here, and wait for them all to finish
    gevent.joinall(tasks)
    # Collect the results of the successful tasks.
    for task in tasks:
        if task.successful():
            results.append(task.value)
        else:
            # Error handling ...
    return results

This calling change is all* that’s necessary to parallelize these functions! The next question is, did that help? Let’s look at the same graphs we had before, but now for the evented case:

Event Loops 3

Definitely different! Instead of running everything sequentially, we can see all seven functions running at the same time. As we’d hoped, this has a major impact on our total response time, as well. It’s 500ms faster — a speedup of 2x!

*Caveats

OK, so it’s not as perfectly simple as this example. There are a few “gotchas” that are worth bearing in mind when you start to use this in a real example.

This first one is that gevent mimics separate threads for each coroutine. This means that if you’re storing global state that’s thread-aware, gevent may discard it. Notably, Pylons/Pyramid uses thread-safe object proxies to store global request state, which means that new coroutines will hide that information from you. In our production version of this code, we explicitly pass that state from caller to callee, then set it in the global “pylons.request” object before running the function. It lets us seemlessly mix evented and non-evented functions, while only thinking about the details of gevent in one place.

The second big gotcha is error handling. Since these aren’t normal function calls, exceptions don’t propagate to the caller. They must be explicitly checked for on the task and re-thrown, if appropriate. This sort of error-case-checking is familiar to any C programmer, but it’s different than the normal Python idiom, so it’s worth thinking about.

Another caveat is that spawning multiple events doesn’t actually get you code-level parallelization. It runs blocking calls in parallel, but you still only get one interpreter thread to run your Python (no magic GIL sidestep here!). If you’re looking to speed up heavy computations or other CPU-intensive work, check out the multiprocessing module. Eventing really shines when the majority of your work is database calls, file access, or other blocking, out-of-process work.

Finally, if you’re looking to trace these kinds of calls with TraceView (like we did here), it’s pretty straightforward. The only thing to remember is to wrap your evented function calls using “oboe.log_method”, and pass “entry_kvs={‘Async’: True}”. The ensures that we calculate the timing information properly for all your parallel work.

But that’s it! You can use this technique to speed up existing projects, or build something an entirely new with gevent at the core. What are you planning on doing with it?