Slow and Inconsistent: Web APIs, PART 1
by July 29, 2013

Filed under: Performance Monitoring

With contributions by Bobby Fitzgerald

No man is an island, and neither is a web app. The time needed to build web applications continues to approach zero, but only because the tools to build them are so much better. Talking about frameworks is almost boring; who wants to talk about Play or Rails when you can skip everything those frameworks provide and use external tools to provision your servers, authenticate, store data, handle support, or deal with APIs (woah…). Unlike downloading sketchy, half-maintained open-source libraries, though, integrating with these external APIs is a breeze! They all have great documentation, and everything works from copy-pasting that first line of code. Most of the time. But what should you expect when that actually goes to production?

Web API Tutorial

What to expect depends on how the your application actually calls that API. Many applications, especially ones that deal with sensitive data tend to have a server-side client library. Just include that and after the inevitable fight with OAuth, make a function call, and the client lib will handle all the messy details of REST-ful HTTP (or SOAP…). For instance, getting a list of invoices from Recurly, the payment provider, looks like this:

[code language=”python”]
import recurly

account = recurly.Account.get(‘frobnozzerinc’)
invoices = account.invoices() # First page of invoices

# Pull out only the fields we need for our template
def format_invoice(invoice):
return {"id": invoice.invoice_number,
"state": invoice.state, # at the time of the invoice: active, expired, etc.
"total": invoice.total_in_cents,
"subtotal": invoice.subtotal_in_cents,
"time": invoice.created_at} # Has a timezone — maybe we should convert that?

print [format_invoice(i) for i in invoices]
# Prints out a bunch of dictionaries. Also good for templates.

Easy enough. What does that mean for the user?

That green span? That's both the application AND the user being bored, waiting for their plan to get changed.

That green span? That’s both the application AND the user being bored, waiting for their plan to get changed.

Practically, if this is part of a request, this means the entire request is now blocked until the call returns! If this happens on page load, this means the user is stuck staring at the previous screen until BOTH of the HTTP requests they’ve unwittingly invoked return. If the request is fast, that’s not a problem. If it’s slow, though, there’s nothing to be done. Let’s measure API calls to Recurly from our application:

Heatmap of request times to Recurly's API, from TraceView

Heatmap of request times to Recurly’s API, from TraceView

Unsurprisingly, the answer isn’t entirely clear. Most of the time, this API is fast enough that if there isn’t a lot of other work done in the same request, and there’s only one API call per request, there’s no problem. Unfortunately, we do see a number of calls that are upwards of 1.5 seconds! It turns out this is pretty typical of external calls like this. The average and majority of calls are fast, but the 95th / 99th percentile tends to be much slower, sometimes even an order of magnitude slower!

Dealing with the Unknown

If performance is important, variability can really ruin your day, especially in calls outside of the application itself. Fortunately, there are a couple of workarounds to dealing with issues in APIs.

The first things to realize is that not all API calls are actually changing state on the remote server. In a number of places, the application above reads state about a user’s account from Recurly’s servers. For instance, on the account page, Recurly is the canonical data store for the active state of the account, the current plan, and the date of the next charge. Building that page looks like this:

Page load for the account details page

Page load for the account details page

Almost all of our time is spent in calls to Recurly! This page was built with this tradeoff explicitly in mind — the account page isn’t commonly used, compared to something like our real-time dashboard. In the initial iteration, a 3-second pageload isn’t the worst. The downside, as we saw above, is that now we have 6 opportunities to hit a few slow API calls, and the page load could balloon out to 10 seconds or more! Because all that state is read-only and slow-changing, we can improve this by caching this information in our DB. That replaces all those 350ms (best case!) calls with 10ms calls to our local DB, into a table with as many rows as we have paying customers. (That would be lovely if that was a Big Data problem…). Of course, this introduces additional complexity, such as invalidating that state on account transitions, but that sort of optimization is definitely worth it for a v2 or v3 of an API integration.

But what if the state does actually change, such as tracking a user’s behavior? We can’t just cache that, since we need to actually send a request to the remote server. Fortunately, the user doesn’t need to know about the result of that API call. To handle this style of request, the app can move them out of the critical path, by putting them in another thread or a message queue (like Resque or Celery) to be handled later. This allows the application to spend as much time as necessary talking to the external API, without impacting the direct user experience. Like above, this introduces additional complexity, like handling failure of the API call separately from the failure of the application, but it can be a powerful tool for streamlining complex requests.

Finally, it’s worth noting that the applications can’t always avoid a fully synchronous API call. Checking out should actually charge the customer’s credit card immediately, and clicking “Post to Facebook” should probably do what it advertises. In that case, all you can do is measure the result, and if the experience is truly unacceptable, make plans to move off that provider in the future.