Last night, the TraceView processing timeline experienced a minor disruption, resulting in slightly slower processing times for some customers. By and large, this issue wasn’t noticeable in any of the dashboards, and no data was lost. The incident did, however, give us an interesting look into how Amazon handles S3 load issues from a performance perspective.
Our data pipeline collects billions of data points per day, with a goal of real-time updates to customer dashboards. Since “realtime” isn’t a precise number, we aim for less than 5 seconds between initial data collection on customer machines and availability in the dashboard. There are a lot of moving pieces, but at the core is a RabbitMQ server that farms out processing captured requests. Each task rebuilds the request (as the data may be collected from multiple machines — service oriented architectures are real!) and extracts information like database query timing, errors, and API call duration.
Normally, a single request processes in about 400 ms. Last night at 9:30, we weren’t so lucky.
We dogfood with our monitoring, so this is a graph from TraceView. The top line is the overall processing time, and the spike in grey implies there was an issue with an HTTP service. Code bottlenecks are one issue; issues with API calls are an entirely different problem.
We use a number of different HTTP services, but there was only one that was having problems:
S3 is down??
Beside the fact that 10 9s (yes, 99.99999999%) of durability, they only provide 99.99% availability. However, this wasn’t downtime; it was just slow. Interestingly, the data we were storing in S3 didn’t just slow down uniformly. Here’s a heatmap (like a scatterplot) of all those S3 calls.
While a few calls went from 10s of ms to a few hundred ms, there are 2 distinct bands: one at 2 seconds and one at 4 seconds. We’re not privy to the internals of S3, but it looks like under load, S3 has internal timeouts, and will actually back off if the first operation is not successful.
Fortunately for us, because there were no failures, we were able to increase the number of workers available to RabbitMQ and mitigate any potential latency.
Netflix was also briefly unavailable during the same time. Coincidence?