This week’s Amazon outage: Alerting to catch infrastructure problems by Dan Kuebrich October 26, 2012
Filed under: Case Studies
You don’t have to be a pre-cog to find and deal with infrastructure and application problems; you just need good monitoring. We had quite a day Monday during the EC2 EBS availability incident. Thanks to some early alerts—which started coming in about 2.5 hours before AWS started reporting problems—our ops team was able to intervene and make sure that our customers’ data was safe and sound. I’ll start with screenshots of what we saw and experienced, then get into what metrics to watch and alert on in your environment, as well as how to do so in TraceView.
10:30 AM EST: Increased disk latency, data pipeline backup
Around 10am, we started to notice that writes weren’t moving through our pipeline as smoothly as before. Sure enough, pretty soon we started seeing alerts about elevated DB load and disk latency. Here’s what it looked like:
12:30 PM EST: Diverting pipeline to S3 instead of EBS, pulling out hair
1:30 PM EST: Frontend offline, AWS incident report
Our workload is very write-heavy, so we first noticed performance problems there, but pretty soon reads made by our frontend were also affected, as a growing fraction of our customer’s data became affected by the mounting EBS problems. At a certain point, any file I/O to affected EBS volumes would cause processes to enter an uninterruptible state, causing our MySQL servers to hang. Here’s a view of the impact on our query sharding service:
6 PM EST: Debate whether backup restore or AWS EBS recovery will finish faster
9 PM EST: Back online
AWS started bringing volumes back online that evening. During the downtime, we continued to collect customer performance data, diverting the pipeline to S3 until our databases came back online. Once the disks were back, we were able to get frontend servers back online, and spun up more pipeline workers to plow through the queued trace backlog as we replayed it from S3.
Latency: the functional test of performance metrics
You might be surprised by this, but monitoring latency is often the easiest and surest way to catch serious problems. It’s the functional test of system: if any of the gears in the system being monitored start getting jammed, it will likely manifest in increased latency. However, latency can be noisy—how can we make this measurement more controlled, or to extend my testing metaphor, closer to a unit test? Using TraceView, you can set alerts not only on the latency of your application, but also on individual layers of the stack, or particular URLs/controllers. The performance of a predictable query load over time is a great way to detect aberrant database performance, for instance.
Alerting on All of the Metrics
When looking at cases of infrastructure degradation, host-level metrics is where the buck stops. Configuration is usually a pain: install agents on each machine and set thresholds. We think the best alerts are at the intersection of easy and actionable. With TraceView, you can set up a single alert and have it cover all hosts in an app.