Filed under: Company News
On June 21st at 7:04 AM EST we experienced an outage on one of our customer shard databases. This led to availability issues in the TraceView dashboard for a small subset of customers.
Our ops team found that one database instance had hung and needed an EC2 reboot; during the time of reboot, affected customers were unable to log in and incoming data for those customers was queued instead of being processed immediately.
Fortunately, the scope of the disruption was limited by our architecture, which splits groups of accounts across a large number of completely isolated database instances. As we worked to bring the few dozen affected accounts back to sunny Traceland, the bulk of our customers were still happily tracing away without any issues, while other engineers were teaching the product to new users.
As investigation continued, a kernel bug was found to be the culprit and speedy upgrading ensued. However, as we began the restore process after the kernel upgrade and reboot, a small error in our RAID configuration was causing the wrong volume to be mounted in the place we expected our database files to be. The database was trying to recover the wrong data.
Re-trying with the correct volumes mounted, a restore from backup was not required. We were live and handling transactions right away. All users were immediately able to see current trace data with no latency; though affected accounts may see small gaps in data during this time of ops intervention. While some stored data remains to be replayed to affected accounts, the replay has been running in the background and is nearly complete.
We strive to provide a reliable and highly-available service. Since this morning we have worked to improve the isolation of databases–keeping errors like this one from affecting more customers than necessary–and improve the ability of our system to automatically identify and respond to problems.