X
    Categories Company NewsIndustry Insights

TraceView Downtime Post-Mortem

AppNeta no longer blogs on DevOps topics like this one.

Feel free to enjoy it, and check out what we can do for monitoring end user experience of the apps you use to drive your business at www.appneta.com.

On Monday afternoon TraceView experienced two brief data outages impacting a fraction of our customers. Affected users may notice intermittent gaps in their data — how to know if your account was affected.

During this time, several of our collector servers (trace data collection endpoints) experienced connection problems while attempting to forward data along to the rest of our trace-processing system, resulting in partial incoming data loss during the periods 2:15-2:45pm EST and 4:05-4:25pm EST. The affected servers were behind load balancers (ELBs) assigned to some, but not all, accounts opened before March 2013. The majority of collector servers and accounts were unaffected.

In the interest of transparency, here’s an explanation of the chain of events leading to that data loss, as well as the steps we are taking to make the system robust in the face of events like it in the future.

The incident began when one of several Memcached servers used during trace analysis suddenly went offline. This caused connection attempts to hang, resulting in delays in our processing pipeline. In turn, these delays added extra load and memory pressure to the entire collector infrastructure, finally growing large enough to prevent three collector servers from forwarding and archiving trace data.

To provide our system better robustness against these kinds of events in the future, we are first re-evaluating connection timeouts used throughout the system. We are also adding additional monitoring checks to warn us of exceptional conditions like those seen during this incident. We’re also modifying our collector server to more robustly preserve trace data when encountering errors. Finally we’re also evaluating Amazon’s new memory-optimized R3 instances, which should handle unexpected load better generally.

We apologize for the downtime, and strive to provide a continuously-available monitoring service that is robust, responsive, and real-time. Thanks for using TraceView and let us know at traceviewsupport@appneta.com if you have any questions!